Skip to content

Conversation

keewis
Copy link
Member

@keewis keewis commented Jun 3, 2025

Note: This is a highly experimental PR (and a lot still has to be done for this to be usable)

This tries to use the RangeMOCIndex class from EOPF-DGGS/healpix-geo#31 instead of the standard pandas index.

In theory, this should allow opening and decoding datasets up to level 29 without having to keep the cell ids in memory (and quite possibly we can skip loading the cell ids if we see that we have $12 \cdot 4^{\mathrm{level}}$ cells).

The current state does not support dask (so the cell ids will still be kept in memory), but this is definitely something that will be added as the PR progresses – I still have to understand how the coordinate transform indexes work in xarray.

cc @benbovy

@benbovy
Copy link
Member

benbovy commented Jun 4, 2025

Nice!

Not sure this PR fully closes #143, though, since this is Healpix specific.

The current state does not support dask (so the cell ids will still be kept in memory), but this is definitely something that will be added as the PR progresses – I still have to understand how the coordinate transform indexes work in xarray.

I'll take a stab at adding coordinate transforms in a dggs-agnostic way in another PR (at least for lat/lon auxiliary coordinates), hopefully this will provide a good baseline for adding a lazy coordinate for cell ids here.

@keewis
Copy link
Member Author

keewis commented Jun 11, 2025

@benbovy, this appears to work properly now. The current behavior is:

  • if we detect a full domain (obj.sizes["cells"] == 12 * 4**level), we use a special constructor and don't look at the values
  • if the data of cell_ids is not a dask array, we use RangeMOCIndex.from_cell_ids
  • else we create a RangeMOCIndex for each chunk, then reduce using RangeMOCIndex.union

Missing functionality (compared to PandasIndex) is isel with arrays and sel, because I couldn't figure out how to implement those on MOCs so far.

I'll look into adding a few more tests before this is ready for merging.

@keewis
Copy link
Member Author

keewis commented Jun 16, 2025

I think I addressed the review comments (and I'll punt the decision on optional vs required dependencies to a different PR), so this should be ready for another round of reviews?

@keewis
Copy link
Member Author

keewis commented Jun 30, 2025

the indexing by a numpy array is currently broken, but for reasons unrelated to xdggs: dask.array.Array does not appear to support indexing by a numpy array if the array itself is bigger than memory... which is totally confusing. I'll have to figure out if this is actually an issue with my environment. Funnily enough, indexing by a 1-chunked dask does work.

@keewis
Copy link
Member Author

keewis commented Jul 1, 2025

I've opened dask/dask#11998 to fix the above error

@keewis
Copy link
Member Author

keewis commented Aug 25, 2025

I'm working on getting the macos CI to run (no osx-arm64 healpix-geo builds on conda-forge, yet), but other than that this should be ready, @benbovy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants