Skip to content

Commit ab5d91b

Browse files
add section on rechunking to chunk optimisation section of tutorial (#730)
* add section on rechunking to chunk optimisation section of tutorial * proofreading
1 parent a8561fc commit ab5d91b

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

docs/tutorial.rst

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1288,6 +1288,45 @@ bytes within chunks of an array may improve the compression ratio, depending on
12881288
the structure of the data, the compression algorithm used, and which compression
12891289
filters (e.g., byte-shuffle) have been applied.
12901290

1291+
.. _tutorial_rechunking:
1292+
1293+
Changing chunk shapes (rechunking)
1294+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1295+
1296+
Sometimes you are not free to choose the initial chunking of your input data, or
1297+
you might have data saved with chunking which is not optimal for the analysis you
1298+
have planned. In such cases it can be advantageous to re-chunk the data. For small
1299+
datasets, or when the mismatch between input and output chunks is small
1300+
such that only a few chunks of the input dataset need to be read to create each
1301+
chunk in the output array, it is sufficient to simply copy the data to a new array
1302+
with the desired chunking, e.g. ::
1303+
1304+
>>> a = zarr.zeros((10000, 10000), chunks=(100,100), dtype='uint16', store='a.zarr')
1305+
>>> b = zarr.array(a, chunks=(100, 200), store='b.zarr')
1306+
1307+
If the chunk shapes mismatch, however, a simple copy can lead to non-optimal data
1308+
access patterns and incur a substantial performance hit when using
1309+
file based stores. One of the most pathological examples is
1310+
switching from column-based chunking to row-based chunking e.g. ::
1311+
1312+
>>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
1313+
>>> b = zarr.array(a, chunks=(1,10000), store='b.zarr')
1314+
1315+
which will require every chunk in the input data set to be repeatedly read when creating
1316+
each output chunk. If the entire array will fit within memory, this is simply resolved
1317+
by forcing the entire input array into memory as a numpy array before converting
1318+
back to zarr with the desired chunking. ::
1319+
1320+
>>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
1321+
>>> b = a[...]
1322+
>>> c = zarr.array(b, chunks=(1,10000), store='c.zarr')
1323+
1324+
For data sets which have mismatched chunks and which do not fit in memory, a
1325+
more sophisticated approach to rechunking, such as offered by the
1326+
`rechunker <https://github.com/pangeo-data/rechunker>`_ package and discussed
1327+
`here <https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11>`_
1328+
may offer a substantial improvement in performance.
1329+
12911330
.. _tutorial_sync:
12921331

12931332
Parallel computing and synchronization

0 commit comments

Comments
 (0)