@@ -1288,6 +1288,45 @@ bytes within chunks of an array may improve the compression ratio, depending on
1288
1288
the structure of the data, the compression algorithm used, and which compression
1289
1289
filters (e.g., byte-shuffle) have been applied.
1290
1290
1291
+ .. _tutorial_rechunking :
1292
+
1293
+ Changing chunk shapes (rechunking)
1294
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1295
+
1296
+ Sometimes you are not free to choose the initial chunking of your input data, or
1297
+ you might have data saved with chunking which is not optimal for the analysis you
1298
+ have planned. In such cases it can be advantageous to re-chunk the data. For small
1299
+ datasets, or when the mismatch between input and output chunks is small
1300
+ such that only a few chunks of the input dataset need to be read to create each
1301
+ chunk in the output array, it is sufficient to simply copy the data to a new array
1302
+ with the desired chunking, e.g. ::
1303
+
1304
+ >>> a = zarr.zeros((10000, 10000), chunks=(100,100), dtype='uint16', store='a.zarr')
1305
+ >>> b = zarr.array(a, chunks=(100, 200), store='b.zarr')
1306
+
1307
+ If the chunk shapes mismatch, however, a simple copy can lead to non-optimal data
1308
+ access patterns and incur a substantial performance hit when using
1309
+ file based stores. One of the most pathological examples is
1310
+ switching from column-based chunking to row-based chunking e.g. ::
1311
+
1312
+ >>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
1313
+ >>> b = zarr.array(a, chunks=(1,10000), store='b.zarr')
1314
+
1315
+ which will require every chunk in the input data set to be repeatedly read when creating
1316
+ each output chunk. If the entire array will fit within memory, this is simply resolved
1317
+ by forcing the entire input array into memory as a numpy array before converting
1318
+ back to zarr with the desired chunking. ::
1319
+
1320
+ >>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
1321
+ >>> b = a[...]
1322
+ >>> c = zarr.array(b, chunks=(1,10000), store='c.zarr')
1323
+
1324
+ For data sets which have mismatched chunks and which do not fit in memory, a
1325
+ more sophisticated approach to rechunking, such as offered by the
1326
+ `rechunker <https://github.com/pangeo-data/rechunker >`_ package and discussed
1327
+ `here <https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11 >`_
1328
+ may offer a substantial improvement in performance.
1329
+
1291
1330
.. _tutorial_sync :
1292
1331
1293
1332
Parallel computing and synchronization
0 commit comments