Shuffling with HDF5Pump gets slower with more blobs in file (shuffle-speed / blob)
When you shuffle h5 files with the HDF5Pump (save with HDF5Sink), the time needed for one blob increases with the number of blobs in the file.
E.g. for an input h5 file with 3k blobs, 200 blobs take around one second and with an input file with 1.5Mio blobs, 200 blobs take around 2 minutes.
The chunksize in axis_0 of the files is 32, and they are using gzip with compression level 1.
Need to check, where this slowdown is coming from.
Example code for shuffling:
pipe = kp.Pipeline(timeit=True)
pipe.attach(km.common.StatusBar, every=200)
pipe.attach(km.common.MemoryObserver, every=200)
pipe.attach(kp.io.hdf5.HDF5Pump, filename=filepath_input, shuffle=True, reset_index=True)
pipe.attach(kp.io.hdf5.HDF5Sink, filename=filepath_output, complib=complib, complevel=complevel, chunksize=chunksize, flush_frequency=1000)
pipe.drain()
Example files:
1.5 Mio blobs:
/home/saturn/capn/mppi033h/Data/input_images/ORCA_2016_115l/tight_0_100b_t_bg_classifier/data_splits/xyzc/xyzc_tight_0_100b_bg_classifier_dataset_train_0.h5
3k blobs:
/home/saturn/capn/mppi033h/Data/input_images/ORCA_2016_115l/tight_0_100b_t_bg_classifier/tau-CC/3-100GeV/xyzc/JTE_KM3Sim_gseagen_tau-CC_3_4-100GeV-2_0E8-1bin-3_0gspec_ORCA115_9m_2016_999_xyzc.h5