Skipping blobs in pipeline leads to missing data in hdf5
Summary
Skipping blobs in a pipeline can lead to missed data in files after repeated hdf5 read-out.
Environment
- KM3Pipe version (
km3pipe --version
): 8.27.5 - Python version (
python --version
): 3.6.8 - OS: (
uname -a
) Linux w1037 4.15.0-50-generic #54 (closed)-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Steps to reproduce
Even this minimal example is quite complex unfortunatly :-) Here is the code I'm using:
import km3pipe as kp
import numpy as np
class DummyPump(kp.Module):
""" Generate some dummy data. """
def configure(self):
self.i = 0
def process(self, blob):
blob = kp.Blob()
blob["x"] = kp.NDArray(np.ones((1, 3, 3))*self.i, h5loc="x")
blob["y"] = kp.Table.from_dict({"aa": self.i * 2.}, h5loc="y")
self.i += 1
return blob
class Skipper(kp.Module):
""" Skip the third blob. """
def configure(self):
self.i = 0
def process(self, blob):
self.i += 1
if self.i == 2:
self.cprint("skipped")
return
else:
return blob
def gen(outfile, infile=None, skip_one_blob=False):
pipe = kp.Pipeline()
if infile is None:
pipe.attach(DummyPump)
else:
pipe.attach(kp.io.HDF5Pump, filename=infile)
if skip_one_blob:
pipe.attach(Skipper)
pipe.attach(kp.io.HDF5Sink, filename=outfile)
pipe.drain(5)
Step 1
First, I generate a h5file using the DummyPump containing a Ndarray ("x") and a Table ("y") with this: gen("base.h5")
.
The file looks like this:
ptdump base.h5
/ (RootGroup) 'KM3NeT'
/group_info (Table(5,), fletcher32, shuffle, zlib(5)) 'Group Info'
/x (EArray(5, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(5,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(5,), fletcher32, shuffle, zlib(5)) 'Generic Table'
Step 2
Next, I read in this file but skip blob number 3 using the Skipper with this:
gen("skipped.h5", infile="base.h5", skip_one_blob=True)
.
The file looks like this:
ptdump skipped.h5
/ (RootGroup) 'KM3NeT'
/group_info (Table(4,), fletcher32, shuffle, zlib(5)) 'GroupInfo'
/x (EArray(4, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(4,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(4,), fletcher32, shuffle, zlib(5)) 'Y'
The group_id/indices look like this:
import h5py
f = h5py.File("skipped.h5", "r")
f["x_indices"]["index"]
# --> array([0, 1, 2, 3])
f["y"]["group_id"]
# --> array([0, 2, 3, 4])
Step 3
Next, i simply read the skipped file in with the hdf5 pump and write it again immediatly with the sink like this:
gen("read_again.h5", "skipped.h5")
ptdump reveals that the resulting file has still 4 lines for y (as it should be) but only 3 lines for x:
ptdump read_again.h5
/ (RootGroup) 'KM3NeT'
/group_info (Table(4,), fletcher32, shuffle, zlib(5)) 'GroupInfo'
/x (EArray(3, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(3,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(4,), fletcher32, shuffle, zlib(5)) 'Y'
Is this intended? Becuase this seems to me that one entry in x got dropped for no reason...