Skipping blobs in pipeline leads to missing data in hdf5

Summary

Skipping blobs in a pipeline can lead to missed data in files after repeated hdf5 read-out.

Environment

KM3Pipe version (km3pipe --version): 8.27.5
Python version (python --version): 3.6.8
OS: (uname -a) Linux w1037 4.15.0-50-generic #54 (closed)-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

Even this minimal example is quite complex unfortunatly :-) Here is the code I'm using:

import km3pipe as kp
import numpy as np

class DummyPump(kp.Module):
    """ Generate some dummy data. """
    def configure(self):
        self.i = 0

    def process(self, blob):
        blob = kp.Blob()
        blob["x"] = kp.NDArray(np.ones((1, 3, 3))*self.i, h5loc="x")
        blob["y"] = kp.Table.from_dict({"aa": self.i * 2.}, h5loc="y")
        self.i += 1
        return blob


class Skipper(kp.Module):
    """ Skip the third blob. """
    def configure(self):
        self.i = 0

    def process(self, blob):
        self.i += 1
        if self.i == 2:
            self.cprint("skipped")
            return
        else:
            return blob


def gen(outfile, infile=None, skip_one_blob=False):
    pipe = kp.Pipeline()
    if infile is None:
        pipe.attach(DummyPump)
    else:
        pipe.attach(kp.io.HDF5Pump, filename=infile)
    if skip_one_blob:
        pipe.attach(Skipper)
    pipe.attach(kp.io.HDF5Sink, filename=outfile)
    pipe.drain(5)

Step 1

First, I generate a h5file using the DummyPump containing a Ndarray ("x") and a Table ("y") with this: gen("base.h5").

The file looks like this:

ptdump base.h5

/ (RootGroup) 'KM3NeT'
/group_info (Table(5,), fletcher32, shuffle, zlib(5)) 'Group Info'
/x (EArray(5, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(5,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(5,), fletcher32, shuffle, zlib(5)) 'Generic Table'

Step 2

Next, I read in this file but skip blob number 3 using the Skipper with this: gen("skipped.h5", infile="base.h5", skip_one_blob=True).

The file looks like this:

ptdump skipped.h5

/ (RootGroup) 'KM3NeT'
/group_info (Table(4,), fletcher32, shuffle, zlib(5)) 'GroupInfo'
/x (EArray(4, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(4,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(4,), fletcher32, shuffle, zlib(5)) 'Y'

The group_id/indices look like this:

import h5py

f = h5py.File("skipped.h5", "r")
f["x_indices"]["index"]
# --> array([0, 1, 2, 3])
f["y"]["group_id"]
# --> array([0, 2, 3, 4])

Step 3

Next, i simply read the skipped file in with the hdf5 pump and write it again immediatly with the sink like this: gen("read_again.h5", "skipped.h5")

ptdump reveals that the resulting file has still 4 lines for y (as it should be) but only 3 lines for x:

ptdump read_again.h5

/ (RootGroup) 'KM3NeT'
/group_info (Table(4,), fletcher32, shuffle, zlib(5)) 'GroupInfo'
/x (EArray(3, 3, 3), fletcher32, shuffle, zlib(5)) 'Unnamed NDArray'
/x_indices (Table(3,), fletcher32, shuffle, zlib(5)) 'Indices'
/y (Table(4,), fletcher32, shuffle, zlib(5)) 'Y'

Is this intended? Becuase this seems to me that one entry in x got dropped for no reason...

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information