Data Provenance in the Pipeline
The data provenance can be managed by the Pipeline
itself. I think the kp.io.hdf5.HDF5MetadataService
should be put right into the pipeline and every Sink
should write that somehow into the target output file.
Currently the data type is limited to a simple str -> str
due to the limitations of the HDF5 attribute structure. Maybe we store JSON as the second argument.
All in all, currently the h5info
shows the metadata, e.g.:
> h5info mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root.h5
HDF5 Meta Data
--------------
format_version: 5.1
ignore_hits: True
jpp: 10585M
km3pipe: 8.4
n_rows_expected: 10000
origin: mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root
pytables: 3.4.4
use_jppy: False
I think we should divide this into stages somehow, like (when inspecting h5info foo.h5
):
Meta Data (note that this is not HDF5 specific
=========
Stage 1
-------
origin: mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root
cmd: tohdf5 --ignore-hits mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root
jpp: 10585M
km3pipe: 8.4
pytables: 3.4.4
HDF5Sink use_jppy: False
HDF5Sink ignore_hits: True
HDF5Sink n_rows_expected: 10000
HDF5Sink format_version: 5.1
Stage 2
-------
origin: mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root.h5
cmd: awesome_script.py -a 23 -x 5 -f mcv5.0.DAT000626.propa.sirene.jte.jchain.aanet.625.root.h5 -o foo.h5
km3pipe: 8.4
pytables: 3.4.4
AnAwesomeModule param_1: 500
AnAwesomeModule bar: 42
SomeOtherModule narf: 'whatever'
Just brainstorming ;)
Edited by Tamas Gal