Best track?

changed milestone to %v1.0

mentioned in issue #38 (closed)

Seems fine to me, that's the only one that I need ;) Do you want to include rec_stages slicing in strategy as well @tgal ?

Yeah well, you and @zaly are the experts on these best track strategies so far, so please feel free to add your proposals ;)

If mc_tracks would contain proper non-empty rec_stages entries, it would be quite straightforward. Otherwise hmm, maybe some index matching would work.

Yes, I suggest we add some cleaning in the tracks (trivial one: remove non-reconstructed tracks). I think it is also good to make sure that the tracks selected are fully reconstructed (following whatever reconstruction stages the user wants)... But that is under discussion... at the moment, we still need to explore the best reco strategies, so please @pkalaczynski do not expect a fast implementation as we would like to make this best reco method as general as possible (for the time being, since km3net has not yet agreed on a common definition of best reco track!).

No problem, just monitoring the progress ;)

I agree. Something like "filter all tracks with stages [x, y, z]" would of course still be needed, I guess? ;)

Yep, exactly. We want it to be user-friendly, but also let people do complex stuff as well. And once again as a reminder, this should be applicable not just on tracks, but also their respective mc_tracks, mc_hits, hits etc. ;)

A relevant issue is here: https://git.km3net.de/working_groups/comp_soft/issues/38

Note that we should keep in mind that we can have a mixture of reconstructed tracks inside the same vector from different reconstruction algorithms. At the moment we did not mix different reconstruction algorithms in the same processing pipeline, but this might (and will) be the case in future.

This means that there is a requirement to also always check the rec_type of each entry.

The data structure (flat vector) which is provided by the aanet data format is clearly not ideal for this.

I would like to add something, if I may. Maybe this is already possible with basic numpy, but: At times one might need a mask for the selected "best tracks" as much as the best tracks and their information. My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.

Is there an easy way to do it?

If a mask is also returned by such a "best track" function (or maybe even an own function) it could be used by the user for the mc_tracks and hits as well.

I would like to add something, if I may

Yes please! Always! :)

My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.

This boils down to our overall goal: providing a natural way of slicing/indexing the KM3NeT datastructures in a whole, just as they were numpy arrays.

Is there an easy way to do it?

Kind of... with a few caveats We are working on this very hard, every day and it's a tough task to implement it so that it's both user-friendly and has maximum performance.

We face multiple problems, but one of the biggest issues is of course that we are not dealing with homogeneous data like 2D matrices (with named columns) but variable lengths of arrays inside variable lengths of arrays inside of variable length arrays.

Anyways, we are on the right track and working on the branch https://git.km3net.de/km3py/km3io/-/tree/refactor-offline which you can try as you want (pip install git+https://git.km3net.de/km3py/km3io.git@refactor-offline).

Here are some examples of the slicing/indexing/masking feature:

>>> import km3io
>>> km3io.version
'0.9.1.dev72+g5dee01f'
>>> f = km3io.OfflineReader("datav5.40.jorcarec.aanet.00006585.root")
>>> tracks = f.events.tracks
>>> tracks
<Branch[tracks]: 141098 elements>
>>> f.events.n_tracks
array([128, 128, 128, ..., 128, 128, 120])
>>> mask = f.events.n_tracks > 0  # boolean mask for events with at least 1 track
>>> mask
<ChunkedArray [True True True True True True True ...] at 0x7facfe227c10>
>>> pos_x_of_every_first_track = tracks[mask].pos_x[:, 0]
>>> pos_x_of_every_first_track
<ChunkedArray [29.758563081224473 44.87933353564635 63.4742112731849 10.348203834859914 -10.186394233913711 1.574926331368094 43.82500580340671 ...] at 0x7facfc1a6d00>
>>> pos_x_of_every_first_track.shape
(141053,)
>>> selected_tracks = tracks[mask]
>>> selected_tracks
<Branch[tracks]: 141053 elements>
>>> selected_hits = f.events.hits[mask]
>>> selected_hits
<Branch[hits]: 141053 elements>

As you can see, the slicing is more natural and can be chained, although we still need to fix issues so don't use this in production yet.

The problem with the best_track() mask is, that it needs to be a two-dimensional mask (event_idx+track_idx) which is not very well handled by these underlying ChunkedArrays in uproot. I opened an issue end of January (https://github.com/scikit-hep/awkward-array/issues/229) and the answer from Jim himself (the creator of uproot) is:

I made a wrong design choice by introducing ObjectArray: it delays evaluation of the (expensive) unsplit-list deserialization by introducing an opaque type that doesn't work like any other data. In Awkward 1.0, the unsplit-list deserialization will be over an order of magnitude faster (no longer the very noticeable performance hit) as shown on page 14/16 of this talk, and the Methods and ObjectArray mechanisms will be replaced by a single non-opaque mechanism as shown in this section of the Coffea demo.

So, not a bug, but exactly the sort of confusing feature that will be eliminated by transitioning to Awkward 1.0.

Recently I also experimented converting these ChunkedArrays (which are ObjectArrays behind the scenes and not "well-sliceable") to AwkwardArrays using the not-yet-released awkward1 library (which is going to replace awkward soon) and there the slicing works as expected. However, we will deal with the issue that those arrays are not lazy, which means that they will eventually need to load all the data in memory. This might not be a problem for high(est) level data though. You can see the issue here: https://github.com/scikit-hep/uproot/issues/461

All in all I can say that we are close to a nice API, but it's still a lot of work. @zaly is also working hard on the test part, to boil the hell out of it by trying to figure out all use-cases which cause problems with our new approach to the indexing/masking/slicing.

Meanwhile, let us know what you think and if you have any ideas, don't hesitate to create issues or comment wherever you like! I am confident that the numpy-style high-level analysis is the way to go.

I'd love to see pandas-like operations and if we had 2D shaped data (matrices) it would be my first choice, as it is for KM3Pipe with the thin kp.Table-class for event-by-event readout, since there you can shape every branch into a 2D columnar data independently (one table for Hits, one for Tracks etc.), but in case of km3io the user wants to deal with the whole file at once, which makes it awkward.

Thanks for your detailed answer.

Very much appreciate all of your works!

Your example already looks pretty much like what I am dreaming about :)

Just a tiny update: @zaly is working on the best_track implementation and meanwhile the slicing has been implemented to work around with the uproot3 limitations using the new awkward1 library.

Feel free to test it!

As always: the latest master is updated in Lyon on every git push (module load python/3.7.5).

changed the description

Hi, does a "best_track_selector" exists by now? I could really use one :)

Not sure if a dedicated function is available yet, but if you want to just have tracks with all reco stages and with their best fits, you can typically just use masks like in this example:

for f in FNAMES:
    print("Opening ", f)
    r = ki.OfflineReader(f)
    if len(r.events)>1: # only non-empty files
        premask = [rstgs.size>1 for rstgs in r.events.tracks.id]
        mask = [[1, 2, 5, 3, 5, 4] == list(s[0]) for s in r.events.tracks.rec_stages[premask]]
        VARS[0].extend([sum(tots) for tots in r.events.hits.tot[premask][mask]])

Not sure if that's the best way, but it works and it's rather fast.

I don't know about the current status, @zaly do you have some news on this?

Hey, sorry I completely forgot answering here.

@lmaderer @pkalaczynski @dguderian I implemented a prototype some time ago but I have not tested it with heavy files (so I have no idea about the performance). Please go ahead and try it out and let me know your feedback, I will really appreciate it :)

This is how to proceed:

Here is some documentation:

best_track(tracks, strategy="first", rec_type=None, rec_stages=None):
    """best track selection based on different strategies

    Parameters
    ----------
    tracks : class km3io.offline.OfflineBranch
        the tracks branch.
    strategy : str
        the trategy desired to select the best tracks. It is either: 
            - "first" : to select the first tracks.
            - "rec_stages": to select all tracks with a specific reconstruction stages.
            This requires the user to specify the reconstruction stages in rec_stages input.
            Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]).
            - "default": to select the best tracks (the first ones) corresponding to a specific
            reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input.
            Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE").
    rec_type : str, optional
        reconstruction type as defined in the official KM3NeT-Dataformat.
    rec_stages : list, optional
        list of the reconstruction stages. Example: [1, 2, 3, 4, 5]

    Returns
    -------
    class km3io.offline.OfflineBranch
        tracks class with the desired "best tracks" selection.
    """

and here is how to proceed:


import km3io
from km3io.tools import best_track

tracks = km3io.OfflineReader(my_file).events.tracks

# ""
best_reco = best_track(tracks, strategy="your_strategy", ...)

in the case of strategy = "default", to get the reconstruction type for the input rec_type, you can check:

from km3io.tools import rec_types

rec_types()

this should return all the names of the reconstruction types as defined in the official KM3NeT-Dataformat.

This is not necessarily the final implementation, so any feedback from you is very appreciated.

Thanks ;)

added awaiting feedback label

Thanks @zaly I decided to give it a try. There are still some problems with empty tracks which gives an IndexError. So we need to exclude those. Here is an example using a file from the ORCA DU4 production which has 3 events without any reconstructed tracks:

>>> rootfile = km3io.OfflineReader("mcv5.40.mupage_10G.sirene.jterbr00005974.jorcarec.aanet.821.root")
>>> sum(rootfile.events.n_tracks == 0)
3

Here is the full traceback:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/.virtualenvs/km3net/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/.virtualenvs/km3net/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/.virtualenvs/km3net/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    682     """A pprint that just redirects to the normal repr function."""
    683     # Find newlines and replace them with p.break_()
--> 684     output = repr(obj)
    685     lines = output.splitlines()
    686     with p.group():

~/Dev/km3io/km3io/rootio.py in __repr__(self)
    198 
    199     def __repr__(self):
--> 200         length = len(self)
    201         return "<{}[{}]: {} element{}>".format(self.__class__.__name__,
    202                                                self._mapper.name, length,

~/Dev/km3io/km3io/rootio.py in __len__(self)
    178                 unfold_indices(
    179                     self._branch[self._keymap['id']].lazyarray(
--> 180                         basketcache=BASKET_CACHE), self._index_chain))
    181 
    182     def __iter__(self):

~/Dev/km3io/km3io/tools.py in unfold_indices(obj, indices)
     29     for depth, idx in enumerate(indices):
     30         try:
---> 31             obj = obj[idx]
     32         except IndexError:
     33             print(

~/Dev/km3io/km3io/patches.py in new_getitem(self, item)
     12         return ak1.Array(self)[item]
     13     else:
---> 14         return old_getitem(self, item)
     15 
     16 

~/.virtualenvs/km3net/lib/python3.7/site-packages/awkward/array/chunked.py in __getitem__(self, where)
    428                 chunk = self._chunks[chunkid][(slice(local_start, local_stop, step),)]
    429                 if len(chunk) > 0:
--> 430                     chunk = chunk[(slice(None),) + tail]
    431                     if len(chunk) > 0:
    432                         chunks.append(chunk)

~/.virtualenvs/km3net/lib/python3.7/site-packages/awkward/array/jagged.py in __getitem__(self, where)
    616                         head = counts + head
    617                     if not self.numpy.bitwise_and(0 <= head, head < counts).all():
--> 618                         raise IndexError("index {0} is out of bounds for jagged min size {1}".format(original_head, counts.min()))
    619                     node = node._content[node._starts + head]
    620                 else:

IndexError: index 0 is out of bounds for jagged min size 0

@tgal

Thank you for your feedback!

I added a quick fix to mask empty tracks. I performed a quick test to see the performance of best_track on orca4 data as you did, here are the results:

import km3io as ki
from km3io.tools import best_track, rec_types
ki.version
>>> '0.10.1.dev17+g76ab9e1'

f = ki.OfflineReader("/home/zineb/km3net/orca4data/mc_data/data/reco/datav6.0test.jchain.aanet.00005971.root")

f.events
>>> <OfflineBranch[events]: 145028 elements>

tracks = f.events.tracks[f.events.n_tracks > 0]
len(tracks)
>>> 145007

sum(f.events.n_tracks == 0)
>>> 21

best_track(f.events, strategy="first")
>>> <OfflineBranch[tracks]: 145007 elements>

%timeit best_track(f.events, strategy="first")
>>> 9.39 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit best_track(f.events, strategy="rec_stages", rec_stages=[1, 3, 5, 4])
>>> 452 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit best_track(f.events, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE")
>>> 17.7 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thanks! My only concern is that you lose the correlation between the best track and the event ID. Not sure how to solve that?

Hum ... maybe we need to add an option to just return the "entire" mask to apply to the tracks/events class (after the best track selection) for those who need the corresponding events ids?

I am not sure I see a trivial solution to this at the moment! I will think about it.

Yeah, that sounds interesting!

I think the mask would do fine (at least in my case), as I do pretty much that myself at the moment ;).

I am not sure that this is as trivial as it sounds like because the shape (and the type) of masks for tracks is completely different from the shape (and the type) of the masks for events (because tracks data is always nested while events data is almost always flat!). So even if we get a mask for tracks using best_track, it won't necessarily be useful for events class.

But that would be just a 1D mask to filter events, right? I think that's already good enough

Yes, the mask for events would be a 1D, but, as you already know, the mask for tracks won't be a 1D .... So I am not sure what we want this function to do precisely ...

So if we say that an option to return masks is needed for best_track, we need to choose one of the following options:

return both masks for tracks and events (no classes are returned), and let the users do whatever they want with them.
return best tracks class and the mask for events (as an option), maybe this is what you have in mind?
return best tracks class, and the masks for tracks as an option and the mask for events as a separate option?

we can also rely on the user to apply the mask events.tracks[events.n_tracks > 0] and use ONLY the reconstructed events as input for best_tracks (since it doesn't make sense to talk about best_track selection if there isn't any reconstruction). When the mask is not applied, there will always be an IndexError raised, but I can raise a custom error to remind the user to apply the mask ...

PS: I am just thinking out loud here. I am personally in favour of doing the minimum operations possible on data and giving more flexibility to the user.

My approach is from the users point of view. If best_track(tracks, ...rectype, etc.) is executed, the user (in my point of view) assumes to retrieve the best track (a single one) for each event. I think that a 1D mask which also shows for which event there actually id a track is very natural and further selections (based on the events) can be made easily.

I don't think that the user is interested in the position of the track inside the track vector, but they definitely need access to the track parameters, which in fact can be quite annoying when there is still a nested structure of tracks per event.

We should probably think about returning a flat "vector of tracks" with all data extracted.

I probably caused some confusion here:

best_track already returns flat data! (except for the strategy "rec_stages" where some nesting is still allowed for those who want to do multiplicity studies, which can be moved to another function btw ...).

Alright, so main conclusion: add an option to return the 1D mask for events to match the tracks with their corresponding event ids!

I am wondering if we could just reuse the dtype of the events.tracks[] entries that would make it quite easy to populate a numpy recarray where the dtype is something like pos_x, pos_y, ..., rec_type etc. (all single values)

Ah ok, so it's flat already

Yea, so I don't think we need to construct a recarray ... I think I just confused you by mentioning the nested masks... .. Sorry

so here is a quick implementation of the events mask option:


def best_track(events, strategy="default", rec_type=None, rec_stages=None, event_mask=False):
    """best track selection based on different strategies

    Parameters
    ----------
    events : class km3io.offline.OfflineBranch
        the events branch.
    strategy : str
        the trategy desired to select the best tracks. It is either: 
            - "first" : to select the first tracks.
            - "rec_stages": to select all tracks with a specific reconstruction stages.
            This requires the user to specify the reconstruction stages in rec_stages input.
            Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]).
            - "default": to select the best tracks (the first ones) corresponding to a specific
            reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input.
            Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE").
    rec_type : str, optional
        reconstruction type as defined in the official KM3NeT-Dataformat.
    rec_stages : list, optional
        list of the reconstruction stages. Example: [1, 2, 3, 4, 5]
    event_mask : bool, optional
        True: to return the mask to apply to the events class to match the best track selected with
        their corresponding event ID.

    Returns
    -------
    class km3io.offline.OfflineBranch
        tracks class with the desired "best tracks" selection.

    Raises
    ------
    ValueError
        ValueError raised when an invalid strategy is requested.
    """
    options = ['first', 'rec_stages', 'default']
    if strategy not in options:
        raise ValueError("{} not in {}".format(strategy, options))

    mask = events.n_tracks > 0
    tracks = events.tracks[mask]
    if strategy == "first":
        out = tracks[:, 0]

    elif strategy == "rec_stages" and rec_stages is not None:
        out = tracks[mask(tracks.rec_stages, rec_stages)]

    elif strategy == "default" and rec_type is not None:
        out = tracks[tracks.rec_type == reconstruction[rec_type]][
            tracks.lik == ak1.max(tracks.lik, axis=1)][:, 0]

    if event_mask:
        out = (out, mask)

    return out

TBH here, I really don't like the idea of returning the mask, because this mask is really trivial and I think it must be left to the user to ALWAYS provide a subset of the events class to the best_track function where all the non-reconstructed events have been removed (ie: where the mask events.n_tracks >0 have been applied on events). This actually will make the function best_track more general and it can be applied to any subset of events the user would like (with the condition events.n_tracks >a where a>0), in which case the events ids of the subset used and the best tracks selected will ALWAYS match!

Alright, now you might think "yea but in the previous implementation, we could still use best_track with any subset of events and it will still work". To this, I reply YES of course! BUT, if a user is not very experienced, and they use best_track function, they will be returned a subset of their input without necessarily knowing why, unless they look at the source code. While if we impose to the user to construct from the beginning their own subset, the best_track method becomes transparent and there will be no ambiguity on the tracks and their corresponding events IDS.

So here is an alternative:

def best_track(events, strategy="default", rec_type=None, rec_stages=None):
    """best track selection based on different strategies

    Parameters
    ----------
    events : class km3io.offline.OfflineBranch
        a subset of reconstructed events where `events.n_tracks > 0` is always true. 
    strategy : str
        the trategy desired to select the best tracks. It is either: 
            - "first" : to select the first tracks.
            - "rec_stages": to select all tracks with a specific reconstruction stages.
            This requires the user to specify the reconstruction stages in rec_stages input.
            Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]).
            - "default": to select the best tracks (the first ones) corresponding to a specific
            reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input.
            Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE").
    rec_type : str, optional
        reconstruction type as defined in the official KM3NeT-Dataformat.
    rec_stages : list, optional
        list of the reconstruction stages. Example: [1, 2, 3, 4, 5]

    Returns
    -------
    class km3io.offline.OfflineBranch
        tracks class with the desired "best tracks" selection.

    Raises
    ------
    ValueError
        ValueError raised when an invalid strategy is requested.
    """
    options = ['first', 'rec_stages', 'default']
    if strategy not in options:
        raise ValueError("{} not in {}".format(strategy, options))
    
    if any(events.n_tracks == 0):
        raise ValueError("'events' should not contain empty tracks. Consider applying the mask: events.n_tracks>0")

    tracks = events.tracks
    if strategy == "first":
        out = tracks[:, 0]

    elif strategy == "rec_stages" and rec_stages is not None:
        out = tracks[mask(tracks.rec_stages, rec_stages)]

    elif strategy == "default" and rec_type is not None:
        out = tracks[tracks.rec_type == reconstruction[rec_type]][
            tracks.lik == ak1.max(tracks.lik, axis=1)][:, 0]

    return out

I don't really care about backwards compatibility until we reach 1.0, so we can do whatever we want ;) From 1.0 on we strictly follow SemVer (we actually also do now, since SemVer says that everything is breaking until 1.0 ).

Given that any(events.n_tracks == 0) has no significant impact on the performance, I think it's a good solution. I agree that we should not make best_track too smart...

After a long talk with @tgal he asked me to provide a reco file containing both recos to test best_track on, it can be found here:

/sps/km3net/users/lnauta/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root

@lnauta Thank you :)

Thanks @lnauta

So we have two problems:

There are two best_track() functions (one in km3io.offline and one in km3io.tools)
km3io.offline.best_track() does not seem to accept a "single event"

In [16]: km3io.offline.best_track(f.events.tracks)
Out[16]: <OfflineBranch[tracks]: 3248 elements>

In [17]: km3io.offline.best_track(f.events[0].tracks)
Out[17]: IndexError while accessing an item from '<ChunkedArray [[1 2 3 ... 1334 1335 1336] [1 2 3 ... 122 123 124] [1 2 3 ... 952 953 954] ... [1 2 3 ... 90 91 92] [1 2 3 ... 1904 1905 1906] [1 2 3 ... 1338 1339 1340]] at 0x7f57e7e41b10>' at depth 1 ((slice(None, None, None), 0)) using the index chain [0, (slice(None, None, None), 0)]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/dev/km3io/venv/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/dev/km3io/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    397                         if cls is not object \
    398                                 and callable(cls.__dict__.get('__repr__')):
--> 399                             return _repr_pprint(obj, self, cycle)
    400
    401             return _default_pprint(obj, self, cycle)

~/dev/km3io/venv/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    687     """A pprint that just redirects to the normal repr function."""
    688     # Find newlines and replace them with p.break_()
--> 689     output = repr(obj)
    690     for idx,output_line in enumerate(output.splitlines()):
    691         if idx:

~/dev/km3io/km3io/rootio.py in __repr__(self)
    198
    199     def __repr__(self):
--> 200         length = len(self)
    201         return "<{}[{}]: {} element{}>".format(self.__class__.__name__,
    202                                                self._mapper.name, length,

~/dev/km3io/km3io/rootio.py in __len__(self)
    178                 unfold_indices(
    179                     self._branch[self._keymap['id']].lazyarray(
--> 180                         basketcache=BASKET_CACHE), self._index_chain))
    181
    182     def __iter__(self):

~/dev/km3io/km3io/tools.py in unfold_indices(obj, indices)
     29     for depth, idx in enumerate(indices):
     30         try:
---> 31             obj = obj[idx]
     32         except IndexError:
     33             print(

IndexError: too many indices for array

@zaly fyi, Lodewijk is processing the thing event-by-event, so in his case best_track() will receive the tracks branch of a single event each time.

Alright, this made me realise that I did not delete the km3io.offline.best_track(..), all the functions are in tools. Sorry for that!

So I updated the master branch now, all the functions are in tools.

If we want best_track to work with a single event, I will have to update it. (I didn't think about this case (oups!), because I thought we will process all events in a file in one go).

@lnauta I am thinking about redesigning the processing of the files in your pipeline. I think it would be better to wrap the km3io readout of the whole file in one module and do the iteration per-file, instead of per-event. What do you think?

That sounds good to me. For me it the important part is that I can read it in with MONA at the end, which requires it to be a flat TTree, so as long as the conversion from some format to root in a flat TTree is possible it's good! But I will need some help with making the pipes, I'm still learning how to do this.

Sure, no problem. I can show you an example. It’s basically the creation of a single kp.Table for a file per iteration and then dumping it into the summary file via the HDFSink.

@tgal @lnauta

Just a follow up:

best_track has been adapted for the case of a tree with one event only. (it is available on the master branch, in km3io.tools)
the strategy "rec_stages" has been removed from the function best_track for consistency. Since the case for strategy='rec_stages' was more for a multiplicity study, a new function was created specifically for this purpose: get_multiplicity(tracks, rec_stages).

Please let me know if you have any requests or if you have noticed other bugs.

I really appreciate all your contributions/feedback and I hope I am providing what you guys need for your analysis :)

Thanks for the update. I use the rec_stages to make sure that all stages have been done in the reconstruction like: km3io.tools.best_track(ev.tracks, strategy="rec_stages", rec_stages=all_gandalf_stages)

Should I switch to this get_multiplicity method? I am still working on rewriting my code to be on file by file level as described above, it's not finished yet.

@lnauta I am a bit confused here, I interpret your reply in two ways:

1- If you just want to select ALL the tracks in an event, that were reconstructed following certain reconstruction stages of interest (for example [1, 2, 3, 4, 5]), then you can either,

a) use the get_multiplicty method: get_multiplicity(ev.tracks, [1, 2, 3, 4, 5]). Be careful here, the order of the reconstruction stages is important :)

fully_reco_tracks = km3io.tools.get_multiplicity(ev.tracks, [1, 2, 3, 4, 5])

b) create a mask using the function km3io.tools.mask(ev.tracks, [1, 2, 3, 4, 5]), and then apply this mask on your ev.tracks like:

my_mask = km3io.tools.mask(ev.tracks, [1, 2, 3, 4, 5])
fully_reco_tracks = ev.tracks[my_mask]

PS: get_multiplicity function already does this for you, but I am just sharing all available options in km3io.tools ;)

2- If after selecting all tracks that were reconstructed following the stages: [1, 2, 3, 4, 5], you want to select the best one among these tracks (either with strategy=default or strategy=first), then you will have to pass fully_reco_tracks to best_track as an input, and use the strategy of your interest.

my_best_tracks = km3io.tools.best_track(fully_reco_tracks, strategy="my_strategy", ..)

I hope this helps :) if you have issues applying this, please let me know ;)

Thanks @zaly ! Looks good to me :)

@lnauta here is an example how to process a file one-by-one as a whole. You can then use the best_track function to get all the tracks in one shot.

The basic idea is to create an event mask (just a boolean numpy array) which is your first cut (and probably only) cut. Then apply this cut mask to extract equal lengths of parameters like the positions of the reconstructed tracks etc. In addition to that, I'd recommennd to create an extra table where you put the header information. For this, Mona probably needs to be adjusted but you can also create columns of the table with the same values for each file.

So here is an example (I wrote that down blindly, so expect typos and whatnot, but it shows the basic workflow

import km3pipe as kp
import km3io


class FilesPump(kp.Module):
    def configure(self):
        self.filenames = self.require("filenames")
        self.opener = self.get("opener", default=open)
        self.blobs = self._blob_generator()

    def process(self, blob):
        return next(self.blobs)

    def _blob_generator(self):
        for filename in self.filenames:
            blob = kp.Blob()
            fobj = self.opener(filename)
            blob["filename"] = filename
            blob["fobj"] = fobj
            yield blob
            fobj.close()

    def __iter__(self):
        return self

    def __next__(self):
        return next(self.blobs)


class SummaryExtractor(kp.Module):
    def configure(self):
        self.all_gandalf_stages = [1, 2, 3, 4, 5]

    def process(self, blob):
        f = blob["fobj"]
        filename = blob["filename"]
        self.cprint(f"Processing {filename}")

        event_mask = f.events.n_tracks > 0

        best_tracks = km3io.tools.best_track(f.events[event_mask].tracks,
                                             strategy="rec_stages",
                                             rec_stages=self.all_gandalf_stages)

        summary_data = {}

        for q in "xyz":
            param = f"pos_{q}"
            summary_data[param] = best_tracks[param]

        blob["Summary"] = kp.Table(summary_data, h5loc="/summary")

        return blob


pipe = kp.Pipeline()
pipe.attach(FilesPump, filenames=["a.root", "b.root"], opener=km3io.OfflineFile)
pipe.attach(SummaryExtractor)
pipe.attach(HDF5Sink)
pipe.drain()

@tgal @lnauta

Please note that the strategy="rec_stages" is no longer in best_track, here is the alternative:

event_mask = f.events.n_tracks > 0

tracks_mask = km3io.tools.mask(f.events[event_mask].tracks.rec_stages, self.all_gandalf_stages)

best_tracks = km3io.tools.best_track(f.events[event_mask].tracks[tracks_mask],
                                     strategy="YOUR_STRATEGY", ...)

Thanks for the quick reply, I just encountered this.

This worked, however I encountered the following bug using "default" strategy:

Pipeline and module initialisation took 0.005s (CPU 0.003s).
2020-06-30 11:53:05 ++ SummaryExtractor: Processing data/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root
Traceback (most recent call last):
  File "pipe-tam.py", line 71, in <module>
    pipe.drain()
  File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 423, in drain
    return self._drain(cycles)
  File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 372, in _drain
    new_blob = module(blob_to_send)
  File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 178, in __call__
    return self.process(*args, **kwargs)
  File "pipe-tam.py", line 52, in process
    strategy="default")
  File "/data/antares/users/ljnauta/nikhef_mc/py_packages/km3io/km3io/tools.py", line 326, in best_track
    return out
UnboundLocalError: local variable 'out' referenced before assignment
Closing remaining open files:dump.h5...done

Using the "first" strategy, this did not occur and the code provided by @tgal and @zaly runs succesfully and about 10x faster:

Pipeline and module initialisation took 0.004s (CPU 0.002s).
2020-06-30 11:55:22 ++ SummaryExtractor: Processing data/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root
2020-06-30 11:56:48 ++ km3pipe.io.hdf5.HDF5Sink.HDF5Sink: HDF5 file written to: dump.h5
============================================================
1 cycles drained in 86.081316s (CPU 85.065551s). Memory peak: 797.94 MB
  wall  mean: 85.133203s  medi: 85.133203s  min: 85.133203s  max: 85.133203s  std: 0.000000s
  CPU   mean: 85.017494s  medi: 85.017494s  min: 85.017494s  max: 85.017494s  std: 0.000000s

@lnauta Just a quick question, which input did you use for rec_type=??? with strategy="default" ?

I am just asking because I want to reproduce the error with your file ;)

@lnauta

I examined the file you provided previously:

import km3io as ki
from km3io.tools import best_track, mask

f = "/sps/km3net/users/lnauta/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root"

events = ki.OfflineReader(f).events

full_events = events[events.n_tracks>0]

mask = mask(full_events.tracks.rec_stages, [1, 2, 3, 4, 5])

full_reco_tracks = full_events.tracks[mask]

Now this full_reco_tracks is supposed to contain all tracks that were reconstructed following the stages [1, 2, 3, 4, 5]. BUT if these stages don't even exist, you will simply have empty tracks (this is what caused best_track to raise an error, which then made your code fail..)! Here is how you can check how many tracks per event were constructed following these stages:

from km3io.tools import count_nested

count_nested(full_reco_tracks.rec_stages, axis=1)
>>> <Array [0, 0, 0, 0, 0, 0, ... 0, 0, 0, 0, 0, 0] type='3248 * int64'>

any(count_nested(full_reco_tracks.rec_stages, axis=1 != 0)
>>> False

Now you see that there isn't ANY track that was reconstructed following the stages [1, 2, 3, 4, 5]!

After looking a little bit in your file, I see that the reconstruction stages of most tracks is actually [1, 2, 5, 3, 5, 4] (just a reminder here: the order is really important). So now let's try with these reco stages:

import km3io as ki
from km3io.tools import best_track, mask, count_nested


f = "/sps/km3net/users/lnauta/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root"

events = ki.OfflineReader(f).events

full_events = events[events.n_tracks>0]

mask = mask(full_events.tracks.rec_stages, [1, 2, 5, 3, 5, 4])

full_reco_tracks = full_events.tracks[mask]

count_nested(full_reco_tracks.rec_stages, axis=1)

>>> <Array [1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1] type='3248 * int64'>

# So now we are sure we are not providing empty tracks to `best_track`

best_tracks = best_track(full_reco_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE")

>>> <OfflineBranch[tracks]: 3248 elements>

and just a sanity check

best_tracks.rec_stages[0]
>>> <Array [1, 2, 5, 3, 5, 4] type='6 * int64'>

Concerning the performance, yes I am not surprised at all that strategy="first" is much faster than strategy="default"`, in the first one, we only select the very first track per event, but in the second one, we apply multiple masks (that are not trivial to create), we also play a little bit with data structures (switching from awkward data structures to awkward1.Array when necessary for nested data etc etc) ;)

Please let me know if you have other questions :)

Great detailed explanation!

I was not complete in my reply, sorry for that: I had a private communication with tgal and there my old code which is not included here took ~900s to run. With the code provided in this issue it runs in 85s, which is the 10x speedup mentioned.

For the bug I posted, I used no rec_type, so that defaults to None and combined with the strategy="default" it crashes and gives the error mentioned. So there is no measured runtime in this case. And I use the gandalf stages [1, 2, 5, 3, 5, 4].

Oh I see!

rec_type must always be specified when strategy="default" is used ;)

If you want to know which rec_type are available, you can check using km3io.tools.rec_types() :)

Alright, so now the only missing thing is to be able to extract the best track for a single event (plus some improved error messages, see second paste):

In [15]: import km3io                                                                                                                                             

In [16]: r = km3io.OfflineReader("/sps/km3net/repo/mc/atm_neutrino/KM3NeT_-00000001_20171212/v5.1/reco/mcv5.1.genhen_anumuNC.sirene.jte.jchain.aashower.7.root")  

In [17]: tracks = r.events[0].tracks                                                                                                                              

In [18]: km3io.tools.best_track(r.events.tracks[0], rec_type="AANET_RECONSTRUCTION_TYPE")                                                                         
Out[18]: <OfflineBranch[tracks]: 1 element>

In [19]: import km3io                                                                                                                                             

In [20]: r = km3io.OfflineReader("/sps/km3net/repo/mc/atm_neutrino/KM3NeT_-00000001_20171212/v5.1/reco/mcv5.1.genhen_anumuNC.sirene.jte.jchain.aashower.7.root")  

In [21]: bt = km3io.tools.best_track(r.events.tracks[0], rec_type="AANET_RECONSTRUCTION_TYPE")                                                                    

In [22]: bt.dir_z                                                                                                                                                 
IndexError while accessing an item from '<ChunkedArray [[-0.10754227138649619 -0.10754227138649619 -0.10754227138649619 ... -0.6308105310435222 -0.17766528872119944 -0.35759773527112715] [-0.30371903873031214 -0.30371903873031214 -0.30371903873031214 ... -0.15904016029640894 0.1351056571910767 0.13460939165148503] [0.0 6.123233995736766e-17 0.9095965159498551 ... -0.3708422859253514 -0.04866654358741362 -0.04349134599131605] ... [0.5465738369509385 0.5465738369509385 0.5465738369509385 ... 0.8999494563076141 0.8999822351372327 0.899991871118423] [0.5046321338332148 0.5046321338332148 0.5046321338332148 ... -0.10543751926994467 -0.10544439906788888 -0.10509217036083282] [0.5438504854415898 0.5438504854415898 0.5438504854415898 ... 0.07688482645387848 0.07698433982082072 0.07689795397716859]] at 0x7f6d93c83e10>' at depth 2 ([False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False  True False False False False False False False False
 False False False False]) using the index chain [0, array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True]), array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False]), 0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-22-6292038f82cc> in <module>
----> 1 bt.dir_z

~/dev/km3pipe/venv/lib/python3.7/site-packages/km3io/rootio.py in __getattribute__(self, attr)
    123 
    124         if attr in self._keymap.keys():  # intercept branch key lookups
--> 125             return self.__getkey__(attr)
    126 
    127         return object.__getattribute__(self, attr)

~/dev/km3pipe/venv/lib/python3.7/site-packages/km3io/rootio.py in __getkey__(self, key)
    152                 self._awkward_cache[cache_key] = ak.from_iter(out)
    153             out = self._awkward_cache[cache_key]
--> 154         return unfold_indices(out, self._index_chain)
    155 
    156     def __getitem__(self, item):

~/dev/km3pipe/venv/lib/python3.7/site-packages/km3io/tools.py in unfold_indices(obj, indices)
     29     for depth, idx in enumerate(indices):
     30         try:
---> 31             obj = obj[idx]
     32         except IndexError:
     33             print(

IndexError: boolean index did not match indexed array along dimension 0; dimension is 14 but corresponding boolean dimension is 52

here is an error which is not so user-friendly:

In [23]: bt = km3io.tools.best_track(r.events.tracks[0])                                                                                                          
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-23-84676f55b048> in <module>
----> 1 bt = km3io.tools.best_track(r.events.tracks[0])

~/dev/km3pipe/venv/lib/python3.7/site-packages/km3io/tools.py in best_track(tracks, strategy, rec_type)
    324                 tracks.lik == ak1.max(tracks.lik, axis=1)][:, 0]
    325 
--> 326     return out
    327 
    328 

UnboundLocalError: local variable 'out' referenced before assignment

@tgal

let me take a look!

mentioned in issue #62 (closed)

@tgal

Alright, so now the only missing thing is to be able to extract the best track for a single event

I think best_track fails here because track.usr raises an error with this file, it is not read by km3io. I opened #62 (closed) to solve it because I don't think it's related to best_track.

here is an error which is not so user-friendly:

Indeed, this is not user-friendly! will be corrected! Thanks for pointing it out ;)

@tgal

best_track works now for both one event and multiple events.

After today's discussion, I added the right mask to select the longest reconstructed tracks as well.


In [2]: import km3io as ki

In [3]: from km3net_testdata import data_path

In [5]: f = ki.OfflineReader(data_path("offline/km3net_offline.root"))

In [6]: trk = f.events[0].tracks

In [7]: trks = f.events.tracks

In [20]: bt = ki.tools.best_track(trk, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE")

In [21]: bt.dir_z
Out[21]: array([-0.87288522])

In [22]: bt.dir_x
Out[22]: array([0.03687111])

In [23]: bt.dir_y
Out[23]: array([-0.48653048])

In [24]: bt.pos_x
Out[24]: array([445.835396])

In [25]: bt = ki.tools.best_track(trks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE")

In [26]: bt.dir_z
Out[26]: <Array [[-0.873], [-0.835, ... [-0.74]] type='10 * option[var * float64]'>

In [27]: bt.dir_x
Out[27]: <Array [[0.0369], [-0.175, ... 0.124], [0.172]] type='10 * option[var * float64]'>

In [28]: bt.dir_y
Out[28]: <Array [[-0.487], [0.521], ... 0.168], [0.651]] type='10 * option[var * float64]'>

and


In [6]: bt = ki.tools.best_track(trk)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-7d12b2b91c1e> in <module>
----> 1 bt = ki.tools.best_track(trk)

~/km3net/km3net/km3io/km3io/tools.py in best_track(tracks, strategy, rec_type)
    317 
    318     if strategy == "default" and rec_type is None:
--> 319         raise KeyError(
    320             "rec_type must be provided when the default strategy is used.")
    321 

KeyError: 'rec_type must be provided when the default strategy is used.'

I hope this helps, let me know if anything else is needed ;)

I like it! But I wouldn't raise a KeyError but a ValueError (https://docs.python.org/3/library/exceptions.html#ValueError). KeyError is meant for accessing non-existent keys of a dict-like object.

I quickly fixed it with one go including a slight restructure of the tests (divided up into smaller tests): 43c8d08c

Let's close this...

closed

Best track?

Designs

Child items 0

Activity