Best track?

changed milestone to %v1.0

mentioned in issue #38 (closed)

Seems fine to me, that's the only one that I need ;) Do you want to include rec_stages slicing in strategy as well @tgal ?

Yeah well, you and @zaly are the experts on these best track strategies so far, so please feel free to add your proposals ;)

If mc_tracks would contain proper non-empty rec_stages entries, it would be quite straightforward. Otherwise hmm, maybe some index matching would work.

Yes, I suggest we add some cleaning in the tracks (trivial one: remove non-reconstructed tracks). I think it is also good to make sure that the tracks selected are fully reconstructed (following whatever reconstruction stages the user wants)... But that is under discussion... at the moment, we still need to explore the best reco strategies, so please @pkalaczynski do not expect a fast implementation as we would like to make this best reco method as general as possible (for the time being, since km3net has not yet agreed on a common definition of best reco track!).

No problem, just monitoring the progress ;)

I agree. Something like "filter all tracks with stages [x, y, z]" would of course still be needed, I guess? ;)

Yep, exactly. We want it to be user-friendly, but also let people do complex stuff as well. And once again as a reminder, this should be applicable not just on tracks, but also their respective mc_tracks, mc_hits, hits etc. ;)

A relevant issue is here: https://git.km3net.de/working_groups/comp_soft/issues/38

Note that we should keep in mind that we can have a mixture of reconstructed tracks inside the same vector from different reconstruction algorithms. At the moment we did not mix different reconstruction algorithms in the same processing pipeline, but this might (and will) be the case in future.

This means that there is a requirement to also always check the rec_type of each entry.

The data structure (flat vector) which is provided by the aanet data format is clearly not ideal for this.

I would like to add something, if I may. Maybe this is already possible with basic numpy, but: At times one might need a mask for the selected "best tracks" as much as the best tracks and their information. My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.

Is there an easy way to do it?

If a mask is also returned by such a "best track" function (or maybe even an own function) it could be used by the user for the mc_tracks and hits as well.

I would like to add something, if I may

Yes please! Always! :)

My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.

This boils down to our overall goal: providing a natural way of slicing/indexing the KM3NeT datastructures in a whole, just as they were numpy arrays.

Is there an easy way to do it?

Kind of... with a few caveats We are working on this very hard, every day and it's a tough task to implement it so that it's both user-friendly and has maximum performance.

We face multiple problems, but one of the biggest issues is of course that we are not dealing with homogeneous data like 2D matrices (with named columns) but variable lengths of arrays inside variable lengths of arrays inside of variable length arrays.

Anyways, we are on the right track and working on the branch https://git.km3net.de/km3py/km3io/-/tree/refactor-offline which you can try as you want (pip install git+https://git.km3net.de/km3py/km3io.git@refactor-offline).

Here are some examples of the slicing/indexing/masking feature:

>>> import km3io
>>> km3io.version
'0.9.1.dev72+g5dee01f'
>>> f = km3io.OfflineReader("datav5.40.jorcarec.aanet.00006585.root")
>>> tracks = f.events.tracks
>>> tracks
<Branch[tracks]: 141098 elements>
>>> f.events.n_tracks
array([128, 128, 128, ..., 128, 128, 120])
>>> mask = f.events.n_tracks > 0  # boolean mask for events with at least 1 track
>>> mask
<ChunkedArray [True True True True True True True ...] at 0x7facfe227c10>
>>> pos_x_of_every_first_track = tracks[mask].pos_x[:, 0]
>>> pos_x_of_every_first_track
<ChunkedArray [29.758563081224473 44.87933353564635 63.4742112731849 10.348203834859914 -10.186394233913711 1.574926331368094 43.82500580340671 ...] at 0x7facfc1a6d00>
>>> pos_x_of_every_first_track.shape
(141053,)
>>> selected_tracks = tracks[mask]
>>> selected_tracks
<Branch[tracks]: 141053 elements>
>>> selected_hits = f.events.hits[mask]
>>> selected_hits
<Branch[hits]: 141053 elements>

As you can see, the slicing is more natural and can be chained, although we still need to fix issues so don't use this in production yet.

The problem with the best_track() mask is, that it needs to be a two-dimensional mask (event_idx+track_idx) which is not very well handled by these underlying ChunkedArrays in uproot. I opened an issue end of January (https://github.com/scikit-hep/awkward-array/issues/229) and the answer from Jim himself (the creator of uproot) is:

I made a wrong design choice by introducing ObjectArray: it delays evaluation of the (expensive) unsplit-list deserialization by introducing an opaque type that doesn't work like any other data. In Awkward 1.0, the unsplit-list deserialization will be over an order of magnitude faster (no longer the very noticeable performance hit) as shown on page 14/16 of this talk, and the Methods and ObjectArray mechanisms will be replaced by a single non-opaque mechanism as shown in this section of the Coffea demo.

So, not a bug, but exactly the sort of confusing feature that will be eliminated by transitioning to Awkward 1.0.

Recently I also experimented converting these ChunkedArrays (which are ObjectArrays behind the scenes and not "well-sliceable") to AwkwardArrays using the not-yet-released awkward1 library (which is going to replace awkward soon) and there the slicing works as expected. However, we will deal with the issue that those arrays are not lazy, which means that they will eventually need to load all the data in memory. This might not be a problem for high(est) level data though. You can see the issue here: https://github.com/scikit-hep/uproot/issues/461

All in all I can say that we are close to a nice API, but it's still a lot of work. @zaly is also working hard on the test part, to boil the hell out of it by trying to figure out all use-cases which cause problems with our new approach to the indexing/masking/slicing.

Meanwhile, let us know what you think and if you have any ideas, don't hesitate to create issues or comment wherever you like! I am confident that the numpy-style high-level analysis is the way to go.

I'd love to see pandas-like operations and if we had 2D shaped data (matrices) it would be my first choice, as it is for KM3Pipe with the thin kp.Table-class for event-by-event readout, since there you can shape every branch into a 2D columnar data independently (one table for Hits, one for Tracks etc.), but in case of km3io the user wants to deal with the whole file at once, which makes it awkward.

Thanks for your detailed answer.

Very much appreciate all of your works!

Your example already looks pretty much like what I am dreaming about :)

Just a tiny update: @zaly is working on the best_track implementation and meanwhile the slicing has been implemented to work around with the uproot3 limitations using the new awkward1 library.

Feel free to test it!

As always: the latest master is updated in Lyon on every git push (module load python/3.7.5).

changed the description

Hi, does a "best_track_selector" exists by now? I could really use one :)

Not sure if a dedicated function is available yet, but if you want to just have tracks with all reco stages and with their best fits, you can typically just use masks like in this example:

for f in FNAMES:
    print("Opening ", f)
    r = ki.OfflineReader(f)
    if len(r.events)>1: # only non-empty files
        premask = [rstgs.size>1 for rstgs in r.events.tracks.id]
        mask = [[1, 2, 5, 3, 5, 4] == list(s[0]) for s in r.events.tracks.rec_stages[premask]]
        VARS[0].extend([sum(tots) for tots in r.events.hits.tot[premask][mask]])

Not sure if that's the best way, but it works and it's rather fast.

I don't know about the current status, @zaly do you have some news on this?

Hey, sorry I completely forgot answering here.

@lmaderer @pkalaczynski @dguderian I implemented a prototype some time ago but I have not tested it with heavy files (so I have no idea about the performance). Please go ahead and try it out and let me know your feedback, I will really appreciate it :)

This is how to proceed:

Here is some documentation:

best_track(tracks, strategy="first", rec_type=None, rec_stages=None):
    """best track selection based on different strategies

    Parameters
    ----------
    tracks : class km3io.offline.OfflineBranch
        the tracks branch.
    strategy : str
        the trategy desired to select the best tracks. It is either: 
            - "first" : to select the first tracks.
            - "rec_stages": to select all tracks with a specific reconstruction stages.
            This requires the user to specify the reconstruction stages in rec_stages input.
            Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]).
            - "default": to select the best tracks (the first ones) corresponding to a specific
            reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input.
            Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE").
    rec_type : str, optional
        reconstruction type as defined in the official KM3NeT-Dataformat.
    rec_stages : list, optional
        list of the reconstruction stages. Example: [1, 2, 3, 4, 5]

    Returns
    -------
    class km3io.offline.OfflineBranch
        tracks class with the desired "best tracks" selection.
    """

and here is how to proceed:


import km3io
from km3io.tools import best_track

tracks = km3io.OfflineReader(my_file).events.tracks

# ""
best_reco = best_track(tracks, strategy="your_strategy", ...)

in the case of strategy = "default", to get the reconstruction type for the input rec_type, you can check:

from km3io.tools import rec_types

rec_types()

this should return all the names of the reconstruction types as defined in the official KM3NeT-Dataformat.

This is not necessarily the final implementation, so any feedback from you is very appreciated.

Thanks ;)

Best track?

Designs

Child items ...

Activity