We all agree that there is no single "best track" algorithm as every reconstruction method can spit out it's own vector of tracks with no constraints on which one is the best track. More over, the single vector can hold different types of reconstruction algorithms by design.
Consider a track reconstruction which always(!) generates a best up-going and a best down-going track, in this particular order. In this case you should never take always the first, but check the first two and decide on additional parameters etc. on top of that, it might be that tracks from other reconstruction algorithms are present, so it's not straight forward.
So we decided to provide a single function with the signature e.g. best_track(tracks, strategy). Both are mandatory and strategy is the algorithm selector.
A very first implementation is best_track(tracks, strategy="first") which will take the first track in each event. This is currently almost always the best strategy.
The implementation is with the upcoming slicing magic something like this:
If mc_tracks would contain proper non-empty rec_stages entries, it would be quite straightforward. Otherwise hmm, maybe some index matching would work.
Yes, I suggest we add some cleaning in the tracks (trivial one: remove non-reconstructed tracks). I think it is also good to make sure that the tracks selected are fully reconstructed (following whatever reconstruction stages the user wants)... But that is under discussion... at the moment, we still need to explore the best reco strategies, so please @pkalaczynski do not expect a fast implementation as we would like to make this best reco method as general as possible (for the time being, since km3net has not yet agreed on a common definition of best reco track!).
Yep, exactly. We want it to be user-friendly, but also let people do complex stuff as well. And once again as a reminder, this should be applicable not just on tracks, but also their respective mc_tracks, mc_hits, hits etc. ;)
Note that we should keep in mind that we can have a mixture of reconstructed tracks inside the same vector from different reconstruction algorithms. At the moment we did not mix different reconstruction algorithms in the same processing pipeline, but this might (and will) be the case in future.
This means that there is a requirement to also always check the rec_type of each entry.
The data structure (flat vector) which is provided by the aanet data format is clearly not ideal for this.
I would like to add something, if I may. Maybe this is already possible with basic numpy, but: At times one might need a mask for the selected "best tracks" as much as the best tracks and their information. My use case for example:
I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.
Is there an easy way to do it?
If a mask is also returned by such a "best track" function (or maybe even an own function) it could be used by the user for the mc_tracks and hits as well.
My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.
This boils down to our overall goal: providing a natural way of slicing/indexing the KM3NeT datastructures in a whole, just as they were numpy arrays.
Is there an easy way to do it?
Kind of... with a few caveats We are working on this very hard, every day and it's a tough task to implement it so that it's both user-friendly and has maximum performance.
We face multiple problems, but one of the biggest issues is of course that we are not dealing with homogeneous data like 2D matrices (with named columns) but variable lengths of arrays inside variable lengths of arrays inside of variable length arrays.
Here are some examples of the slicing/indexing/masking feature:
>>>importkm3io>>>km3io.version'0.9.1.dev72+g5dee01f'>>>f=km3io.OfflineReader("datav5.40.jorcarec.aanet.00006585.root")>>>tracks=f.events.tracks>>>tracks<Branch[tracks]:141098elements>>>>f.events.n_tracksarray([128,128,128,...,128,128,120])>>>mask=f.events.n_tracks>0# boolean mask for events with at least 1 track>>>mask<ChunkedArray[TrueTrueTrueTrueTrueTrueTrue...]at0x7facfe227c10>>>>pos_x_of_every_first_track=tracks[mask].pos_x[:,0]>>>pos_x_of_every_first_track<ChunkedArray[29.75856308122447344.8793335356463563.474211273184910.348203834859914-10.1863942339137111.57492633136809443.82500580340671...]at0x7facfc1a6d00>>>>pos_x_of_every_first_track.shape(141053,)>>>selected_tracks=tracks[mask]>>>selected_tracks<Branch[tracks]:141053elements>>>>selected_hits=f.events.hits[mask]>>>selected_hits<Branch[hits]:141053elements>
As you can see, the slicing is more natural and can be chained, although we still need to fix issues so don't use this in production yet.
The problem with the best_track() mask is, that it needs to be a two-dimensional mask (event_idx+track_idx) which is not very well handled by these underlying ChunkedArrays in uproot. I opened an issue end of January (https://github.com/scikit-hep/awkward-array/issues/229) and the answer from Jim himself (the creator of uproot) is:
I made a wrong design choice by introducing ObjectArray: it delays evaluation of the (expensive) unsplit-list deserialization by introducing an opaque type that doesn't work like any other data. In Awkward 1.0, the unsplit-list deserialization will be over an order of magnitude faster (no longer the very noticeable performance hit) as shown on page 14/16 of this talk, and the Methods and ObjectArray mechanisms will be replaced by a single non-opaque mechanism as shown in this section of the Coffea demo.
So, not a bug, but exactly the sort of confusing feature that will be eliminated by transitioning to Awkward 1.0.
Recently I also experimented converting these ChunkedArrays (which are ObjectArrays behind the scenes and not "well-sliceable") to AwkwardArrays using the not-yet-released awkward1 library (which is going to replace awkward soon) and there the slicing works as expected. However, we will deal with the issue that those arrays are not lazy, which means that they will eventually need to load all the data in memory. This might not be a problem for high(est) level data though. You can see the issue here: https://github.com/scikit-hep/uproot/issues/461
All in all I can say that we are close to a nice API, but it's still a lot of work. @zaly is also working hard on the test part, to boil the hell out of it by trying to figure out all use-cases which cause problems with our new approach to the indexing/masking/slicing.
Meanwhile, let us know what you think and if you have any ideas, don't hesitate to create issues or comment wherever you like! I am confident that the numpy-style high-level analysis is the way to go.
I'd love to see pandas-like operations and if we had 2D shaped data (matrices) it would be my first choice, as it is for KM3Pipe with the thin kp.Table-class for event-by-event readout, since there you can shape every branch into a 2D columnar data independently (one table for Hits, one for Tracks etc.), but in case of km3io the user wants to deal with the whole file at once, which makes it awkward.
Just a tiny update: @zaly is working on the best_track implementation and meanwhile the slicing has been implemented to work around with the uproot3 limitations using the new awkward1 library.
Feel free to test it!
As always: the latest master is updated in Lyon on every git push (module load python/3.7.5).
Not sure if a dedicated function is available yet, but if you want to just have tracks with all reco stages and with their best fits, you can typically just use masks like in this example:
for f in FNAMES: print("Opening ", f) r = ki.OfflineReader(f) if len(r.events)>1: # only non-empty files premask = [rstgs.size>1 for rstgs in r.events.tracks.id] mask = [[1, 2, 5, 3, 5, 4] == list(s[0]) for s in r.events.tracks.rec_stages[premask]] VARS[0].extend([sum(tots) for tots in r.events.hits.tot[premask][mask]])
Not sure if that's the best way, but it works and it's rather fast.
@lmaderer@pkalaczynski@dguderian
I implemented a prototype some time ago but I have not tested it with heavy files (so I have no idea about the performance). Please go ahead and try it out and let me know your feedback, I will really appreciate it :)
This is how to proceed:
Here is some documentation:
best_track(tracks,strategy="first",rec_type=None,rec_stages=None):"""best track selection based on different strategies Parameters ---------- tracks : class km3io.offline.OfflineBranch the tracks branch. strategy : str the trategy desired to select the best tracks. It is either: - "first" : to select the first tracks. - "rec_stages": to select all tracks with a specific reconstruction stages. This requires the user to specify the reconstruction stages in rec_stages input. Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]). - "default": to select the best tracks (the first ones) corresponding to a specific reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input. Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE"). rec_type : str, optional reconstruction type as defined in the official KM3NeT-Dataformat. rec_stages : list, optional list of the reconstruction stages. Example: [1, 2, 3, 4, 5] Returns ------- class km3io.offline.OfflineBranch tracks class with the desired "best tracks" selection."""