I was wondering if there is a faster way to get the number of hit DOMs among the triggered hits of an event. I used the following because I could not come up with a way to evaluate this in array from.
hits = f.events.hitshit_doms_cut = 5no_hit_doms = np.array([len(np.unique(evt.dom_id[evt.trig!=0])) for evt in hits])hit_event_selection_mask = no_hit_doms >= hit_doms_cut
So, step by step this is: get only the triggered hits, get the dom_id's of them, only use unique entries among them, count the length of these unique entries. Then loop over every event.
Any suggestions?
p.s. Where can I add tags in this? Would be a "discussion" I guess.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
yeah. not super large though:
/sps/km3net/users/guderian/track_quality_output/reconstructed/string/ARCA_42/DU1/stretching/real/stretching_new/-0.03_run8422.root.aanet.root
Alright, this will be a bit technical, but since there is no existing bridge to np.unique which is performant, I decided to write my own implementation in a Numba JITted function and utilise the Awkward1 package which works with Numba.
This is currently your implementation using the Python list comprehension and the file you provided. I takes 14 seconds on my super desktop:
And here is how to make it ~18x faster using Awkward1 and Numba:
importkm3ioimportawkward1asakimportnumpyasnpimportnumbaasnbfilename="/home/tgal/data/tmp/-0.03_run8422.root.aanet.root"f=km3io.OfflineReader(filename)@nb.jit(nopython=True)defunique(array,dtype=np.int64):"""Return the unique elements of an array with a given dtype"""n=len(array)out=np.empty(n,dtype)last=array[0]entry_idx=0out[entry_idx]=lastforiinrange(1,n):current=array[i]ifcurrent==last:# shortcut for sorted arrayscontinuealready_present=Falseforjinrange(i):ifcurrent==out[j]:already_present=Truebreakifnotalready_present:entry_idx+=1out[entry_idx]=currentlast=currentreturnout[:entry_idx+1]@nb.jit(nopython=True)defuniquecount(array,dtype=np.int64):"""Count the number of unique elements in a jagged Awkward1 array."""out=np.empty(len(array),dtype)foriinrange(len(array)):out[i]=len(unique(array[i]))returnout
Notice that there is no dynamic dtype determination support of Awkward1 arrays inside a JITted function, so you have to pass in the dtype manually. You can pass dtype=np.int16 to the uniquecount() since we do not have more than 32k DOMs. On the other hand, it will not have any significant impact on the performance, so you can stick to the default
I agree with adding this to km3io (if it's not added to awkward1), I personally use this a lot in my analysis and I can only imagine that it's useful for others as well.
However, I would modify the script so that it returns an awkward1 Array for consistency?
I can add it to my to do list of offline functions if you don't have time.
In general it is good to provide the most basic structure you can, for a low level algorithm. So I'd propose to stick to a numpy array, the interface is the same, so the user can work with it even if they expect an awkward one. Also an awkward1.Array doesn't really make sense here because the output is 1D for both functions, so it's unnecessary overhead.
However, if we want to extend the unique function to work for jagged arrays, of course awkward1.Array is the right type.
Alright, I added km3io.tools.unique and km3io.tools.uniquecount. Implementing it into awkward1.Array is a large undertaking since one also needs to implement it into their C++ level code including proper bindings. The numba implementation is simple, robust and as fast as the C++ code, so let's stick to this.