Best way to get number of hit DOMs

added discussion performance labels

First, what kind of file are you looking at? I am wondering since triggered hits are usually part of the online data and can be read separately.

Oh sry, it is offline data. A reconstructed event file. This is supposed to be a quality cut on the number of hit doms for an analysis.

Do you have a path?

I could not find a largish offline file on my computer with actual triggered hits in them

yeah. not super large though: /sps/km3net/users/guderian/track_quality_output/reconstructed/string/ARCA_42/DU1/stretching/real/stretching_new/-0.03_run8422.root.aanet.root

Alright, this will be a bit technical, but since there is no existing bridge to np.unique which is performant, I decided to write my own implementation in a Numba JITted function and utilise the Awkward1 package which works with Numba.

This is currently your implementation using the Python list comprehension and the file you provided. I takes 14 seconds on my super desktop:

import km3io
import numpy as np

filename = "/home/tgal/data/tmp/-0.03_run8422.root.aanet.root"
f = km3io.OfflineReader(filename)

%%time
hits = f.events.hits
hit_doms_cut = 5
no_hit_doms = np.array([len(np.unique(evt.dom_id[evt.trig!=0])) for evt in hits])
print(no_hit_doms)
hit_event_selection_mask = no_hit_doms >= hit_doms_cut 
print(hit_event_selection_mask)

Output:

[3 3 7 ... 3 4 5]
[False False  True ... False False  True]
CPU times: user 14.1 s, sys: 29 µs, total: 14.1 s
Wall time: 14.1 s

And here is how to make it ~18x faster using Awkward1 and Numba:

import km3io
import awkward1 as ak
import numpy as np
import numba as nb

filename = "/home/tgal/data/tmp/-0.03_run8422.root.aanet.root"
f = km3io.OfflineReader(filename)

@nb.jit(nopython=True)
def unique(array, dtype=np.int64):
    """Return the unique elements of an array with a given dtype"""
    n = len(array)
    out = np.empty(n, dtype)
    last = array[0]
    entry_idx = 0
    out[entry_idx] = last
    for i in range(1, n):
        current = array[i]
        if current == last: # shortcut for sorted arrays
            continue
        already_present = False
        for j in range(i):
            if current == out[j]:
                already_present = True
                break
        if not already_present:
            entry_idx += 1
            out[entry_idx] = current
        last = current
    return out[:entry_idx+1]


@nb.jit(nopython=True)
def uniquecount(array, dtype=np.int64):
    """Count the number of unique elements in a jagged Awkward1 array."""
    out = np.empty(len(array), dtype)
    for i in range(len(array)):
        out[i] = len(unique(array[i]))
    return out

Notice that there is no dynamic dtype determination support of Awkward1 arrays inside a JITted function, so you have to pass in the dtype manually. You can pass dtype=np.int16 to the uniquecount() since we do not have more than 32k DOMs. On the other hand, it will not have any significant impact on the performance, so you can stick to the default

Here is the performance of the implementation:

%%timeit
hits = f.events.hits
hit_doms_cut = 5
dom_ids = ak.Array(f.events.hits.dom_id)
trig_dom_ids = dom_ids[f.events.hits.trig != 0]
unique_dom_ids = uniquecount(trig_dom_ids)
print(unique_dom_ids)
hit_event_selection_mask = unique_dom_ids >= hit_doms_cut
print(hit_event_selection_mask)

yields:

[3 3 7 ... 3 4 5]
[False False  True ... False False  True]
803 ms ± 5.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Hope this helps.

Edit: all this is just an example and needs further testing of course...

added awaiting feedback label

Awesome!

I will try and get back to you.

Thank you very much :)

It does give the expected result.

I will be using it more so in case something comes up I will tell you.

Do you think this can be somehow useful as an implementation in km3io or km3pipe?

Glad that helped!

Hmm good question, I think the best place is awkward1 itself. I don't know if I have time to make a proper pull request though.

Meanwhile I guess we can include it in km3io. I'll leave this open until we decided.

removed awaiting feedback label

added enhancement label

@tgal

I agree with adding this to km3io (if it's not added to awkward1), I personally use this a lot in my analysis and I can only imagine that it's useful for others as well.

However, I would modify the script so that it returns an awkward1 Array for consistency?

I can add it to my to do list of offline functions if you don't have time.

In general it is good to provide the most basic structure you can, for a low level algorithm. So I'd propose to stick to a numpy array, the interface is the same, so the user can work with it even if they expect an awkward one. Also an awkward1.Array doesn't really make sense here because the output is 1D for both functions, so it's unnecessary overhead.

However, if we want to extend the unique function to work for jagged arrays, of course awkward1.Array is the right type.

created merge request !31 (merged) to address this issue

mentioned in merge request !31 (merged)

Alright, I added km3io.tools.unique and km3io.tools.uniquecount. Implementing it into awkward1.Array is a large undertaking since one also needs to implement it into their C++ level code including proper bindings. The numba implementation is simple, robust and as fast as the C++ code, so let's stick to this.

closed via merge request !31 (merged)

mentioned in commit 58991928

Available in release 0.13.0

Nice, thanks!

Best way to get number of hit DOMs

Designs

Child items ...

Activity