Large Update to OrcaSong, mostly usability improvements.

- Implemented a better parser and implemented config files as an input to OrcaSong. - Got rid of the ORCA_Geo_115lines.txt file, the detector geometry is now read from a .detx file that needs to be supplied with the input file. - Improved general readability of the code - fixed run_id readout - added option to make no timecut

Large Update to OrcaSong, mostly usability improvements.
e7e92063 · ViaFerrata · 234a4a98 · e7e92063 · e7e92063 · e7e92063
Commit e7e92063 authored 6 years ago by ViaFerrata
--- a/detx_files/orca_115strings_av23min20mhorizontal_18OMs_alt9mvertical_v1.detx
+++ b/detx_files/orca_115strings_av23min20mhorizontal_18OMs_alt9mvertical_v1.detx
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -18,10 +18,13 @@ OrcaSong: Main Framework
 .. autosummary::
  :toctree: api

+  parse_input
+  parser_check_input
  calculate_bin_edges
  calculate_bin_edges_test
+  skip_event
+  data_to_images
  main
-  parse_input


 ``orcasong.file_to_hits``: Extracting event information

--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
@@ -28,7 +28,7 @@ In order to use :code:`tohdf5`, you need to have loaded a jpp version first. A r

    /sps/km3net/users/mmoser/setenvAA_jpp9_cent_os7.sh

-
+Additionally, you need to have a python environment on Lyon, where you have installed km3pipe (e.g. use a pyenv).
 Then, the usage of :code:`tohdf5` is quite easy::

    ~$: tohdf5 -o testfile.h5 /sps/km3net/users/kmcprod/JTE_NEMOWATER/withMX/muon-CC/3-100GeV/JTE.KM3Sim.gseagen.muon-CC.3-100GeV-9.1E7-1bin-3.0gspec.ORCA115_9m_2016.99.root
@@ -138,41 +138,106 @@ After pulling the OrcaSong repo to your local harddisk you first need to install

    ~/orcasong$: pip install .

-Before you can start to use OrcaSong, you need to provide a file that contains the XYZ positions of the DOMs to OrcaSong.
+Before you can start to use OrcaSong, you need a .detx detector geometry file that corresponds to your input files.
 OrcaSong is currently producing event "images" based on a 1 DOM / XYZ-bin assumption. This image generation is done
 automatically, based on the number of bins (n_bins) for each dimension XYZ that you supply as an input and based on the
-DOM XYZ position file. An examplary file for the DOM positions can be found in the folder /orcasong of the OrcaSong
-repo, "ORCA_Geo_115lines.txt". Currently, this file is hardcoded as an input for OrcaSong, so if you want to use another
-detector geometry, you should include your .txt file in the main() function in "data_to_images.py".
-You can generate this .txt file by taking the .det (not .detx!) file, e.g.::
+.detx file which contains the DOM positions.
+
+If your .detx file is not contained in the OrcaSong/detx_files folder, please add it to the repository!
+Currently, only the 115l ORCA 2016 detx file is available.

-    /afs/in2p3.fr/throng/km3net/detectors/orca_115strings_av23min20mhorizontal_18OMs_alt9mvertical_v1.det
+At this point, you're finally ready to use OrcaSong, it can be executed as follows::

-Then, you need to look for the :code:`OM_cluster_data` lines::
+    ~/orcasong$: python data_to_images.py testfile.h5 geofile.detx

-    OM_cluster_data: 1  -86.171 116.880 196.500 -1.57080 0.00000 1.57080
-    OM_cluster_data: 2  -86.171 116.880 187.100 -1.57080 0.00000 1.57080
-    ...
+OrcaSong will then generate a hdf5 file with images into the Results folder, e.g. Results/4dTo4d/h5/xyzt.

-Here, the first column is the dom_id and the second, third and fourth column is the XYZ position.
-You need to copy this information into a .txt file, such that it can be read by OrcaSong. One could automate this such
-that OrcaSong looks for the correct lines in the .det file automatically, however, multiple (old) conventions exist for
-the .det file structure, so it may be a bit tedious. Nevertheless, contributions are very welcome! :)
+The configuration options of OrcaSong can be found by calling the help::
+
+    ~/orcasong$: python data_to_images.py -h
+    Main OrcaSong code which takes raw simulated .h5 files and the corresponding .detx detector file as input in
+    order to generate 2D/3D/4D histograms ('images') that can be used for CNNs.
+    The input file can be calibrated or not (e.g. contains pos_xyz of the hits).
+    Makes only 4D histograms ('images') by default.
+
+    Usage:
+        data_to_images.py [options] FILENAME DETXFILE
+        data_to_images.py (-h | --help)
+
+    Options:
+        -h --help                       Show this screen.
+
+        -c CONFIGFILE                   Load all options from a config file (.toml format).
+
+        --n_bins N_BINS                 Number of bins that are used in the image making for each dimension, e.g. (x,y,z,t).
+                                        [default: 11,13,18,60]
+
+        --det_geo DET_GEO               Which detector geometry to use for the binning, e.g. 'Orca_115l_23m_h_9m_v'.
+                                        [default: Orca_115l_23m_h_9m_v]
+
+        --do2d                          If 2D histograms, 'images', should be created.
+
+        --do2d_plots                    If 2D pdf plot visualizations of the 2D histograms should be created, cannot be called if do2d=False.
+
+        --do2d_plots_n N                For how many events the 2D plot visualizations should be made.
+                                        OrcaSong will exit after reaching N events. [default: 10]
+
+        --do3d                          If 3D histograms, 'images', should be created.
+
+        --dont_do4d                     If 4D histograms, 'images', should NOT be created.
+
+        --do4d_mode MODE                What dimension should be used in the 4D histograms as the 4th dim.
+                                        Available: 'time', 'channel_id'. [default: time]
+
+        --timecut_mode MODE             Defines what timecut mode should be used in hits_to_histograms.py.
+                                        At the moment, these cuts are only optimized for ORCA 115l neutrino events!
+                                        Currently available:
+                                        'timeslice_relative': Cuts out the central 30% of the snapshot.
+                                        'trigger_cluster': Cuts based on the mean of the triggered hits.
+                                        The timespan for this cut can be chosen in --timecut_timespan.
+                                        'None': No timecut.
+                                        [default: trigger_cluster]
+
+        --timecut_timespan TIMESPAN     Only used with timecut_mode 'trigger_cluster'.
+                                        Defines the timespan of the trigger_cluster cut.
+                                        Currently available:
+                                        'all': [-350ns, 850ns] -> 20ns / bin (60 bins)
+                                        'tight-1': [-250ns, 500ns] -> 12.5ns / bin
+                                        'tight-2': [-150ns, 200ns] -> 5.8ns / bin
+                                        [default: tight-1]
+
+        --do_mc_hits                    If only the mc_hits (no BG hits!) should be used for the image processing.
+
+        --data_cut_triggered            If non-triggered hits should be thrown away for the images.
+
+        --data_cut_e_low E_LOW          Cut events that are lower than the specified threshold value in GeV.
+
+        --data_cut_e_high E_HIGH        Cut events that are higher than the specified threshold value in GeV.
+
+        --data_cut_throw_away FRACTION  Throw away a random fraction (percentage) of events. [default: 0.00]
+
+        --prod_ident PROD_IDENT         Optional int identifier for the used mc production.
+                                        This is useful, if you use events from two different mc productions,
+                                        e.g. the 1-5GeV & 3-100GeV Orca 2016 MC. The prod_ident int will be saved in
+                                        the 'y' dataset of the output file of OrcaSong. [default: 1]

-At this point, you're finally ready to use OrcaSong, it can be executed as follows::

-    python data_to_images.py testfile.h5

 OrcaSong will then generate a hdf5 file with images into the Results folder, e.g. Results/4dTo4d/h5/xyzt.

-The configuration options of OrcaSong can be found in the :code:`main()` function.
+Alternatively, they can also be found in the docs of the :code:`main()` function:

 .. currentmodule:: orcasong.data_to_images
 .. autosummary::
    main

-In the future, these configurations should probably be relocated to a dedicated .config file. Again, contributions and
-thoughts are very welcome!
+Other than parsing every information to orcasong via the console, you can also load a .toml config file::
+
+    ~/orcasong$: python data_to_images.py -c config.toml testfile.h5 geofile.detx
+
+Please checkout the config.toml file in the main folder of the OrcaSong repo in order to get an idea about
+the structure of the config file.
+
 If anything is still unclear after this introduction just tell me in the deep_learning channel on chat.km3net.de or
 write me an email at michael.m.moser@fau.de, such that I can improve this guide!


--- a/orcasong/ORCA_Geo_115lines.txt
+++ b/orcasong/ORCA_Geo_115lines.txt
--- a/orcasong/config.toml
+++ b/orcasong/config.toml
+# filename: config.toml
+# A config file for OrcaSong with multiple configurations.
+# Outcomment the config that you want to use!
+# More info about the .toml format at https://github.com/toml-lang/toml
+
+### All available options with some dummy values
+# --n_bins = '11,13,18,60'
+# --det_geo = 'Orca_115l_23m_h_9m_v'
+# --do2d = false
+# --do2d_plots = false
+# --do2d_plots_n = 10
+# --do3d = false
+# --do4d = true
+# --do4d_mode = 'time'
+# --timecut_mode = 'trigger_cluster'
+# --timecut_timespan = 'tight_1'
+# --do_mc_hits = false
+# --data_cut_triggered = false
+# --data_cut_e_low = 3
+# --data_cut_e_high = 100
+# --data_cut_throw_away = 0.00
+# --prod_ident = 1
+
+
+####----- Collection of configurations for ORCA 115l -----####
+
+###--- 3-100GeV ---###
+
+## XYZ-T
+
+--n_bins = '11,13,18,60'
+--det_geo = 'Orca_115l_23m_h_9m_v'
+--do4d_mode = 'time'
+--timecut_mode = 'trigger_cluster'
+--timecut_timespan = 'tight_1'
+--prod_ident = 1
+
+## XYZ-C
+
+#--n_bins = '11,13,18,31'
+#--det_geo = 'Orca_115l_23m_h_9m_v'
+#--do4d_mode = 'channel_id'
+#--timecut_mode = 'trigger_cluster'
+#--timecut_timespan = 'tight_1'
+#--prod_ident = 1
+
+###--- 3-100GeV ---###
+
+###--- 1-5GeV ---###
+
+### only take < 3GeV events, info throw away: elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
+
+## XYZ-T
+
+#--n_bins = '11,13,18,60'
+#--det_geo = 'Orca_115l_23m_h_9m_v'
+#--do4d_mode = 'time'
+#--timecut_mode = 'trigger_cluster'
+#--timecut_timespan = 'tight_1'
+#--data_cut_e_high = 3
+#--data_cut_throw_away = 0.25 # elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
+#--prod_ident = 2
+
+## XYZ-C
+
+#--n_bins = '11,13,18,31'
+#--det_geo = 'Orca_115l_23m_h_9m_v'
+#--do4d_mode = 'channel_id'
+#--timecut_mode = 'trigger_cluster'
+#--timecut_timespan = 'tight_1'
+#--data_cut_e_high = 3
+#--data_cut_throw_away = 0.25 # elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
+#--prod_ident = 2
+
+###--- 1-5GeV ---###
+
+####----- Collection of configurations for ORCA 115l -----####
--- a/orcasong/data_to_images.py
+++ b/orcasong/data_to_images.py
--- a/orcasong/file_to_hits.py
+++ b/orcasong/file_to_hits.py
@@ -45,7 +45,6 @@ def get_time_residual_nu_interaction_mean_triggered_hits(time_interaction, hits_
        Array with trigger flags that specifies if the hit is triggered or not.

    """
-
    hits_time_triggered = hits_time[triggered == 1]
    t_mean_triggered = np.mean(hits_time_triggered, dtype=np.float64)
    time_residual_vertex = t_mean_triggered - time_interaction
@@ -53,7 +52,7 @@ def get_time_residual_nu_interaction_mean_triggered_hits(time_interaction, hits_
    return time_residual_vertex


-def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts, do4d, prod_ident):
+def get_event_data(event_blob, geo, do_mc_hits, data_cuts, do4d, prod_ident):
    """
    Reads a km3pipe blob which contains the information for one event and returns a hit array and a track array
    that contains all relevant information of the event.
@@ -68,9 +67,6 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
    do_mc_hits : bool
        Tells the function of the hits (mc_hits + BG) or the mc_hits only should be parsed.
        In the case of mc_hits, the dom_id needs to be calculated thanks to the jpp output.
-    use_calibrated_file : bool
-        Specifies if a calibrated file is used as an input for the event_blob.
-        If False, the hits of the event_blob are calibrated based on the geo parameter.
    data_cuts : dict
        Specifies if cuts should be applied.
        Contains the keys 'triggered' and 'energy_lower/upper_limit' and 'throw_away_prob'.
@@ -79,6 +75,9 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
        In the case of 'channel_id', this information needs to be included in the event_hits as well.
    prod_ident : int
        Optional int that identifies the used production, more documentation in the docs of the main function.
+    geo : None/kp.Geometry
+        If none, the event_blob should already contain the calibrated hit information (e.g. pos_xyz).
+        Else, a km3pipe Geometry instance that contains the geometry information of the detector.

    Returns
    -------
@@ -91,11 +90,16 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
        vertex_pos_x, vertex_pos_y, vertex_pos_z, time_residual_vertex, (prod_ident)].

    """
-    p = get_primary_track_index(event_blob)
+    p = get_primary_track_index(event_blob) # TODO fix for mupage/random_noise
+
+    # parse track, contains event information like mc_energy
+
+    if 'Header' in event_blob: # if Header exists in file, take run_id from it. Else take it from RawHeader.
+        run_id = event_blob['Header'].start_run.run_id.astype('float32')
+    else:
+        run_id = event_blob['RawHeader'][0][0].astype('float32')

-    # parse tracks [event_id, particle_type, energy, isCC, bjorkeny, dir_x/y/z, time]
    event_id = event_blob['EventInfo'].event_id[0]
-    run_id = event_blob['Header'].start_run.run_id.astype('float32') # todo check if already f32
    particle_type = event_blob['McTracks'][p].type
    energy = event_blob['McTracks'][p].energy
    is_cc = event_blob['McTracks'][p].is_cc
@@ -106,12 +110,9 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
    time_interaction = event_blob['McTracks'][p].time

    # parse hits [x,y,z,time]
-    if do_mc_hits is True:
-        hits = event_blob["McHits"]
-    else:
-        hits = event_blob["Hits"]
+    hits = event_blob['Hits'] if do_mc_hits is False else event_blob['McHits']

-    if use_calibrated_file is False:
+    if 'pos_x' not in event_blob['Hits'].dtype.names: # check if blob already calibrated
        hits = geo.apply(hits)

    if data_cuts['triggered'] is True:
@@ -127,7 +128,7 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
    # save collected information to arrays event_track and event_hits
    track = [event_id, particle_type, energy, is_cc, bjorkeny, dir_x, dir_y, dir_z, time_track, run_id,
             vertex_pos_x, vertex_pos_y, vertex_pos_z, time_residual_vertex]
-    if prod_ident is not None: track.append(np.full(event_id.shape, prod_ident))
+    if prod_ident is not None: track.append(prod_ident)

    event_track = np.array(track, dtype=np.float64)


--- a/orcasong/hits_to_histograms.py
+++ b/orcasong/hits_to_histograms.py
@@ -65,12 +65,16 @@ def get_time_parameters(event_hits, mode=('trigger_cluster', 'all'), t_start_mar
        t_start = t_mean - t_start_margin * t_diff
        t_end = t_mean + t_end_margin * t_diff

+    elif mode[0] is None:
+        t_start = np.amin(t)
+        t_end = np.amax(t)
+
    else: raise ValueError('Time cut modes other than "first_triggered" or "timeslice_relative" are currently not supported.')

    return t_start, t_end


-def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edges, n_bins, all_4d_to_2d_hists, timecut, event_track, do2d_pdf, pdf_2d_plots):
+def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edges, n_bins, all_4d_to_2d_hists, timecut, event_track, do2d_plots, pdf_2d_plots):
    """
    Computes 2D numpy histogram 'images' from the 4D data and appends the 2D histograms to the all_4d_to_2d_hists list,
    [xy, xz, yz, xt, yt, zt].
@@ -89,7 +93,7 @@ def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edge
        Tuple that defines what timecut should be used in hits_to_histograms.
    event_track : ndarray(ndim=2)
        Contains the relevant mc_track info for the event in order to get a nice title for the pdf histos.
-    do2d_pdf : bool
+    do2d_plots : bool
        If True, generate 2D matplotlib pdf histograms.
    pdf_2d_plots : mpl.backends.backend_pdf.PdfPages/None
        Either a mpl PdfPages instance or None.
@@ -117,7 +121,7 @@ def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edge
                               np.array(hist_yt[0], dtype=np.uint8),
                               np.array(hist_zt[0], dtype=np.uint8)))

-    if do2d_pdf:
+    if do2d_plots:
        # Format in classical numpy convention: x along first dim (vertical), y along second dim (horizontal)
        # Need to take that into account in convert_2d_numpy_hists_to_pdf_image()
        # transpose to get typical cartesian convention: y along first dim (vertical), x along second dim (horizontal)

--- a/requirements.txt
+++ b/requirements.txt
@@ -2,3 +2,5 @@ numpy
 h5py
 matplotlib
 km3pipe
+docopt
+toml