Skip to content
Snippets Groups Projects
Commit 051ca570 authored by Tamas Gal's avatar Tamas Gal :speech_balloon:
Browse files

Add remaining content and figures

parent b1f3f9bb
No related branches found
No related tags found
No related merge requests found
Pipeline #15183 passed
+++
date = "2020-10-27T14:53:59Z"
title = "Data"
type = "article"
draft = true
+++
---
* Processing
* Data formats
---
# Our Data
## Data taking
Data processing follows a tier-based approach, where initial filtering for particle interaction-related photon patterns (triggering of photon "hits") serves to create data at a first event-based data level.
In a second step, processing of the events, applying calibration, particle reconstruction and data analysis methods leads to enhanced data sets,
requiring a high-performance computing infrastructure for flexible application of modern data processing and data mining techniques.
For physics analyses, derivatives of these enriched data sets are generated and their information is reduced to low-volume high-level data which can be analysed and integrated locally into the analysis workflow of the
scientist. For interpretability of the data, a full Monte Carlo simulation of the data generation and processing chain, starting at the
primary data level, is run to generate reference simulated data for cross-checks at all processing stages and for statistic interpretation of the particle measurements.
![Overview over data levels](/figures/Data_levels.gif "Overview over data levels")
### Event data processing
Photon-related information is written to ROOT-based tree-like data structures and accumulated during a predefined data taking time range of usually several hours (so-called data runs) before being transferred to high-performance computing (HPC) clusters.
Processed event data sets at the second level represent input to physics analyses, e.g. regarding neutrino oscillation and particle properties, and studies of atmospheric and cosmic neutrino generation. Enriching the data to this end involves probabilistic interpretation of temporal and spatial photon distributions for the reconstruction of event properties in both measured and simulated data, and requires high-performance computing capabilities.
Access to data at this level is restricted to collaboration members due to the intense use of computing resources, the large volume and complexity of the data and the members' primary exploitation right of KM3NeT data. However, data at this stage is already converted to HDF5 format as a less customized hierarchical format. This format choice increases interoperability and facilitates the application of data analysis software packages used e.g. in machine learning and helps to pave the way to wider collaborations within the scientific community utilizing KM3NeT data.
### High level data and data derivatives
#### Summary formats and high-level data
As mostly information on particle type, properties and direction is relevant for
the majority of physics analyses, a high-level summary format has been designed to
reduce the complex event information to simplified arrays which allow for easy representation of an event data set as a table-like data structure.
Although this already leads to a reduced data volume, these neutrino data sets are still dominated by atmospheric muon events at a ratio of about $10^{6} :1$. Since, for many analyses, atmospheric muons are considered background events to both astrophysics and oscillation studies, publication of low-volume general-purpose neutrino data sets requires further event filtering. Here, the choice of optimal filter criteria is usually dependent on the properties of the expected flux of the signal neutrinos and performed using the simulated event sets.
## Open data sets and formats
As all of the following data is published, inter alia, via the Open Data Center, the data sets are all enriched with metadata following the [KM3OpenResource description](Datamodels.md#resource-description).
### Particle event tables
#### Data generation
For particle event publication, the full information from data level 2 file reconstructed event is reduced to a "one row per event" format by selecting the relevant parameters from the level 2 files. The event and parameters selection, metadata annotation and conversion of parameters to the intended output format is performed using the *km3pipe* software. The prototype provenance recording has also been included in this software, so that the output of the pipeline includes already the relevant metadata as well as provenance information. The software allows writing of the data to several formats, including text-based formats and hdf5, which are the two relevant formats used in this demonstator.
#### Data description
**Scientific use**
Particle event samples can be used in both astrophyics analysis as well as neutrino oscillation studies, see the [KM3NeT science targets](ScienceTargets.md). Therefore, the data must be made available in a format suitable for the Virtual Observatory as well as for particle physics studies.
**Metadata**
The events, from which relevant *parameters* like particle direction, time, energy and classification parameters are selected for generation of the event table, is enriched with the following metadata.
| Metadata type | content |
| ------------- | ------- |
| *Provenance* information | list of processing steps (referenced by identifier) |
| *Parameter* description | parameter name, unit (SI), type, description, identifier |
| *Data taking* metadata | start/stoptime, detector, event selection info |
| *Publication* metadata | publisher, owner, creation date, version, description |
#### Technical specification
##### Data structure
The general data structure is an event list which can be displayed as a flat table with parameters for one event filling one row. Each event row contains an [event identifier](Datamodels.md#particle-event-identifiers).
##### File format
For the tabled event data, various output formats are used depending on the platform used for publication and the requirements for interoperability. The formats defined at the moment here are not exclusive and might be extended according to specific requests from the research community in the future.
For hdf5 files as output, various options exist to store metadata, as several tables can be written to the same file and each table and the file itself can hold additional information as attributes to the file or table. Therefore, metadata that should be easy for the user to find and read have been stored to a separate "header" table while metadata that is more relevant for the machine-based interpretation of the data has been stored as attributes.
In the case of a text-based table, csv files are generated that are accompanied by a metadata file.
| output format | provenance | parameters | data taking | publication |
| ------------- | ---------- | ---------- | ----------- | ----------- |
| hdf5 | file header | table header | table header | "header" table |
| csv table | metadata file | metadata file | metadata file | metadata file |
##### Interfaces
**VO server**
If the neutrino set is relevant for astrophysics analyses, a text file is generated and the metadata mapped to the [resource description format](https://dachs-doc.readthedocs.io/tutorial.html#the-resource-descriptor) required by the DaCHs software, with the [simple cone search (SCS)](https://ivoa.net/documents/cover/ConeSearch-20080222.html) protocol applied to it. In the ODC, the event sample is recorded as KM3OpenResource pointing to the service endpoints of the VO server. Thus, the data set is findable both through the VO registry and the ODC and accessible through VO-offered access protocols.
**KM3NeT Open Data Server**
In the current test setup, event files that are not easily interpretable in an astrophysics context like the test sample from the ORCA detector, containing mostly atmospheric muons, are stored on the server, and registered as KM3OpenResource. While this practice is acceptable now for the relatively small datasets, the design of the server also allows in the future to point to external data sources and interface with storage locations of extended data samples.
### Multimessenger alerts
#### Data generation
Data generation and scientific use have been described in [the Multimessenger section](Multimessenger.md). The output of the online reconstruction chain is an array of parameters for the identified event as json *key: value* dictionary, which then is annotated with the relevant metadata to match the [VOEvent specifications](https://ivoa.net/documents/VOEvent/20110711/index.html).
#### Data description
The event information can, depending on its specific use, be divided into the following data or metadata categories.
| (Meta)data type | content |
| ------------- | ------- |
| Event identification | event identifier, detector |
| Event description | type of triggers, IsRealAlert |
| Event coordinates | time, rightascension, declination, longitude, latitude |
| Event properties | flavor, multiplicity, energy, neutrino type, error box 50%, 90% (TOC), reconstruction quality, probability to be neutrino, probability for astrophysical origin, ranking |
| Publication metadata | publisher, contact |
#### Technical specification
##### Data structure & format
The VOEvent is stored as XML file which contains central sections of *WhereWhen, Who, What, How* and *Why*.
##### VO Event specifications
| Section | Description | (Meta)data |
| ------- | ----------- | ---------- |
| `<Who>` | Publication metadata | including VOEvent stream identifier |
| `<WhereWhen>` | Space-time coordinates | event coordinates offered in UTC (time) and FK5 (equatorial coordinates) and detector location |
| `<What>` | Additional parameters | event properties, event identifier |
| `<How>` | Additional information | description of the alert type |
| `<Why>` | Scientific context | details on the alert procedure |
##### Interfaces
The Alert receiving/sending is via the [GCN](https://gcn.gsfc.nasa.gov/). The Alert data will be the neutrino candidates in VOEvent format, which is the standard data format for experiments to report and communicate their observed transient celestial events facilitating for follow-ups. The alert distribution is done via [Comet](https://comet.transientskp.org/en/stable/index.html) which is an implementation of the VOEvent transportation protocol.
Beyond this, there are also others receivers that can be implemented but are less convenient, e.g. the TNS for the optical alerts, the ZTF/LSST broker for the optical transients, the Fermi flare’s advocate for the Fermi blazar outbursts.
For the public alerts, KM3NeT will also submit the notice and circular (human in the loop) for the dissemination.
### Supplementary services and data derivatives
#### Data generation
Providing context information on a broader scale in the form of e.g. sensitivity services and instrument response functions alongside the VO-published data sets is still under investigation and highly dependent on the specific information. Therefore, additional metadata for the interpretation of the format is required.
#### Data description
**Scientific use**
Models and theoretical background information used in the analysis are provided, e.g. accompagning data sets (as for the ANTARES example dataset), to statistically interpret the data sets. Alternatively, probability functions for theoretical predictions and drawn from simulations are considered for publication, including e.g. instrument response functions.
**Metadata**
Metadata here must be case specific:
* Description of the *structure of the data* (e.g. binned data, formula), which will be indicated by a content descriptor [ktype](Datamodels.md#ktype) and accompanied by type-specific additional metadata
* Description of the *basic data set* from which the information is derived, its scope in time and relevant restraints to the basic domain, e.g. description of the simulation sample
* Description of all relevant *parameters*
#### Technical specification
##### Data structure & format
The data is provided as csv table or json with the relevant metadata provided alongside the data in a separate text file or in a header section.
##### Interfaces
Interprestation of the plot or service data is provided using the *openkm3* package, which loads the data as KM3OpenResource from the ODC and interprets it according to the *ktype*. The relevant data can the be accessed either as array or, where applicable, directly be rendered to a plot using [matplotlib](https://matplotlib.org/), which can then be edited further.
### Acoustic hydrophone data
#### Data generation
Acoustic data aquisition as described in the [the sea science section](SeaScience.md#acoustic-data) offers a continuous data stream of digitized acoustic data that will undergo a filtering process according to the scientific target of the audio data. At this point, the raw acoustic data before filtering can be offered as example data and to researchers interested in sea science. Snippets of acoustic data with a duration of a few minutes are produced at a fixed interval and directly offered, after format conversion, from a data server integrated in the acoustic data acquisitioning system and made accessible through a REST API. Integrating this data stream in the open science system therefore offers a good example on demonstrating the use of a data stream offered externally to the ODC and with a growing number of individually data sets.
#### Data description
**Scientific use**
The hydrophone data can be used, after triggering and filtering, for acoustic neutrino detection, detector positioning calibration and identification of marine acoustic signals, e.g. originating from whales. In the unfiltered form, the acoustic data might primarily be of interest for sea science.
**Metadata**
* *Publication metadata* is added during record creation at the ODC
* *Instrumentation & data taking settings* are offered for each data package through a separate endpoint (/info) of the REST API.
#### Technical specification
##### Data structure & format
Each data package consists of the same audio data, recorded in custom binary format (raw), which is formatted to [wave](http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html) and [mp3](https://mpeg.chiariglione.org/) audio files. Additionally, statistical properties of the audio snipped are offered in a separate stream.
| format | endpoint | description | return format |
| ------ | -------- | ----------- | ------------- |
| raw | /raw | custom binary format | application/km3net-acoustic |
| mp3 | /mp3 | mpeg encoded data | audio/mpeg |
| wave | /wav | wave format data | application/octet-stream |
| psd | /psd | array with mean, median, 75% and 95% quantile | application/json |
##### Interfaces
For each file, a KM3OpenResource is registered in the ODC. All resources belonging to the same data type are grouped using the [KM3ResourceStream](Datamodels.md#datamodels) as metadata class, pointing to all resources of the data stream through the kid unique identifier. All streams belonging to the acoustic data service are grouped as KM3ResourceCollection. Thus,
each single resource can be addressed as well as the logical connection between the resources preserved.
The data is directly accessible through the ODC webpage views or using openkm3 as client from a python interface.
This diff is collapsed.
This diff is collapsed.
+++
date = "2020-10-27T14:53:59Z"
title = "Software"
type = "article"
draft = true
+++
---
* Python
* Software@Git <Git>
* Docker containers <Docker>
---
# Our Software
## @Gitlab
KM3NeT uses a self-hosted GitLab instance as the main platform to develop and
discuss software, analysis tools, papers and other private or collaborative
creations. GitLab offers professional and advanced features to keep track of
development history and its rich feature set allows to exchange and archive
thoughts and ideas easily. The continuous integration (CI) that is part of the
GitLab distribution, proves to be a powerful automation tool and is utilised to
generate consistently up-to-date test reports, documentation and software
releases in a transparent way. The CI pipeline is triggered every time when
changes are pushed to a project. Each job runs in an isolated Docker container
which makes them fully reproducible.
![CI Pipelines of the km3pipe projects showing a failing test in a Python 3.6 environment](/figures/ci-pipelines.png)
In case of test reports for example, failing tests are signalised in merge
requests and prevent changes that broke them to be applied accidentally. The
documentation is also built in a dedicated pipeline job and is published to the
web upon successful generation. A tight integration of the documentation into
the software projects is mandatory and highly improves its up-to-dateness.
The KM3NeT GitLab server is accessible to the public but only projects which are
marked as _global_ are visible to a regular visitor without a KM3NeT account.
They can download the projects and all the public branches, access the issues,
documentation and Wiki, they however are not allowed to collaborate, i.e. to
comment or contribute in any way. To circumvent this problem, open source
projects are mirrored to a yet inofficial GitHub group
(https://github.com/KM3NeT) where everyone with a GitHub account is allowed to
interact.
## Docker
Due to the huge variety of operating systems, languages and frameworks, the
number of possible system configurations has grown rapidly in the past decades.
Operating-system-level virtualisation is one of the most successful techniques
to tackle this problem and allows the conservation of environments, making them
interoperable and reproducible in an almost system agnostic way. KM3NeT utilises
Docker (https://www.docker.com) for this task, which is the most popular
containerisation solution with high interoperability. Docker containers run with
negligible performance overhead and create an isolated environment in a fully
reproducible manner, regardless of the host system as long as Docker itself is
supported (Linux, macOS and Windows).
These containers are used in the GitLab CI to run test suites in many different
configurations. Python based projects for example can easily be tested under
different Python versions.
### List of accessible docker images
ToDo: add list here!
## Python environment
KM3NeT develops open source Python software for accessing and working with data taken by the detector, produced in simulations or in other analysis pipelines e.g. event reconstructions, and a number of other types like metadata, provenance history and environmental data. The software is following the Semantic Versioning 2.0 (https://semver.org) conventions and releases are automatically triggered on the GitLab CI by annotated Git tags. These releases including alpha and beta releases are uploaded to the publicly accessible Python Package Index, which is the main repository of software for the Python programming language. The installation of these packages is as simple as executing `pip install PACKAGE_NAME`. Additionally, the packages can also be installed directly from the GitLab repositories for example in case of experimental branches.
### Preferred python packages
The general philosophy behind all Python packages is to build a bridge to commonly use open source scientific tools, libraries and frameworks. While the common base is built on [NumPy](https://numpy.org), the de facto standard for scientific, numerical computing in Python, other popular packages from the [SciPy stack](https://www.scipy.org) are highly preferred. Examples are [Matplotlib](https://matplotlib.org) to create publication quality plots, [Pandas](https://pandas.pydata.org) which is used to work with tabular data, [Astropy](https://www.astropy.org) for astronomical calculations or [numba](http://numba.pydata.org) for high-performance low level optimisations.
### Preferred formats for interoperability
The output format is preferably CSV and JSON to maximise interoperability. For larger or more complex datasets, two additional formats are supported. [HDF5](https://www.hdfgroup.org) which is a widely used data format in science and accessible in many popular computer languages is used to store data from every tier, including uncalibrated low-level data and high-level reconstruction summaries. Additionally and mainly for astronomical data, the [FITS](https://fits.gsfc.nasa.gov) dataformat is considered if required due to its high popularity among astronomers.
### Python interface to KM3NeT data
In addition to offering services and data through the KM3NeT Open Data Center, the [openkm3](https://open-data.pages.km3net.de/openkm3/) python client was developed to directly use open data in python from local computer and within e.g. Jupyter notebooks. It interlinks with the ODC REST-API and allows to query metadata of the resources and collections. In additionn to that, it offers function to interpret the data according to its KM3NeT type description ([ktype](Datamodels.md#ktype)), e.g. returning tables in a required format. These interface options will be expanded according to the requirements of data integrated to the ODC.
In addition to that, basic functions relevant for astrophysics are offered in the [km3astro](https://km3py.pages.km3net.de/km3astro/) package. As development of the python environment is an ongoing process, the number of packages offered for KM3NeT data interpretation will surely grow in the future.
portal/static/figures/Acoustic_data.png

204 KiB

portal/static/figures/Data_levels.gif

126 KiB

portal/static/figures/ci-pipelines.png

311 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment