Snippets Groups Projects

4 years ago
4a157a81

Changing outline to match deliverable TOC · 4a157a81
Jutta Schnabel authored 4 years ago

4a157a81

History

Changing outline to match deliverable TOC
Jutta Schnabel authored 4 years ago

Dataformats.md 11.72 KiB

Title: Open data formats
Author: Jutta
status: review
Topics:
    - Creation of the open data
    - Data format description
    - different example data: antares events, plots, orca events, acoustic

Access and Archiving

Open data sets and formats

As all of the following data is published, inter alia, via the Open Data Center, the data sets are all enriched with metadata following the KM3OpenResource description.

Particle event tables

Data generation

For particle event publication, the full information from data level 2 file reconstructed event is reduced to a "one row per event" format by selecting the relevant parameters from the level 2 files. The event and parameters selection, metadata annotation and conversion of parameters to the intended output format is performed using the km3pipe software. The prototype provenance recording has also been included in this software, so that the output of the pipeline includes already the relevant metadata as well as provenance information. The software allows writing of the data to several formats, including text-based formats and hdf5, which are the two relevant formats used in this demonstator.

Data description

Scientific use

Particle event samples can be used in both astrophyics analysis as well as neutrino oscillation studies, see the KM3NeT science targets. Therefore, the data must be made available in a format suitable for the Virtual Observatory as well as for particle physics studies.

Metadata

The events, from which relevant parameters like particle direction, time, energy and classification parameters are selected for generation of the event table, is enriched with the following metadata.

Metadata type	content
Provenance information	list of processing steps (referenced by identifier)
Parameter description	parameter name, unit (SI), type, description, identifier
Data taking metadata	start/stoptime, detector, event selection info
Publication metadata	publisher, owner, creation date, version, description

Technical specification

Data structure

The general data structure is an event list which can be displayed as a flat table with parameters for one event filling one row. Each event row contains an event identifier.

File format

For the tabled event data, various output formats are used depending on the platform used for publication and the requirements for interoperability. The formats defined at the moment here are not exclusive and might be extended according to specific requests from the research community in the future.

For hdf5 files as output, various options exist to store metadata, as several tables can be written to the same file and each table and the file itself can hold additional information as attributes to the file or table. Therefore, metadata that should be easy for the user to find and read have been stored to a separate "header" table while metadata that is more relevant for the machine-based interpretation of the data has been stored as attributes.

In the case of a text-based table, csv files are generated that are accompanied by a metadata file.

output format	provenance	parameters	data taking	publication
hdf5	file header	table header	table header	"header" table
csv table	metadata file	metadata file	metadata file	metadata file

Interfaces

VO server If the neutrino set is relevant for astrophysics analyses, a text file is generated and the metadata mapped to the resource description format required by the DaCHs software, with the simple cone search (SCS) protocol applied to it. In the ODC, the event sample is recorded as KM3OpenResource pointing to the service endpoints of the VO server. Thus, the data set is findable both through the VO registry and the ODC and accessible through VO-offered access protocols.

KM3NeT Open Data Server In the current test setup, event files that are not easily interpretable in an astrophysics context like the test sample from the ORCA detector, containing mostly atmospheric muons, are stored on the server, and registered as KM3OpenResource. While this practice is acceptable now for the relatively small datasets, the design of the server also allows in the future to point to external data sources and interface with storage locations of extended data samples.

Multimessenger alerts

Data generation

Data generation and scientific use have been described in the Multimessenger section. The output of the online reconstruction chain is an array of parameters for the identified event as json key: value dictionary, which then is annotated with the relevant metadata to match the VOEvent specifications.

Data description

The event information can, depending on its specific use, be divided into the following data or metadata categories.

(Meta)data type	content
Event identification	event identifier, detector
Event description	type of triggers, IsRealAlert
Event coordinates	time, rightascension, declination, longitude, latitude
Event properties	flavor, multiplicity, energy, neutrino type, error box 50%, 90% (TOC), reconstruction quality, probability to be neutrino, probability for astrophysical origin, ranking
Publication metadata	publisher, contact

Technical specification

Data structure & format

The VOEvent is stored as XML file which contains central sections of WhereWhen, Who, What, How and Why.

VO Event specifications

Section	Description	(Meta)data
`<Who>`	Publication metadata	including VOEvent stream identifier
`<WhereWhen>`	Space-time coordinates	event coordinates offered in UTC (time) and FK5 (equatorial coordinates) and detector location
`<What>`	Additional parameters	event properties, event identifier
`<How>`	Additional information	description of the alert type
`<Why>`	Scientific context	details on the alert procedure

Interfaces

The Alert receiving/sending is via the GCN. The Alert data will be the neutrino candidates in VOEvent format, which is the standard data format for experiments to report and communicate their observed transient celestial events facilitating for follow-ups. The alert distribution is done via Comet which is an implementation of the VOEvent transportation protocol.

Beyond this, there are also others receivers that can be implemented but are less convenient, e.g. the TNS for the optical alerts, the ZTF/LSST broker for the optical transients, the Fermi flare’s advocate for the Fermi blazar outbursts.

For the public alerts, KM3NeT will also submit the notice and circular (human in the loop) for the dissemination.

Supplementary services and data derivatives

Data generation

Providing context information on a broader scale in the form of e.g. sensitivity services and instrument response functions alongside the VO-published data sets is still under investigation and highly dependent on the specific information. Therefore, additional metadata for the interpretation of the format is required.

Data description

Scientific use

Models and theoretical background information used in the analysis are provided, e.g. accompagning data sets (as for the ANTARES example dataset), to statistically interpret the data sets. Alternatively, probability functions for theoretical predictions and drawn from simulations are considered for publication, including e.g. instrument response functions.

Metadata

Metadata here must be case specific:

Description of the structure of the data (e.g. binned data, formula), which will be indicated by a content descriptor ktype and accompanied by type-specific additional metadata
Description of the basic data set from which the information is derived, its scope in time and relevant restraints to the basic domain, e.g. description of the simulation sample
Description of all relevant parameters

Technical specification

Data structure & format

The data is provided as csv table or json with the relevant metadata provided alongside the data in a separate text file or in a header section.

Interfaces

Interprestation of the plot or service data is provided using the openkm3 package, which loads the data as KM3OpenResource from the ODC and interprets it according to the ktype. The relevant data can the be accessed either as array or, where applicable, directly be rendered to a plot using matplotlib, which can then be edited further.

Acoustic hydrophone data

Data generation

Acoustic data aquisition as described in the the sea science section offers a continuous data stream of digitized acoustic data that will undergo a filtering process according to the scientific target of the audio data. At this point, the raw acoustic data before filtering can be offered as example data and to researchers interested in sea science. Snippets of acoustic data with a duration of a few minutes are produced at a fixed interval and directly offered, after format conversion, from a data server integrated in the acoustic data acquisitioning system and made accessible through a REST API. Integrating this data stream in the open science system therefore offers a good example on demonstrating the use of a data stream offered externally to the ODC and with a growing number of individually data sets.

Data description

Scientific use The hydrophone data can be used, after triggering and filtering, for acoustic neutrino detection, detector positioning calibration and identification of marine acoustic signals, e.g. originating from whales. In the unfiltered form, the acoustic data might primarily be of interest for sea science.

Metadata

Publication metadata is added during record creation at the ODC
Instrumentation & data taking settings are offered for each data package through a separate endpoint (/info) of the REST API.

Technical specification

Data structure & format

Each data package consists of the same audio data, recorded in custom binary format (raw), which is formatted to wave and mp3 audio files. Additionally, statistical properties of the audio snipped are offered in a separate stream.

format	endpoint	description	return format
raw	/raw	custom binary format	application/km3net-acoustic
mp3	/mp3	mpeg encoded data	audio/mpeg
wave	/wav	wave format data	application/octet-stream
psd	/psd	array with mean, median, 75% and 95% quantile	application/json

Interfaces

For each file, a KM3OpenResource is registered in the ODC. All resources belonging to the same data type are grouped using the KM3ResourceStream as metadata class, pointing to all resources of the data stream through the kid unique identifier. All streams belonging to the acoustic data service are grouped as KM3ResourceCollection. Thus, each single resource can be addressed as well as the logical connection between the resources preserved.

The data is directly accessible through the ODC webpage views or using openkm3 as client from a python interface.