ROOT. A data analysis framework (Андрей Савченко, LVEE-2014)
Материал из 0x1.tv
- Андрей Савченко
Modern high energy physics (HEP) demanded a high performance large scale data mining toolkit. An introduction to such tool — a ROOT data analysis framework is presented. A brief overview of its ample features is provided. Some performance and architecture details are discussed.
High energy physics (HEP) is well-known not only for fundamental research, but for being an incenve for technology bleeding edge as byproduct by its challenging demands. WWW was born at CERN1, Grid technology is nursed in scientific environment and petabyte scale data processing free tools are breeded there.
Today HEP experiments produce petabytes of data and are in demand of a tool to process and physically analyse this data. Such tool is available since 1995 and is known as ROOT2. It is licensed by LGPL and is developed by wold-wide recognized scientific centers (CERN, FermiLab, BNL, etc). ROOT is an object-oriented C++ framework, designed for large scale data analysis, mining and storing and analyzing petabytes of data in an efficient way3.
Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques.4 This container uses buckets for each tree branch, where each bucket is continuous space in file, allowing to effectively extract data subsets (e.g. if only several values from each event are needed for current analysis)5. These containers can span a large number of files on local disks, the web, or a number of different shared file systems.
In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package6 allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advanced statistical tools. Multivariate classification methods based on machine learning techniques and neural networks are available via the TMVA package7.
A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in vector formats like Postscript, PDF or LaTeX with Metafont, or in bitmap formats like JPG, PNG or GIF4.
Users typically create their analysis macros step by step, making use of the interactive C++ interpreter Cling (which is based on Clang and superseded older CINT project), while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone C++ program using ROOT libraries. Bindings for Python, Ruby as well as integration with R and Mathematica are available.
Finally, if HPC clusters are present, the user can reduce the execution time of intrinsically parallel tasks — e.g. data mining in HEP — by using PROOF, which will take care of optimally distributing the work over the available resources in a transparent way.4 Grid and AFS8 support are also available.
Besides in High Energy Physics ROOT is also widely used in many other scientific fields, like astronomy and biology but also in finance and medicine9, and may be used in any other field requiring spectral analysis or advanced histogram facilities.
Примечания и отзывы