Wolfgang von Rueden
Rosy Mondardini
CERN
An increasing number of scientific and commercial disciplines require intensive computation and access to large shared databases - tens of terabytes (TB) today, which will become petabytes (PB) by the middle of the decade. High Energy Physics (HEP) is a representative example: thousands of physicists from hundreds of research institutes, laboratories and universities worldwide collaborate to design, create and operate detectors located at CERN in Europe or at U.S. laboratories such as FNAL and SLAC (see Resources). To obtain physics results, they need to pool their computing, storage and networking resources to analyze up to several PetaBytes of data, sharing large databases and computational resources throughout centers distributed across Europe and in several other countries.
At present, the available computational capacity is installed in different geographic locations, and only subsets of data are available at specific sites, where data analysis is performed according to data availability. Such a "static" approach is difficult in an environment where the data access pattern is generally unpredictable, and depends on changing scientific interest. The amount of data this community will create within the next five years will make it necessary to utilize national and regional computing facilities to their fullest capacities. At the same time, exploring this large amounts of data with centralized computing resources will not be an option.
The Grid paradigm seems to contain most of the elements to answer the distributed computing requirements of the HEP community. HEP, in turn, is an almost ideal field in which to experiment and demonstrate such technology for both data- and computation-intensive applications. The Grid will clearly become an instrumental element of the HEP computing strategy over the next decade, and several projects are already addressing these issues both in Europe and in the USA.
CERN, the European Organization for Nuclear Research, is playing a special role in such developments to address the needs of the Large Hadron Collider (LHC), the world's most powerful particle accelerator, which is being constructed now, and which will start operation in 2007 (see Resources). The computing facility for LHC will be implemented as a global computation- and data-intensive Grid, and the ongoing development and prototyping work involves many scientific institutes and industrial partners, coordinated by CERN. The project, named LCG (after LHC Computing Grid), will be integrated with several European national computational Grid activities, and will collaborate closely with other projects lead by CERN and involved in advanced grid technology, such as the EU DataGrid project.
LHC Computing Requirements
The LHC is the next research instrument of Europe's particle physics armory, which will let scientists probe deeper into matter than has ever before been possible. Due to start operations in 2007, it is a particle accelerator that will ultimately collide beams of protons at the energy level of 14 TeV . Beams of lead nuclei will be also accelerated, smashing together with a collision energy of 1,150 TeV. As well as having the highest energy of any accelerator in the world, the LHC will also have the most intense beams, with collisions happening so fast (40 million times a second) that particles from one collision will still be traveling through the detector when the next collision happens. Four experiments (ATLAS, CMS, LCHb and ALICE) will study with huge detectors what happens at the collision points.
The experiments will each produce a few PB of data per year. In particular, ATLAS and CMS foresee to produce more than 1 PB/year of raw data and approximately 200 TB of Event Summary Data (ESD), resulting from the first reconstruction pass. Analysis data will be reduced to a few tens of TB, and event tag data (short event indexes) will be a few TB. ALICE foresees around 2 PB/year of raw data, combining the heavy ion data with the proton data. LHCb will be generating about 4PB/year of data covering all stages of raw acquisition and simulation processing.
The raw data are generated at a single location (CERN) where the accelerator and experiments are hosted, but the computational capacity required to analyze them implies that the analysis must be performed at geographically distributed centers. It will require some 10 PB of disk storage and computing power that is roughly the equivalent of 200,000 of today's fastest PC processors. Even taking into account the continuing increase in storage densities and processor performance, this will be a very large and complex computing system. About two thirds of the computing capacity will be installed in regional computing centers spread across Europe, America and Asia.
About 5,000 researchers will collaborate on the four experiments, from countries all around the world. Their data access and handling requirements can be summarized as follows:
- All physicists should have, at least in principle, similar access to the data, irrespective of their location
- Access to the data should be transparent, and as efficient as possible; the same should hold for access to the computational resources needed to process the data.
The computing facility for LHC will thus be implemented as a global computational and data intensive Grid, with the goal of integrating large geographically distributed storage and computing resources into a virtual computing center.
The scale of the problem requires the innovative use of large scale distributed computing resources and mass storage management systems to organize a hierarchical distribution of the data between random access secondary storage (disk) and serial tertiary storage (such as magnetic tape). The data transfer requires high-speed point-to-point replication facilities, and a system that checks and maintains data consistency. Duplicated data should be used to help balance the search and network loads according to the response of the site. Consequently, the computing model adopted by the LHC collaborations is distributed, and it will be implemented via a hierarchy of "Regional Centers." Different Regional Centers may focus on specific areas of functionality and tasks. Some will specialize in simulation, whereas others will take part in specific reconstruction tasks.
Basic ideas and building blocks of the model are common to the four collaborations, and their description can be found in the documents of the MONARC common project.
Data Challenges
The realization of a complex of interrelated data samples stored at different centers and accessible online both locally and remotely constitutes a formidable technical and organizational challenge. Smooth operation of the whole system will be mandatory when real data will begin to flow from the experimental apparatus in order to produce reliable physics results in a timely fashion. To enable this development, the four collaborations of LHC have planned large-scale tests called data challenges.
A data challenge is a major exercise involving several tens of institutes and hundreds of physicists. In such tests large samples of data are simulated via advanced simulation programs and analyzed as if they were coming from the real experiment. The aim is to validate computing models, software, data models, and to ensure the correctness of all technical choices.
The four collaborations have gained experience in the distributed production and analysis of data, and some have already performed fairly large-scale data challenges with traditional tools. The plan is to progressively increase complexity, while making use as much as possible of the Grid middleware being developed in the context of several Grid projects, especially the EU Data Grid. Drawn from the above scenario, the following computing and data intensive activities should be made Grid-aware:
- Data transfers among Regional Centers and CERN
- Production of simulated data
- Reconstruction of real/simulated data
- Data analysis.
Over the past months, special data challenges have been carried out in collaboration with the European DataGrid project to demonstrate the feasibility of the Grid technology to implement and operate effectively an integrated service in the internationally distributed environment.
The EU DataGrid is one of the chief projects developing such technology in the world. Its goal is to develop, implement and exploit a large-scale data and compute-oriented grid, developing the necessary middleware software in collaboration with some of the leading centers of competence in Grid technology in Europe and elsewhere. The EDG software relies upon the widely known Globus and Condor technologies.
The project recently passed very successfully its second year review. At the beginning of its third year of activity, the project is demonstrating middleware to facilitate secure access to massive amounts of data in a universal global name space, to move and replicate data at high speeds from one geographical site to another, and to manage the synchronization of remote data copies. Novel software is being developed such that strategies for automated wide-area data replication and distribution will adapt according to dynamic usage patterns. Seamless and efficient integration of distributed storage resources will be enabled by a generic interface to the different mass storage management systems in use at different sites. Several important performance and reliability issues associated with the use of tertiary storage are being addressed as well.
Throughout 2002, LHC experiments successfully operated distributed data production in the EDG testbed, consisting currently of more than twenty sites spread around Europe. An Atlas/EDG Task Force performed a controlled test with EDG software, redoing simulation already performed without it, and a subsequent evaluation report provided the priorities for the direction of the EDG developments. In October CMS requested to perform part of their data challenges with EDG software. A CMS/EDG Task Force was set up, and during the first production some 50,000 simulated events were produced in a weekend for CMS physics studies using facilities in CERN, Italy, CNRS, NIKHEF and the
UK.
Conclusions
Impressive progress was made over the last few years in the development of computing and data intensive Grids, as a response to the requirements of several scientific communities. In particular, the viability of Grid technology as a solution to HEP problems has been recently demonstrated at CERN thanks to the collaboration among different development projects, such as LCG and the EU DataGrid.
The results of dedicated "data challenge" tests, performed in collaboration with the scientists involved in the future LHC experiments, have helped to identify a number of improvements to the DataGrid software and to the way the experiment applications are managed.
This fertile collaboration will continue for the remaining lifetime of the EDG project (end 2003) and a new proposal (EGEE) has been submitted to the European Commission within the 6th Framework program to provide a large-scale grid infrastructure to the scientific community. The work will build on the current major assets of the project, i.e. a set of largely deployed middleware, close links with the applications, and a large Grid development team composed of experienced groups distributed across Europe.
Acknowledgements
This article has been produced with significant contributions from the DataGrid Project Office (DataGrid is funded by the European Union - contract IST-2000-25182
Resource:
Medications
Generic Cialis
Generic Viagra
Buy Cialis
Buy Meridia