Frank Sommers
Autospaces
One's bookshelf is like the Earth's crust: New books tend to settle in the place of old ones, pushing less recent books increasingly further away from the surface - much the same way erosion deposits layers upon layers of sediments. On occasion, though, some books just keep coming back to the fore of the pile: They seem to possess the strength and resilience of igneous rocks that form the backbones of mountains capable of shouldering the weight of time.
High Performance Mass Storage and Parallel I/O: Technologies and Applications, edited by Hai Jin, Toni Cortes, and Rajkumar Buyya, and published jointly by the IEEE Press and John Wiley and Sons in 2002, will likely belong to that "igneous" book category. As the "Red Book" (Readings in Database Systems, 3rd Edition, edited by Michael Stonebraker and Joseph Hellerstein, Morgan Kaufmann, 1998) claims a rightful place on the bookshelf of every learned data management professional, this book might be referred to as the "Green Book" of storage and I/O, and it, too, will occupy a prominent reef in a computer professional's library.
The book collects 45 influential research papers on parallel I/O and mass storage from the last 14 years. The editors selected 41 papers from existing conference proceedings and scholarly publications, and commissioned 4 additional papers especially for this book. The authors of several papers updated their work for inclusion in this volume. The result is 670 pages of top-notch systems research, sandwiched between a foreword by David Patterson, an introductory essay by the editors, and an index.
Different readers will find the book useful for different reasons, and even those reasons might change over time. On one level, it is possible to use the volume as a course book for a graduate, and even undergraduate, course on storage and I/O: The key concepts a semester-long course would cover are illustrated from multiple angles in the articles. For a course use, the book could especially be useful as a supplement to another text book, such as John May's recently published Parallel I/O for High Performance Computing (Morgan Kaufmann Publishers, 2000). Equally, designers and builders of high-performance computing systems will find the book's chapters to be helpful refreshers of past university courses and conference presentations, and will learn much additional insight as well.
The editors organized the content into nine conceptual categories, about equally divided between storage and I/O. The categories, and the articles within each category, are organized, not chronologically, but so that a development of ideas emerges. That makes reading the book cover to cover an interesting and rewarding excercise.

High Performance Mass Storage and Parallel I/O:
Technologies and Applications. Hai Jin, Toni Cortes, and Rajkumar
Buyya, editors. IEEE Press and John Wiley and Sons, 2002.
RAID and variations
The book commences with a section titled "Introduction to Redundant Disk Array Architectures." The chronologically oldest paper in the book, "A case for redundant arrays of inexpensive disks (RAID)" (David Patterson, Garth Gibson, and Randy Katz, 1988) contains the classic definition of RAID. It was given the 1998 ACM SIGMOD most influential paper award from the SIGMOD conference proceedings 10 years earlier. As Richard Wagner's Overture to the Meistersinger seeds that opera's key thematic material, this paper both organizes previous work and presages much future mass storage systems research. Another piece by the UC Berkeley RAID research teams follows, "Disk system architectures for high performance computing,"(Randy Katz, Garth Gibson, and David Patterson, 1990) discussing the structure of magnetic disks and including further observations on RAID.
The section's two remaining chapters elaborate on the use of RAID: "The performance of parity placements in disk arrays" (Edward K. Lee and Randy Katz, 1993) evaluates different strategies for placing parity data on RAID disks, and discusses the performance implications of those strategies, focusing mainly on relatively large request sizes (several 100 kB) for parity data. Jai Menon's (IBM Fellow and head of storage systems research at IBM's Almaden lab) article "A performance comparison of RAID-5 and log-structured arrays" (1995), focuses on RAID performance on transactional workloads (many small reads and writes), and compares that to the
performance of LSA that stores data in compressed form and writes updates to new disk locations, instead of in place of old data.
The book's next section is devoted to "Advanced Disk Array Architectures." The first article, "Parity Logging: Overcoming the small write problem in redundant disk arrays" (Daniel Stodolsky, Garth Gibson, and Mark Holland, 1990) addresses the small write problem for RAID and proposes the use of journalling techniques to overcome that bottleneck. A key benefit of reading this article is further insight into the performance of RAID, especially RAID 5 in the presence of transaction-type workloads. The following article describes an extension of RAID 5 to distributed systems: "Distributed RAID - A new multiple copy algorithm" (Michael Stonebraker and Gerhard Schloss, 1990). A "redundant array of distributed disks" (RADD) helps keep data alive by providing multiple redundant copies of data items on a network, and is proposed as an alternative high-availability solution to hot standbys or plain old multicopy redundancy.
The book's next article, "The HP AutoRAID Hierarchical Storage System" (John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan, 1996), demonstrates the importance of automating the configuration of complex disk array systems. The article presents a good example of how RAID can form part of a more inclusive storage framework. As the previous articles show, a system's workload characteristics greatly impact RAID performance (transaction-type workload with many small writes vs. "supercomputer I/O" with large sequential accesses). Not only human error in configuring RAID, but possibly changing workload features over a system's lifetime can lead to sub-optimal I/O. This article outlines a storage hierarchy where a higher level keeps two data copies for full redundancy and high performance, while the lower level features a RAID 5 system. The system automates data migration between the two levels.
Revisiting the topic of Jai Menon's earlier article in the volume is "Scalable distributed log structured arrays" (Witold Litwin and Jai Menon, 2000). Whereas the earlier article contains a comparison of LSA with RAID 5, this article is the book's first departure from discussing RAID, and instead focuses on extending LSA to multiple storage nodes, effectively making LSA suitable for storage area networks. One benefit of distributing an LSA structure is that parts of a distributed file can then be scanned parallel, and a file can also exceed the size limit imposed on single-node resident LSA files by several orders of magnitude. Distributed LSA (or LSA*) also provides high availability for LSA data.
High availability underpins the next article as well. In RAID systems, high data availability is attained often by having online spare disks so that data on failed disk drives can be quickly reconstructed on the spares. Menon's next article in the book, "Comparison of sparing alternatives for disk arrays" (Jai Menon, 1992) returns to the subject of RAID 5, and examines how different strategies for managing spare disk drives ("sparing") impacts performance.
A common method of reducing the overhead incurred during parity updates in RAID is to introduce a nonvolatile write cache to minimize write latency: Once a write is committed to the cache, it is deemed complete. The system can copy the contents of that cache (whether parity information or data) to disk in the background. The process of copying data from cache to disk is termed destaging. In "Destage algorithms for disk arrays with non-volatile caches" (1998), Anujan Varma and Quinn Jacobson discuss the performance of alternative algorithms for that operation under different workload characteristics.
Keeping data alive
Having presented a series extensions to RAID and analyses of RAID's performance, the book's next section, "Fault-Tolerance Issues in Disk Arrays," collects papers on RAID's ability to ensure the high availability of data. An overview of failure conditions, and the requirements for correcting failures is presented in "Failure correction techniques for large disk arrays" (Garth A. Gibson, Lisa Hellerstein, Richard M. Karp, Randy H. Katz, and David A. Patterson, 1989). The article outlines redundancy strategies to ensure data availability under a variety of failure conditions. It also reviews failure terminology, and outlines redundancy metrics for RAID.
With the increase in the size of a disk array, the probability of multiple concurrent disk failures increases as well. "Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering" (Guillermo A. Alvarez, Walter A. Burkhard, and Falviu Cristian, 1997), presents a method that masks an arbitrary number of failures. This paper extends the information dispersal algorithm (IDA), developed by Michael Rabin ("Efficient dispersal of information for security, load balancing, and fault tolerance" Journal of the ACM, 36(2), 1989) to data declustering in RAID. IDA is a type of error-correcting algorithm that improves data availability by dispersing the contents of a file into chunks, each stored at a different location, such that a subset of those pieces allows the file's reconstruction. For example, a file's length may be 10,000 bytes, and that file may then be divided into 14 pieces, each piece having 1,000 bytes. With the additional codes in each piece, any 10 chunks suffice in reconstructing the entire file. DATUM, presented in this paper, applies a similar scheme to handling multiple failures in RAID.
The next paper deals with "Parity declustering for continuous operation in redundant disk arrays" (Mark Holland and Garth A. Gibson, 1992). Parity declustering aims to reduce the load on surviving disks during the reconstruction of a failed disk's contents, yielding higher overall system throughput and making it valuable in servers requiring continuous operation (e.g., multimedia servers). The paper places the proposed strategy in context, and contrasts the notions of "clustering" vs. "declustering," (the latter meaning a scheme where redundant information is distributed over more disks than would be required as a minimum). The authors also review other declustering strategies, such as mirrored declustering, interleaved declustering, and chained declustering (for information about the first two, see "A comparison of high-availability media recovery techniques", SIGMOD Record, 18:2, 1989; for an expose on chained declustering, see "Chained Declustering: A new availability strategy for multiprocessor database machines," Hui-I Hsiao and David DeWitt, Proceedings of the 6th International Data Engineering Conference, 1990). Finally, they present a data layout strategy for parity declustering.
"The EVENODD code and its generalization, " (Mario Blaum, Jim Brady, Joshua Bruck, Jai Menon, and Alexander Vardy, 2000) is an extended version of the important earlier paper on EVENODD that appeared as "EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures" (IEEE Transactions on Computers, 1995). EVENODD requires only parity hardware present in typical RAID 5 controllers, and performs only exclusive OR computations. The paper presents an encoding and decoding mechanism, and extends previous work to multiple parities, tolerating more than 2 disk failures.
Performance matters
The next five articles are about performance, gathered under the heading "Caching and Prefetching." Many high performance I/O systems employ nonvolatile caches to improve write performance. However, placing a caching component in front of a highly redundant disk array introduces a single point of failure for the system. In order to improve MTTF in those situations, some I/O systems use primary and backup caches. But nonvolatile cache components are also many times more expensive than magnetic disks. In order to improve the reliability of a disk array system, and to reduce overall system cost, "RAPID-Cache - A reliable and inexpensive write cache for disk I/O systems" (Yiming Hu, Qing Yang, and Tycho Nightingale, 1999) proposes a two-level hierarchy whereby a small nonvolatile RAM (NVRAM) sits on top of a log disk. The backup cache has similar write performance than the NVRAM, but it's read performance is poorer. However, that lower read performance does not hurt overall system performance, since the backup cache is used only when the NVRAM fails. The article presents simulation results, as well as a cost-reliability analysis of RAPID-Cache.
Traditionally, I/O systems are passive - they merely respond to an application's read and write requests. "Informed prefetching and caching," (R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka, 1999) explores a scheme where an application works in tandem with an I/O system, giving hints to the latter about the application's I/O characteristics. Armed with that information, the I/O system can assume a more active role. For instance, it can reconfigure itself in order to deliver better performance. The article focuses on both prefetching and caching in the presence of hints provided by an application, and the authors observe performance improvements between 20%-42%, depending on the type of application. They conclude the article with six performance evaluations performed on I/O-intensive applications, such as the scientific visualization tool XDataSlice, the speech recognition system Sphynx, and the Postgres RDBMS.
The notion of using an application's I/O history in predicting I/O workload, and thereby improving prefetching, is carried on in "Practical prefetching techniques for multiprocessor file systems," (David Kotz and Carla Schlatter Ellis, 1991). They describe the sorts of patterns to predict I/O workload and enumerate ways to implement pattern predictors. The paper concludes with an evaluation of the effectiveness of taking those patterns into account in a practical system.
This section's two concluding articles focus on collective parallel I/O on distributed memory machines. The first article, "Design issues of a cooperative cache with no coherence problems," (Toni Cortes, Sergi Girona, and Jesus Labarta, 1997) explores shared memory in the context of a parallel and distributed file system. In that scenario, multiple nodes cooperate to build a global file system cache, each node using a portion of its own memory. That cooperation leads to better file system performance, since both the cache's size and hit ratio improve. The article describes solutions to ensuring a cache's coherence in the presence of independent nodes. It also discusses load balancing and failure tolerance in support of a global cache. The file system used in this article's context is the Parallel and Distributed File System (PAFS). The authors conclude with a comparison of PAFS vis-a-vis the cooperative caching system of xFS (the file system used in the NOW project, see below), another file system using effective cooperating caching algorithms.
In "Collective buffering: Improving parallel I/O performance," (Bill Nitzberg and Virginia Lo, 1997), the authors present four algorithms they designed in support of building a collective buffer in distributed shared memory machines. They state the key challenge of parallel I/O as being the task of a mapping memory resident data to a disk layout where a file's content spreads over multiple nodes, and conclude that a key requirement in improving parallel I/O is the "coalescing of small, uncoordinated data access into larger, coordinated accesses." The algorithms they evaluate facilitate that aim by providing collective buffering. Because of their high effectiveness, these algorithms are especially useful for the emerging MPI-2 I/O standard.
File systems for parallel I/O
Of special interest to readers in the cluster computing community are the book's five papers devoted to "Parallel File Systems." One overriding theme in this section is that often these file systems need tuning in order to meet the specific needs of an application's workload. How much of that tuning capability they expose impacts their practical benefit, but also increases the complexity of their use.
The section opens with a survey of the "The Vesta parallel file system," (Peter F. Corbett and Dror G. Feitelson, 1996). IBM's AIX Parallel I/O File System, offered on the SP2, is based on Vesta. While many previous parallel file systems hide how a file is distributed (striped) across nodes, Vesta provides user interface-level access to a file's parallel structure. Vesta also aims to facilitate file access via different types of decomposition of a file's data (i.e., data might be accessed by row, and also by column). A Vesta file has a 2-dimensional structure, called a cell. A cell is defined as "a container where data can be deposited, or alternatively, as virtual I/O nodes that are then mapped to available physical I/O nodes." Thus, one file dimension is the cell dimension, and the other is the data within the cell. As a file is written, a parameter specifies how many cells it can occupy. Given that the number of cells determines the amount of parallelism a file uses, that features is useful especially in the presence of small files, since distributing a small file over many nodes results in potentially poor access to that file. The article describes many other Vesta features, and offers an evaluation of the file system's performance using a parallel sorting application as an example.
"The Zebra striped network file system," (John H. Hartman and John K. Ousterhout, 1995), describes an extension to log structured file systems. In such a system, "each client forms its new data for all files into a sequential log that it stripes across the storage servers." That batching of writes from each node results in much improved performance, especially in the presence of many small writes. Zebra also takes advantage of RAID-style redundancy techniques: It can recover from a single server failure without loss of information (but not from multiple concurrent server failures). File block creations, updates, and deletes are communicated between nodes in the form of deltas. The paper gives an overview of log-structured file systems, and then outlines Zebra's components (including a "stripe cleaner" to reclaim unused space). It concludes with a performance evaluation focusing on small-file writes.
Many parallel file systems hide the striping and other data distribution strategies from client applications in order to simplify the interface application programmers must be aware of. However, that facade also makes it difficult to experiment with, and explore, distributed file system characteristics in general. "PPFS: A High-Performance Portable Parallel File System" (James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal), which is an updated version of a 1995 original paper, defines a parallel file system that lends itself to experimentation. It is a rich user-level library implemented on top of an operating system's native file system, effectively interposing itself between that OS and an application program. The article outlines PPFS's design philosophy, and provides two benchmark application results, one from a read-intensive gene sequencing application, and the other based on a write-intensive electron scattering code from plasma physics.
"The Global File system" (Steven R. Soltis, Thomas M. Ruwart, Grant M. Erickson, Kenneth W. Preslan, and Matthew T. O'Keefe, 1996) is a cluster-based distributed file system. Instead of focusing on serving a large number of clients, GFS aims to deliver high performance to a relatively small number of systems. Its target applications are those requiring high bandwidth and large amounts of storage place, such as multimedia information systems. GFS takes advantage of device locks to ensure data consistency, and outlines a file access pattern that is somewhat similar to how processes access shared memory in an SMP. In GFS, clients obtain a lock when they read or write data on shared network storage, and those locks correspond to device-level locks on the storage system; in the prototype implementation, those locks are implemented with the SCSI Dlock command. Because that design allows GFS to atomically modify data, GFS clients can remain unaware of each other's presence. To provide a unified logical storage place, GFS collects storage devices into a network storage pool; subpools divide network storage pools based on device type. GFS enables clients to export file systems to machines not directly connected to a storage pool: a GFS client can act as an NFS server, and can even serve HTTP clients, enabling access to the storage pool via the Web.
"Serverless network file systems" (1996, Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe Matthews, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang) describes xFS, the file system for the network of workstations (NOW) system. NOW defines a cooperation mechanism for independent workstations. For storage, that implies that there is no central server, and that file system services are provided collectively by the workstations: Any workstation on the network can store, cache, and access any block of data. Since any machine on the network can fail as well, xFS also provides for the redundancy of data. xFS incorporates previous research and extends it to enable a serverless mode of operation. The paper discusses RAID, log-structured storage, Zebra, different multiprocessor cache consistency mechanisms, and then focuses on how that research is adapted to xFS's unique needs. For instance, xFS distributes file system management services among the workstations: It attempts to assign a file used by a client to a manager located on that client machine. The paper describes a simulation study showing that co-locating a file's management with that file's client improves locality, and thereby significantly reduces network-bound client requests. Data distribution in xFS is achieved by a software implementation of RAID, utilizing a Zebra-like log-based striping. In
addition, xFS employs a cooperative caching mechanism that uses portions of a client's memory to build a shared cache.
The book's remaining chapters focus on different paradigms of parallel I/O, and present various I/O programming models and applications. I will survey those chapters in the concluding part of this book review.
Appendix: Table of contents
| I. Introduction to Disk Array Architectures | |
|---|---|
|
A case for redundant arrays of inexpensive disks (RAID). |
David Patterson, Garth Gibson, and Randy Katz |
Disk System Architectures for High Performance Computing |
Randy Katz, Garth Gibson, and David Patterson |
The performance of parity placements in disk arrays |
Edward K. Lee and Randy Katz |
A performance comparison of RAID-5 and log-structured arrays. |
Jai Menon |
| II. Advanced Disk Array Architectures | |
Parity Logging: Overcoming the small write problem in redundant disk arrays. |
Daniel Stodolsky, Garth Gibson, and Mark Holland |
Distributed RAID - A new multiple copy algorithm. |
Michael Stonebraker and Gerhard Schloss |
The HP AutoRAID Hierarchical Storage System |
John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan |
Scalable distributed log structured arrays |
Witold Litwin and Jai Menon |
Comparison of sparing alternatives for disk arrays |
Jai Mennon |
Destage algorithms for disk arrays with non-volatile caches |
Anujan Varma and Quinn Jacobson |
| III. Fault-Tolerance Issues in Disk Arrays | |
Failure Correction Techniques for Large Disk Arrays |
Garth A. Gibson, Lisa Hellerstein, Richard M. Karp, Randy H. Katz, and David A. Patterson |
Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering |
Guillermo A. Alvarez, Walter A. Burkhard, and Falviu Cristian |
Parity declustering for continuous operation in redundant disk arrays |
Mark Holland and Garth A. Gibson |
The EVENODD code and its generalization |
Mario Blaum, Jim Brady, and Joshua Bruck, Jai Menon, and Alexander Vardy |
| IV. Caching and prefetching | |
RAPID-Cache - A reliable and inexpensive write cache for disk I/O systems. |
Yiming Hu, Qing Yang, and Tycho Nightingale |
Informed prefetching and caching |
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka |
Practical prefetching techniques for multiprocessor file systems |
David Kotz and Carla Schlatter Ellis |
Design issues of a cooperative cache with no coherence problems |
Toni Cortes, Sergi Girona, and Jesus Labarta |
Collective buffering: Improving parallel I/O performance |
Bill Nitzberg and Virginia Lo |
| V. Parallel file systems | |
The Vesta parallel file system. |
Peter F. Corbett and Dror G. Feitelson |
The Zebra striped network file system |
John H. Hartman and John K. Ousterhout |
PPFS: A High-Performance Portable File System |
James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal |
The Global File system |
Steven R. Soltis, Thomas M. Ruwart, Grant M. Erickson, Kenneth W. Preslan, and Matthew T. O'Keefe |
Serverless Network File systems |
Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe Matthews, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang |
| VI. Parallel I/O Systems | |
| Parallel I/O Subsystems in Massively Parallel Supercomputers | Dror G. Feitelson, Peter F. Corbett, Sandra Johnson Baylor, and yarsun Hsu |
RAID-II: A high-bandwidth network file server |
Ann. L Chervenak, Ken Shirriff, John H. Hartman, Ethan L. Miller, Srinivasan Seshan, Randy H. Katz, Ken Lutz, David A. Patterson, Edward K. Lee, Peter M. Chen, and Garth A. Gibson |
Petal: Distributed Virtual Disks. |
Edward K. Lee and Chandramohan A. Thekkath |
|
A cost-effective, high-bandwidth storage architecture. |
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka |
RAID-x: A new distributed disk array for I/O-centric cluster computing |
Kai Hwang, Hai Jin, and Roy S. C. Ho |
Designing a self-maintaining storage system |
Satoshi Asami, Nisha Talagala, and David A. Patterson |
|
Modeling and evaluation of fibre channel storage area networks |
Xavier Molero, Federico Silla, Vincente Santonja, and Jose Duato |
| VII. Parallel I/O Programming Paradigms | |
Overview of the MPI-IO parallel I/O interface |
Peter Corbett, Dror Feitelson, Sam Fineberg, Yarsun Hsu, Bill Nitzberg, Jean-Pierre Prost, Marc Snir, Bernard Traversat, and Parkson Wong |
Disk resident arrays: An array-oriented I/O library for out-of-core computations |
Ian Foster and Jarek Nieplocha |
Active disks: Programming model, algorithms and evaluation |
Anurag Acharya, Mustafa Uysal, and Joel Saltz |
Disk-directed I/O for MIMD multiprocessors |
David Kotz |
| VIII. Parallel I/O Applications and Environments | |
Applications-driven parallel I/O: |
Nicholas P. Galbreath, William D. Gropp, and David M. Livine |
Comparing Multimedia Storage Architectures |
Benoit A. Gennart and Roger D. Hersch |
High availability in clustered multimedia servers |
Renu Tewari, Daniel M. Dias, Rajat Mukherjee, and Harrick M. Vin |
|
An architecture for a scalable high-performance digital library |
R. Grossman, X. Qin, W. Xu, H. Hulen, and T. Tyler |
|
I/O requirements for scientific applications: An evolutionary view |
Evgenia Smirni, Ruth A Aydt, Adrew A. Chien, Daniel Reed |
Mitra: A scalable continuous media server |
Shahram Ghandeharizadeh, Roger Zimmermann, Weifeng Shi, Reza Rejaie, Douglas J. Ierardi, and Ta-Wei Li |
| IX. Emerging Technologies and future trends | |
An introduction to the InfiniBand Architecture |
Gregory F. Pfister |
XML, Hyper-media, and Fortran I/O | Dror G. Feitelson and Tomer Klainer |
I/O Programming paradigms: Past and future |
Mahmut Taylan Kandemir and Alok Choudhary |
Scientific applications using parallel I/O |
Ron Oldfield and David Kotz |
Resource:
Michael Stonebraker and Josheph Hellerstein, Readings in Database Systems, 3rd Edition, Morgan Kaufmann Publishers, 1998
John M. May, Parallel I/O for High Performance Computing, Morgan Kaufmann Publishers, 2000
A comparison of high-availability media recovery techniques, SIGMOD Record, 1998
Chained declustering: A new availability strategy for multiprocessor database machines, Proceedings of the 6th International Data Engineering Conference, 1990
An interesting Web page on EVENODD demonstrates that algorithm in the context of reconstructing image data
The Error Correcting Codes Page
- Login to post comments
t
g