Skip navigation.
Home

The ATOLL System Area Network (SAN)

David Slogsnat
Patrick R. Haspel
Holger Froening
Ulrich Bruening
University of Mannheim

Cluster computing has become a promising trend in high-performance computing. Judging on the basis of their system architectures alone, personal computers (PCs) may not be the perfect building blocks for cluster systems. It is rather their low price, and thus their good cost-to-performance ratio, that makes PCs so popular components of a cluster.

The standard interconnection network of PCs, Fast Ethernet or Gigabit Ethernet, do not deliver the performance required by most parallel applications. That lack of performance is the driving force behind SANs (system area networks). While Quadrics QsNet is the most sophisticated SAN available, it is also expensive. Myrinet is another very successful representative of system area networks.

The ATOLL SAN is the result of an effort to design a network interface controller by integrating all required components of high-performance networking (including switches) into one single chip, yielding a better price/performance ratio than today's system area solutions. In the remainder of this article, we first give an overview of the basic architecture of ATOLL. Then we decribe how we tested the ATOLL chip. Following that, we present the performance of ATOLL and compare it to Myrinet. Finally, we will give some concluding remarks.

Basic ATOLL Architecture

The ATOLL Chip completely integrates a switch, network ports and network interfaces on a single chip (see Figure 1). It mainly consists of a self-routing 8x8 crossbar switch where four ports are used as link ports connecting to the interconnection network, and four network ports connecting to the host ports. The four host ports are completely replicated devices to directly support up to four processes.



Figure 1: The ATOLL architecture

The host ports can be mapped into the user's address space to give the user direct access to the communication device (user level communication). User level communication does not involve the operating system for each send/receive operation, and thus significantly reduces communication latency.

The standard PCI-X interface is used as a connection to the host system. The PCI-X core connects the four host ports to the PCI-X bus providing a maximum bandwidth to the host of 800MBytes/s (100MHz x 8 Bytes). Slower bus specifications are also supported, e.g. PCI-X 66 and PCI 66/33 with a data width of 64.

The network port converts the 64-bit data stream to and from the host port to a byte wide stream for the interconnect system. The link port adds the link level protocol to the message and controls the transmission from link port to link port through the link cable. That stream is automatically partitioned into Link Packets (LIPs) of 64 bytes each and extended by a CRC check to verify the transmission. In case of an error, an automatic retransmission is performed between two link cable endpoints.

One link cable houses a sending and a receiving data channel, each 9-bit wide, providing a bidirectional interconnect between two nodes. The bandwidth is directly related to the ATOLL onchip clock, and thus results in 250MBytes/s for each direction on one bidirectional link, summing up to an aggregate bandwidth of 2GBytes/s for all four links. The typical cluster to be constructed using four links will be a grid or torus as depicted in Figure 2.



Figure 2: ATOLL system diagram

The bisection bandwidth of this 16-node configuration is 8 x 500MBytes/s = 40GBytes/s and the longest path is 4 hops. Each hop adds merely 27 clock ticks to the pipelined message transport time, which is about 100ns. That interconnect structure is scalable to a high number of nodes (8 x 8 = 256 or more) because adding nodes also adds crossbar switches for interconnectivity.

Testing Development Timeline

Testing begun with the delivery of the dies from the MPW run in June 2002. For three selected bare dies, measurements were performed using a wafer prober. This includes IDDQ tests for an improved current estimation. Additionally, measurements of the diode characteristics for all combinations of VDD3 (3.3 Volt power supply), VDD2 (1.8 Volt power supply) and GND (ground) were performed. All three dies passed this first test.

With the data collected from measurements on bare dies, the packaged dies were qualified. Only those packages that passed the diode characteristic/IDDQ test were assembled on PCBs. The loss rate of packaged dies was about 25%.

On the assembled PCBs, only diode characteristic tests were performed. This had already been done by the manufacturer with electrical test, and the purpose of our in-house test was only to verify their results. A loss rate of 0% approves the quality of the PCB production and assembly.

Next goal was to test the behavior of the voltage supervisor responsible for the reset signal. Power was applied to each card to be used and the behavior of the reset controller was observed.

Cards which passed this test were now ready for a first use in a PC environment. The test PC was a standard dual Pentium III with 1GHz running under Suse Linux. It is equipped with an extender for the PCI-66MHz/64-bit bus. During the start-up of the system, certain PCI configuration cycles were monitored. The boot was successful, and the /proc/pci file reported an ATOLL card in the system.

The next step was to set-up the core clock and to initialize the device. During this step it, turned out that chipsets usually generate 32-bit bursts, rather than 64-bit PCI accesses. It was not possible to generate true 64-bit PCI cycles with programmed I/O. However, ATOLL's PCI interface only accepts 64-bit PCI accesses or double 32-bit burst. A method using FPU-move instructions was discovered which produced these bursts on the PCI bus. With the core clock up and running, the ATOLL was set up to perform PIO transfers, which was successful. The first message on a ATOLL was delivered!

As next step, endurance tests using the PIO mode were performed. In the meantime, cards were qualified regarding link functionality and operating speed. In addition, the software was broadened to support DMA transfers. As soon as the DMA test software was finished regarding development and testing, the endurance tests were expanded to support both transfer modes. Now the fine tuning of the software begun. Focus was set on optimizing the software for low latency and high bandwidth.

After the fine tuning, the PALMS software was developed based on the test software.

PALMS: The Software on ATOLL

PALMS (Palms Atoll Library and Management Software) is a message-based communication system for the ATOLL interconnect. The objective of this software package is to provide a reliable, high-bandwidth and low-latency communication path between nodes within a cluster.

Figure 3 illustrates the different parts of the PALMS package:



Figure 3: Overview of PALMS

The PALMS Library provides the PALMS Application and Programming Interface (API) to the software layers on top of it. The API may be used by user applications directly, but often it will be used by middleware libraries, such as MPI, PVM, or sockets. All interactions of a user application with PALMS take place via this API. The library implements a direct user-space communication with the ATOLL device.

The ATOLL Driver performs basic control operations on the ATOLL network interface. It has control over all ATOLL resources, and is responsible to assign host ports to applications.

The Daemon performs basic cluster management tasks: it constantly explores the cluster topology and creates and adapts routing tables. It is generally responsible for all fault-tolerance and deadlock resolving measures.

There is also a graphical front-end to the deamon, which provides the cluster administrator with information about the state of the cluster and possible problems. The PALMS software package is still in development. The most important part of it, the PALMS API itself, is basically completed, with one major exception: we are not done with the performance optimizations yet. For example, PIO is implemented only in an experimental way and therefore not yet used by the API. We expect the PIO mode to decrease small message latencies further. That said, the current performance results are very close to the results of the final PALMS release.

Performance results

We compared the performance of ATOLL at the PALMS 1.2.1 API interface with the performance of the Myrinet 2 MB PCIX card, using Myricoms's GM 2.0 software (see Resources). As a benchmark, we used NetPIPE 3.7, which was developed by the Ames Laboratory (see Resources). NetPIPE comes with an interface to GM. In cooperation with Dr. Dave Turner from Ames Lab, we added an interface to the PALMS API. NetPIPE uses simple ping-pong messages with different message sizes to determine message latencies.

Our testsystem consists of two machines with Supermicro P4DL6 mainboards with Serverworks GC-LE chipset, 512 MB RAM and Dual P4 Xeon 1.8 GHz processors. The Myrinet NICs are plugged into the standard 133MHz PCIX bus and are directly interconnected with an optical fiber. The ATOLL NICs are directly connected as well. Due to some problems with the PCIX IP core, they run with a reduced PCIX speed of 100 MHz. Since the clock frequency of the ATOLL core can be set by the driver, we here present results for two different clock speeds: 243 MHz and 300 MHz.

The following table shows the performance of ATOLL (with clock frequencies of 243 MHz and 300 MHz) and Myrinet for different message sizes:

Atoll with 243 MHz
Message Size

(bytes)
Latency (us)
Bandwidth (Mb/s)
Speedup

over Myrinet
ATOLL
Myrinet
ATOLL
Myrinet
8
4.88
7.52
9.4
8
+54%
64
5.46
8.01
90
60
+47%
256
6.72
9.55
290
204
+42%
1k
11.11
14.08
703
554
+27%
8k
48.76
48.39
1281
1291
-1%
256k
1237.85
1072.79
1615
1864
-16%
Atoll with 300 MHz
Message Size

(bytes)
Latency (us)
Bandwidth (Mb/s)
Speedup

over Myrinet
ATOLL
Myrinet
ATOLL
Myrinet
8
4.75
7.52
13
8
+58%
64
5.32
8.01
92
60
+51%
256
6.35
9.55
307
204
+50%
1k
10.17
14.08
768
554
+38%
8k
42.39
48.39
1474
1291
+14%
256k
1101.84
1072.79
1815
1864
-1%

We now discuss the results for 243 MHz. As can be seen, ATOLL has an excellent small message latency of 4.88 us, which is much better than Myrinet's 7.52 us. On the other hand, Myrinet has a higher maximum bandwidth, which can be observed when sending large messages. ATOLL outperforms Myrinet for message sizes up to about 8KB. For larger messages, Myrinet is faster than ATOLL.

With ATOLL chips running at 300 MHz, small message latency is only slightly better than with the 243 MHz version. Small message latency is dominated by the PCIX cycles that are required to set up a DMA transfer. But the higher clock frequency pays off for larger messages, since the available bandwidth on the link between two ATOLLs scales directly with the clock frequency. With 300 MHz, the maximum NetPIPE bandwidth is 16% higher than for the 243 MHz ATOLL.



Figure 4: Performance comparisons of ATOLL and Myrinet

As can been seen in the graph of Figure 4, the 300 MHz ATOLL has the same maximum bandwidth as Myrinet, which is reached for 96kB packets. For larger messages, ATOLL's bandwidth drops slightly and falls behind Myrinet's performance. We are currently investigating this performance drop.

Conclusions

The ATOLL SAN Architecture shows a significant improvement in the price/performance ratio compared to well established SANs, such as Myrinet. During the development and implementation of the ATOLL SAN Architecture (including link cables, PCB, Package and ASIC), performance and price were both kept in mind. As shown in the performance benchmark results, increasing the ATOLL SAN clock frequency directly improves the bandwidth, so pure back end optimizations would improve the peak bandwidth further.

Unfortunately, the ATOLL SAN prototype is limited in the PCI-X frequency to 100 MHz due to inefficiencies during place and route. That, of course, reduces the overall performance. Even though the ATOLL SAN was intended to be downward compatible to 32-bit PCI systems also, we discovered a functional bug forcing ATOLL to work with 64-bit PCI systems only. Both problems can be resolved and easily fixed for volume production. Due to our limited resources, medium scale clusters haven't been benchmarked yet.

As we have observed the promising capacity of the implemented architectural features, we are committed to further research in this area and are currently developing our next generation communication device. Our research group is open for collaboration.

Acknowledgements

Thanks go to Synopsys for the donation of the PCI-X core. Also, we would like to thank Dave Turner for his help with the NetPIPE benchmark.



Resource:

Computer Architecture Group, University of Mannheim

Ulrich Bruening, Holger Froening, Patrick R. Schulz, and Lars Rzymianowicz, ATOLL: Performance and Cost Optimization of a SAN Interconnect, IASTED Conference: Parallel and Distributed Computing and Systems (PDCS), 2002.

The ATOLL Network

NetPIPE Benchmark

Myricom

Quadrics

PCI-X

Louis Vuitton Replica Denim