Database of Case Studies Achieved Using CADP

Performance Evaluation of MPI on CC-NUMA Architectures

Organisation: INRIA Grenoble - Rhône-Alpes / VASY project-team and BULL SAS (FRANCE)

Method: LOTOS
Interactive Markov Chains

Tools used: CADP (Construction and Analysis of Distributed Processes)

Domain: Distributed Systems.

Period: 2008-2009

Size: 797 lines of LOTOS, 1044 lines of C code
450 lines of regular alternation-free mu-calculus
1,452,856 states

Description: In the context of the MULTIVAL project, VASY studied together with BULL the MPI software layer and MPI benchmark applications to be run on FAME2 (Flexible Architecture for Multiple Environments), a CC-NUMA (Cache Coherent - Non Uniform Memory Access) multiprocessor architecture developed at BULL for teraflop mainframes and petaflop computing.

A particular MPI benchmark (the so-called ping-pong protocol) was considered, with the objective to predict the performance of this benchmark on FAME2 machines, in particular to estimate the latency of send/receive operations in different topologies, different software implementations of the MPI primitives, and different cache coherency protocols. The ping-pong benchmark consists of two parallel processes, which send to each other a data packet k times. Two variants of the benchmark were specified using a combination of LOTOS code (to describe the behavior of processes) and C code (to describe the data structures of memory and caches). Several configurations were studied, for two data packets and k varying between 1 and 10 (which corresponds to a normal termination of the benchmark session) or being infinite (which corresponds to cyclic execution).

For each configuration, several safety and liveness temporal properties were specified in regular alternation-free mu-calculus, and successfully verified on the corresponding graph using the EVALUATOR model checker of CADP. Additionally, it was successfully checked, using the BISIMULATOR equivalence checker of CADP, that the two variants were branching equivalent to each other (when observing only send/receive actions), and that each configuration with k iterations was smaller (modulo the branching preorder) than the configurations with k+1 iterations and k infinite. Finally, the LOTOS specification was extended with Markov delays and a performance analysis was carried out using the BCG_MIN, DETERMINATOR, and BCG_STEADY tools of CADP in order to predict the latency of the send/receive operations and of the various types of transfers between memory and (local or remote) caches. The results obtained by numerical analysis were close enough to the experimental measures to assess the suitability of the model.

In a second step, further configurations of the ping-pong benchmark were studied, by specifying two different implementations of the send/receive primitives (based on linked lists with locks and based on lock-free buffers, respectively) and two different cache coherency protocols (in which a variable written by a process becomes either owned by that process, or shared between that process and the previous owner, respectively). To reflect the real system behaviour more accurately, the model was improved by decomposing the read/write accesses in two request/response phases. The performance analysis also allowed to estimate the number of cache misses corresponding to each instruction, which indicates that the combination of the second send/receive protocol (based on lock-free buffers) and the first cache coherency protocol (in which a variable written by a process becomes owned by that process) provides the best performance among the configurations considered.

Conclusions: A method was proposed to analyze the impact of cache coherence protocol on the MPI library performance in a CC-NUMA architecture. The goal of this method is to help software teams optimize the MPI implementation during the hardware design phase and to analyze measurements afterwards. It is based on formal modeling that integrates functional and performance behaviors, using the IMC extension of LOTOS supported by the CADP toolbox. The model's functional behavior is formally verified, which cannot be done in classical simulation approaches.

The software aspect (benchmark and MPI library primitives), the hardware aspect (cache protocol and topological mapping), and the performance aspect (delay transitions) are described in separate processes. Their integration is done by composition, but each aspect can be refined independently. An abstraction of the hardware cache protocol allows to easily capture the fact that an access latency in one process depends on all cache states in the system. This is allowed by the compositional method of building the model and by the distinction of immediate transitions from delay transitions in IMC theory. The method enables to compute overall latency and above all cache miss count per variable, in different software and hardware configurations. These detailed figures allow to understand the impact of each access in the overall latency. The model is relatively small and its formal verification and performance evaluation take only a few minutes. Thus, even such an abstract model of a complex large-scale architecture has the potential to compare and analyze benchmark behavior.

Publications: [Chehaibar-Zidouni-Mateescu-09] Ghassan Chehaibar, Meriem Zidouni, and Radu Mateescu. "Modeling Multiprocessor Cache Protocol Impact on MPI Performance". Proceedings of the 2009 IEEE International Workshop on Quantitative Evaluation of Large-Scale Systems and Technologies QuEST'09, IEEE Computer Society Press, May 2009.
Available on-line from
Ghassan Chehaibar
Senior Engineer
Platforms Hardware R&D - Architecture & Verification
Bull SAS / Architect of an Open World (TM)
rue Jean Jaurès, BP 68
F-78340 Les Clayes-sous-Bois
Tel: +33 1 30 80 78 25

Further remarks: This case study, amongst others, is described on the CADP Web site:

Last modified: Tue Sep 8 18:14:48 2015.

Back to the CADP case studies page