# Impact of memory contention on communications [![SWH](https://archive.softwareheritage.org/badge/origin/https://gitlab.inria.fr/pswartva/memory-contention/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://gitlab.inria.fr/pswartva/memory-contention) [![SWH](https://archive.softwareheritage.org/badge/swh:1:dir:579f4e1d0e641536102f536117153d01fe1fca6c/)](https://archive.softwareheritage.org/swh:1:dir:579f4e1d0e641536102f536117153d01fe1fca6c;origin=https://gitlab.inria.fr/pswartva/memory-contention;visit=swh:1:snp:a7633c0f3cdcdec7078afbbc104eb3670f1ced6a;anchor=swh:1:rev:e2d788f5718386c818f0aa07e826fd9e8c6b4870) Benchmark suite to evaluate the impact of memory contention on communications (and vice-versa). ## Requirements - Communications are made with the MPI API, so you need an MPI library. If you use MadMPI (from [NewMadeleine](http://pm2.gforge.inria.fr/newmadeleine/)) build NewMadeleine with the profile `pukabi+madmpi-mini.conf`. - If you want to measure frequencies, [LIKWID](https://hpc.fau.de/research/tools/likwid/) can be used, but there is also a version of our code which does not need it. - If you want to measure the impact of a task-based runtime system, [StarPU](https://starpu.gitlabpages.inria.fr/) is required. - Some computing benchmarks use the Intel MKL library. - [hwloc](https://www.open-mpi.org/projects/hwloc/) is used to bind threads. ## Available programs - `bench_openmp` measures interferences between communications and computations when OpenMP is used to parallelize computations. - `bench_openmp_likwid`: same as `bench_openmp` but measures also frequencies with LIKWID. - `bench_openmp_freq`: same as `bench_openmp` but measures also frequencies, by reading content of `/proc` files. - `bench_starpu`: same as `bench_openmp`, but uses StarPU to parallelize computations. - `uncore_get` and `uncore_set` use LIKWID to respectively get and set uncore frequencies of sockets. Build each program with `make`: ```bash make ``` ## Benchmarking You can then chose the computing benchmark (`--compute_bench={stream,prime,cursor,scalar,scalar_avx}`) and the communication benchmark (`--bench={bandwidth,latency}`). All available options are listed in the help of the program (`--help`). Examples of executions: ```bash ncores=$(hwloc-calc all -N core) nbnuma=$(hwloc-calc all -N node) last_numa_node=$(($nbnuma-1)) for i in $(seq 1 $((ncores-1))); do mpirun -DOMP_NUM_THREADS=$i -DOMP_PROC_BIND=true -DOMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./bench_openmp --compute_bench=prime --ping_thread=last &> latency_thread_last_$((i))_threads.out done for i in $(seq 1 $((ncores-1))); do mpirun -DOMP_NUM_THREADS=$i -DOMP_PROC_BIND=true -DOMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./bench_openmp --compute_bench=prime --ping_thread=last --bench=bandwidth &> bandwidth_thread_last_$((i))_threads.out done for i in $(seq 2 $((ncores-2))); do mpirun -DSTARPU_NCPU=$i -DSTARPU_MAIN_THREAD_BIND=1 -DSTARPU_WORKERS_CPUID=1- -DSTARPU_MAIN_THREAD_CPUID=$((ncores-2)) -DSTARPU_MPI_THREAD_CPUID=$((ncores-1)) ./bench_starpu --compute_bench=stream --no_stream_add --no_stream_scale --ping_thread=first &> latency_thread_first_main_last_mpi_last_$((i-1))_threads.out; done ``` Environment variables and `hwloc-bind` are used to correctly bind threads to cores. See `bench_suite.example.sh` to see how all combinaisons of parameters can be launched. ## Plotting Scripts are in the `plot` folder and require Python with Matplotlib. `plot_comm_comp.py` is the main script. It plots computing benchmark results and communication performances on the same graph, according to mainly the number of cores. For instance: ```bash python3 plot_comm_comp.py bandwidth_thread_last_* --per-core --top=10000 --stream-top=15000 --o=bandwidth_thread_last.png --network-bandwidth --title="Network Bandwidth and STREAM Benchmark" ``` The Python module `comm_comp` provides classes to parse outputs of the benchmarking program and to generate plots. For instance: ```python import glob from comm_comp import * parser = FilesParser(glob.glob("copy_*_threads.out")) results_copy_alone = parser.flatten_results['comp']['alone']['copy']['time']['avg'] results_copy_with_comms = parser.flatten_results['comp']['with_comm']['copy']['time']['avg'] results_comm_alone = parser.flatten_results['comm']['alone']['lat']['med'] results_comm_with_comp = parser.flatten_results['comm']['with_comp']['lat']['med'] graph = CommCompGraph(parser.x_values, parser.x_type, CommCompGraphCommType.LATENCY, parser.compute_bench_type, CompMetric.TIME, "Network Latency (64 MB)", "COPY") graph.add_comp_curve(results_copy_alone, "alone", CommCompGraphCurveType.ALONE, False) graph.add_comp_curve(results_copy_with_comms, "while Ping-Pongs", CommCompGraphCurveType.PARALLEL, True) graph.add_comm_curve(results_comm_alone, "alone", CommCompGraphCurveType.ALONE, False, display_line=False) graph.add_comm_curve(results_comm_with_comp, "while Computations", CommCompGraphCurveType.PARALLEL, True) graph.comm_top_limit = 7 graph.comp_top_limit = 45 graph.show() ``` ## Measuring if a program is CPU- or memory-bound We use [pmu-tools](https://github.com/andikleen/pmu-tools), a wrapper above the `perf` tool. To get the summary of the whole program execution: ```bash toplev.py --global -l3 -v --csv ";" -o /tmp/pmu.csv -- program ``` You can also get a temporal view: ```bash toplev.py -l3 -x, -I 100 -o pmu.csv -- program tl-barplot.py pmu.csv ``` ## Original STREAM benchmark From http://www.cs.virginia.edu/stream/ The output is slightly modified to ease the plotting. How to bench: ```bash # Without binding: for i in $(seq 1 56); do OMP_NUM_THREADS=$i ./stream; done # Binding in logical order, ignoring hyper-threading: for i in $(seq 1 28); do OMP_NUM_THREADS=$i OMP_PROC_BIND=true OMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./stream; done # Binding in logical order, binding memory and ignoring hyper-threading: for i in $(seq 1 28); do OMP_NUM_THREADS=$i OMP_PROC_BIND=true OMP_PLACES=cores hwloc-bind --membind node:0 --cpubind core:0-$((i-1)) ./stream; done ```