README.md 6.04 KB
Newer Older
1
2
# Impact of memory contention on communications

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
3
4
5
[![SWH](https://archive.softwareheritage.org/badge/origin/https://gitlab.inria.fr/pswartva/memory-contention/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://gitlab.inria.fr/pswartva/memory-contention)
[![SWH](https://archive.softwareheritage.org/badge/swh:1:dir:579f4e1d0e641536102f536117153d01fe1fca6c/)](https://archive.softwareheritage.org/swh:1:dir:579f4e1d0e641536102f536117153d01fe1fca6c;origin=https://gitlab.inria.fr/pswartva/memory-contention;visit=swh:1:snp:a7633c0f3cdcdec7078afbbc104eb3670f1ced6a;anchor=swh:1:rev:e2d788f5718386c818f0aa07e826fd9e8c6b4870)

6

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
7
8
Benchmark suite to evaluate the impact of memory contention on communications
(and vice-versa).
9
10


Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
11
## Requirements
12

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
13
- Communications are made with the MPI API, so you need an MPI library. If you
14
15
  use MadMPI (from [NewMadeleine](http://pm2.gforge.inria.fr/newmadeleine/))
  build NewMadeleine with the profile `pukabi+madmpi-mini.conf`.
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
- If you want to measure frequencies,
  [LIKWID](https://hpc.fau.de/research/tools/likwid/) can be used, but there is
  also a version of our code which does not need it.
- If you want to measure the impact of a task-based runtime system,
  [StarPU](https://starpu.gitlabpages.inria.fr/) is required.
- Some computing benchmarks use the Intel MKL library.
- [hwloc](https://www.open-mpi.org/projects/hwloc/) is used to bind threads.


## Available programs

- `bench_openmp` measures interferences between communications and computations
  when OpenMP is used to parallelize computations.
- `bench_openmp_likwid`: same as `bench_openmp` but measures also frequencies
  with LIKWID.
- `bench_openmp_freq`: same as `bench_openmp` but measures also frequencies,
  by reading content of `/proc` files.
- `bench_starpu`: same as `bench_openmp`, but uses StarPU to parallelize
  computations.
- `uncore_get` and `uncore_set` use LIKWID to respectively get and set uncore
  frequencies of sockets.

Build each program with `make`:
39
40

```bash
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
41
make <program>
42
43
```

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
44
45
46



Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
47
## Benchmarking
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
48

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
49
50
51
52
You can then chose the computing benchmark
(`--compute_bench={stream,prime,cursor,scalar,scalar_avx}`) and the
communication benchmark (`--bench={bandwidth,latency}`). All available options
are listed in the help of the program (`--help`).
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
53

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
Examples of executions:
```bash
ncores=$(hwloc-calc all -N core)
nbnuma=$(hwloc-calc all -N node)
last_numa_node=$(($nbnuma-1))

for i in $(seq 1 $((ncores-1)));
do
    mpirun -DOMP_NUM_THREADS=$i -DOMP_PROC_BIND=true -DOMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./bench_openmp --compute_bench=prime --ping_thread=last &> latency_thread_last_$((i))_threads.out
done

for i in $(seq 1 $((ncores-1)));
do
    mpirun -DOMP_NUM_THREADS=$i -DOMP_PROC_BIND=true -DOMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./bench_openmp --compute_bench=prime --ping_thread=last --bench=bandwidth &> bandwidth_thread_last_$((i))_threads.out
done

for i in $(seq 2 $((ncores-2)));
do
    mpirun -DSTARPU_NCPU=$i -DSTARPU_MAIN_THREAD_BIND=1 -DSTARPU_WORKERS_CPUID=1- -DSTARPU_MAIN_THREAD_CPUID=$((ncores-2)) -DSTARPU_MPI_THREAD_CPUID=$((ncores-1)) ./bench_starpu --compute_bench=stream --no_stream_add --no_stream_scale --ping_thread=first &> latency_thread_first_main_last_mpi_last_$((i-1))_threads.out;
done
```
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
75

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
76
77
78
Environment variables and `hwloc-bind` are used to correctly bind threads to
cores. See `bench_suite.example.sh` to see how all combinaisons of parameters
can be launched.
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
79
80
81
82
83
84
85



## Plotting

Scripts are in the `plot` folder and require Python with Matplotlib.

86
87
88
`plot_comm_comp.py` is the main script. It plots computing benchmark results
and communication performances on the same graph, according to mainly the
number of cores.
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
89
90
91

For instance:
```bash
92
python3 plot_comm_comp.py bandwidth_thread_last_* --per-core --top=10000 --stream-top=15000 --o=bandwidth_thread_last.png --network-bandwidth --title="Network Bandwidth and STREAM Benchmark"
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
93
```
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
94

95
96
97
98
99
100
101
102
103
104
105
106
The Python module `comm_comp` provides classes to parse outputs of the benchmarking program and to generate plots. For instance:
```python
import glob
from comm_comp import *

parser = FilesParser(glob.glob("copy_*_threads.out"))

results_copy_alone = parser.flatten_results['comp']['alone']['copy']['time']['avg']
results_copy_with_comms = parser.flatten_results['comp']['with_comm']['copy']['time']['avg']
results_comm_alone = parser.flatten_results['comm']['alone']['lat']['med']
results_comm_with_comp = parser.flatten_results['comm']['with_comp']['lat']['med']

107
108
109
110
111
graph = CommCompGraph(parser.x_values, parser.x_type, CommCompGraphCommType.LATENCY, parser.compute_bench_type, CompMetric.TIME, "Network Latency (64 MB)", "COPY")
graph.add_comp_curve(results_copy_alone, "alone", CommCompGraphCurveType.ALONE, False)
graph.add_comp_curve(results_copy_with_comms, "while Ping-Pongs", CommCompGraphCurveType.PARALLEL, True)
graph.add_comm_curve(results_comm_alone, "alone", CommCompGraphCurveType.ALONE, False, display_line=False)
graph.add_comm_curve(results_comm_with_comp, "while Computations", CommCompGraphCurveType.PARALLEL, True)
112
113
114
115
graph.comm_top_limit = 7
graph.comp_top_limit = 45
graph.show()
```
Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
116

117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
## Measuring if a program is CPU- or memory-bound

We use [pmu-tools](https://github.com/andikleen/pmu-tools), a wrapper above the `perf` tool.

To get the summary of the whole program execution:
```bash
toplev.py --global -l3 -v --csv ";" -o /tmp/pmu.csv -- program
```

You can also get a temporal view:
```bash
toplev.py -l3 -x, -I 100 -o pmu.csv -- program
tl-barplot.py pmu.csv
```

Philippe SWARTVAGHER's avatar
Philippe SWARTVAGHER committed
132
133

## Original STREAM benchmark
134
135
136
137
138
139
140
141
142
143
144
145
146
147

From http://www.cs.virginia.edu/stream/

The output is slightly modified to ease the plotting. How to bench:
```bash
# Without binding:
for i in $(seq 1 56); do OMP_NUM_THREADS=$i ./stream; done

# Binding in logical order, ignoring hyper-threading:
for i in $(seq 1 28); do OMP_NUM_THREADS=$i OMP_PROC_BIND=true OMP_PLACES=cores hwloc-bind --cpubind core:0-$((i-1)) ./stream; done

# Binding in logical order, binding memory and ignoring hyper-threading:
for i in $(seq 1 28); do OMP_NUM_THREADS=$i OMP_PROC_BIND=true OMP_PLACES=cores hwloc-bind --membind node:0 --cpubind core:0-$((i-1)) ./stream; done
```