MAJ terminée. Nous sommes passés en version 14.6.2 . Pour consulter les "releases notes" associées c'est ici :

Commit 07f3b168 authored by Ludovic Courtès's avatar Ludovic Courtès
Browse files

article: Comment the new benchmark results.

* article/content.tex (Running the Benchmarks): Write.
parent 358f96d2
......@@ -67,7 +67,7 @@ Autoconf, Automake, and Libtool.
Should we run the latest revision of libchop, dated 2016, or should we
rather run the 2006 revision that was used at the time the paper was
written? The latest \texttt{libchop} revision is available as a
written? The latest libchop revision is available as a
GNU~Guix package. Unfortunately, the benchmarking scripts mentioned
above are stuck in 2006--2007, so to speak: they require libchop
programming interface that changed after that time, and they also
......@@ -104,22 +104,121 @@ eventually presented.
The resulting dependency graph---packages needed to be build this
libchop revision---is of course more complex. It is shown in
Figure~\ref{fig:dependencies} for reference.
Figure~\ref{fig:dependencies} for reference (the reader is invited to
zoom in or use a high-resolution printer).
Section~\ref{sec:reproducing} will get back to this graph.
\section{Running the Benchmarks}
Section 4.2 of the original paper\supercite{courtes06:storage} evaluates
the efficiency and computational cost of different storage pipelines, on
different file sets, each involving a variety of compression techniques.
\subsection{Input File Sets}
Figure~3 of the original article describes the three file sets used as
input of the evaluation. Of these three file sets, only the first one
could be recovered precisely: it is a set of publicly available source
code files available
from \url{} and in the
Software Heritage archive. The two other file sets were not publicly
available. With the information available, we decided to
use \emph{similar} file sets, publicly available this time. For the
``Ogg Vorbis'' file set, we chose freely-redistributable files available
from \url{}. For
the ``mailbox'' file set, we chose the mbox-formatted archive for
the \texttt{} mailing list.
\caption{File sets.}
\textbf{Name} & \textbf{Size} & \textbf{Files} & \textbf{Average Size} \\
Lout (versions 3.20 to 3.29) & 76 MiB & 5,853 & 13 KiB \\
Ogg Vorbis files & 32 MiB & 10 & 3 MiB \\
mbox-formatted mailbox & 8 MiB & 1 & 8 MiB \\
Table~\ref{tab:file-sets} summarizes the file sets used in this
replication. This is an informal description, but rest assured:
Section~\ref{sec:reproducing} will explain the \emph{executable
specification} of these file sets that accompanies this article.
\subsection{Evaluation Results}
\caption{Storage pipeline configurations benchmarked.}
\textbf{Config.} & \textbf{Single Instance?} & \textbf{Chopping Algo.} & \textbf{Block Size} & \textbf{Input Zipped?} & \textbf{Blocks Zipped?} \\
A1 & no & --- & --- & yes & --- \\
A2 & yes & --- & --- & yes & --- \\
B1 & yes & Manber's & 1024 B & no & no \\
B2 & yes & Manber's & 1024 B & no & yes \\
B3 & yes & fixed-size & 1024 B & no & yes \\
C & yes & fixed-size & 1024 B & yes & no \\
Like in the original article, we benchmarked the configurations listed
in Table~\ref{tab:configurations}. Running the benchmarking scripts
using the libchop revision packaged earlier revealed a crash for some of
the configurations. Fortunately, that problem had been fixed in later
revisions of libchop, and we were able to ``backport'' a simpler fix to
that revision (most likely, the bug depended on other factors such as
the CPU architecture and libc version and did not show up back in 2006).
The original benchmarks run on a PowerPC G4 machine running GNU/Linux.
This time, we ran them on an x86\_64 machine with an Intel i7 CPU at
2.6~GHz (the author playfully started looking for a G4 so that even the
\emph{hardware} setup could be reproduced, but eventually gave up). The
benchmarking results in Figure~5 of the original
paper\supercite{courtes06:storage} were squashed in a single,
hard-to-read chart. Here we present them as two separate figures:
Figure~\ref{fig:size} shows the space savings (ratio of the resulting
data size to the input data size) and Figure~\ref{fig:throughput} shows
the throughput of each storage pipeline, for each data set.
\caption{Ratio of the resulting data size to the input data size (lower is better).}
The space savings in Figure~\ref{fig:size} are about the same of the
original article, with one exception: the ``mailbox'' file set has
noticeably better space savings in configurations A1 and C. This could
be due to the mailbox file chosen in this replication exhibiting more
redundancy; or it could be due to today's zlib implementation having
different defaults, such as a larger compression buffer, allowing it to
achieve better compression.
\caption{Throughput for each storage pipeline and each file set (higher is better).}
The throughput shown in Figure~\ref{fig:throughput} is, not
surprisingly, an order of magnitude higher than that measured on the
2006-era hardware. The CPU cost of configurations relative to one
another is close to that of the original paper, though less pronounced.
For example, the throughput for B2 is only half that of A1 in this
replication, whereas it was about a third in the original paper. There
can be several factors explaining this, such as today's compiler
producing better code for the implementation of the ``chopper'' based on
Manber's algorithm in libchop, or very low input/output costs on today's
hardware (using a solid-state device today compared to a spinning hard
disk drive back then).
Overall, the analysis in Section~4.2.2 of the original paper remains
valid today. The part of evaluation of the CPU cost is, as we saw,
sensitive to changes in the underlying hardware. Nevertheless, the main
performance characteristics of the different configurations observed in
2006 remain valid today.
\section{Reproducing this Article}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment