Attention une mise à jour du service Gitlab va être effectuée le mardi 14 décembre entre 13h30 et 14h00. Cette mise à jour va générer une interruption du service dont nous ne maîtrisons pas complètement la durée mais qui ne devrait pas excéder quelques minutes.

Commit 00fadc90 authored by Ludovic Courtès's avatar Ludovic Courtès
Browse files

article: Last changes before submission.

* article/content.tex: Fix typos, improve wording.
* article/bibliography.bib: Add "janssen20:gwl".
parent 3d76d50a
......@@ -113,6 +113,13 @@
@misc{akhlaghi19:template,
author = {Mohammad Akhlaghi},
title = {Reproducible Paper Template},
year = 2019,
url = {https://gitlab.com/makhlaghi/reproducible-paper}
howpublished = {\url{https://gitlab.com/makhlaghi/reproducible-paper}},
note = {Accessed 2020/04/28}
}
@misc{janssen20:gwl,
author = {Janssen, Roel and Wurmus, Ricardo and others},
title = {{GNU Guix Workflow Language}},
howpublished = {\url{https://www.guixwl.org/}},
note = {Accessed 2020/04/28}
}
......@@ -41,9 +41,9 @@ paper\supercite{courtes06:storage}.
\end{itemize}
The code of libchop itself was published as free software in 2007 and
continued to evolved in the following
years\supercite{courtes20:libchop}. As of these writings, there have
been no changes its source code since 2016.
continued to evolve in the following
years\supercite{courtes20:libchop}. As of this writing, there have
been no changes to its source code since 2016.
\section{Building the Source Code}
\label{sec:building}
......@@ -76,13 +76,13 @@ rather run the 2006 revision that was used at the time the paper was
written? The latest libchop revision is available as a
GNU~Guix package. Unfortunately, the benchmarking scripts mentioned
above are stuck in 2006--2007, so to speak: they require libchop
programming interface that changed after that time, and they also
programming interfaces that changed after that time, and they also
require interfaces specific to Guile~1.8, the version that was current
at the time (the latest version of Guile today is 3.0.2; it has
seen \textit{three} major versions since 2006).
The author set out to use libchop, Guile, and G-Wrap from 2006, but
reusing as much as possible of today's software packages apart from
The author chose to use libchop, Guile, and G-Wrap from 2006, but
reusing as many as possible of today's software packages apart from
these. The good thing with the Autotools is that a user building from a
source tarball does not need to install the Autotools. However, no
release of libchop had been published as a source tarball back then, and
......@@ -95,6 +95,7 @@ incompatible. Fortunately, the ``downgrade cascade'' stops here.
\begin{figure}[ht!]
\caption{Dependency graph for the 2006 revision of libchop.}
\label{fig:dependencies}
\hspace{-0.5cm}
\includegraphics[width=.7\paperwidth]{libchop-graph}
\end{figure}
......@@ -108,31 +109,31 @@ was chosen as dating to right before the submission of the paper for
the European Dependable Computing Conference (EDCC), where it was
eventually presented.
The resulting dependency graph---packages needed to be build this
The resulting dependency graph---packages needed to build this
libchop revision---is of course more complex. It is shown in
Figure~\ref{fig:dependencies} for reference (the reader is invited to
zoom in or use a high-resolution printer).
zoom in or use a high-resolution printer). It is interesting to see
that it is a unique blend of vintage 2006 packages with 2020 software.
Section~\ref{sec:reproducing} will get back to this graph.
\section{Running the Benchmarks}
Section 4.2 of the original paper\supercite{courtes06:storage} evaluates
the efficiency and computational cost of different storage pipelines, on
the efficiency and computational cost of several storage pipelines, on
different file sets, each involving a variety of compression techniques.
\subsection{Input File Sets}
Figure~3 of the original article describes the three file sets used as
input of the evaluation. Of these three file sets, only the first one
could be recovered precisely: it is a set of publicly available source
code files available
could be recovered precisely: it is source code publicly available
from \url{https://download.savannah.gnu.org/releases/lout} and in the
Software Heritage archive. The two other file sets were not publicly
available. With the information available, we decided to
available. With the information given in the paper, we decided to
use \emph{similar} file sets, publicly available this time. For the
``Ogg Vorbis'' file set, we chose freely-redistributable files available
from \url{https://archive.org/download/nine_inch_nails_the_slip/}. For
the ``mailbox'' file set, we chose the mbox-formatted archive for
the ``mailbox'' file set, we chose an mbox-formatted monthly archive of
the \texttt{guix-devel@gnu.org} mailing list.
\begin{table}
......@@ -149,14 +150,15 @@ the \texttt{guix-devel@gnu.org} mailing list.
Table~\ref{tab:file-sets} summarizes the file sets used in this
replication. This is an informal description, but rest assured:
Section~\ref{sec:reproducing} will explain the \emph{executable
specification} of these file sets that accompanies this article.
Section~\ref{sec:reproducing} will explain the ``executable
specification'' of these file sets that accompanies this article.
\subsection{Evaluation Results}
\begin{table}
\caption{Storage pipeline configurations benchmarked.}
\label{tab:configurations}
\hspace{-0.5cm}
\begin{tabular}{c|c|c|c|c|c}
\textbf{Config.} & \textbf{Single Instance?} & \textbf{Chopping Algo.} & \textbf{Block Size} & \textbf{Input Zipped?} & \textbf{Blocks Zipped?} \\
\hline
......@@ -173,20 +175,20 @@ Like in the original article, we benchmarked the configurations listed
in Table~\ref{tab:configurations}. Running the benchmarking scripts
using the libchop revision packaged earlier revealed a crash for some of
the configurations. Fortunately, that problem had been fixed in later
revisions of libchop, and we were able to ``backport'' a simpler fix to
that revision (most likely, the bug depended on other factors such as
revisions of libchop, and we were able to ``backport'' a small fix to
our revision (most likely, the bug depended on other factors such as
the CPU architecture and libc version and did not show up back in 2006).
The original benchmarks run on a PowerPC G4 machine running GNU/Linux.
This time, we ran them on an x86\_64 machine with an Intel i7 CPU at
2.6~GHz (the author playfully started looking for a G4 so that even the
\emph{hardware} setup could be reproduced, but eventually gave up). The
\emph{hardware} setup could be replicated, but eventually gave up). The
benchmarking results in Figure~5 of the original
paper\supercite{courtes06:storage} were squashed in a single,
hard-to-read chart. Here we present them as two separate figures:
Figure~\ref{fig:size} shows the space savings (ratio of the resulting
data size to the input data size) and Figure~\ref{fig:throughput} shows
the throughput of each storage pipeline, for each data set.
the throughput of each storage pipeline, for each file set.
\begin{figure}[ht!]
\caption{Ratio of the resulting data size to the input data size (lower is better).}
......@@ -194,9 +196,9 @@ the throughput of each storage pipeline, for each data set.
\includegraphics[width=.7\paperwidth]{charts/size.pdf}
\end{figure}
The space savings in Figure~\ref{fig:size} are about the same of the
The space savings in Figure~\ref{fig:size} are about the same as in the
original article, with one exception: the ``mailbox'' file set has
noticeably better space savings in configurations A1 and C. This could
noticeably better space savings in configurations A1 and C this time. This could
be due to the mailbox file chosen in this replication exhibiting more
redundancy; or it could be due to today's zlib implementation having
different defaults, such as a larger compression buffer, allowing it to
......@@ -221,7 +223,7 @@ hardware (using a solid-state device today compared to a spinning hard
disk drive back then).
Overall, the analysis in Section~4.2.2 of the original paper remains
valid today. The part of evaluation of the CPU cost is, as we saw,
valid today. The part of evaluation that relates to the CPU cost is, as we saw,
sensitive to changes in the underlying hardware. Nevertheless, the main
performance characteristics of the different configurations observed in
2006 remain valid today.
......@@ -230,7 +232,7 @@ performance characteristics of the different configurations observed in
\label{sec:reproducing}
We were able to replicate experimental results obtained thirteen years
ago, observing only non-significant variations. Yet, this replication
ago, observing non-significant variations. Yet, this replication
work highlighted the weaknesses of the original work, which fall into
three categories:
......@@ -243,7 +245,7 @@ fully automated thanks to the scripts mentioned earlier, but the figure
that appeared in the 2006 paper was made ``by hand'' from the output
produced by the script.
\item
Lack of a way to redeploy the software pipeline: the 2006 article did
Lack of a way to redeploy the software stack: the 2006 article did
not contain references to software revisions and version numbers, let
alone a way to automatically deploy the software stack.
\end{enumerate}
......@@ -254,7 +256,7 @@ It should come as no surprise that the author, who has been working on
reproducible software deployment issue for several years now, felt the
need to address the software deployment issue. The original paper
lacked references to the software. Figure~\ref{fig:dependencies} here
does provide much information, but how useful is it to someone trying to
provides much information, but how useful is it to someone trying to
redeploy this software stack? Sure it contains version and dependency
information, but it says nothing about configuration and build flags,
about patches that were applied, and so on. It also lacks information
......@@ -271,7 +273,7 @@ versions that were needed were added to the new Guix-Past channel.
By specifying the commits of Guix and Guix-Past of interest, one can
build the complete software stack of this article. For example, the
instructions below build the 2006 revision of libchop along with its
dependencies, deploying pre-built binaries when they are available:
dependencies, downloading pre-built binaries if they are available:
\begin{verbatim}
git clone https://gitlab.inria.fr/lcourtes-phd/edcc-2006-redone
......@@ -311,7 +313,7 @@ computation, benchmarks like those discussed in this paper \emph{are}
computations, and in fact, the process that produced the PDF file you
are reading is yet another computation.
We set out to describe this whole pipeline as a single dependency graph
The author set out to describe this whole pipeline as a single dependency graph
whose sink is the \LaTeX{} build process that produces this PDF. The
end result is that, from a checkout of the \texttt{edcc-2006-redone}
repository, this PDF, \emph{and everything it depends on} (software,
......@@ -333,11 +335,11 @@ dependency graph and, for each node, its associated build process.
For the purposes of this article, we had to bridge the gap from the
benchmarking scripts to the actual plots by implementing a parser of the
script's standard output that would then feed it to Guile-Charting, the
library used to produce the charts. They are chained together in the
library used to produce the plots. They are chained together in the
top-level \texttt{guix.scm} file. The graph in
Figure~\ref{fig:dependencies} is also produced automatically as part of
the build process, using the channels specified
in \texttt{channels.scm}. Thus, it is guaranteed to correspond
in \texttt{channels.scm}. Thus, it is guaranteed to describe
precisely to the software stack used to produce the benchmark results in
this document.
......@@ -345,7 +347,7 @@ What about the input data? Guix \texttt{origin} records allow us to
declare data that is to be downloaded, along with the cryptographic hash
of its content---a form of \emph{content addressing}, which is the most
precise way to refer to data, independently of its storage location and
transport. The three file sets in Figure~\ref{fig:file-sets} are
transport. The three file sets in Figure~\ref{tab:file-sets} are
encoded as \texttt{origin}s and downloaded if they are not already
available locally.
......@@ -356,8 +358,8 @@ pipeline as a fully-specified, executable and reproducible computation,
could certainly be applied to a wide range of scientific articles. We
think that, at least conceptually, it could very much represent the
``gold standard'' of reproducible scientific articles. Nevertheless,
there are two points that deserve further discussion: handling input
data, dealing with inherently non-deterministic byproducts, and dealing
there are three points that deserve further discussion: handling input
data, dealing with non-deterministic computations, and dealing
with expensive computations.
Our input file sets were easily handled using the standard
......@@ -366,8 +368,9 @@ easily downloaded. This data is copied as content-addressed items in
the ``store'', which would be unsuitable or at least inconvenient for
large data sets. Probably some ``out-of-band'' mechanism would need to
be sought for those data sets---similar to how Git-Annex provides
support for ``out-of-band'' data storage while remaining integrated with
Git. The developers of the Guix Workflow Language (GWL), which is used
``out-of-band'' data storage integrated with
Git. As an example, the developers of the Guix Workflow
Language\supercite{janssen20:gwl} (GWL), which is used
for bioinformatics workflows over large data sets, chose to treat each
process and its data outside standard Guix mechanisms.
......@@ -382,8 +385,8 @@ there is some sort of an ``impedance mismatch''. It would be
interesting to see whether explicit support for non-deterministic
processes would be useful.
Last, our approach does not lend itself well to long-running
computations that require high-performance computing resources. Again,
Last, the approach does not mesh with long-running
computations that require high-performance computing (HPC) resources. Again,
some mechanism is needed to bridge between these necessarily out-of-band
computations and the rest of the framework. The GWL provides
preliminary answers to this question.
......@@ -391,23 +394,23 @@ preliminary answers to this question.
\section{Conclusion}
We are glad to report that we were able to replicate the experimental
results that appear in our thirteen year-old article and that its
results that appear in our thirteen-year-old article and that its
conclusions in this area still hold\supercite{courtes06:storage}. But
really, truth be told, the replication was also an excuse to prototype
an \emph{end-to-end reproducible scientific pipeline}---from source code
to PDF.
The idea has previously been explored at different from different angles
The idea has previously been explored from different angles
notably with Hinsen's ActivePapers
framework\supercite{hinsen11:activepapers}, Akhlaghi's reproducible paper
template\supercite{akhlaghi19:template}, and by combining literate
programming with Org-Mode and version control with
Git\supercite{stanisic:reproducible}. Akhlaghi's template is one of the
few efforts to consider software deployment a part of the broader
few efforts to consider software deployment as part of the broader
scientific authoring pipeline. However, software deployed with the
template relies on host software such as a compilation tool chain,
making it non-self-contained; it also lacks the provenance tracking and
reproducibility that come with the functional deployment model
reproducibility benefits that come with the functional deployment model
implemented in Guix.
We hope our work could serve as the basis of a reproducible paper
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment