MAJ terminée. Nous sommes passés en version 14.6.2 . Pour consulter les "releases notes" associées c'est ici :

Commit b3aeb3d9 authored by Ludovic Courtès's avatar Ludovic Courtès
Browse files

article: Write last sections.

* article/content.tex (Reproducing this Article): Write.
(Conclusion): Write.
* article/bibliography.bib: Add entries, use 'howpublished' for web
parent 383e40e9
......@@ -13,12 +13,106 @@
author = {Courtès, Ludovic},
url = {},
year = 2020
title = {Libchop},
howpublished = {\url{}},
note = {Accessed 2020/04/28}
author = {Lord, Tom},
url = {},
year = 2020
author = {Lord, Tom and others},
title = {{GNU Arch}},
howpublished = {\url{}},
note = {Accessed 2020/04/28}
url = {},
title = {{Nix}: A Safe and Policy-Free System for Software Deployment},
author = {Dolstra, Eelco and de Jonge, Merijn and Visser, Eelco},
affiliation = {Utrecht University},
booktitle = {Proceedings of the 18th Large Installation System Administration Conference ({LISA})},
pages = {79--92},
publisher = {{USENIX}},
year = 2004,
month = nov,
address = {Atlanta, Georgia, {USA}}
TITLE = {Reproducible and User-Controlled Software Environments in {HPC} with {Guix}},
AUTHOR = {Courtès, Ludovic and Wurmus, Ricardo},
URL = {},
BOOKTITLE = {2nd International Workshop on Reproducibility in Parallel Computing ({RepPar})},
ADDRESS = {Vienna, Austria},
YEAR = 2015,
MONTH = Aug,
KEYWORDS = {hpc ; reproducible research ; package management ; calcul intensif ; gestion de paquets ; r{\'e}p{\'e}tabilit{\'e}},
PDF = {},
HAL_ID = {hal-01161771},
TITLE = {Code Staging in {GNU Guix}},
AUTHOR = {Court{\`e}s, Ludovic},
URL = {},
BOOKTITLE = {16th {ACM SIGPLAN} International Conference on Generative Programming: Concepts and Experiences ({GPCE'17})},
ADDRESS = {Vancouver, Canada},
YEAR = 2017,
MONTH = Oct,
DOI = {10.1145/3136040.3136045},
KEYWORDS = {Functional languages ; Source code generation ; Software deployment ; Scheme ; Code staging ; System administration},
PDF = {},
HAL_ID = {hal-01580582},
author = {Eric Schulte and Dan Davison and Thomas Dye and Carsten Dominik},
title = {A Multi-Language Computing Environment for Literate Programming and Reproducible Research},
journal = {Journal of Statistical Software, Articles},
volume = {46},
number = {3},
year = {2012},
keywords = {},
abstract = {We present a new computing environment for authoring mixed natural and computer language documents. In this environment a single hierarchically-organized plain text source file may contain a variety of elements such as code in arbitrary programming languages, raw data, links to external resources, project management data, working notes, and text for publication. Code fragments may be executed in situ with graphical, numerical and textual output captured or linked in the file. Export to LATEX, HTML, LATEX beamer, DocBook and other formats permits working reports, presentations and manuscripts for publication to be generated from the file. In addition, functioning pure code files can be automatically extracted from the file. This environment is implemented as an extension to the Emacs text editor and provides a rich set of features for authoring both prose and code, as well as sophisticated project management capabilities.},
issn = {1548-7660},
pages = {1--24},
doi = {10.18637/jss.v046.i03},
url = {}
TITLE = {Effective Reproducible Research with {Org-Mode} and {Git}},
AUTHOR = {Stanisic, Luka and Legrand, Arnaud},
URL = {},
BOOKTITLE = {1st International Workshop on Reproducibility in Parallel Computing},
ADDRESS = {Porto, Portugal},
YEAR = {2014},
MONTH = Aug,
KEYWORDS = {Reproducibility of Results},
PDF = {},
HAL_ID = {hal-01083205},
doi = {10.1016/j.procs.2011.04.061},
url = {},
year = 2011,
publisher = {Elsevier {BV}},
volume = {4},
pages = {579--588},
author = {Konrad Hinsen},
title = {A data and code model for reproducible research and executable papers},
journal = {Procedia Computer Science}
author = {Mohammad Akhlaghi},
title = {Reproducible Paper Template},
year = 2019,
url = {}
......@@ -2,9 +2,14 @@
This article reports on the effort to reproduce the results shown
in \textit{Storage Tradeoffs in a Collaborative Backup Service for
Mobile Devices}\supercite{courtes06:storage}, an article published in
2006, more than thirteen years ago. Additionally, it describes a way to
capture the complete dependency graph of this article and the software
and data it refers to, making it fully reproducible.
2006, more than thirteen years ago. The article presented the design of
the storage layer of such a backup service. It included an evaluation
of the efficiency and performance of several storage pipelines, which is
the experiment we replicate here.
Additionally, this article describes a way to capture the complete
dependency graph of this article and the software and data it refers to,
making it fully reproducible, end to end.
\section{Getting the Source Code}
......@@ -41,6 +46,7 @@ years\supercite{courtes20:libchop}. As of these writings, there have
been no changes its source code since 2016.
\section{Building the Source Code}
Libchop is written in C and accessible from Scheme thanks to bindings
for GNU~Guile, an implementation of the Scheme programming language.
......@@ -222,3 +228,200 @@ performance characteristics of the different configurations observed in
\section{Reproducing this Article}
We were able to replicate experimental results obtained thirteen years
ago, observing only non-significant variations. Yet, this replication
work highlighted the weaknesses of the original work, which fall into
three categories:
Lack of a properly referenced public archive of the input data.
Gaps in the document authoring pipeline: running the benchmarks was
fully automated thanks to the scripts mentioned earlier, but the figure
that appeared in the 2006 paper was made ``by hand'' from the output
produced by the script.
Lack of a way to redeploy the software pipeline: the 2006 article did
not contain references to software revisions and version numbers, let
alone a way to automatically deploy the software stack.
\subsection{Deploying Software}
It should come as no surprise that the author, who has been working on
reproducible software deployment issue for several years now, felt the
need to address the software deployment issue. The original paper
lacked references to the software. Figure~\ref{fig:dependencies} here
does provide much information, but how useful is it to someone trying to
redeploy this software stack? Sure it contains version and dependency
information, but it says nothing about configuration and build flags,
about patches that were applied, and so on. It also lacks information
about dependencies that are considered implicit such as the compiler
tool chain. This calls for a \emph{formal and executable specification}
of the software stack.
GNU~Guix is set of tools supporting \emph{reproducible software
deployment}\supercite{courtes15:reproducible}, building upon the
functional deployment model\supercite{dolstra04:nix}. As mentioned in
Section~\ref{sec:building}, we defined all the software stack as Guix
packages: most of them pre-existed in the main Guix channel, and old
versions that were needed were added to the new Guix-Past channel.
By specifying the commits of Guix and Guix-Past of interest, one can
build the complete software stack of this article. For example, the
instructions below build the 2006 revision of libchop along with its
dependencies, deploying pre-built binaries when they are available:
git clone
cd edcc-2006-redone
guix time-machine -C channels.scm -- build libchop@0.0
The file \texttt{channels.scm} above lists the commits of Guix and
Guix-Past to be used. Thus, recording the commit
of \texttt{edcc-2006-redone} that was used \emph{is all it takes to
refer unambiguously to this whole software stack}.
The key differences compared to a ``container image''
are \emph{provenance tracking} and \emph{reproducibility}. Guix has a
complete view of the package dependency graph; for example,
Figure~\ref{fig:dependencies} is the result of running:
guix time-machine -C channels.scm -- graph libchop@0.0 \
| dot -Tpdf > graph.pdf
Furthermore, almost all the packages Guix provides are bit-reproducible:
building a package at different times or on different machines gives the
exact same binaries (there is a small minority of exceptions, often
packages that record build timestamps).
Last, each package's source code is automatically looked up in Software
Heritage should its nominal upstream location become unreachable.
\subsection{Reproducible Computations}
Often enough, software deployment is treated as an activity of its own,
separate from computations and from document authoring. But really,
this separation is arbitrary: a software build process \emph{is} a
computation, benchmarks like those discussed in this paper \emph{are}
computations, and in fact, the process that produced the PDF file you
are reading is yet another computation.
We set out to describe this whole pipeline as a single dependency graph
whose sink is the \LaTeX{} build process that produces this PDF. The
end result is that, from a checkout of the \texttt{edcc-2006-redone}
repository, this PDF, \emph{and everything it depends on} (software,
benchmarking results, plots) can be produced by running:
guix time-machine -C channels.scm -- build -f article/guix.scm
The files \texttt{guix.scm} and \texttt{article/guix.scm} describe the
dependency graph above libchop. Conceptually, they are similar to a
makefile and in fact, part of \texttt{article/guix.scm} is a translation
of the makefile of the ReScience article template. Using the Scheme
programming interfaces of Guix and its support for \textit{code
staging}, which allows users to write code staged for eventual
execution\supercite{courtes17:staging}, these files describe the
dependency graph and, for each node, its associated build process.
For the purposes of this article, we had to bridge the gap from the
benchmarking scripts to the actual plots by implementing a parser of the
script's standard output that would then feed it to Guile-Charting, the
library used to produce the charts. They are chained together in the
top-level \texttt{guix.scm} file. The graph in
Figure~\ref{fig:dependencies} is also produced automatically as part of
the build process, using the channels specified
in \texttt{channels.scm}. Thus, it is guaranteed to correspond
precisely to the software stack used to produce the benchmark results in
this document.
What about the input data? Guix \texttt{origin} records allow us to
declare data that is to be downloaded, along with the cryptographic hash
of its content---a form of \emph{content addressing}, which is the most
precise way to refer to data, independently of its storage location and
transport. The three file sets in Figure~\ref{fig:file-sets} are
encoded as \texttt{origin}s and downloaded if they are not already
available locally.
The techniques described above to encode the complete document authoring
pipeline as a fully-specified, executable and reproducible computation,
could certainly be applied to a wide range of scientific articles. We
think that, at least conceptually, it could very much represent the
``gold standard'' of reproducible scientific articles. Nevertheless,
there are two points that deserve further discussion: handling input
data, dealing with inherently non-deterministic byproducts, and dealing
with expensive computations.
Our input file sets were easily handled using the standard
Guix \texttt{origin} mechanism because they are relatively small and
easily downloaded. This data is copied as content-addressed items in
the ``store'', which would be unsuitable or at least inconvenient for
large data sets. Probably some ``out-of-band'' mechanism would need to
be sought for those data sets---similar to how Git-Annex provides
support for ``out-of-band'' data storage while remaining integrated with
Git. The developers of the Guix Workflow Language (GWL), which is used
for bioinformatics workflows over large data sets, chose to treat each
process and its data outside standard Guix mechanisms.
The second issue is non-deterministic byproducts like the performance
data of Figure~\ref{fig:throughput}. That information is inherently
non-deterministic: the actual throughput varies from run to run and from
machine to machine. The functional model implemented in
Guix\supercite{dolstra04:nix} is designed for deterministic build
processes. While it is entirely possible to include non-deterministic
build processes in the dependency graph without any practical issues,
there is some sort of an ``impedance mismatch''. It would be
interesting to see whether explicit support for non-deterministic
processes would be useful.
Last, our approach does not lend itself well to long-running
computations that require high-performance computing resources. Again,
some mechanism is needed to bridge between these necessarily out-of-band
computations and the rest of the framework. The GWL provides
preliminary answers to this question.
We are glad to report that we were able to replicate the experimental
results that appear in our thirteen year-old article and that its
conclusions in this area still hold\supercite{courtes06:storage}. But
really, truth be told, the replication was also an excuse to prototype
an \emph{end-to-end reproducible scientific pipeline}---from source code
to PDF.
The idea has previously been explored at different from different angles
notably with Hinsen's ActivePapers
framework\supercite{hinsen11:activepapers}, Akhlaghi's reproducible paper
template\supercite{akhlaghi19:template}, and by combining literate
programming with Org-Mode and version control with
Git\supercite{stanisic:reproducible}. Akhlaghi's template is one of the
few efforts to consider software deployment a part of the broader
scientific authoring pipeline. However, software deployed with the
template relies on host software such as a compilation tool chain,
making it non-self-contained; it also lacks the provenance tracking and
reproducibility that come with the functional deployment model
implemented in Guix.
We hope our work could serve as the basis of a reproducible paper
template in the spirit of Akhlaghi's. We are aware that, in its current
form, our reproducible pipeline requires a relatively high level of Guix
expertise---although, to be fair, it should be compared with the wide
variety of programming languages and tools conventionally used for
similar purposes. We think that, with more experience, common build
processes and idioms could be factorized as libraries and high-level
programming constructs, making it more approachable.
It is interesting to see that a single Git commit
identifier---a \emph{content address}---is enough to refer to whole
pipeline leading to this article!
We look forward to a future where reproducible scientific pipelines
become commonplace.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment