Commit fc96274a authored by Alexandre Péré's avatar Alexandre Péré

Merge branch 'master' of gitlab.inria.fr:apere/orchestra

parents bdc81b57 cae331c3
# Orchestra
# Orchestra tools
Ochestra is a set of tools meant to help in performing experimental campaigns in computer science. It provides you with
simple tools to:
Ochestra is a project that aims at providing a set of tools to help performing experimental campaigns in computer science. In particular, we would like to address the following points:
+ Organize a manual experimental workflow, leveraging git and lfs through a simple interface.
+ Collaborate with other people on a single experimental campaign.
+ Execute pieces of code on remote hosts such as clusters or clouds, in one line.
+ Automate the execution of batches of experiments and the presentation of the results through a clean web ui.
- Running a large set of experiments on a large cluster should be as simple as running one such experiment on your laptop.
- Hyper-Parameter search should be left to algorithms rather than students.
- Experimental code should be versioned. Results should be linked to code version.
- Collaboration on experimental campaigns should be hassle free.
- Newcomers should be able to crank-up their campaign in a day.
A lot of advanced tools exists on the net to handle similar situations. Most of them target very complicated workflows,
e.g. DAGs of tasks. Those tools are very powerful but lack the simplicity needed by newcomers. Here, we propose a
limited but very simple tool to handle one of the most common situation of experimental campaigns: the repeated
execution of an experiment on variations of parameters.
To address those points, we work on a few tools developed on the same backbone library:
In particular, we include three tools:
+ expegit: a tool to organize your experimental campaign results in a git repository using git-lfs (large file storage).
+ runaway: a tool to execute code on distant hosts parameterized with easy to use file templates.
+ orchestra: a tool to automate the use of the two previous tools on large campaigns.
- runaway: a command-line tool to execute code on distant clusters as simply as possible.
- expegit: a command-line tool to organize an experimental campaign as a repository, allowing code versioning, results versioning, and collaboration.
- parasearch: a command-line tool to automate hyper-parameter search using a few common algorithms.
- orchestra: a tool to manage the whole lifecycle of an experimental campaign addressed in the preceding other tools, through a simple web ui.
## Quick Start
## Status
The preferred way is through the web interface, but if you prefer the command line, after setting up
the tools, you should be able to go through the following steps:
In 2018 a first prototype of those tools was written, but a foundational design decision prevented the tools to scale to massive campaigns. In the 0.1.0 release, a new backbone library was introduced, allowing the expected performance. Only a new version of runaway was shipped with this release, and we are currently working toward updating the other tools. The project is currently under active development, and breaking changes should be expected until mid 2020.
Assuming you have a repository containing your experimental code, create a run handle:
```bash
$ cd my_experiment
$ ln -s my_script.py run && chmod +x run
$ git add run && git commit -m "Adds orchestra run handle" && git push
```
## Current release
We let orchestra create a campaign repository at a new address for us:
```bash
$ orchestra init ssh://git@gitlab.com:user/my_experiment.git ssh://git@gitlab.com:user/my_new_campaign.git .
```
The current release of the orchestra tools is 0.1.
Now, we run a bunch of experiments:
```bash
$ orchestra jobs run my-cluster "" " --param¤first;second¤--flag;"
orchestra: using git 2.18.0 and lfs 2.4.2
orchestra: generating jobs
orchestra: no commit hash encountered. Continuing with HEAD ...
orchestra: parameters --param¤first;second¤--flag; encountered
orchestra: start run with parameters ' --param first --flag'
orchestra: start run with parameters ' --param first '
orchestra: start run with parameters ' --param second --flag'
orchestra: waiting for tasks completion
orchestra: tasks completed
```
After the end of the runs, you should be able to navigate your results at the address
`https://gitlab.com/user/my_new_campaign/tree/master/excs` or, in your local repository at `./my_new_campaign/excs`.
## Going further
As we can see on the previous examples, Orchestra allows you to handle two issues when facing an experimental campaign:
+ The storage of the results, which is made through a _campaign_ git repository.
+ The (asynchronous) execution of experiments on remote hosts, which is parameterized by profiles.
Those two issues are adressed by two lower-level tools:
+ __Expegit__: a command line tool to manage campaign repositories
+ __Runaway__: a command line tool to excute code on remote hosts
In essence, the __Orchestra__ tool is just a frontend that automates those two lower level tools. Let's dive deeper into
those.
### Managing experimental campaign results with Expegit
Expegit is a command line tool that allows to use a git repository to store the results of your experimental campaign.
As you may know, Git is not meant to handle large binary files, which you'll have plenty of in your results. To use
git to handle those, we make use of [Large File Storage](https://git-lfs.github.com/), which allows to avoid
storing large binary files containing results, as diffs. In essence, _Expegit_ just wraps some git commands in an
application to provide a comprehensive workflow for experimental campaigns. You can think of it as something similar to
gitflow, for experimental campaign.
We consider that you are writing the code of you experiment inside an other repository, which we will call the
_experiment repository_. To initialize the campaign repository in a local and empty repository, simply run:
```bash
expegit init -p ssh://git@gitlab:user/my_experiment.git local-empty-campaign-repository
```
That's it! By doing that, you have created a few folders in your local campaign repository:
```
local-campaign-repository/ # Your local campaign repository
├── xprp/ # Your experiment repository as a git submodule, with some experimental code of yours
│ ├── experiment.sh
│ └── ...
├── excs/ # An empty foler that will contain experiment executions
│ └── ...
└── ...
```
The `-p` flag should have pushed your changes to the origin, so if you navigate to `https://gitlab.com/user/my_experiment`,
you should see the same architecture.
Now, let's create a new experiment:
```bash
expegit exec new -p 7371ecc9232fd03ea0ecece7ab9db12171ef9d6f -- param_1 --param2="test" --param3
```
Wuw, what does that means? We create a new _execution_ of the experiment, that will correspond to the experiment
repository at commit `7371ecc9232fd03ea0ecece7ab9db12171ef9d6f`. Plus, we give a list of parameters
`param_1 --param2="test" --param3`, that could be retrieved to be fed at your script at execution time. By performing
this, you have created a new folder in the `excs` directory:
```
campaign-repository/
├── excs/
│ └── 89cfd4a9-f42b-482a-9340-c5d762ea6f73 # The execution folder, which is given an identfier
│ ├── data/ # The data folder where experimental data will be stored
│ │ └── lfs/ # The special lfs that will use lfs storing for every data it contains
│ │ └── .gitattributes
│ ├── experiment.sh # Your experimental code at 7371ecc9232fd03ea0ecece7ab9db12171ef9d6f
│ └── ...
└── ...
```
As you see, your experimental script `experiment.sh` must be made so as to store experimental data directly in `data`
for those needing lfs and in `data/lfs` for those that don't.
You can now run your experiment on your own, by retrieving the parameters out of expegit:
```bash
cd campaign-repository/excs/89cfd4a9-f42b-482a-9340-c5d762ea6f73
./experiment.sh $(expegit exec params 89cfd4a9-f42b-482a-9340-c5d762ea6f73)
```
We only have to end up the experiment by running:
```bash
cd ..
expegit finish -p 89cfd4a9-f42b-482a-9340-c5d762ea6f73
```
This command allows to delete the unnecessary files, such as all the (non-results) files from the experiment, and pushes
the results to the remote. If you now visit `https://gitlab.com/user/my_campaign/tree/master/excs/89cfd4a9-f42b-482a-9340-c5d762ea6f73`,
you should see the results generated by your script. Your execution is now over, and you can repeat the process :)
### Executing code on remote hosts with Runaway
Runaway is a second command line tool which allows to run a script on a remote host that you can reach via ssh. It
basically automates the upload of the files, the execution of the code, and the fetching of the results.
Let's take an example. Imagine we have a directory containing a python script and some modules it uses:
```
my-experiment/
├── src/
│ └── some modules ...
├── script.py
└── ...
```
To run `script.py` on our cluster, we run:
```
$ cd my-experiment
$ runaway cluster script.py -- --param=1 --flag
```
By doing this, we perform three things:
+ If the code is not already there, the content of `my-experiment` folder is sent to our cluster
+ `script.py --param=1 --flag` is executed on remote
+ The files created in the `my-experiment` folder on the cluster, are fetched to our local directory
A few remarks:
+ Unless we use a `-v` argument, the stderr, stdout and exit code of the runaway command copy the ones of the remote
execution
+ Since only the content of the folder can be sent, the resources used by our code should either be in the folder, or
exist and be accessible on the cluster we execute on.
How is the cluster parameterized, by the way? We can write our own execution profile as `.yml` files which are stored in
`~/.runaway`. For example, a profile to execute on our own computer would be something like that:
```yaml
# Name of the ssh config to use. Must be defined in your ~/.ssh/config.
ssh_config: localhost
# Path to the host directory in which to store the code.
host_directory: /home/user/Executions
# Bash commands to execute before the script.
before_execution:
- echo Preparing Execution
- echo Executed on $HOSTNAME
- export PATH="/home/user/.local/share/anaconda3/bin:$PATH"
- chmod +x $SCRIPT_NAME
# Bash command to execute the script. The following environment variables are replaced at run time:
# + `$SCRIPT_NAME`: the file name of the script.
# + `$SCRIPT_ARGS`: the arguments of the script.
execution:
- echo Starting Execution
- source /home/user/.bashrc; $SCRIPT_NAME $SCRIPT_ARGS
# Bash commands to execute after the script.
after_execution:
- echo 'Cleaning Execution'
```
As we can see, the remote must be accessible via ssh and configured in your `.ssh/config`. An other example of such a
profile to run on a slurm cluster would be:
```yaml
# Name of the ssh config to use. Must be defined in your ~/.ssh/config.
ssh_config: cluster-ext
# Path to the host directory in which to put the code.
host_directory: /home/user/Executions
# Bash commands to execute before the script.
before_execution:
- echo Preparing Execution
- echo Executed on $HOSTNAME
- module load slurm
- module load language/intelpython/3.6
# Bash command to execute the script. The following environment variables are replaced at run time:
# + `$SCRIPT_NAME`: the file name of the script.
# + `$SCRIPT_ARGS`: the arguments of the script.
execution:
- echo Starting Execution
- srun -p long-queue --verbose $SCRIPT_NAME $SCRIPT_ARGS
# Bash commands to execute after the script.
after_execution:
- echo 'Cleaning Execution'
```
As you can imagine, sending and fetching the whole folder back and forth can be resource consuming. Runaway proposes two
ways to temper this:
+ Code Reuse: The code is sent to remote as a tarball archive, which is compared with archives already existing on the
remote. If the remote already contains this archive, then it will skip the sending and will directly used the one there.
Basically, runaway cleans the whole remote execution at the end of the run `--leave=nothing`, but you can leave either
the code or the code+results with `--leave=code` and `--leave=everything` respectively.
+ Files Ignoring: When packing the files the be sent to the remote, files and folders can be ignored with the
`.sendignore` file. This one is nothing but a simple text file containing patterns and globs, in the same way as a
`.gitignore` file. For example if you are at the root of a git repository, adding your `.git` folder in the
`.sendignore` may be interesting. The same ignoring can be parameterized for the fetching of the results in the
`.fetchignore` file. If you are in an expegit execution, you can parameterize this one to only fetch the `data` folder
for example. Beware to not ignore your `.fetchignore` in your `.sendignore` !
### Automating Expegit and Runaway with Orchestra
The two previous tools allows to handle the results and the run of a single experiment execution. In the case of an
experimental campaign, we are, in genereal, interested in automating those stuffs to run large batches of experiments.
Orchestra allows to automate the creation and execution of expegit executions through a simple interface. You can choose
to use orchestra only via command line, but the prefered way is to use the web ui:
```bash
$ orchestra -v gui
```
This will launch an application that allows you to generate batch of executions, and to monitor and access the results.
Indeed, every images (plots, gifs, ...) is rendered in the execution report at `https://localhost:8088/executions/{id}`.
This allows you to quickly check the results of a single experiment. As you may expect, the results are automatically
synchronized with the remote repository.
Moreover, thanks to the use of Git to store the results, a same remote repository can be fed results by multiple
Orchestra instances. This allows you to execute experiments from different machines which may not have access to the
same computational ressources. On such case appears if multiple people collaborate on an experimental campaign, with
access to different ressources. This quickly becomes the case with clusters, since your number of simultaneous
executions may be limited by your user account. With Orchestra, you can multiply your limit by running executions from
the computers of the different people working on your campaign.
Most of the work for this version was focused on a new implementation of the backbone library `liborchestra` featuring:
- A shift to `futures`-based concurrency in the whole codebase. From the ssh connection to the resource allocation, slot acquisition, and repository interactions; every blocking operations involved in the execution was made non-blocking. This allows to concurrently execute as much executions as allowed by the different resources (ssh connections, schedulers, nodes), and not by the task scheduler.
- A concurrent model of cluster schedulers. The scheduling of executions that was once deferred to the remote schedulers such as slurms, is now concurrently managed from the library. This allows to substitute for the platform queue, locally and concurrently.
- A fine-grained slot acquisition model. The placement of executions processes on nodes that was once deferred to the scheduler is now managed by the library. This means that we can acquire 10 nodes and place any number (e.g. threds number) of execution processes on every single nodes independently and concurrently. This allows for a much more intensive use of the acquires resources, while retaining the gain of managing executions one by one.
- Much more :)
## Install
To install orchestra tools, simply go to the [release page](https://gitlab.inria.fr/apere/orchestra/wikis/home) of the wiki.
There you'll be able to download the binaries of the three tools, which you only need to make available in your PATH. As of
today, only linux and osx binaries are available.
## Todo
Orchestra is far from being complete. Here are some ideas which may be implemented in the future:
+ Automating the execution of campaign wide analysis at new result
+ Automating the generation of runs, based on the existing results, using exploration, bayesian optimization or
diversity search.
+ ...
## About
Orchestra-tools are written in Rust. Though this choice was led by pure curiosity at first, this language proved to be
quite handy to produce reliable tools. Despite its strong emphasis on speed and memory-safety, Rust also benefits from a
set of nice abstractions to enforce good practices in software development (Ownership, Options, Results, etc..).
Moreover, when the right library is available, writing code in rust is not longer than with a scripting language such as
python. Because of its young age, rust does not provide such high level librairies for `git` and `ssh`, which are
consequently managed through processes.
To install the tools, consult the [wiki](https://gitlab.inria.fr/apere/orchestra/wikis/home)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment