Release V0.3.0

Again, this release mostly revolves around runaway. A lot of new features were implemented!

Scheduler sub-command

The most important features implemented for this release, is the new sched sub-command, which allows to use a separate program to generate executions parameters to be executed by runaway based on the results of the already executed parameters. Said differently, this allows users to implement custom online hyper-parameters optimization or exploration procedures, while relying on runaway ability to perform experiments.

The program used as scheduler must implement the following protocol. The communication between the scheduler and the library use stdin and stdout. Communications will be initiated by the library, which will send a request over stdin. The command will treat this request and answer with a response on stdout. Request should be treated synchronously by the command, and no particular order of requests should be assumed (any necessary book-keeping must be done on the command side). Both requests and responses will be encoded as utf-8 json string.

The protocol possible requests are the following:

Request string (written by runaway to program stdin) Definition
{"GET_PARAMETERS_REQUEST": {}} Sent by runaway to request a parameter string to execute
{"RECORD_OUTPUT_REQUEST": {"parameters": "some params", "features": "[0.5, 0.5]"} } Sent by runaway to record the output of a parameter string
{"SHUTDOWN_REQUEST": {}} Sent by runaway to request a program shutdown

The protocol possible responses are the following:

Response string (read by runaway on program stdout) Definition
{"GET_PARAMETERS_RESPONSE": {"parameters": "some params"}} Sent by the program to answer positively to a parameter request
{"NOT_READY_RESPONSE": {}} Sent by the program to answer negatively to a parameter request (if more record are needed for instance)
{"RECORD_OUTPUT_RESPONSE": {}} Sent by the program to answer positively to a record request
{"SHUTDOWN_RESPONSE": {}} Sent by the program to answer positively to a shutdown request (just before exit)
{"ERROR_RESPONSE": {"message": "some error message" }} Sent by the program on any error

Context Carrying Handles

Before this release, for any acquired node, a fixed number of identical handles to the actual node (ssh connections) were used by executions. Assuming a cluster with one node of 64 cores, 64 executions would have been granted the same handle to run their code. This was a problem since their was no way to restrict one execution to a single core, or a single gpu for instance.

In this release, we introduce context carrying handles. In essence the handles given to the executions, all carry a context, in the sense that a specific environment variable, RUNAWAY_HANDLE_ID, is set to a different value. This allows to differentiate the handles, and as such, allows to give different resources to different executions.

Templating reborn

To support CCH, we had to rethink our templating approach. The templates must now follow this form:

---
name: 
  localhost
ssh_configuration: 
  localhost
node_proxycommand: 
  "ssh -A -l user localhost -W $RUNAWAY_NODE_ID:22"
start_allocation:
  - "export RUNAWAY_NODES='localhost remotehost'"
  - "export RUNAWAY_JOB_ID=911"
cancel_allocation:
  - "echo Revoke $RUNAWAY_JOB_ID"
allocation_duration: 
  1
get_node_handles:
  - "export RUNAWAY_NODE_HANDLES='1 2'"
directory: 
  /projets/flowers/alex/executions
execution:
  - $RUNAWAY_COMMAND 

The first important thing to notice is that now, the procedures accepts any valid bash syntax. The different lines of the scripts are executed in sequence in the very same pty, allowing to pass environment variables and current working directory, from lines to lines. This feature shipped, we didn't think having before_execution and after_execution procedures was still relevant. Now, we only have the execution procedure that represents the execution logic. Runaway will automatically set the RUNAWAY_SCRIPT_PATH and RUNAWAY_SCRIPT_ARGUMENTS variables to the actual script path and arguments. Also, the outputs of the procedures are no longer used to extract information such as job id and so on. Those are read from the environment variables. For instance, in the start_allocation procedure, the RUNAWAY_NODES variable must be set to a white-space-separated list of host-names, that will be read by runaway.

Dynamic number of handles

The template section executions_per_nodes disappears in favor of a procedure get_node_handle which will be called on every nodes in RUNAWAY_NODES. During this procedure, the variable RUNAWAY_HANDLES will be set to a white-space-separated list of handle identifiers. This allows to deliver a dynamic set of handles depending on the actual node. For instance, we can have one handle per threads, or one handle per gpu, and so on. Later on, when the execution procedure is called on the handle, the RUNAWAY_HANDLE_ID variable is set by the library.

Generalized environment variables

Environment variables can now be used in various occasions in the program:

  • First, the runaway cli will capture every environment variable prefixed with RUNAWAY_, that exists in the shell it is executed from. This means, among other things, that a template can be parameterized at run-time by environment variables.
  • The context (current working directory and environment variables), are inherited between procedures. start_allocation will inherit from the host shell, as explained before. cancel_allocation and get_allocation_nodes will inherit from start_allocation, and execution will inherit from get_allocation_nodes.
  • Runaway will set a lot of environment variables by itself, that you can later reuse: RUNAWAY_NODE_ID, RUNAWAY_HANDLE_ID, RUNAWAY_STDOUT, RUNAWAY_STDERR, RUNAWAY_ECODE...

Remote folders

In runaway, the options --remote-folders and --remotes-file are now available to specify the location where data will be extracted on the remote.

Template strings

In runaway, the options ARGUMENTS, --arguments-file, --output-folders, --outputs-file, --remote-folders and --remotes-file can now accept new-style template strings. Those strings allows to specify product set by the mean of the following syntax:

  • A set of elements is written like { 'element_1'; 'element_2'}
  • The product is specified in the following manner set_A + set_B + set_C

So for instance, the template string '--param1=' + {'A'; 'B'} +{' --flag'; ''} will generate the following parameter strings:

  • --param1=A --flag
  • --param1=A
  • --param1=B --flag
  • --param1=B

Environment interpolation

Moreover, the options --output-folders, --outputs-file, --remote-folders and --remotes-file can also accept pattern strings for which runaway will interpolate environment variables.

For instance ./batch/$RUNAWAY_UUID would yield a different path for each executions (RUNAWAY_UUID is a unique identifier given to every executions):

  • ./batch/a421yjr189
  • ./batch/hg9gh34090
  • ...

This allows you to perform some kind of funny results sorting. For instance you can sort the results by error codes using ./batch/$RUNAWAY_ECODE/$RUNAWAY_UUID. Note that the environment variable must exist when you want to use it. For instance, you can't use RUNAWAY_ECODE in a pattern string for --remote-folders, since it would mean that you know the exit code of the execution before it was extracted and run.

Onsite option

If you intend to use runaway from the same platform that you target for executions, then you can reduce data transfer by enabling the --on-local flag. If you ue this flag, then remote folders will be overridem to follow the output folder.

Enhancement

Various enhancement were brought along the way:

  • Logging was refactored to bring better debugging options to the user.
  • The following release binary is now a fully static binary. You can copy paste it to any x86_64 linux platform and it should work!
  • Timer were re-implemented using a timer wheel, which gave a huge improvement in terms of performances.
  • Ctrl-C handling was broken, possibly from its very beginning (was working at some point, but probably because of some side effect :s). It is now working as expected.
  • And some more !

Binary: