Tags give the ability to mark specific points in history as being important
  • V0.3.0
    Release V0.3.0

    Release V0.3.0

    Again, this release mostly revolves around runaway. A lot of new features were implemented!

    Scheduler sub-command

    The most important features implemented for this release, is the new sched sub-command, which allows to use a separate program to generate executions parameters to be executed by runaway based on the results of the already executed parameters. Said differently, this allows users to implement custom online hyper-parameters optimization or exploration procedures, while relying on runaway ability to perform experiments.

    The program used as scheduler must implement the following protocol. The communication between the scheduler and the library use stdin and stdout. Communications will be initiated by the library, which will send a request over stdin. The command will treat this request and answer with a response on stdout. Request should be treated synchronously by the command, and no particular order of requests should be assumed (any necessary book-keeping must be done on the command side). Both requests and responses will be encoded as utf-8 json string.

    The protocol possible requests are the following:

    Request string (written by runaway to program stdin) Definition
    {"GET_PARAMETERS_REQUEST": {}} Sent by runaway to request a parameter string to execute
    {"RECORD_OUTPUT_REQUEST": {"parameters": "some params", "features": "[0.5, 0.5]"} } Sent by runaway to record the output of a parameter string
    {"SHUTDOWN_REQUEST": {}} Sent by runaway to request a program shutdown

    The protocol possible responses are the following:

    Response string (read by runaway on program stdout) Definition
    {"GET_PARAMETERS_RESPONSE": {"parameters": "some params"}} Sent by the program to answer positively to a parameter request
    {"NOT_READY_RESPONSE": {}} Sent by the program to answer negatively to a parameter request (if more record are needed for instance)
    {"RECORD_OUTPUT_RESPONSE": {}} Sent by the program to answer positively to a record request
    {"SHUTDOWN_RESPONSE": {}} Sent by the program to answer positively to a shutdown request (just before exit)
    {"ERROR_RESPONSE": {"message": "some error message" }} Sent by the program on any error

    Context Carrying Handles

    Before this release, for any acquired node, a fixed number of identical handles to the actual node (ssh connections) were used by executions. Assuming a cluster with one node of 64 cores, 64 executions would have been granted the same handle to run their code. This was a problem since their was no way to restrict one execution to a single core, or a single gpu for instance.

    In this release, we introduce context carrying handles. In essence the handles given to the executions, all carry a context, in the sense that a specific environment variable, RUNAWAY_HANDLE_ID, is set to a different value. This allows to differentiate the handles, and as such, allows to give different resources to different executions.

    Templating reborn

    To support CCH, we had to rethink our templating approach. The templates must now follow this form:

    ---
    name: 
      localhost
    ssh_configuration: 
      localhost
    node_proxycommand: 
      "ssh -A -l user localhost -W $RUNAWAY_NODE_ID:22"
    start_allocation:
      - "export RUNAWAY_NODES='localhost remotehost'"
      - "export RUNAWAY_JOB_ID=911"
    cancel_allocation:
      - "echo Revoke $RUNAWAY_JOB_ID"
    allocation_duration: 
      1
    get_node_handles:
      - "export RUNAWAY_NODE_HANDLES='1 2'"
    directory: 
      /projets/flowers/alex/executions
    execution:
      - $RUNAWAY_COMMAND 

    The first important thing to notice is that now, the procedures accepts any valid bash syntax. The different lines of the scripts are executed in sequence in the very same pty, allowing to pass environment variables and current working directory, from lines to lines. This feature shipped, we didn't think having before_execution and after_execution procedures was still relevant. Now, we only have the execution procedure that represents the execution logic. Runaway will automatically set the RUNAWAY_SCRIPT_PATH and RUNAWAY_SCRIPT_ARGUMENTS variables to the actual script path and arguments. Also, the outputs of the procedures are no longer used to extract information such as job id and so on. Those are read from the environment variables. For instance, in the start_allocation procedure, the RUNAWAY_NODES variable must be set to a white-space-separated list of host-names, that will be read by runaway.

    Dynamic number of handles

    The template section executions_per_nodes disappears in favor of a procedure get_node_handle which will be called on every nodes in RUNAWAY_NODES. During this procedure, the variable RUNAWAY_HANDLES will be set to a white-space-separated list of handle identifiers. This allows to deliver a dynamic set of handles depending on the actual node. For instance, we can have one handle per threads, or one handle per gpu, and so on. Later on, when the execution procedure is called on the handle, the RUNAWAY_HANDLE_ID variable is set by the library.

    Generalized environment variables

    Environment variables can now be used in various occasions in the program:

    • First, the runaway cli will capture every environment variable prefixed with RUNAWAY_, that exists in the shell it is executed from. This means, among other things, that a template can be parameterized at run-time by environment variables.
    • The context (current working directory and environment variables), are inherited between procedures. start_allocation will inherit from the host shell, as explained before. cancel_allocation and get_allocation_nodes will inherit from start_allocation, and execution will inherit from get_allocation_nodes.
    • Runaway will set a lot of environment variables by itself, that you can later reuse: RUNAWAY_NODE_ID, RUNAWAY_HANDLE_ID, RUNAWAY_STDOUT, RUNAWAY_STDERR, RUNAWAY_ECODE...

    Remote folders

    In runaway, the options --remote-folders and --remotes-file are now available to specify the location where data will be extracted on the remote.

    Template strings

    In runaway, the options ARGUMENTS, --arguments-file, --output-folders, --outputs-file, --remote-folders and --remotes-file can now accept new-style template strings. Those strings allows to specify product set by the mean of the following syntax:

    • A set of elements is written like { 'element_1'; 'element_2'}
    • The product is specified in the following manner set_A + set_B + set_C

    So for instance, the template string '--param1=' + {'A'; 'B'} +{' --flag'; ''} will generate the following parameter strings:

    • --param1=A --flag
    • --param1=A
    • --param1=B --flag
    • --param1=B

    Environment interpolation

    Moreover, the options --output-folders, --outputs-file, --remote-folders and --remotes-file can also accept pattern strings for which runaway will interpolate environment variables.

    For instance ./batch/$RUNAWAY_UUID would yield a different path for each executions (RUNAWAY_UUID is a unique identifier given to every executions):

    • ./batch/a421yjr189
    • ./batch/hg9gh34090
    • ...

    This allows you to perform some kind of funny results sorting. For instance you can sort the results by error codes using ./batch/$RUNAWAY_ECODE/$RUNAWAY_UUID. Note that the environment variable must exist when you want to use it. For instance, you can't use RUNAWAY_ECODE in a pattern string for --remote-folders, since it would mean that you know the exit code of the execution before it was extracted and run.

    Onsite option

    If you intend to use runaway from the same platform that you target for executions, then you can reduce data transfer by enabling the --on-local flag. If you ue this flag, then remote folders will be overridem to follow the output folder.

    Enhancement

    Various enhancement were brought along the way:

    • Logging was refactored to bring better debugging options to the user.
    • The following release binary is now a fully static binary. You can copy paste it to any x86_64 linux platform and it should work!
    • Timer were re-implemented using a timer wheel, which gave a huge improvement in terms of performances.
    • Ctrl-C handling was broken, possibly from its very beginning (was working at some point, but probably because of some side effect :s). It is now working as expected.
    • And some more !

    Binary:

  • V0.2.0
    Release V0.2.0

    Release 0.2.0

    This release mainly revolves around Runaway. Several features that were asked for were brought in:

    • Ctrl-C handling: Now, if you Ctrl-C during an allocation, the allocation will be revoked before closing runaway.
    • Interactive output: Now, when running a remote command with runaway, the outputs (stderr and stdout) will be printed in the terminal as they appear on the remote end. That matches more with what users are used to when working with ssh.
    • Shell completion: The completion for runaway was added. It can be generated by running runaway install-completion. Only bash and zsh are supported for now. The zsh completion even allows to complete the profiles names.
    • Leave tars option: In order to debug .*ignore files transfer optimization, you can use the --leave-tars option which will leave the transfer files on the local end. This will allow to debug the size and content of the send and fetch transfers.
    • Config indent: A lot of editors transform tabs into spaces, which killed the profiles. Now the config parser accepts any kind of combinations of whitespaces and tabs as indentation for clauses.
    • Automatic fetchignore: Now when one wants to optimize the transfers, it can either write a .sendignore and a .fetchignore on its own, or only write the .sendignore and let runaway derive the .fetchignore. Importantly, all the results files must be included (ignored) in the .sendignore in this scenario, for the automatic generation to work well.

    A few bug fixes and changes:

    • Now the default option is to remove every data on the remote end after execution. This allows to avoid the clutter on the remote end if the user does not take care about it himself.
    • A bug in the ssh exec command was addressed. An overflow of the receive window occurred on call to flush() in asynchronous exec. When executing a large number of concurrent execs on the same remote, this caused several commands to observe a Transport Read error. This was fixed.
    • The number of threads used to drive the futures is now set to the number of threads on the computer, rather than 1.

    Binaries:

  • V0.1.2
    Release V0.1.2

    Hotfix 0.1.2

    • When using the runaway batch sub-command, the parameters string was not accounted for due to a typo in the name of the argument. This is fixed right now.
    • The ; character was used in the parameters string to cut sets blocks, but it was meant to be &. Now it is the case.

    Binaries:

  • V0.1.1   When using the exec sub-command, we check for existence of a folder named after the send archive hash on the remote end. The wrong path was checked on the exec sub-command, which we changed on this hotfix.
    Release V0.1.1

    Hotfix 0.1.1

    • When using the runaway exec sub-command, we check for existence of a folder named after the send archive hash on the remote end. The wrong path was checked on the exec sub-command, giving an error.

    Binaries:

  • V0.1.0   V0.1.0
    ed47909c · Removes unecessary code ·
    Release V0.1.0

    Release 0.1.0

    Most of the work for this version was focused on a new implementation of the backbone library liborchestra featuring:

    • A shift to futures-based concurrency in the whole codebase. From the ssh connection to the resource allocation, slot acquisition, and repository interactions; every blocking operations involved in the execution was made non-blocking. This allows to concurrently execute as much executions as allowed by the different resources (ssh connections, schedulers, nodes), and not by the task scheduler.

    • A concurrent model of cluster schedulers. The scheduling of executions that was once deferred to the remote schedulers such as slurms, is now concurrently managed from the library. This allows to substitute for the platform queue, locally and concurrently.

    • A fine-grained slot acquisition model. The placement of executions processes on nodes that was once deferred to the scheduler is now managed by the library. This means that we can acquire 10 nodes and place any number (e.g. threds number) of execution processes on every single nodes independently and concurrently. This allows for a much more intensive use of the acquires resources, while retaining the gain of managing executions one by one.

    • Much more :)

    We also implemented a new version of runaway-cli using it.

    Binaries: