swarm resource discovery
The maximal number of concurrent jobs is statically configured. This is problematic when:
- the size of the swarm changes (one node goes down)
- the qualif instance of allgo is running its own jobs
When the available resources are exceeded the new jobs fail immediately with docker.errors.APIError: 500 Server Error: Internal Server Error ("b'no resources available to schedule container'")
The controller should:
- handle this exception and reschedule the job
- detect the available resources (
docker info
) on startup, on errors (the above exception) and periodically.
Also the config for BIG_MEM or LONG jobs, should state the number of reserved slots for the low-demanding apps rather than for the high-demanging apps.
related to #121 (closed), #123