controller/controller.py · 0e301e74f21e19de90087aebb191354d9ff81dcb · allgo / allgo

refactor the management of swarm/sandbox resources · 0e301e74

BAIRE Anthony authored Nov 14, 2017

- add SwarmAbstractionClient: a class that extends docker.Client and
  hides the API differences between the docker remote API and the
  swarm API. Thus a single docker engine can be used like a swarm

- add SharedSwarmClient: a class that extends SwarmAbstractionClient
  and monitors the swarm health and its resource (cpu/mem) and manages
  the resource allocation.
  - resources are partitioned in groups (to allow reserving resources
    for higher priority jobs)
  - two SharedSwarmClient can work together over TCP in a master/slave
    configuration (to allow the production and qualification platforms
    to use the same swarm without any interference)

- the controller is modified to:
  - use SharedSwarmClient to:
    - wait for the end of a job (in place of DockerWatcher)
    - manage resource reservation (LONG_APPS vs. BIGMEM_APPS vs normal
      apps) and monitor swarm health (fix #124)
    - NOTE: resources of the swarm and sandbox are now managed
      separately (2 instances of SharedSwarmClient), whereas it was
      global before (this was suboptimal)
  - rely on SwarmAbstractionClient to compute the cpu quotas
  - store the container_id of jobs into the DB (fix #128), this is a
    prerequisite to permit renaming apps in the future
  - store the class of the job (normal vs. long app) in the container
    name (for the resource management with SharedSwarmClient)
  - read the configuration from a yaml file (/vol/ro/config.yml) for:
    - cpu/mem quotas
    - swarm resources allocation policy
    - master/slave configuration

0e301e74