allgo issueshttps://gitlab.inria.fr/allgo/allgo/-/issues2018-04-10T11:18:41+02:00https://gitlab.inria.fr/allgo/allgo/-/issues/121short jobs vs. long jobs2018-04-10T11:18:41+02:00BAIRE Anthonyshort jobs vs. long jobsWe have an issue with long jobs. If all workers are busy with long jobs, then no new job are launched.
* short jobs should have the priority over long jobs
* at least two worker slots should be reserved exclusively for short jobs
...We have an issue with long jobs. If all workers are busy with long jobs, then no new job are launched.
* short jobs should have the priority over long jobs
* at least two worker slots should be reserved exclusively for short jobs
**how to tell apart short and long jobs ?**
* (smart solution) using the job inputs (files + parameters) do some machine learning to make an estimation of duration of the job and feed the data to a scheduler that understands priorities
* (quick solution) ask the user which queue to use (in the job submission form), and implement a job timeout according to the priority. Eg:
* **interactive** -> highest priority, max duration: 1mn
* **standard** -> normal priority, max duration: 10mn
* **batch** -> low priority, max duration: ?? hourshttps://gitlab.inria.fr/allgo/allgo/-/issues/125allow cancelling jobs2018-04-10T11:18:41+02:00BAIRE Anthonyallow cancelling jobsrelated to #108related to #108https://gitlab.inria.fr/allgo/allgo/-/issues/128store container ids into the DB2018-04-10T11:18:41+02:00BAIRE Anthonystore container ids into the DBThe controller identifies containers by their names (prod-job-XXXX-XXXX / prod-sandbox-XXXX) which is nicely human readable but this is a blocker for renaming webapps/users (or destroying a webapp and recreating another webapp with the s...The controller identifies containers by their names (prod-job-XXXX-XXXX / prod-sandbox-XXXX) which is nicely human readable but this is a blocker for renaming webapps/users (or destroying a webapp and recreating another webapp with the same name).
The container names should just be indicative (and not be considered as a key), the association should be made by storing the container ID into the database. Thus it will be agnostic to the webapp name.
Related to #123: any unknown container (job or sandbox) should be detected and deleted immediately.https://gitlab.inria.fr/allgo/allgo/-/issues/124swarm resource discovery2018-04-10T11:18:41+02:00BAIRE Anthonyswarm resource discoveryThe maximal number of concurrent jobs is statically configured. This is problematic when:
- the size of the swarm changes (one node goes down)
- the qualif instance of allgo is running its own jobs
When the available resources are e...The maximal number of concurrent jobs is statically configured. This is problematic when:
- the size of the swarm changes (one node goes down)
- the qualif instance of allgo is running its own jobs
When the available resources are exceeded the new jobs fail immediately with `docker.errors.APIError: 500 Server Error: Internal Server Error ("b'no resources available to schedule container'")`
The controller should:
- handle this exception and reschedule the job
- detect the available resources (`docker info`) on startup, on errors (the above exception) and periodically.
Also the config for BIG_MEM or LONG jobs, should state the number of reserved slots for the low-demanding apps rather than for the high-demanging apps.
related to #121, #123https://gitlab.inria.fr/allgo/allgo/-/issues/51changer docker_name de l'appli 'hufa' -> 'facedetector' (+ changer images doc...2018-03-20T11:33:19+01:00MAUPETIT Charlychanger docker_name de l'appli 'hufa' -> 'facedetector' (+ changer images docker et nom répertoire)0.5https://gitlab.inria.fr/allgo/allgo/-/issues/36disable/secure docker remote API from outside2018-03-20T11:33:19+01:00MAUPETIT Charlydisable/secure docker remote API from outside0.5https://gitlab.inria.fr/allgo/allgo/-/issues/26add TLS to docker swarm cluster2018-03-20T11:33:18+01:00MAUPETIT Charlyadd TLS to docker swarm clusterMAUPETIT CharlyMAUPETIT Charlyhttps://gitlab.inria.fr/allgo/allgo/-/issues/53basculement de l'infra vers ce qu'a fait Anthony (connexion aux sockets Docker)2018-03-20T11:33:18+01:00MAUPETIT Charlybasculement de l'infra vers ce qu'a fait Anthony (connexion aux sockets Docker)0.5https://gitlab.inria.fr/allgo/allgo/-/issues/54Monitoring de l'activité des workers2018-03-20T11:33:18+01:00MAUPETIT CharlyMonitoring de l'activité des workersUn conteneur était planté et pompait 99% du cpu, empêchant l'exécution d'autres jobsUn conteneur était planté et pompait 99% du cpu, empêchant l'exécution d'autres jobshttps://gitlab.inria.fr/allgo/allgo/-/issues/130implement logs in the ssh container2017-09-13T17:46:14+02:00BAIRE Anthonyimplement logs in the ssh containerto ease debugging : there are so many places where things may go wrong (sshd, pam-mysql, ssh-authorized-keys, ssh, the sandbox sshd, the sandbox root account)
also a debugging flag should be useful too, because `ssh -vvv WEBAPP@ssh-all...to ease debugging : there are so many places where things may go wrong (sshd, pam-mysql, ssh-authorized-keys, ssh, the sandbox sshd, the sandbox root account)
also a debugging flag should be useful too, because `ssh -vvv WEBAPP@ssh-allgo.inria.fr` enables verbosity only in the first ssh connection (not in the second)
see https://support.inria.fr/Ticket/Display.html?id=79762https://gitlab.inria.fr/allgo/allgo/-/issues/123swarm resources not freed2017-06-28T17:32:50+02:00BAIRE Anthonyswarm resources not freed(to be investigated)
We observed old terminated jobs that remain in the swarm, even when the prod-controller is restarted. This means that the controller is unaware of these jobs and considers the resources available. ~~The most likel...(to be investigated)
We observed old terminated jobs that remain in the swarm, even when the prod-controller is restarted. This means that the controller is unaware of these jobs and considers the resources available. ~~The most likely explanation is that the job was destroyed in the db while being executed (maybe through the REST API, because the job destroy button is broken).~~
~~This is very likely a duplicate of #108.~~
Update: the jobs are still present in the db, it is definitely not related to #108
<code>
root@worker0:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9320ae9f8919 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 33 minutes ago Up 33 minutes 22/tcp prod-job-9188-gatbcompiler
8edbab79c05a swarm:1.2.6 "/swarm join --add..." 12 hours ago Up 12 hours 2375/tcp prod-swarm-node
274d849ad015 allgo/swarm-proxy "socat TCP-LISTEN:..." 12 hours ago Up 12 hours prod-swarm-proxy
17d89c96ec05 cargo.irisa.fr:8000/allgo/prod/webapp/massiccc:1.0 "/bin/sh -c '\n ..." 4 days ago Exited (255) 2 days ago 22/tcp prod-job-8725-massiccc
1d892b44bc32 6cf910dfd599 "/bin/sh -c '\n ..." 9 days ago Exited (255) 2 days ago 22/tcp prod-job-8619-gatbcompiler
</code>
<code>
root@worker1:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e7443b8c5867 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 29 minutes ago Up 29 minutes 22/tcp prod-job-9185-gatbcompiler
d3469eddc570 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 31 minutes ago Up 31 minutes 22/tcp prod-job-9179-gatbcompiler
b54ec624329c swarm:1.2.6 "/swarm join --add..." 12 hours ago Up 12 hours 2375/tcp prod-swarm-node
510beace80ff allgo/swarm-proxy "socat TCP-LISTEN:..." 12 hours ago Up 12 hours prod-swarm-proxy
d644bcf2a1b3 cargo.irisa.fr:8000/allgo/prod/webapp/massiccc:1.0 "/bin/sh -c '\n ..." 7 days ago Exited (137) 2 days ago prod-job-8691-massiccc
ce3d92c1d85e 6cf910dfd599 "/bin/sh -c '\n ..." 9 days ago Exited (137) 2 days ago prod-job-8617-gatbcompiler
</code>https://gitlab.inria.fr/allgo/allgo/-/issues/108job destroy is broken2017-06-21T19:02:22+02:00BAIRE Anthonyjob destroy is brokenNoMethodError in Job.archived -> self.datafile no longer exists
There is an amount of dead code to remove here. The Job destroy procedure is really not clear. Furthermore there must be some interaction with the controller in case the ...NoMethodError in Job.archived -> self.datafile no longer exists
There is an amount of dead code to remove here. The Job destroy procedure is really not clear. Furthermore there must be some interaction with the controller in case the job is currently running (needs to be killed).
see also #76 (which is a different bug)0.6https://gitlab.inria.fr/allgo/allgo/-/issues/46réduire taille image docker2017-06-01T15:44:46+02:00MAUPETIT Charlyréduire taille image dockerréduire taille des images docker avant de les pousser dans le registry.
enlever les fichiers temporaires et "squasher" les couches docker.
J'avais fait un test, (en passant par un .tar) ça peut être intéressant à partir d'une certa...réduire taille des images docker avant de les pousser dans le registry.
enlever les fichiers temporaires et "squasher" les couches docker.
J'avais fait un test, (en passant par un .tar) ça peut être intéressant à partir d'une certaine taille d'image.https://gitlab.inria.fr/allgo/allgo/-/issues/74Demande de certificat RENATER2017-06-01T15:44:46+02:00Charles DeltelDemande de certificat RENATER0.5https://gitlab.inria.fr/allgo/allgo/-/issues/88the 'scp' command must be available in every sandbox2017-06-01T15:44:46+02:00BAIRE Anthonythe 'scp' command must be available in every sandbox0.5https://gitlab.inria.fr/allgo/allgo/-/issues/100nginx container fails to start if the rails container is not yet started2017-06-01T15:44:46+02:00BAIRE Anthonynginx container fails to start if the rails container is not yet startedSince the docker legacy links were replace by docker networks, the nginx container may fail at startup (race condition) if the rails container is not yet up.
Allgo instance: backend qualif-rails:8080 frontend worker0.irisa.fr:443
...Since the docker legacy links were replace by docker networks, the nginx container may fail at startup (race condition) if the rails container is not yet up.
Allgo instance: backend qualif-rails:8080 frontend worker0.irisa.fr:443
nginx: [emerg] host not found in upstream "qualif-rails" in /etc/nginx/sites-enabled/default:97
nginx: configuration file /etc/nginx/nginx.conf test failed0.5BAIRE AnthonyBAIRE Anthonyhttps://gitlab.inria.fr/allgo/allgo/-/issues/103implement a hard (global) memory limit for the jobs2017-06-01T15:44:46+02:00BAIRE Anthonyimplement a hard (global) memory limit for the jobs(to be refined later when we implement resources reservation)(to be refined later when we implement resources reservation)0.5BAIRE AnthonyBAIRE Anthonyhttps://gitlab.inria.fr/allgo/allgo/-/issues/110squash user -> 500 for the datastore2017-06-01T15:44:46+02:00BAIRE Anthonysquash user -> 500 for the datastore0.5https://gitlab.inria.fr/allgo/allgo/-/issues/111upload unpushable images from sid to the registry2017-06-01T15:44:46+02:00BAIRE Anthonyupload unpushable images from sid to the registry cargo.irisa.fr:5000/allgo/blockcluster:1.0
cargo.irisa.fr:5000/allgo/fasst:1.0
cargo.irisa.fr:5000/allgo/gatbcompiler:1.1
cargo.irisa.fr:5000/allgo/mgda:1.0
cargo.irisa.fr:5000/allgo/trackingmecasys:1.0
cargo.irisa.fr:5000/allgo/blockcluster:1.0
cargo.irisa.fr:5000/allgo/fasst:1.0
cargo.irisa.fr:5000/allgo/gatbcompiler:1.1
cargo.irisa.fr:5000/allgo/mgda:1.0
cargo.irisa.fr:5000/allgo/trackingmecasys:1.0
0.5https://gitlab.inria.fr/allgo/allgo/-/issues/112backup all previous machines (sid cargo woody worker1)2017-06-01T15:44:46+02:00BAIRE Anthonybackup all previous machines (sid cargo woody worker1)0.5