allgo issueshttps://gitlab.inria.fr/allgo/allgo/-/issues2022-05-16T13:56:27+02:00https://gitlab.inria.fr/allgo/allgo/-/issues/236Add init script for runners2022-05-16T13:56:27+02:00BERJON MatthieuAdd init script for runnersMake an init script available for runner users could be useful if the runner is rebooting.Make an init script available for runner users could be useful if the runner is rebooting.BERJON MatthieuBERJON Matthieuhttps://gitlab.inria.fr/allgo/allgo/-/issues/172timeout when pushing big images2018-06-07T18:21:34+02:00BAIRE Anthonytimeout when pushing big imagesWhen a user commits a sandbox, the controller pushes the new image to the registry, but if the image has lots of changes (~hundreds of MBs or maybe GBs), we experience push failures because of socket timeouts
```
2018-Jun-06 16:45:11 ER...When a user commits a sandbox, the controller pushes the new image to the registry, but if the image has lots of changes (~hundreds of MBs or maybe GBs), we experience push failures because of socket timeouts
```
2018-Jun-06 16:45:11 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/usr/lib/python3.4/http/client.py", line 500, in read
return super(HTTPResponse, self).read(amt)
File "/usr/lib/python3.4/http/client.py", line 529, in readinto
return self._readinto_chunked(b)
File "/usr/lib/python3.4/http/client.py", line 614, in _readinto_chunked
chunk_left = self._read_next_chunk_size()
File "/usr/lib/python3.4/http/client.py", line 552, in _read_next_chunk_size
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python3.4/socket.py", line 371, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/allgo-docker/controller.py", line 101, in report_error
yield
File "/opt/allgo-docker/controller.py", line 1390, in _process
yield from self.run_in_executor(docker_check_error, self.ctrl.sandbox.push, image, tag)
File "/opt/allgo-docker/controller.py", line 336, in run_in_executor
return (yield from run())
File "/usr/lib/python3.4/asyncio/tasks.py", line 472, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/lib/python3.4/asyncio/futures.py", line 277, in result
raise self._exception
File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/allgo-docker/controller.py", line 76, in docker_check_error
for elem in func(*k, stream=True, **kw):
File "/usr/lib/python3/dist-packages/docker/client.py", line 217, in _stream_helper
data = reader.read(1)
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 201, in read
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
2018-Jun-06 16:45:11 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/usr/lib/python3.4/http/client.py", line 500, in read
return super(HTTPResponse, self).read(amt)
File "/usr/lib/python3.4/http/client.py", line 529, in readinto
return self._readinto_chunked(b)
File "/usr/lib/python3.4/http/client.py", line 614, in _readinto_chunked
chunk_left = self._read_next_chunk_size()
File "/usr/lib/python3.4/http/client.py", line 552, in _read_next_chunk_size
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python3.4/socket.py", line 371, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/allgo-docker/controller.py", line 286, in _done
hnd.cur.result()
File "/usr/lib/python3.4/asyncio/futures.py", line 277, in result
raise self._exception
File "/usr/lib/python3.4/asyncio/tasks.py", line 235, in _step
result = coro.send(value)
File "/opt/allgo-docker/controller.py", line 1390, in _process
yield from self.run_in_executor(docker_check_error, self.ctrl.sandbox.push, image, tag)
File "/opt/allgo-docker/controller.py", line 336, in run_in_executor
return (yield from run())
File "/usr/lib/python3.4/asyncio/tasks.py", line 472, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/lib/python3.4/asyncio/futures.py", line 277, in result
raise self._exception
File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/allgo-docker/controller.py", line 76, in docker_check_error
for elem in func(*k, stream=True, **kw):
File "/usr/lib/python3/dist-packages/docker/client.py", line 217, in _stream_helper
data = reader.read(1)
File "/usr/lib/python3/dist-packages/urllib3/response.py", line 201, in read
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
2018-Jun-06 16:45:11 DEBUG controller task scheduled <controller.PushManager object at 0x7fe9b52b9dd8> 219
2018-Jun-06 16:45:11 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
```
Fortunately the controller is resilient and retries the push immediately (and users do not complain ;-)), however the upload takes a long time to complete (eg: ~40mn for magritpoc:1)
```
2018-Jun-06 16:38:07 INFO controller sandbox 'magritpoc' is now in state 'IDLE'
2018-Jun-06 16:45:11 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:45:11 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 16:45:11 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 16:46:12 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:46:12 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 16:46:12 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:46:12 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:46:12 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 16:47:12 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:47:12 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 16:47:12 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:47:12 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 16:54:55 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:54:55 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 16:54:55 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:54:55 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 16:55:58 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 16:55:58 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 16:55:58 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:12:05 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:13:14 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:13:14 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 17:13:14 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:13:14 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:14:14 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:14:14 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 17:14:14 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:14:14 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:15:14 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:15:14 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 17:15:14 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:15:14 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:16:14 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:16:14 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 17:16:14 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:16:15 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:17:15 ERROR controller unable to push version id 219 (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:17:15 ERROR controller task <controller.PushManager object at 0x7fe9b52b9dd8> 219 unhandled exception
2018-Jun-06 17:17:15 ERROR controller unable to pull version id 219 to swarm (urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.)
2018-Jun-06 17:17:15 INFO controller push from the sandbox cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
2018-Jun-06 17:17:44 INFO controller pull to the swarm cargo.irisa.fr:8000/allgo/prod/webapp/magritpoc:1
```
Increasing the timeout of docker-py requests (in the APIClient constructor) should be sufficient. Note: it is not possible to tune the timeout on a per-request basis (the timeout is global :-/)BAIRE AnthonyBAIRE Anthonyhttps://gitlab.inria.fr/allgo/allgo/-/issues/130implement logs in the ssh container2017-09-13T17:46:14+02:00BAIRE Anthonyimplement logs in the ssh containerto ease debugging : there are so many places where things may go wrong (sshd, pam-mysql, ssh-authorized-keys, ssh, the sandbox sshd, the sandbox root account)
also a debugging flag should be useful too, because `ssh -vvv WEBAPP@ssh-all...to ease debugging : there are so many places where things may go wrong (sshd, pam-mysql, ssh-authorized-keys, ssh, the sandbox sshd, the sandbox root account)
also a debugging flag should be useful too, because `ssh -vvv WEBAPP@ssh-allgo.inria.fr` enables verbosity only in the first ssh connection (not in the second)
see https://support.inria.fr/Ticket/Display.html?id=79762https://gitlab.inria.fr/allgo/allgo/-/issues/129make allgo reachable from both domains : allgo.inria.fr & allgo.irisa.fr2018-09-12T17:58:40+02:00BAIRE Anthonymake allgo reachable from both domains : allgo.inria.fr & allgo.irisa.frAprès discussion, il semble effectivement logique que les deux noms de domaines fonctionnent en parallèle (pas avec une redirection comme actuellement) :
- allgo.irisa.fr
- allgo.inria.fr
Les deux pointerons sur la meme infrastructure ...Après discussion, il semble effectivement logique que les deux noms de domaines fonctionnent en parallèle (pas avec une redirection comme actuellement) :
- allgo.irisa.fr
- allgo.inria.fr
Les deux pointerons sur la meme infrastructure et donc les meme pages.
Nous ajouterions en pied de page de chacune les logos INRIA CNRS RENNES 1.
**Update 2018-09:** the django [sites framework](https://docs.djangoproject.com/en/2.1/ref/contrib/sites/#module-django.contrib.sites) should be suitable for implementing thise (each app would be associated with one site_id)https://gitlab.inria.fr/allgo/allgo/-/issues/127fix webapp deletion vs. user deletion vs. job deletion2022-05-16T13:44:25+02:00BAIRE Anthonyfix webapp deletion vs. user deletion vs. job deletionThe webapp-job webapp-user and the user-job associations have `dependent: :delete_all` or `dependent: :destroy` which is problematic because:
* if a webapp is deleted by its owner, then all its jobs are immediately deleted, which is wr...The webapp-job webapp-user and the user-job associations have `dependent: :delete_all` or `dependent: :destroy` which is problematic because:
* if a webapp is deleted by its owner, then all its jobs are immediately deleted, which is wrong because the jobs belonging to other users should remain on the server (during one month).
* datastore files are not deleted in the process
* running jobs are ignored in the process
Deleted webapps and users should be kept in the database (with a status `DELETED`) until all jobs referring to them are purged. Also the short_name should be changed int the process so that it can be reused for another app.https://gitlab.inria.fr/allgo/allgo/-/issues/123swarm resources not freed2017-06-28T17:32:50+02:00BAIRE Anthonyswarm resources not freed(to be investigated)
We observed old terminated jobs that remain in the swarm, even when the prod-controller is restarted. This means that the controller is unaware of these jobs and considers the resources available. ~~The most likel...(to be investigated)
We observed old terminated jobs that remain in the swarm, even when the prod-controller is restarted. This means that the controller is unaware of these jobs and considers the resources available. ~~The most likely explanation is that the job was destroyed in the db while being executed (maybe through the REST API, because the job destroy button is broken).~~
~~This is very likely a duplicate of #108.~~
Update: the jobs are still present in the db, it is definitely not related to #108
<code>
root@worker0:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9320ae9f8919 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 33 minutes ago Up 33 minutes 22/tcp prod-job-9188-gatbcompiler
8edbab79c05a swarm:1.2.6 "/swarm join --add..." 12 hours ago Up 12 hours 2375/tcp prod-swarm-node
274d849ad015 allgo/swarm-proxy "socat TCP-LISTEN:..." 12 hours ago Up 12 hours prod-swarm-proxy
17d89c96ec05 cargo.irisa.fr:8000/allgo/prod/webapp/massiccc:1.0 "/bin/sh -c '\n ..." 4 days ago Exited (255) 2 days ago 22/tcp prod-job-8725-massiccc
1d892b44bc32 6cf910dfd599 "/bin/sh -c '\n ..." 9 days ago Exited (255) 2 days ago 22/tcp prod-job-8619-gatbcompiler
</code>
<code>
root@worker1:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e7443b8c5867 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 29 minutes ago Up 29 minutes 22/tcp prod-job-9185-gatbcompiler
d3469eddc570 cargo.irisa.fr:8000/allgo/prod/webapp/gatbcompiler:1.0 "/bin/sh -c '\n ..." 31 minutes ago Up 31 minutes 22/tcp prod-job-9179-gatbcompiler
b54ec624329c swarm:1.2.6 "/swarm join --add..." 12 hours ago Up 12 hours 2375/tcp prod-swarm-node
510beace80ff allgo/swarm-proxy "socat TCP-LISTEN:..." 12 hours ago Up 12 hours prod-swarm-proxy
d644bcf2a1b3 cargo.irisa.fr:8000/allgo/prod/webapp/massiccc:1.0 "/bin/sh -c '\n ..." 7 days ago Exited (137) 2 days ago prod-job-8691-massiccc
ce3d92c1d85e 6cf910dfd599 "/bin/sh -c '\n ..." 9 days ago Exited (137) 2 days ago prod-job-8617-gatbcompiler
</code>https://gitlab.inria.fr/allgo/allgo/-/issues/115allow replaying jobs2022-05-16T13:56:15+02:00BAIRE Anthonyallow replaying jobsput a button "replay this job" on jobs#show and the webapp#show demo panel (which actually is an example rather than a demo) useful for relaunching a job the failed (eg: when debugging a sandbox)
prerequisites
* we need a way to reco...put a button "replay this job" on jobs#show and the webapp#show demo panel (which actually is an example rather than a demo) useful for relaunching a job the failed (eg: when debugging a sandbox)
prerequisites
* we need a way to recover the original input files in case the app modifies them (related to #219)
~~side note:~~
* ~~the output files of the demo should be generated by allgo rather than be uploaded by the developer~~ (may no longer be relevant with jupyter)https://gitlab.inria.fr/allgo/allgo/-/issues/106OOM killer logging2018-09-12T17:48:51+02:00BAIRE AnthonyOOM killer loggingpossible solutions:
- parse dmesg
- wait all processes (reaper) -> will catch all unwaited children but useless when parent process uses waitpid() but ignores its resultpossible solutions:
- parse dmesg
- wait all processes (reaper) -> will catch all unwaited children but useless when parent process uses waitpid() but ignores its resulthttps://gitlab.inria.fr/allgo/allgo/-/issues/97mount the job dir as /work instead of /tmp2022-04-26T16:12:02+02:00BAIRE Anthonymount the job dir as /work instead of /tmpjob container should have two external volumes:
* `/work` mounted from the job dir (in the datastore)
* `/tmp` mounted from a temporary directory
-> users should be advised to write temporary files into /tmp because (depending on the ...job container should have two external volumes:
* `/work` mounted from the job dir (in the datastore)
* `/tmp` mounted from a temporary directory
-> users should be advised to write temporary files into /tmp because (depending on the storage backend used by the docker daemon) we may have performance issues if the app uses the docker filesystem too muchhttps://gitlab.inria.fr/allgo/allgo/-/issues/92allgo.log is an annoyance for the user writing an entrypoint2021-01-19T12:21:13+01:00BAIRE Anthonyallgo.log is an annoyance for the user writing an entrypointIn the previous versions of allgo, the allgo.log file was created *after* the entrypoint is started (thank to a race condition). Now we now longer have a race condition: allgo.log is always created *before* the entrypoint is started.
Th...In the previous versions of allgo, the allgo.log file was created *after* the entrypoint is started (thank to a race condition). Now we now longer have a race condition: allgo.log is always created *before* the entrypoint is started.
This is a disturbance if the users makes a wildcard expansion like: `for fich in * ; do ...`.
I am wondering what is the most appropriate solution. There are multiple possibilities:
* rename `allgo.log` as `.allgo.log`
* put the input and output files in separate directories (not backward compatible)https://gitlab.inria.fr/allgo/allgo/-/issues/46réduire taille image docker2017-06-01T15:44:46+02:00MAUPETIT Charlyréduire taille image dockerréduire taille des images docker avant de les pousser dans le registry.
enlever les fichiers temporaires et "squasher" les couches docker.
J'avais fait un test, (en passant par un .tar) ça peut être intéressant à partir d'une certa...réduire taille des images docker avant de les pousser dans le registry.
enlever les fichiers temporaires et "squasher" les couches docker.
J'avais fait un test, (en passant par un .tar) ça peut être intéressant à partir d'une certaine taille d'image.