sandbox/webapp/webapp_versions state machine
The current state machines:
- webapps.status
- 0 -> initial state
- 1 -> first sandbox committed (app now testable)
- 2 -> app published (and pushed to the registry)
- webapps.ipaddress -> 'NOT NULL' means that the sandbox is running
- webapp_versions -> no state, entry is created once the new version is committed and pushed
- the first sandbox (sandbox_XXXXX) is testable, but the next ones (update_XXXXX) are not
- errors are scattered into several log files (log/webapp_create_container.log, log/sidekiq_create_image.log, log/sidekiq_webapp_update_container.log) and the user has no feedback
- transitions are written to the db after the sidekiq job is done -> the same sidekiq job may be started multiple times concurrently (and sidekiq keeps repeating it until it fails)
- allgo allows overwriting a docker tag, which is problematic because:
- it requires synchronising immediately all the swarm nodes onto the new image
- a job created with an image may be run on a different image (if committed in between)
- it requires careful handling of race conditions when sid talks to the registry since it may push & pull concurrently the same image (eg: if sandbox is created immediatly after overwriting an image, we must be careful that the pull (before creating the new sandbox) is run after the push (after committing the old sandbox) othewise the new image is overwritten by the previous image from the registry)
- each sandbox operation (start, stop, ...) imply a sequence of docker commands, if any of these commands fail, the sandbox may be in an unusable state
- bug: it is not possible du run a job when a webapp is being upgraded
Suggested changes:
-
new webapps/versions should be immediately visible to the user (no messages "you're app will show in a few minutes") Rationale:
- in practice, creating the job takes time
- the jobs are launched in the background anyway
-
there should be a state machine for the sandbox:
-
Webapp.sandbox_state: integer
-
Webapp.sandbox_error: plain string or NULL if no error
-
Transitions by the rails app:
- IDLE -> STARTING
- STARTING+ERROR -> STARTING ("retry" button)
- RUNNING -> STOPPING
- STOPPING+ERROR -> STOPPING
-
Transitions by the docker controller (outside rails):
- STARTING -> RUNNING
- STARTING -> STARTING+ERROR
- STOPPING -> IDLE
- STOPPING -> STOPPING+ERROR
-
The controller should make the distinction between
- permanent errors (that should be reported immediately to the user)
- transient errors (caused by platform problem)
-
-
remove webapps.status Rationale:
- the 0->1 transition is not useful (the sandbox is committed at job creation anyway)
- the 1->2 transition should rather be replaced with a 'published' flag on Webapps_version
-
published status should be stored in WebappVersion (add a 'published' boolean field) rather than Webapp Rationale: a new version may not be usable immediately and the owner may prefer to keep it unpublished for sometime
-
app should be listed in 'apps' if any of:
- app is public and has at least one version marked as published
- current user is the app's owner
-
the job creation form should display:
- all versions if the current user it the app's owner (Note: unpublished versions should be marked as such, eg: "1.0" vs "1.1 (not published)")
- only published versions otherwise
-
-
docker images should have a revision number (not shown to the user),
- eg: 1.0r0, 1.0r1, ...
- add a 'revision' field to WebappVersion (pointing to the current revision) and to Job
- when a new docker image is committed, a new tag is created ==> no name conflict
- discard old docker tags once they are no longer referenced by the WebappVersion and by any waiting Job
- no need to repull from the registry when starting a new job (because the image is not expected to have been changed)
- having separate revision numbers would allow to implement rollbacks
-
sandbox images should be committed with the ":sandbox" tag rather than ":1.0" (and it should be listed as "sandbox" in the job_creation form)
- Note: "sandbox" becomes a reserved word
-
running test jobs (on sid) should be possible for app upgrades too (and the container name should always be 'sandbox_')
-
users should be able to have multiple active sandbox concurrently
-
add user operation: sandbox rollback (to drop the current sandbox)