SSH connections through a jump host are not reliable
Example of an error I got with Enos on G5K with Enoslib 8.1.3 (automatic SSH jump host). This was with 4 hosts.
INFO:enoslib.log:[G5k] Waiting for the end of deployment [D-805892fc-1b04-4afb-8f99-246002926c8f]
PLAY [all] *********************************************************************************************************************************************************************************************************
TASK [Run dhcp on the nodes] ***************************************************************************************************************************************************************************************
fatal: [gros-44-kavlan-4.nancy.grid5000.fr]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535", "unreachable": true}
changed: [gros-45-kavlan-4.nancy.grid5000.fr]
changed: [gros-30-kavlan-4.nancy.grid5000.fr]
changed: [gros-41-kavlan-4.nancy.grid5000.fr]
PLAY RECAP *********************************************************************************************************************************************************************************************************
gros-30-kavlan-4.nancy.grid5000.fr : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
gros-41-kavlan-4.nancy.grid5000.fr : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
gros-44-kavlan-4.nancy.grid5000.fr : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
gros-45-kavlan-4.nancy.grid5000.fr : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
ERROR:enoslib.api:Unreachable hosts: [_AnsibleExecutionRecord(host='gros-44-kavlan-4.nancy.grid5000.fr', status='UNREACHABLE', task='Run dhcp on the nodes', payload={'unreachable': True, 'msg': 'Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535', 'changed': False})]
[_AnsibleExecutionRecord(host='gros-44-kavlan-4.nancy.grid5000.fr',
status='UNREACHABLE', task='Run dhcp on the nodes', payload={'unreachable':
True, 'msg': 'Failed to connect to the host via ssh:
kex_exchange_identification: Connection closed by remote host\r\nConnection
closed by UNKNOWN port 65535', 'changed': False})]
(CRITICAL cli.py:109)
Relaunching the script (that reuses the same G5K job), everything went fine, so this was a transient error.
Possibly because the host took a bit of time to become reachable, or because there were too many connections on the jump host, or some other issue.
@vparolgu reported a similar issue with Enoslib directly, where it can fail when gathering facts: