Problem

The other day one of my playbooks reported an issue with connecting to one of the switches. What seemed odd is that the this playbook ran against this box without any problems ten minutes earlier.

I double checked my password and all the permissions, all seemed to be in order. Just as a test, I tried to run this playbook for other devices, worked like a charm. It was time to turn the debug and full verbosity (-vvvv), on.

Here's what I found in the logs:

2017-10-08 11:24:01,589 p=30940 u=przemek |  PLAYBOOK: arista_test.yml ********************************************
**********************************************
2017-10-08 11:24:01,589 p=30940 u=przemek |  1 plays in arista_test.yml
2017-10-08 11:24:01,624 p=30940 u=przemek |  PLAY [Playbook used to collect output from the show commands] ******************************************************
2017-10-08 11:24:01,668 p=30940 u=przemek |  META: ran handlers
2017-10-08 11:24:01,670 p=30940 u=przemek |  TASK [eos_command] *************************************************************************************************
2017-10-08 11:24:01,670 p=30940 u=przemek |  task path: /data/ansible/blog/arista_test.yml:10
2017-10-08 11:24:01,833 p=30955 u=przemek |  re-using existing socket for ansible@veos01.lab:22
2017-10-08 11:24:31,859 p=30955 u=przemek |  number of connection attempts exceeded, unable to connect to control socket
2017-10-08 11:24:31,859 p=30955 u=przemek |  persistent_connect_interval=1, persistent_connect_retries=30
2017-10-08 11:24:31,999 p=30940 u=przemek |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/network/eos/eos_command.py
2017-10-08 11:24:32,167 p=30940 u=przemek |  fatal: [veos01.lab]: FAILED! => {
    "changed": false,
    "err": "[Errno 111] Connection refused",
    "failed": true,
    "invocation": {
        "module_args": {
            "auth_pass": null,
            "authorize": null,
            "commands": [
                "show version"
            ],
            "host": null,
            "interval": 1,
            "match": "all",
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
            "port": 443,
            "provider": null,
            "retries": 10,
            "ssh_keyfile": null,
            "timeout": null,
            "transport": "cli",
            "use_ssl": true,
            "username": "ansible",
            "validate_certs": true,
            "wait_for": null
        }
    },
    "msg": "unable to connect to socket"
}

We got error "[Errno 111] Connection refused", but interestingly we can also see the below message:

"msg": "unable to connect to socket"

Now, starting with Ansible 2.3, there is support for persistent SSH connections when network modules are run. The above message suggest that something could've gone wrong with the socket that has been opened and left in place by Ansible.

We just need to find out the name of the socket and its location. When playbook is run with -vvvv arguments, Ansible tells us which socket has been created/reused for the device, as seen below:

PLAYBOOK: arista_test.yml ******************************************************************************************
1 plays in arista_test.yml

PLAY [Playbook used to collect output from the show commands] ******************************************************
META: ran handlers

TASK [eos_command] *************************************************************************************************
task path:/data/ansible/blog/arista_test.yml:10
<veos01.lab> connection transport is cli
<veos01.lab> using connection plugin network_cli
<veos01.lab> socket_path: /home/przemek/.ansible/pc/9cbf669de1

Let's quickly check if this socket is indeed at the given location:

[przemek@quasar ~]$ cd /home/przemek/.ansible/pc/
[przemek@quasar pc]$ ls -l
total 0
srwxrwxrwx 1 przemek przemek 0 Oct 08 11:13 9cbf669de1

It is there all right. It also looks to have been created over 10 minutes prior to the playbook's run. This is more than the default timeout value for the SSH sockets opened by Ansible. It would seem that something got stuck here.

Solution

Solution is very simple, delete the stale socket!

[przemek@quasar pc]$ rm 9cbf669de1
[przemek@quasar pc]$ ls

Re-run the playbook:

[przemek@quasar blog]$ ansible-playbook arista_test.yml

PLAY [arista] ********************************************************************************************************************

TASK [Playbook used to collect output from the show commands] ********************************************************************
 [WARNING]: argument transport has been deprecated and will be removed in a future version

ok: [10.50.0.2]

PLAY RECAP ***********************************************************************************************************************
10.50.0.2                  : ok=1    changed=0    unreachable=0    failed=0

Great, we're back in the game.

I've got to say. Troubleshooting some of the connectivity issues in Ansible isn't always easy, but logging and debug messages have definitely been improved in the newer releases.

Since dealing with this issue I also found out that Ansible docs have a section dealing with clearing out persistent connections. This can be found at the following link:

http://docs.ansible.com/ansible/2.4/network_debug_troubleshooting.html#clearing-out-persistent-connections