Ansible - 'until' loop |

Introduction
Examples
Conclusion
References
GitHub repository with resources for this post

Introduction

In this short post I'll introduce you to lesser known type of Ansible loop: "until" loop. This loop is used for retrying task until certain condition is met.

To use this loop in task you essentially need to add 3 arguments to your task arguments:

until - condition that must be met for loop to stop. That is Ansible will continue executing the task until expression used here evaluates to true.
retry - specifies how many times we want to run the task before Ansible gives up.
delay - delay, in seconds, between retries.

As an example, below task will keep sending GET request to specified URL until the "status" key in response is equal to "READY". We ask Ansible to make 10 attempts in total with delay of 1 second between each attempt. If after final attempt condition in until is still not met task is marked as failed.

  - name: Wait until web app status is "READY"
    uri:
      url: "{{ app_url }}/status"
    register: app_status
    until: app_status.json.status == "READY"
    retries: 10
    delay: 1

What's so cool about this loop is that you can use it to actively check result of executing given task before proceeding to other tasks.

This is different to using when task argument for instance, where we only execute task IF condition is met. Here the condition MUST be met before we execute next task.

One is conditional execution, usually based on static check, i.e. existence of package or feature, or value of pre-defined variable. The other pauses execution until condition is met, and failing task if it isn't, to ensure desired state is in place before proceeding.

Some scenarios where until loop could be useful:

Making sure web app service came up before progressing Playbook.
Checking status via API endpoint of long running asynchronous task.
Waiting for routing protocol adjacency to come up.
Waiting for convergence of the system, e.g. routing in networking.
Checking if Docker container is reporting as healthy.
Retrying service that might take multiple attempts to come up fully.

Basically, there are a lot of use cases for until loop :)

Note that some of the above can also be achieved with wait_for module, which is a bit more specialized. Module wait_for can check status of ports, files and processes, among other things. Have a look at link in References if you want to find out more.

Examples

We now know what until loop is, how to use it, and where it could be useful. Next we'll now through some examples to give you a better intuition of how one would go about using it in Playbooks.

Setup details

Details of the setup used for the examples:

Python 3.8.5
Ansible 2.9.10 running in Python virtual environment
Python libraries listed in requirements.txt in the GitHub repository for this post
Docker engine
Docker container named "veos:4.18.10M" built with vrnetlab and "vEOS-lab-4.18.10M.vmdk" image

Example 1 - Polling web app status via API

In first example I have a Playbook that gets content of home page of a web app. Twist is that this web app takes some time to fully come up. Fortunately there is an API endpoint that we can query to check if the app is ready to accept requests.

We'll take advantage of the until loop to keep polling the status until we get green light to proceed.

until_web_app.yml

---
- name: "PLAY 1. Use 'until' to wait for Web APP to come up."
  hosts: local
  connection: local
  gather_facts: no

  vars:
    app_url: "http://127.0.0.1:5010"

  tasks:
  - name: "TASK 1.1. Start Web app (async 20 keeps up app in bg for 20 secs)."
    command: python flask_app/main.py
    async: 20
    poll: 0
    changed_when: no

  - name: "TASK 1.2. Retrieve Web app home page (should fail)."
    uri:
      url: "{{ app_url }}"
    register: app_hp
    ignore_errors: yes

  - name: "TASK 1.3. Display HTTP code returned by home page."
    debug:
      msg: "Web app returned {{ app_hp.status }} HTTP code"

  - name: "TASK 1.4. Wait until GET to 'status' returns 'READY'."
    uri:
      url: "{{ app_url }}/status"
    register: app_status
    until: app_status.json.status == "READY"
    retries: 10
    delay: 1
    
  - name: "TASK 1.5. Retrieve Web app home page (should succeed now)."
    uri:
      url: "{{ app_url }}"
    register: app_hp

  - name: "TASK 1.6. Display HTTP code and body returned by home page."
    debug:
      msg: 
        - "Web app returned {{ app_hp.status }} HTTP code"
        - "Web page content: {{ lookup('url', app_url) }}"

Let's have a look at interesting bits in this Playbook.

I built this app with API endpoint that returns status of the service in the json payload. This can be either "NOT_READY" or "READY".

In TASK 1.1 we launch a small Flask Web App that takes 10 seconds to fully come up. I use async argument here to trick Ansible into keeping this up in background for 20 seconds, otherwise the Playbook would get stuck on this task.
In TASK 1.2 we get an error while retrieving home page because App is not ready yet.
In TASK 1.4 we use until loop to keep querying the status endpoint until returned value equals "READY". Only when the task succeeds will we proceed to the next task where we again retrieve home page, now knowing that our chance of succeeding is much higher.
In TASK 1.5 we retrieve home page again, which should now succeed, contents of which we'll display in TASK 1.6.

A lot of different Web API services expose some kind of status or healthcheck endpoint so this example shows a very useful pattern that we can use elsewhere.

If you're curiouse, you can find code of the Flask app in the Github repository together with the playbook.

And this is the output from the Playbook run:

venv) przemek@quark:~/netdev/repos/ans_unt$ ansible-playbook -i hosts.yml until_web_app.yml 

PLAY [PLAY 1. Use 'until' to wait for Web APP to come up.] *********************************************************************************************************************************************

TASK [TASK 1.1. Start Web app (async 20 keeps up app in bg for 20 secs).] ******************************************************************************************************************************
ok: [localhost]

TASK [TASK 1.2. Retrieve Web app home page (should fail).] *********************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "content": "", "content_length": "0", "content_type": "text/html; charset=utf-8", "date": "Sun, 22 Nov 2020 16:47:37 GMT", "elapsed": 0, "msg": "Status code was 503 and not [200]: HTTP Error 503: SERVICE UNAVAILABLE", "redirected": false, "server": "Werkzeug/1.0.1 Python/3.8.5", "status": 503, "url": "http://127.0.0.1:5010"}
...ignoring

TASK [TASK 1.3. Display HTTP code returned by home page.] **********************************************************************************************************************************************
ok: [localhost] => {
    "msg": "Web app returned 503 HTTP code"
}

TASK [TASK 1.4. Wait until GET to 'status' returns 'READY'.] *******************************************************************************************************************************************
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (10 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (9 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (8 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (7 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (6 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (5 retries left).
FAILED - RETRYING: TASK 1.4. Wait until GET to 'status' returns 'READY'. (4 retries left).
ok: [localhost]

TASK [TASK 1.5. Retrieve Web app home page (should succeed now).] **************************************************************************************************************************************
ok: [localhost]

TASK [TASK 1.6. Display HTTP code and body returned by home page.] *************************************************************************************************************************************
ok: [localhost] => {
    "msg": [
        "Web app returned 200 HTTP code",
        "Web page content: Service ready for use."
    ]
}

PLAY RECAP *********************************************************************************************************************************************************************************************
localhost                  : ok=6    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

Example 2 - Wait for BGP to establish before retrieving peer routes

In the world of networking we often encounter situations where some kind of adjacency, be it BFD, PIM or BGP has to be established before we can retrieve information that is of interest.

To illustrate this I wrote a Playbook that waits for BGP peering to come up before checking routes we receive from neighbors.

Bear in mind this is simplified for use in an example, in the real world you might need to add more checks to ensure routing information between peer has been fully exchanged.

until_eos_net.yml

---
- name: "PLAY 1. Use 'until' to wait for BGP sessions to establish."
  hosts: veos_net
  gather_facts: no

  tasks: 
  - name: "TASK 1.1. Record peer IPs for use in 'until' task."
    eos_command:
      commands: 
      - command: show ip bgp summary
        output: json
    register: init_bgp_sum

  - name: "TASK 1.2. Forcefully reset BGP sessions."
    eos_command:
      commands: clear ip bgp neighbor *

  - name: "TASK 1.3. Use 'until' to wait for all BGP sessions to establish."
    eos_command:
      commands:
      - command: show ip bgp summary
        output: json
    register: u_bgp_sum
    until: u_bgp_sum.stdout.0.vrfs.default.peers[item.key].peerState == "Established"
    retries: 15
    delay: 1
    loop: "{{ init_bgp_sum.stdout.0.vrfs.default.peers | dict2items }}"
    loop_control:
      label: "{{ item.key }}"

  - name: "TASK 1.4. Retrieve neighbor routes."
    eos_command:
      commands:
      - command: "show ip bgp neighbors {{ item.key }} routes"
        output: json
    register: nbr_routes
    loop: "{{ init_bgp_sum.stdout.0.vrfs.default.peers | dict2items }}"
    loop_control:
      label: "{{ item.key }}"

  - name: "TASK 1.5. Display neighbor routes."
    debug:
      msg: 
        - "{{ ''.center(80, '=') }}"
        - "Neighbor: {{ nbr.item.key }}"
        - "{{ nbr.stdout.0.vrfs.default.bgpRouteEntries.keys() | list }}"
        - "{{ ''.center(80, '=') }}"
    loop: "{{ nbr_routes.results }}"
    loop_control:
      loop_var: nbr
      label: "{{ nbr.item.key }}"

Again, we'll look more closely at tasks that do something interesting.

In TASK 1.1 we record output of show ip bgp summary that we'll be used to iterate over list of BGP neighbors.
In TASK 1.3 we have core of our logic.
- Using until loop we keep checking status of each of the peers until all of them report "Established" value.
- During each retry output of show ip bgp summary is recorded in u_bgp_sum variable.
- To add to fun, until loop is run inside of a standard outer loop. Outer loop feeds until IPs of the peers so that it's easier to access data structure recorded in u_bgp_sum.
In TASK 1.4 we can get routes received from each neighbor knowing that all of the peerings are now established. These routes are displayed in TASK 1.5.

Waiting for convergence, or adjacency to get up, is another use case that comes up often. Hopefully this example illustrates how we can handle these.

You can also see here that until loop happily cooperates with standard loop allowing us to handle even more use cases.

Below is the result of running this Playbook.

(venv) przemek@quark:~/netdev/repos/ans_unt$ ansible-playbook -i hosts.yml until_eos_net.yml 

PLAY [PLAY 1. Use 'until' to wait for BGP sessions to establish.] **************************************************************************************************************************************

TASK [TASK 1.1. Record peer IPs for use in 'until' task.] **********************************************************************************************************************************************
ok: [veos01]

TASK [TASK 1.2. Forcefully reset BGP sessions.] ********************************************************************************************************************************************************
ok: [veos01]

TASK [TASK 1.3. Use 'until' to wait for all BGP sessions to establish.] ********************************************************************************************************************************
FAILED - RETRYING: TASK 1.3. Use 'until' to wait for all BGP sessions to establish. (15 retries left).
FAILED - RETRYING: TASK 1.3. Use 'until' to wait for all BGP sessions to establish. (14 retries left).
FAILED - RETRYING: TASK 1.3. Use 'until' to wait for all BGP sessions to establish. (13 retries left).
ok: [veos01] => (item=10.0.13.2)
ok: [veos01] => (item=10.0.12.2)
FAILED - RETRYING: TASK 1.3. Use 'until' to wait for all BGP sessions to establish. (15 retries left).
FAILED - RETRYING: TASK 1.3. Use 'until' to wait for all BGP sessions to establish. (14 retries left).
ok: [veos01] => (item=10.1.11.2)

TASK [TASK 1.4. Retrieve neighbor routes.] *************************************************************************************************************************************************************
ok: [veos01] => (item=10.0.13.2)
ok: [veos01] => (item=10.0.12.2)
ok: [veos01] => (item=10.1.11.2)

TASK [TASK 1.5. Display neighbor routes.] **************************************************************************************************************************************************************
ok: [veos01] => (item=10.0.13.2) => {
    "msg": [
        "================================================================================",
        "Neighbor: 10.0.13.2",
        [
            "192.168.0.0/25",
            "192.168.1.0/25",
            "192.168.4.0/24",
            "192.168.7.0/24",
            "192.168.6.0/24",
            "10.50.255.3/32",
            "192.168.5.0/24"
        ],
        "================================================================================"
    ]
}
ok: [veos01] => (item=10.0.12.2) => {
    "msg": [
        "================================================================================",
        "Neighbor: 10.0.12.2",
        [
            "10.50.255.2/32",
            "192.168.0.0/25"
        ],
        "================================================================================"
    ]
}
ok: [veos01] => (item=10.1.11.2) => {
    "msg": [
        "================================================================================",
        "Neighbor: 10.1.11.2",
        [],
        "================================================================================"
    ]
}

PLAY RECAP *********************************************************************************************************************************************************************************************
veos01                     : ok=5    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Example 3 - Polling health status of Docker container

Docker is everywhere these days. There is a lot of tools like docker-compose that make running container easier. But we can also use Ansible to manage our containers.

Many containers these days come with built-in health checks which Docker engine can use to report on health of given container.

In this example I'll show you how we can talk to Docker to get the container status from inside of Ansible Playbook. Our goal is to launch 4 containers with virtual routers that we want to dynamically add to Ansible inventory. But we only want to do that once all of them came up fully.

until_docker.yml

---
- name: "PLAY 1. Provision lab with virtual routers."
  hosts: local
  connection: local
  gather_facts: no
    
  tasks:

  - name: "TASK 1.1. Bring up virtual router containers."
    docker_container:
      name: "{{ item }}"
      image: "{{ vr_image_name }}"
      privileged: yes
    register: cont_data
    loop: "{{ vnodes }}"
    loop_control:
      pause: 10

  - name: "TASK 1.2. Wait for virtual routers to finish booting."
    docker_container_info:
      name: "{{ item }}"
    register: cont_check
    until: cont_check.container.State.Health.Status == 'healthy'
    retries: 15
    delay: 25
    loop: "{{ vnodes }}"

  - name: "TASK 1.3. Auto discover device IPs and add to inventory group."
    set_fact:
      dyn_grp: "{{ dyn_grp | combine({cont_name: {'ansible_host': cont_ip_add }}) }}"
    vars:
      cont_ip_add: "{{ item.container.NetworkSettings.IPAddress }}"
      cont_name: '{{ item.container.Name | replace("/", "") }}'
      dyn_grp: {}
    loop: "{{ cont_data.results }}"
    loop_control:
      label: "{{ cont_name }}"

  - name: "TASK 1.4. Dynamically create hosts.yml inventory."
    copy:
      content: "{{ dyn_inv | to_nice_yaml }}"
      dest: ./lab_hosts.yml
    vars:
      dyn_inv:
        "{{ {'all': {'children': {inv_name: {'hosts': dyn_grp}}}} }}"

Of interest here are mostly TASK 1.1 and TASK 1.2. Remaining tasks deal with generating and saving inventory, but I wanted to leave them here to provide context.

Let's have a look at the first two tasks then.

In TASK 1.1 we loop over container names recorded in vnodes var and we launch container for each of the entries. I added 10 second pause between launching each container to avoid overwhelming my local Docker.
In TASK 1.2 we got our until loop inside of standard loop. In until loop we tell Docker to get info on container with name fed from outer loop. Then we check if value of health status is healthy. We'll keep retrying here until we get status we want, of if we exceed number of retries the task will fail.

You might wonder how I chose the values for retries and delay arguments. These are completely arbitrary and depend on the machine and container that you're running. In my case I know from running these by hand that it takes some time for all containers to come up so 15 retries with 25 second delays fits my case well.

Now you can see that you can have Ansible poll status of your containers, pretty cool right?

To finish off, here's the result of this playbook being executed.

(venv) przemek@quark:~/netdev/repos/ans_unt$ ansible-playbook -i hosts.yml until_docker.yml 

PLAY [PLAY 1. Provision lab with virtual routers.] *****************************************************************************************************************************************************

TASK [TASK 1.1. Bring up virtual router containers.] ***************************************************************************************************************************************************
changed: [localhost] => (item=spine1)
changed: [localhost] => (item=spine2)
changed: [localhost] => (item=leaf1)
changed: [localhost] => (item=leaf2)

TASK [TASK 1.2. Wait for virtual routers to finish booting.] *******************************************************************************************************************************************
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (15 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (14 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (13 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (12 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (11 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (10 retries left).
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (9 retries left).
ok: [localhost] => (item=spine1)
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (15 retries left).
ok: [localhost] => (item=spine2)
ok: [localhost] => (item=leaf1)
FAILED - RETRYING: TASK 1.2. Wait for virtual routers to finish booting. (15 retries left).
ok: [localhost] => (item=leaf2)

TASK [TASK 1.3. Auto discover device IPs and add to inventory group.] **********************************************************************************************************************************
ok: [localhost] => (item=spine1)
ok: [localhost] => (item=spine2)
ok: [localhost] => (item=leaf1)
ok: [localhost] => (item=leaf2)

TASK [TASK 1.4. Dynamically create hosts.yml inventory.] ***********************************************************************************************************************************************
changed: [localhost]

PLAY RECAP *********************************************************************************************************************************************************************************************
localhost                  : ok=4    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Conclusion

Adding Ansible until loop to your toolset will open some new possibilities. Ability to dynamically repeat polling until certain condition is met is powerful and will allow you to add logic to your Playbooks that otherwise might be difficult to achieve.

I hope that my examples helped in illustrating the value of the until loop and you found this post useful.

Thanks for reading!

References

Ansible docs for 'until' loop: https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html#retrying-a-task-until-a-condition-is-met
Ansible docs for 'wait_for' module:
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/wait_for_module.html
vrnetlab: https://github.com/plajjan/vrnetlab
TTL255 post on vrnetlab: https://ttl255.com/vrnetlab-run-virtual-routers-in-docker-containers/
GitHub repo with resources for this post: https://github.com/progala/ttl255.com/tree/master/ansible/until-loop

TTL255 - Przemek Rogala's blog

Computer Networks, Python and Automation

Ansible - 'until' loop

Contents