Introduction

In this short post I explain what are YAML aliases and anchors. I then show you how to stop PyYAML from using these when serializing data structures.

While references are absolutely fine to use in YAML files meant for programmatic consumption I find that it sometimes confuses humans, especially if they’ve never seen these before. For this reason I tend to disable anchors and aliases when saving data to YAML files meant for human consumption.

Contents

YAML aliases and anchors

YAML specification has provision for preserving information about nodes pointing to the same data. This basically means that if you have some data that is referenced in multiple places in your data structure then YAML dumper will:

  • add an anchor to the first occurrence
  • replace any subsequent occurrences of that data with aliases

Now, how do these anchors and aliases look like?

&id001 - example of an anchor, placed with the first occurrence of data
*id001 - example of an alias, replaces subsequent occurrence of data

This might be easier to see on a concrete example. Below we have information about some interfaces. You can see that Ethernet1 has anchor &id001 next to its properties key and Ethernet2 has just alias *id001 next to its properties key.

Ethernet1:
  description: Uplink to core-1
  mtu: 9000
  properties: &id001
  - pim
  - ptp
  - lldp
  speed: 1000
Ethernet2:
  description: Uplink to core-2
  mtu: 9000
  properties: *id001
  speed: 1000

When we load this data in Python and print it we get the below:

{'Ethernet1': {'description': 'Uplink to core-1',
               'mtu': 9000,
               'properties': ['pim', 'ptp', 'lldp'],
               'speed': 1000},
 'Ethernet2': {'description': 'Uplink to core-2',
               'mtu': 9000,
               'properties': ['pim', 'ptp', 'lldp'],
               'speed': 1000}}

Anchor &id001 is gone and alias *id001 was expanded into ['pim', 'ptp', 'lldp'].

When does YAML use anchors and aliases

Dumper used by PyYAML can recognize Python variables and data structures pointing to the same object. This can often happen with deeply nested dictionaries with keys that refer to the same piece of data. A lot of APIs in the world of Network Automation use such dictionaries.

It’s worth pointing out that section 3.1.1 of YAML spec requires anchors and aliases to be used when serializing multiple references to the same node (data object). I will show how to override this behaviour but it’s good to know where it came from.

I wrote two, nearly identical, programs creating data structure from the beginning of this post. These will help us in understanding when PyYAML adds anchors and aliases.

Program #1:

# yaml_diff_ids.py
import yaml


interfaces = dict(
    Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
    Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)

interfaces["Ethernet1"]["properties"] = ["pim", "ptp", "lldp"]
interfaces["Ethernet2"]["properties"] = ["pim", "ptp", "lldp"]

# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))

# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))

Program #1 output:

Ethernet1 properties object id: 41184424
Ethernet2 properties object id: 41182536

##### Resulting YAML:

Ethernet1:
  description: Uplink to core-1
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000
Ethernet2:
  description: Uplink to core-2
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000

Program #2:

# yaml_same_ids.py
import yaml


interfaces = dict(
    Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
    Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)

prop_vals = ["pim", "ptp", "lldp"]

interfaces["Ethernet1"]["properties"] = prop_vals
interfaces["Ethernet2"]["properties"] = prop_vals

# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))

# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))

Program #2 output:

Ethernet1 properties object id: 13329416
Ethernet2 properties object id: 13329416

##### Resulting YAML:

Ethernet1:
  description: Uplink to core-1
  mtu: 9000
  properties: &id001
  - pim
  - ptp
  - lldp
  speed: 1000
Ethernet2:
  description: Uplink to core-2
  mtu: 9000
  properties: *id001
  speed: 1000

So, two pretty much identical programs, two data structures containing identical data but two different results of YAML dump.

What caused this difference? It’s all due to a tiny change in the way we assigned values to properties key:

Program #1

interfaces["Ethernet1"]["properties"] = ["pim", "ptp", "lldp"]
interfaces["Ethernet2"]["properties"] = ["pim", "ptp", "lldp"]

Program #2

properties = ["pim", "ptp", "lldp"]

interfaces["Ethernet1"]["properties"] = properties
interfaces["Ethernet2"]["properties"] = properties

In Program #1 we created two new lists and passed the references to relevant properties keys. These look to be the same but are actually two completely separate objects.

In Program #2 we first created a list which was assigned to prop_vals variable. We then assigned prop_vals to each of the properties keys. This essentially means that each of the keys now references the same list object.

We also asked Python to give us IDs of the objects referenced by properties keys. Here we can see that indeed IDs in Program #1 differ but they’re the same in Program #2:

Program #1 IDs:

Ethernet1 properties object id: 41184424
Ethernet2 properties object id: 41182536

Program #2 IDs:

Ethernet1 properties object id: 13329416
Ethernet2 properties object id: 13329416

And that’s it. That’s how PyYAML knows it should use aliases and anchors to represent first and subsequent references to the same object.

For completeness, here’s an example of loading YAML file with references that we just dumped:

import yaml


with open("yaml_files/interfaces_same_ids.yml") as fin:
    interfaces = yaml.safe_load(fin)

# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))

IDs of the loaded properties keys:

Ethernet1 properties object id: 19630664
Ethernet2 properties object id: 19630664

As you can see IDs are the same, so information about properties keys referencing same object was preserved.

YAML dump - don’t use anchors and aliases

You know know what anchors and aliases are, what they’re used for and where they come from. It’s now time to show you how to stop PyYAML from using them during dump operation.

PyYAML does not have built-in setting allowing disabling of the default behaviour. Fortunately there are two ways in which we can prevent references from being used:

  1. Force all data objects to have unique IDs by using copy.deepcopy() function
  2. Override ignore_aliases() method in PyYAML Dumper class

Method 1 might require source code modifications in multiple places and could be slow when copying large amounts of compound objects.

Method 2 only requires few lines of code to define custom dumper class. This can then be used alongside standard PyYAML dumper.

In any case, have a look at both and decide which one fits your case better.

Using copy.deepcopy() function

Python standard library provides us with copy.deepcopy() function which returns copy of an object, and copies of objects within that object if any found.

As we’ve seen already, PyYAML serializer uses anchors and aliases when it finds references to the same object. By applying deepcopy() during object assignment we’ll ensure all of these will have unique IDs. The end result? No YAML references in the final dump.

Program #2, modified to use deepcopy():

# yaml_same_ids_deep_copy.py
from copy import deepcopy

import yaml


interfaces = dict(
    Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
    Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)

prop_vals = ["pim", "ptp", "lldp"]

interfaces["Ethernet1"]["properties"] = deepcopy(prop_vals)
interfaces["Ethernet2"]["properties"] = deepcopy(prop_vals)

# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))

# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))

Result:

Ethernet1 properties object id: 19775848
Ethernet2 properties object id: 19823048

##### Resulting YAML:

Ethernet1:
  description: Uplink to core-1
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000
Ethernet2:
  description: Uplink to core-2
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000

We passed prop_vals to deepcopy() during each assignment resulting in two new copies of that data. In the output we have two different IDs even though we reused prop_vals. The final YAML representation has no references, which is exactly what we wanted.

Overriding ignore_aliases() method

To completely disable generation of YAML references we can sub-class Dumper class and override its ignore_aliases method:

Class definition, borrowed from Issue #103 posted on PyYAML GitHub page:

class NoAliasDumper(yaml.SafeDumper):
    def ignore_aliases(self, data):
        return True

You could also monkey-patch the actual Dumper class but I think this solution is safer and more elegant.

We’ll now take NoAliasDumper and use it to modify Program #2:

# yaml_same_ids_custom_dumper.py
import yaml


class NoAliasDumper(yaml.SafeDumper):
    def ignore_aliases(self, data):
        return True


interfaces = dict(
    Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
    Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)

prop_vals = ["pim", "ptp", "lldp"]

interfaces["Ethernet1"]["properties"] = prop_vals
interfaces["Ethernet2"]["properties"] = prop_vals

# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))

# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.dump(interfaces, Dumper=NoAliasDumper))

Output:

Ethernet1 properties object id: 19455080
Ethernet2 properties object id: 19455080

##### Resulting YAML:

Ethernet1:
  description: Uplink to core-1
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000
Ethernet2:
  description: Uplink to core-2
  mtu: 9000
  properties:
  - pim
  - ptp
  - lldp
  speed: 1000

Perfect, properties keys reference the same object but dumped YAML no longer uses aliases and anchors. This is exactly what we needed.

Note that I replaced yaml.safe_dump with yaml.dump in the above example. This is because we need to explicitly pass our modified Dumper class. However NoAliasDumper inherited from yaml.SafeDumper class so we still get the same protection we do when using yaml.safe_dump.

Conclusion

This brings us to the end of the post. I hope I helped you in understanding what are &id001, *id001 found in YAML files and where they com from. You now also know how to stop PyYAML from using anchors and aliases when serializing data structures, should you ever need it.

References:

  1. PyYAML GitHub repository. Issue #103 Disable Aliases/Anchors: https://github.com/yaml/PyYAMLHi/issues/103
  2. YAML specification. Section 3.1.1. Dump: https://yaml.org/spec/1.2/spec.html#id2762313
  3. YAML specification. Section 6.9.2. Node Anchors. https://yaml.org/spec/1.2/spec.html#id2785586
  4. YAML specification. Section 7.1. Alias Nodes. https://yaml.org/spec/1.2/spec.html#id2786196
  5. GitHub repo with resources for this post. https://github.com/progala/ttl255.com/tree/master/yaml/anchors-and-aliases