python how to find duplicated values of yaml file for specific key

I have a yaml file like this:

-
    ip: 1.1.1.1
    status: Active
    type: 'typeA'
-
    ip: 1.1.1.1
    status: Disabled
    type: 'typeA'
-
    ip: 2.2.2.2
    status: Active
    type: 'typeC'
-
    ip: 3.3.3.3
    status: Active
    type: 'typeB'
-
    ip: 3.3.3.3
    status: Active
    type: 'typeC'
-
    ip: 2.2.2.2
    status: Active
    type: 'typeC'
-

I’m going to find any duplicate IPs which type is the same.

For example, IP 1.1.1.1 has two entries and both types are typeA, so it should be considered. But IP 3.3.3.3‘s type is not the same so it should not be.

Expected output:

IP 1.1.1.1, typeA duplicate
IP 2.2.2.2, typeC duplicate

  • You don’t indicate why there is a comma after the IP address in the first line of the output, but not on the second line. Is that determined by the value for key type?

    – 

  • @Anthon you mean in the expected output? That’s a typo, sir. Since that’s the output, it doesn’t much matter. But I really appreciate that. I edit the question and add the comma.

    – 

install pyyaml using pip install pyyaml then run the python script by replacing myyaml.yaml with your YAML file

import yaml

with open('myyaml.yaml', 'r') as file:
    data = yaml.safe_load(file)

ip_type_map = {}

for entry in data:
    if entry and 'ip' in entry and 'type' in entry:
        ip, entry_type = entry['ip'], entry['type']
        print(f"IP {ip}, {entry_type} duplicate") if (ip in ip_type_map and entry_type == ip_type_map[ip]) else ip_type_map.update({ip: entry_type})
    else:
        print("Invalid entry in YAML data.")

There are a few ways you can do this. All of them require you loading yaml in with a parser. Use a library like pyyaml https://pypi.org/project/PyYAML/

import yaml

with open('yaml_file.yml', 'r') as stream:
    file = yaml.safe_load(stream)

First way, loop over each row making note of the types you’ve seen, storing them in a list, along with the indexes of duplicates. This will preserve the original list order.

rows_to_remove = []
rows_seen = []
for idx, row in enumerate(file):
   if (row['ip'], row['type']) in rows_seen:
      rows_to_remove.append(idx)
      continue
   rows_seen.append((row['ip'], row['type']))

for row_idx in rows_to_remove:
   file.pop(row_idx)

Create a new list using a list comprehension.

[dict(t) for t in {tuple(d.items()) for d in file}]

This loops over each line in the list, turning the dictionary into a tuple and storing that in a set, which doesn’t allow for duplicates. Each unique line is then turned back into a dictionary stored inside of a list.

I’ll point out that these questions have already been answered and could be found with some searching.

How can I parse a YAML file in Python

Remove duplicate dict in list in Python

Leave a Comment