ID pairing and unique pair count

I am writing a code in R which should analzye two columns P1 and P2 which both contain ID-code and the respective PAIR column.

enter image description here

  1. I want each individual ID-code to be only used once for a pair, but the individual ID-code can be within P1 and P2 (just in different rows).

  2. Further, I want to exclude logical duplicates. So, if a pair is looking like this “X30112_X30101” then it could be a duplicate from this “X30101_X30112”

  3. On the longrun I am actually looking for the maximum count of pairs which is quite tricky as each ID-code can only be used once but the data shows that a pairing of one individual ID code can be 1:n.

Unfortuenately, I am missing the experience to better describe and I think it might be a combinatorical solve. I would be happy for any kind of help.

What I tried so far?

So far I only tried successfully to solve 1) with an easier dataframe which somewhat worked with this code:

    # Sample data: df dataframe
    df <- data.frame(
      P1 = c("A", "B", "C", "W"),
      P2 = c("W", "X", "Y", "A"),
      PAIR = c("A_W", "B_X", "C_Y", "W_A")
    )

    # Function to normalize and sort pairs
    normalize_and_sort <- function(pair) {
      elements <- unlist(strsplit(pair, "[_\\.]"))
      sorted_pair <- paste(sort(elements), collapse = "_")
      return(sorted_pair)
     }

     # Normalize and sort the pairs and keep unique pairs
    unique_pairs_df <- data.frame(PAIR = unique(sapply(df$PAIR, normalize_and_sort)))

     # Print the unique_pairs_df
     print(unique_pairs_df)
  PAIR
1  A_W
2  B_X
3  C_Y

But this did not work with my actual dataframe. Maybe because my ID-codes use numbers, too.

  • For example, data.frame(P1 = c("A", "A", "B"), P2 = c("B", "C", "D")). If you kept the first row, you would need to remove the other two rows. However, if you removed the first row, you would keep the the other two rows. When you say “looking for the maximum count of pairs”, does that mean you would prefer to remove the first row since it would result in a higher pair count?

    – 

Leave a Comment