Pair-cluster across many variables, respecting pre-existing grouping variable ❤️ {UPDATED 2023}

I have a tibble with an id column, a G grouping variable, and 300 numeric variables.

I want a method that clusters the raws to the point that each row is matched/paired in a cluster with another within each grouping variable. Spare raws in odd groups can be left out of the clusters.

So, if in a group there are 4 raws, then there will be 2 clusters of 2. If there are 5 raws, then 2 clusters of 2 and a spare raw.

I think I like the Mahalanobis distance for clustering but I am open to an alternative proposal.

I think that a diagnostic variable with the intra-cluster Mahalanobis could help, too.

Technically speaking, MatchIt does something very similar, over-imposing a binary classification to the raws. I don’t want the need of such classification.

Leave a Comment Cancel reply