Grouping tags based on a specific pattern

Question 1

A text snippet like

04040p0015 Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

has to be parsed into the four groups

04040
p0015
Macro drive object / Macro DO
SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF,
R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

Where the first and second groups represent different kind of IDs, the 3rd group represents a title and the 4th group contains tags, consisting of 2 or more capital letters, followed by none or one _ , followed by none or more capital letters, followed by a comma.

The regex

([0-9]+)([rp][0-9]{4,})(.*)([A-Z]{2,}_?[A-Z,0-9]{2,},)

returns

04040
p0015
Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC,
VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120,
TM150,

i.e., it gets the first two groups right, but fails to correctly separate the last two groups.

What is wrong with the regex expression?

Question 2

You may use this regex to get desired 4 capture groups:

(\d+)([rp]\d{4,})(.*?)\s+((?:[A-Z]\w+,\s+)+)

RegEx Demo

RegEx Details:

(\d+): 1st group to capture 1+ digits
([rp]\d{4,}): 2nd group to match text starting with r or p followed by 4+ digits
(.*?): 3rd group to match and capture 0 or more of any characters (lazy)
\s+: 1+ whitespaces
((?:[A-Z]\w+,\s+)+): 4th group to match & capture words starting with upper case letter and followed by comma and 1+ whitespaces

Leave a Comment Cancel reply