A text snippet like
04040p0015 Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,
has to be parsed into the four groups
- 04040
- p0015
- Macro drive object / Macro DO
- SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF,
R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,
Where the first and second groups represent different kind of IDs, the 3rd group represents a title and the 4th group contains tags, consisting of 2 or more capital letters, followed by none or one _ , followed by none or more capital letters, followed by a comma.
([0-9]+)([rp][0-9]{4,})(.*)([A-Z]{2,}_?[A-Z,0-9]{2,},)
returns
- 04040
- p0015
- Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC,
VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, - TM150,
i.e., it gets the first two groups right, but fails to correctly separate the last two groups.
What is wrong with the regex expression?
You may use this regex to get desired 4 capture groups:
(\d+)([rp]\d{4,})(.*?)\s+((?:[A-Z]\w+,\s+)+)
RegEx Details:
(\d+)
: 1st group to capture 1+ digits([rp]\d{4,})
: 2nd group to match text starting withr
orp
followed by 4+ digits(.*?)
: 3rd group to match and capture 0 or more of any characters (lazy)\s+
: 1+ whitespaces((?:[A-Z]\w+,\s+)+)
: 4th group to match & capture words starting with upper case letter and followed by comma and 1+ whitespaces
Ae “tags” always single words?
Yes. I have specified the question accordingly