Amount of unique values in a column [closed]

Let’s say there are a lot of lines in a column which is like:

/aba72?mprag=ABCDD&radgroup=$%VDFVFid%7D&bobuecontext=$%7678XEL_CONTEXT%7D&do_id=$%7BMI%7D

I want to find out how many of those do_ids are unique. can be done in SQl or bash

Couldn’t find out how to do it

  • 1

    “how many of those do_ids are unique” -> So what are you expecting for output here? Just a number (like 10 for example)? Are these lines in a text file? What will you be using to iterate over these and perform a count? What about a value that appears more than once, is this counted for each occurrence or only for each unique occurrence? Finally you need to put in some effort into solving your own problem and show that effort along with the result and what you still need.

    – 

  • Yes, a number. yes in a text file. I just need the count. only unique. i am thinking the logic can be: do_id and the left side should be removed so only the IDs will be left and only after then it’s easy. Counting can be done on excel too. But can not perform the first part.

    – 




  • Is this something you will be wanting to automate or you need this just one time so amount of manual effort does not really matter? If it is just the once you can use a regular expression to capture the values, copy them to a new file or Excel or something for further processing.

    – 

  • I need it once. there are 30000 lines. What type of regular expression can help me?

    – 

  • 1

    Please show multiple lines and your desired out in your question (no comment here).

    – 

You can do this using pure Bash features:

  • Regex matching inside [[ ]]
  • Associative arrays (declare -A)
declare -A SEEN
while read -r LINE; do
    if [[ $LINE =~ (^|&)do_id=([^\&]+) ]]; then
        VAL=${BASH_REMATCH[2]}
        if [[ '5.2.0' < $BASH_VERSION ]]; then
            (( ++SEEN[$VAL] ))
        else
            # Workaround for a bug in Bash fixed in 5.2.0
            (( ++SEEN[${VAL@Q}] ))
        fi
    fi
done < lines.txt

for SEEN_VAL in "${!SEEN[@]}"; do
    # Values with counts
    printf -- '%s (count: %d)\n' "$SEEN_VAL" \
        "${SEEN[$SEEN_VAL]}"
    # Just the bare values
    #printf -- '%s\n' "$SEEN_VAL"
done

Prints lines with the value of do_id:

$%7BMQ%7D (count: 1)
$%7BMI%7D (count: 2)

Caveat: do_id cannot occur more than once per line.

Decoding the URL-encoded values is clearly outside the scope of this answer, so I’m not gonna do it.

Leave a Comment