Improving efficiency of a bash script reading data from a large gml file

For various reasons, I have been tasked with importing data from a large gml file into a MySQL database. The gml files I am looking at are the ones at https://use-land-property-data.service.gov.uk/datasets/inspire/download.

Basically, all I am interested in is the coordinates of each object (polygon). I can import the coordinates, as they appear in the gml file (lat1 long1 lat2 long2…). I also need to include the maximum and minimum latitude and longitude as four separate columns (this needs to be calculated; it is not included in the file).

My attempt so far is as follows:

#!/bin/bash

gmlfile=$1
gmldir="/home/user.name/gml_files"

gml_short=$(echo $gmlfile | sed 's/\.gml//')

echo "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES" > gml_sql_query.sql

xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file

while read coordinates; do

    latitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f1) #odd elements
    longitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f2) #even elements

    max_lat=$(echo "$latitude" | tr " " '\n' | sort -n | tail -1)
    min_lat=$(echo "$latitude" | tr " " '\n' | sort -n | head -1)
    max_long=$(echo "$longitude" | tr " " '\n' | sort -n | tail -1)
    min_long=$(echo "$longitude" | tr " " '\n' | sort -n | head -1)


    echo "('$gml_short','$coordinates','$max_lat','$max_long','$min_lat','$min_long')," >> gml_sql_query.sql

done < "$gmldir/$gml_short-coords.txt"

sed -i '$s/,$/;/' gml_sql_query.sql #Replace last , with ;

mysql < gml_sql_query.sql

The xmllint line and mysql import itself are both fast. The population of the “gml_sql_query.sql” file is really, really slow; around 1000 every 30 seconds. Every one of those gml files has tens of 1000s of coordinates, so this is taking a long, long time. Which parts of my script are the most inefficient, and is there any way to speed this up?

Two asides:

  1. I attempted to have lat_long and lat_process (etc) variables that did most of the piping and just did a final pipe for the first/second column or whatever. I didn’t find that had any improvement.
  2. I tried using arrays, but since the values are decimals this was also slow:
mapfile -t lat_array <<< "$latitude"
    max_lat=0
    min_lat=9999999.0
    for num in "${lat_array[@]}"; do
        if [[ $(echo "$num > $max_lat" | bc -l) -eq 1 ]]; then
            max_lat=$num
        fi

        if [[ $(echo "$num < $min_lat" | bc -l) -eq 1 ]]; then
            min_lat=$num
        fi
    done

    mapfile -t long_array <<< "$longitude"
    max_long=0
    min_long=9999999.0
    for num in "${long_array[@]}"; do
        if [[ $(echo "$num > $max_long" | bc -l) -eq 1 ]]; then
            max_long=$num
        fi

        if [[ $(echo "$num < $min_long" | bc -l) -eq 1 ]]; then
            min_long=$num
        fi
    done

EDIT: example input data:

525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65

Desired MySQL insert statement:

INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');

  • Unfortunately you need to start over and rewrite at least that while-read loop in awk, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice. You should also copy/paste the rest of it into shellcheck.net and fix the issues it’ll tell you about but, while that will make your script much more robust, that won’t affect the performance much, it’ll still be extremely slow until you get rid of that while-read loop.

    – 




  • 1

    For each line in the coordinates file, you spawn 20 child processes. Most of them are doing close to nothing, but creating a child process costs resources. If you want to stick with bash, eliminate all child processes from the loop body. Alternatively, throw away the script and write it again in i.e Ruby, Python, Perl, C++, Java or whatever you feel most familiar in.

    – 

  • the while loop is slow primarily because of the 24x subshells that are invoked for each pass through the loop; the mapfile approach performs 8x subshell invocations for each array entry that’s processed by the for loops; while there are likely some ways to reduce the number of subshell invocations, the fact that you’re dealing with non-integers complicates matters (from a bash perspective); using another language/tool (eg, awk, perl, python, etc) is going to be substantially faster

    – 

  • 1

    Thanks guys for your super-fast responses already! @EdMorton, I added a reproducible example to the body above. I had a suspicion that reading text from a while loop was the cause of a lot of the issues. I guess I don’t really have a solid grounding in any scripting language and am just trying my best, so any tips at all, however simple, would be useful for me in learning about bash efficiency.

    – 




  • 1

    @joe.mse the important thing to know about bash efficiency is that bash is a shell and a shell is a tool to manipulate (create/destroy) files and processes and to sequence calls to other tools. It is NOT a tool to manipulate text so don’t expect it to be easy to use or efficient for that purpose. Other tools that bash can call such as awk (the general purpose text manipulation tool that exists on all Unix systems as it’s a mandatory POSIX tool) exist to manipulate text.

    – 




Using any awk:

$ cat tst.sh
#!/usr/bin/env bash

gmlfile="Adur_District_Council.gml"
gml_short="${gmlfile%%.gml}"

awk -v q="'" -v gml_short="$gml_short" '
    BEGIN {
        OFS = q "," q
        print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
        ORS = ""
    }
    {
        max_lat = min_lat = $1
        max_lon = min_lon = $2
        for ( i=3; i<NF; i+=2 ) {
            lat = $i
            lon = $(i+1)
            max_lat = (lat > max_lat ? lat : max_lat)
            min_lat = (lat < min_lat ? lat : min_lat)
            max_lon = (lon > max_lon ? lon : max_lon)
            min_lon = (lon < min_lon ? lon : min_lon)
        }
        print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
        sep = ",\n"
    }
    END {
        print ";\n"
    }
' coords.txt

$ cat coords.txt
525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65

$ ./tst.sh
INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');

So your full script would become (untested):

#!/usr/bin/env bash

gmlfile="$1"
gmldir="/home/user.name/gml_files"
gml_short="${gmlfile%%.gml}"

xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file

awk -v q="'" -v gml_short="$gml_short" '
    BEGIN {
        OFS = q "," q
        print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
        ORS = ""
    }
    {
        max_lat = min_lat = $1
        max_lon = min_lon = $2
        for ( i=3; i<NF; i+=2 ) {
            lat = $i
            lon = $(i+1)
            max_lat = (lat > max_lat ? lat : max_lat)
            min_lat = (lat < min_lat ? lat : min_lat)
            max_lon = (lon > max_lon ? lon : max_lon)
            min_lon = (lon < min_lon ? lon : min_lon)
        }
        print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
        sep = ",\n"
    }
    END {
        print ";\n"
    }
' "$gmldir/$gml_short-coords.txt" > gml_sql_query.sql

mysql < gml_sql_query.sql

or if you’d rather not have that apparently temporary gml_sql_query.sql file:

$ cat tst.sh
#!/usr/bin/env bash

gmlfile="$1"
gmldir="/home/user.name/gml_files"
gml_short="${gmlfile%%.gml}"

xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file

mysql < <(
    awk -v q="'" -v gml_short="$gml_short" '
        BEGIN {
            OFS = q "," q
            print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
            ORS = ""
        }
        {
            max_lat = min_lat = $1
            max_lon = min_lon = $2
            for ( i=3; i<NF; i+=2 ) {
                lat = $i
                lon = $(i+1)
                max_lat = (lat > max_lat ? lat : max_lat)
                min_lat = (lat < min_lat ? lat : min_lat)
                max_lon = (lon > max_lon ? lon : max_lon)
                min_lon = (lon < min_lon ? lon : min_lon)
            }
            print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
            sep = ",\n"
        }
        END {
            print ";\n"
        }
    ' "$gmldir/$gml_short-coords.txt"
)

Leave a Comment