Awk not finding all matches to index

Question 1

I have two files using pipe as the delimiter to produce two columns in each file. I want to combine the second column from the second file each time there is a match between the second column of the first file and the first column of the second file. My awk seems to only identify some matches and not others and I cannot understand why it is not working. In the test data set, it should match all instances of g1001.t1.

test.addgene.txt file content:

ptg000013l  AUGUSTUS    gene    7594135 7594636 0.57    +   .   ID=g1000;|
ptg000013l  AUGUSTUS    mRNA    7594135 7594636 0.57    +   .   ID=g1000.t1;Parent=g1000;|g1000
ptg000013l  AUGUSTUS    start_codon 7594135 7594137 .   +   0   ID=g1000.t1.start1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    CDS 7594135 7594312 0.6 +   0   ID=g1000.t1.CDS1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    exon    7594135 7594312 .   +   .   ID=g1000.t1.exon1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    intron  7594313 7594367 0.68    +   .   ID=g1000.t1.intron1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    CDS 7594368 7594636 0.68    +   2   ID=g1000.t1.CDS2;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    exon    7594368 7594636 .   +   .   ID=g1000.t1.exon2;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    stop_codon  7594634 7594636 .   +   0   ID=g1000.t1.stop1;Parent=g1000.t1;|g1000.t1
ptg000013l  AUGUSTUS    gene    7594770 7599695 0.46    +   .   ID=g1001;|
ptg000013l  AUGUSTUS    mRNA    7594770 7599695 0.46    +   .   ID=g1001.t1;Parent=g1001;|g1001
ptg000013l  AUGUSTUS    start_codon 7594770 7594772 .   +   0   ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    CDS 7594770 7594848 0.9 +   0   ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    exon    7594770 7594848 .   +   .   ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    intron  7594849 7599270 0.8 +   .   ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    CDS 7599271 7599695 0.48    +   2   ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    exon    7599271 7599695 .   +   .   ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1
ptg000013l  AUGUSTUS    gene    7611253 7611658 0.68    +   .   ID=g1002;|
ptg000013l  AUGUSTUS    mRNA    7611253 7611658 0.68    +   .   ID=g1002.t1;Parent=g1002;|g1002
ptg000013l  AUGUSTUS    start_codon 7611253 7611255 .   +   0   ID=g1002.t1.start1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    CDS 7611253 7611390 0.72    +   0   ID=g1002.t1.CDS1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    exon    7611253 7611390 .   +   .   ID=g1002.t1.exon1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    intron  7611391 7611439 0.78    +   .   ID=g1002.t1.intron1;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    CDS 7611440 7611658 0.84    +   0   ID=g1002.t1.CDS2;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    exon    7611440 7611658 .   +   .   ID=g1002.t1.exon2;Parent=g1002.t1;|g1002.t1
ptg000013l  AUGUSTUS    stop_codon  7611656 7611658 .   +   0   ID=g1002.t1.stop1;Parent=g1002.t1;|g1002.t1

test.names.txt file content:

g1001.t1|sorting nexin-4
g10010.t1|2-methoxy-6-polyprenyl-1,4-benzoquinol methylase [EC:2.1.1.201]
g10012.t2|small nuclear ribonucleoprotein D3
g10013.t1|tetratricopeptide repeat protein 4
g10024.t1|ATP-binding cassette, subfamily C (CFTR/MRP), member 4
g10027.t1|synaptosomal-associated protein 29
g10032.t1|serine/threonine-protein phosphatase PP1 catalytic subunit [EC:3.1.3.16]
g10033.t1|ligand of Numb protein X 1/2 [EC:2.3.2.27]
g10034.t1|PAX-interacting protein 1
g10038.t1|zinc finger SWIM domain-containing protein 7
g10041.t1|neuronal cell adhesion molecule
g10045.t1|peptidyl-tRNA hydrolase, PTH2 family [EC:3.1.1.29]
g10060.t1|endonuclease G, mitochondrial
g1007.t2|protocadherin-16/23
g10072.t1|fatty acid synthase, animal type [EC:2.3.1.85]
g10078.t1|cathepsin B [EC:3.4.22.1]
g1009.t1|gem associated protein 8
g10090.t1|KRAB domain-containing zinc finger protein
g1010.t1|translation initiation factor 3 subunit K
g10117.t1|kinetochore protein NDC80
g1012.t1|T-complex protein 1 subunit epsilon

Code used:

awk 'BEGIN { FS=OFS="|"; }; NR==FNR{a[$2]=$1} ($1 in a){print a[$1], $0}' test.addgene.txt test.names.txt

Resulting output only matched the line containing stop_codon:

ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4

Expected output:

ptg000013l  AUGUSTUS    start_codon 7594770 7594772 .   +   0   ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    CDS 7594770 7594848 0.9 +   0   ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    exon    7594770 7594848 .   +   .   ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    intron  7594849 7599270 0.8 +   .   ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    CDS 7599271 7599695 0.48    +   2   ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    exon    7599271 7599695 .   +   .   ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l  AUGUSTUS    stop_codon  7599693 7599695 .   +   0   ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4

result of looking for odd characters in input file indicates nothing unexpected:

LC_ALL=C sed -n l test.addgene.txt

ptg000013l\tAUGUSTUS\tgene\t7594135\t7594636\t0.57\t+\t.\tID=g1000;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594135\t7594636\t0.57\t+\t.\tID=g1000.t1;Parent=g1\
000;|g1000$
ptg000013l\tAUGUSTUS\tstart_codon\t7594135\t7594137\t.\t+\t0\tID=g1000.t1.start\
1;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594135\t7594312\t0.6\t+\t0\tID=g1000.t1.CDS1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594135\t7594312\t.\t+\t.\tID=g1000.t1.exon1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tintron\t7594313\t7594367\t0.68\t+\t.\tID=g1000.t1.intron1\
;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594368\t7594636\t0.68\t+\t2\tID=g1000.t1.CDS2;Paren\
t=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594368\t7594636\t.\t+\t.\tID=g1000.t1.exon2;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7594634\t7594636\t.\t+\t0\tID=g1000.t1.stop1;\
Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tgene\t7594770\t7599695\t0.46\t+\t.\tID=g1001;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594770\t7599695\t0.46\t+\t.\tID=g1001.t1;Parent=g1\
001;|g1001$
ptg000013l\tAUGUSTUS\tstart_codon\t7594770\t7594772\t.\t+\t0\tID=g1001.t1.start\
1;Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594770\t7594848\t0.9\t+\t0\tID=g1001.t1.CDS1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7594770\t7594848\t.\t+\t.\tID=g1001.t1.exon1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tintron\t7594849\t7599270\t0.8\t+\t.\tID=g1001.t1.intron1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7599271\t7599695\t0.48\t+\t2\tID=g1001.t1.CDS2;Paren\
t=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7599271\t7599695\t.\t+\t.\tID=g1001.t1.exon2;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7599693\t7599695\t.\t+\t0\tID=g1001.t1.stop1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tgene\t7611253\t7611658\t0.68\t+\t.\tID=g1002;|$
ptg000013l\tAUGUSTUS\tmRNA\t7611253\t7611658\t0.68\t+\t.\tID=g1002.t1;Parent=g1\
002;|g1002$
ptg000013l\tAUGUSTUS\tstart_codon\t7611253\t7611255\t.\t+\t0\tID=g1002.t1.start\
1;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611253\t7611390\t0.72\t+\t0\tID=g1002.t1.CDS1;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611253\t7611390\t.\t+\t.\tID=g1002.t1.exon1;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tintron\t7611391\t7611439\t0.78\t+\t.\tID=g1002.t1.intron1\
;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611440\t7611658\t0.84\t+\t0\tID=g1002.t1.CDS2;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611440\t7611658\t.\t+\t.\tID=g1002.t1.exon2;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7611656\t7611658\t.\t+\t0\tID=g1002.t1.stop1;\
Parent=g1002.t1;|g1002.t1$

This is a test data set and I am not looking for an answer specific to matching g1001.t1. Hopefully someone can suggest methods to help me troubleshoot the problem of awk not finding and printing all matches. (This is on a Macbook pro).

Question 2

You are replacing the value of a["g1001.t1"] each time in NR==FNR so you only end up capturing the stop_codon line in a["g1001.t1"].

The logic seems bass-ackwards; you probably instead want to read the names into memory and then add them to any lines from the addgene file where $2 is identical to $1 from the names file.

awk 'BEGIN { FS=OFS="|" }
    NR==FNR { a[$1]=$2; next }
    ($2 in a) { print $0, a[$2] }' test.names.txt test.addgene.txt

Also notice the next in the NR==FNR case to prevent the script from falling through and accidentally printing out something from the first input file as well; there will obviously also be a minor efficiency improvement.

Example output:

ptg000013l      AUGUSTUS        start_codon     7594770 7594772 .       +             0       ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        CDS     7594770 7594848 0.9     +       0             ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        exon    7594770 7594848 .       +       .             ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        intron  7594849 7599270 0.8     +       .             ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        CDS     7599271 7599695 0.48    +       2             ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        exon    7599271 7599695 .       +       .             ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l      AUGUSTUS        stop_codon      7599693 7599695 .       +             0       ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4

If you wanted to print all lines but add the name decoration where available, the change should be obvious; simply change the value of $0 in the ($2 in a) clause, and then add an unconditional print at the end of the script (conventionally simply 1).

Leave a Comment Cancel reply