I have two files using pipe as the delimiter to produce two columns in each file. I want to combine the second column from the second file each time there is a match between the second column of the first file and the first column of the second file. My awk seems to only identify some matches and not others and I cannot understand why it is not working. In the test data set, it should match all instances of g1001.t1.
test.addgene.txt file content:
ptg000013l AUGUSTUS gene 7594135 7594636 0.57 + . ID=g1000;|
ptg000013l AUGUSTUS mRNA 7594135 7594636 0.57 + . ID=g1000.t1;Parent=g1000;|g1000
ptg000013l AUGUSTUS start_codon 7594135 7594137 . + 0 ID=g1000.t1.start1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS CDS 7594135 7594312 0.6 + 0 ID=g1000.t1.CDS1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS exon 7594135 7594312 . + . ID=g1000.t1.exon1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS intron 7594313 7594367 0.68 + . ID=g1000.t1.intron1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS CDS 7594368 7594636 0.68 + 2 ID=g1000.t1.CDS2;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS exon 7594368 7594636 . + . ID=g1000.t1.exon2;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS stop_codon 7594634 7594636 . + 0 ID=g1000.t1.stop1;Parent=g1000.t1;|g1000.t1
ptg000013l AUGUSTUS gene 7594770 7599695 0.46 + . ID=g1001;|
ptg000013l AUGUSTUS mRNA 7594770 7599695 0.46 + . ID=g1001.t1;Parent=g1001;|g1001
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1
ptg000013l AUGUSTUS gene 7611253 7611658 0.68 + . ID=g1002;|
ptg000013l AUGUSTUS mRNA 7611253 7611658 0.68 + . ID=g1002.t1;Parent=g1002;|g1002
ptg000013l AUGUSTUS start_codon 7611253 7611255 . + 0 ID=g1002.t1.start1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS CDS 7611253 7611390 0.72 + 0 ID=g1002.t1.CDS1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS exon 7611253 7611390 . + . ID=g1002.t1.exon1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS intron 7611391 7611439 0.78 + . ID=g1002.t1.intron1;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS CDS 7611440 7611658 0.84 + 0 ID=g1002.t1.CDS2;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS exon 7611440 7611658 . + . ID=g1002.t1.exon2;Parent=g1002.t1;|g1002.t1
ptg000013l AUGUSTUS stop_codon 7611656 7611658 . + 0 ID=g1002.t1.stop1;Parent=g1002.t1;|g1002.t1
test.names.txt file content:
g1001.t1|sorting nexin-4
g10010.t1|2-methoxy-6-polyprenyl-1,4-benzoquinol methylase [EC:2.1.1.201]
g10012.t2|small nuclear ribonucleoprotein D3
g10013.t1|tetratricopeptide repeat protein 4
g10024.t1|ATP-binding cassette, subfamily C (CFTR/MRP), member 4
g10027.t1|synaptosomal-associated protein 29
g10032.t1|serine/threonine-protein phosphatase PP1 catalytic subunit [EC:3.1.3.16]
g10033.t1|ligand of Numb protein X 1/2 [EC:2.3.2.27]
g10034.t1|PAX-interacting protein 1
g10038.t1|zinc finger SWIM domain-containing protein 7
g10041.t1|neuronal cell adhesion molecule
g10045.t1|peptidyl-tRNA hydrolase, PTH2 family [EC:3.1.1.29]
g10060.t1|endonuclease G, mitochondrial
g1007.t2|protocadherin-16/23
g10072.t1|fatty acid synthase, animal type [EC:2.3.1.85]
g10078.t1|cathepsin B [EC:3.4.22.1]
g1009.t1|gem associated protein 8
g10090.t1|KRAB domain-containing zinc finger protein
g1010.t1|translation initiation factor 3 subunit K
g10117.t1|kinetochore protein NDC80
g1012.t1|T-complex protein 1 subunit epsilon
Code used:
awk 'BEGIN { FS=OFS="|"; }; NR==FNR{a[$2]=$1} ($1 in a){print a[$1], $0}' test.addgene.txt test.names.txt
Resulting output only matched the line containing stop_codon:
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
Expected output:
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
result of looking for odd characters in input file indicates nothing unexpected:
LC_ALL=C sed -n l test.addgene.txt
ptg000013l\tAUGUSTUS\tgene\t7594135\t7594636\t0.57\t+\t.\tID=g1000;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594135\t7594636\t0.57\t+\t.\tID=g1000.t1;Parent=g1\
000;|g1000$
ptg000013l\tAUGUSTUS\tstart_codon\t7594135\t7594137\t.\t+\t0\tID=g1000.t1.start\
1;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594135\t7594312\t0.6\t+\t0\tID=g1000.t1.CDS1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594135\t7594312\t.\t+\t.\tID=g1000.t1.exon1;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tintron\t7594313\t7594367\t0.68\t+\t.\tID=g1000.t1.intron1\
;Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594368\t7594636\t0.68\t+\t2\tID=g1000.t1.CDS2;Paren\
t=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\texon\t7594368\t7594636\t.\t+\t.\tID=g1000.t1.exon2;Parent\
=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7594634\t7594636\t.\t+\t0\tID=g1000.t1.stop1;\
Parent=g1000.t1;|g1000.t1$
ptg000013l\tAUGUSTUS\tgene\t7594770\t7599695\t0.46\t+\t.\tID=g1001;|$
ptg000013l\tAUGUSTUS\tmRNA\t7594770\t7599695\t0.46\t+\t.\tID=g1001.t1;Parent=g1\
001;|g1001$
ptg000013l\tAUGUSTUS\tstart_codon\t7594770\t7594772\t.\t+\t0\tID=g1001.t1.start\
1;Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7594770\t7594848\t0.9\t+\t0\tID=g1001.t1.CDS1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7594770\t7594848\t.\t+\t.\tID=g1001.t1.exon1;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tintron\t7594849\t7599270\t0.8\t+\t.\tID=g1001.t1.intron1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tCDS\t7599271\t7599695\t0.48\t+\t2\tID=g1001.t1.CDS2;Paren\
t=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\texon\t7599271\t7599695\t.\t+\t.\tID=g1001.t1.exon2;Parent\
=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7599693\t7599695\t.\t+\t0\tID=g1001.t1.stop1;\
Parent=g1001.t1;|g1001.t1$
ptg000013l\tAUGUSTUS\tgene\t7611253\t7611658\t0.68\t+\t.\tID=g1002;|$
ptg000013l\tAUGUSTUS\tmRNA\t7611253\t7611658\t0.68\t+\t.\tID=g1002.t1;Parent=g1\
002;|g1002$
ptg000013l\tAUGUSTUS\tstart_codon\t7611253\t7611255\t.\t+\t0\tID=g1002.t1.start\
1;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611253\t7611390\t0.72\t+\t0\tID=g1002.t1.CDS1;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611253\t7611390\t.\t+\t.\tID=g1002.t1.exon1;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tintron\t7611391\t7611439\t0.78\t+\t.\tID=g1002.t1.intron1\
;Parent=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tCDS\t7611440\t7611658\t0.84\t+\t0\tID=g1002.t1.CDS2;Paren\
t=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\texon\t7611440\t7611658\t.\t+\t.\tID=g1002.t1.exon2;Parent\
=g1002.t1;|g1002.t1$
ptg000013l\tAUGUSTUS\tstop_codon\t7611656\t7611658\t.\t+\t0\tID=g1002.t1.stop1;\
Parent=g1002.t1;|g1002.t1$
This is a test data set and I am not looking for an answer specific to matching g1001.t1. Hopefully someone can suggest methods to help me troubleshoot the problem of awk not finding and printing all matches. (This is on a Macbook pro).
You are replacing the value of a["g1001.t1"]
each time in NR==FNR
so you only end up capturing the stop_codon
line in a["g1001.t1"]
.
The logic seems bass-ackwards; you probably instead want to read the names into memory and then add them to any lines from the addgene file where $2
is identical to $1
from the names file.
awk 'BEGIN { FS=OFS="|" }
NR==FNR { a[$1]=$2; next }
($2 in a) { print $0, a[$2] }' test.names.txt test.addgene.txt
Also notice the next
in the NR==FNR
case to prevent the script from falling through and accidentally printing out something from the first input file as well; there will obviously also be a minor efficiency improvement.
Example output:
ptg000013l AUGUSTUS start_codon 7594770 7594772 . + 0 ID=g1001.t1.start1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7594770 7594848 0.9 + 0 ID=g1001.t1.CDS1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7594770 7594848 . + . ID=g1001.t1.exon1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS intron 7594849 7599270 0.8 + . ID=g1001.t1.intron1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS CDS 7599271 7599695 0.48 + 2 ID=g1001.t1.CDS2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS exon 7599271 7599695 . + . ID=g1001.t1.exon2;Parent=g1001.t1;|g1001.t1|sorting nexin-4
ptg000013l AUGUSTUS stop_codon 7599693 7599695 . + 0 ID=g1001.t1.stop1;Parent=g1001.t1;|g1001.t1|sorting nexin-4
If you wanted to print all lines but add the name decoration where available, the change should be obvious; simply change the value of $0
in the ($2 in a)
clause, and then add an unconditional print at the end of the script (conventionally simply 1
).