I am using Bash (very unfamiliar for me). I have 2 source files.
One of them (name: clusters.txt) looks like:
Cluster 10: WP_1.2 WP_1.1 WP_1.4 ......
Cluster 15: WP_2.1 WP_1.4 WP_1.3 ......
In short, every line corresponds to a cluster which has a sequence of IDs (each ID looks like XY_123.4).
The second file (name: sequences.fasta) looks like:
>WP_1.1 some dummy text...
>WP_1.2 some more text...
>WP_1.3 some more text...
>WP_1.4 some more text...
>WP_2.1 some more text...
>WP_2.2 some more text...
In short, every line starts with a “>” sign means it is a sequence.
What I need to do is take every “sequence” of a “cluster” and create a separate fasta file with them. For example, for Cluster 10, I need to create:
>WP_1.1 some dummy text...
>WP_1.2 some more text...
>WP_1.4 some more text...
I tried using grep in a loop, but that was extremely costly on resources and could not perform.
All help is much appreciated. Thanks in advance.
You can try this awk
script
#! /usr/bin/awk -f
BEGIN { FS=": " }
{ gsub(" ", "_", $1); gsub(" ", "|", $2); system("/usr/bin/awk '/" $2 "/' sequences.fasta >" $1) }
assuming you have a reasonable amount of lines in clusters.txt
because this is invoking a new awk
for each one.
chmod +x myscript
./myscript clusters.txt
Do you need to do it in bash? What language that you are familiar with?
please update the question with the names and contents of all the files you wish to create (making sure they match the provided inputs)
please update the question with the (grep) code you’ve tried; you’ve tagged the question with
r
…. whatr
code have you attempted?Do you want a “Bash” or a “R” solution?
that was extremely costly : Could it be that you implemented it in an imperformant way? How can we improve your solution, if you don’t show us how you solved it?
Show 1 more comment