Finding correct lines out of a file and using them to create new files [closed] ❤️ {UPDATED 2023}

I am using Bash (very unfamiliar for me). I have 2 source files.

One of them (name: clusters.txt) looks like:

Cluster 10: WP_1.2 WP_1.1 WP_1.4 ......

Cluster 15: WP_2.1 WP_1.4 WP_1.3 ......

In short, every line corresponds to a cluster which has a sequence of IDs (each ID looks like XY_123.4).

The second file (name: sequences.fasta) looks like:

>WP_1.1 some dummy text...

>WP_1.2 some more text...

>WP_1.3 some more text...

>WP_1.4 some more text...

>WP_2.1 some more text...

>WP_2.2 some more text...

In short, every line starts with a “>” sign means it is a sequence.

What I need to do is take every “sequence” of a “cluster” and create a separate fasta file with them. For example, for Cluster 10, I need to create:

>WP_1.1 some dummy text...

>WP_1.2 some more text...

>WP_1.4 some more text...

I tried using grep in a loop, but that was extremely costly on resources and could not perform.

All help is much appreciated. Thanks in advance.

Do you need to do it in bash? What language that you are familiar with?

–
please update the question with the names and contents of all the files you wish to create (making sure they match the provided inputs)

–
please update the question with the (grep) code you’ve tried; you’ve tagged the question with r …. what r code have you attempted?

–
1

Do you want a “Bash” or a “R” solution?

–
1

that was extremely costly : Could it be that you implemented it in an imperformant way? How can we improve your solution, if you don’t show us how you solved it?

–

Show 1 more comment

You can try this awk script

#! /usr/bin/awk -f

BEGIN { FS=": " }
{ gsub(" ", "_", $1); gsub(" ", "|", $2); system("/usr/bin/awk '/" $2 "/' sequences.fasta >" $1) }

assuming you have a reasonable amount of lines in clusters.txt because this is invoking a new awk for each one.

chmod +x myscript
./myscript clusters.txt

Leave a Comment Cancel reply