Finding correct lines out of a file and using them to create new files [closed]

I am using Bash (very unfamiliar for me). I have 2 source files.

One of them (name: clusters.txt) looks like:

Cluster 10: WP_1.2 WP_1.1 WP_1.4 ......

Cluster 15: WP_2.1 WP_1.4 WP_1.3 ......

In short, every line corresponds to a cluster which has a sequence of IDs (each ID looks like XY_123.4).

The second file (name: sequences.fasta) looks like:

>WP_1.1 some dummy text...

>WP_1.2 some more text...

>WP_1.3 some more text...

>WP_1.4 some more text...

>WP_2.1 some more text...

>WP_2.2 some more text...

In short, every line starts with a “>” sign means it is a sequence.

What I need to do is take every “sequence” of a “cluster” and create a separate fasta file with them. For example, for Cluster 10, I need to create:

>WP_1.1 some dummy text...

>WP_1.2 some more text...

>WP_1.4 some more text...

I tried using grep in a loop, but that was extremely costly on resources and could not perform.

All help is much appreciated. Thanks in advance.

  • Do you need to do it in bash? What language that you are familiar with?

    – 

  • please update the question with the names and contents of all the files you wish to create (making sure they match the provided inputs)

    – 

  • please update the question with the (grep) code you’ve tried; you’ve tagged the question with r …. what r code have you attempted?

    – 

  • 1

    Do you want a “Bash” or a “R” solution?

    – 




  • 1

    that was extremely costly : Could it be that you implemented it in an imperformant way? How can we improve your solution, if you don’t show us how you solved it?

    – 

You can try this awk script

#! /usr/bin/awk -f

BEGIN { FS=": " }
{ gsub(" ", "_", $1); gsub(" ", "|", $2); system("/usr/bin/awk '/" $2 "/' sequences.fasta >" $1) }

assuming you have a reasonable amount of lines in clusters.txt because this is invoking a new awk for each one.

chmod +x myscript
./myscript clusters.txt

Leave a Comment