BOL: bash script to extract sequence by ids !

BioScripts
Abhimanyu Singh
bash script to extract sequence by ids !

bash script to extract sequence by ids !

Use a Perl one-liner, grep and seqtk subseq to extract the desired fasta sequences:

# Create test input:

cat > in.fasta <<EOF
>BGI_novel_T016697 Solyc03g033550.3.1
CTGACGTATACAATTAAGCCGCG
>BGI_novel_T016313 Solyc03g025570.2.1
TTCAAGTGTTAGTTTCACATCAT
>BGI_novel_T018109 Solyc03g080075.1.1
GCAAGGGAAAGAAGTATTACTAG
>BGI_novel_T016817 BGI_novel_G001220
GCCCAAGTCATAGGTAGTGCCTG
>BGI_novel_T016141 Solyc03g007600.3.1
ACGTACGTACGTACGTACGTACG
EOF

cat > gene_ids.txt <<EOF
Solyc03g033550.3.1
Solyc03g080075.1.1
Solyc00g256710.2.1
Solyc01g010890.3.1
EOF

# Extract ids and gene ids into a tsv file:
perl -lne '@f = /^>(\S+)\s+(\S+)/ and print join "\t", @f;' in.fasta > ids_gene_ids.tsv

# Select ids that correspond to the desired gene ids:
grep -f gene_ids.txt ids_gene_ids.tsv | cut -f1 > ids.selected.txt

# Extract fasta sequence that correspond to desired gene ids:
seqtk subseq in.fasta ids.selected.txt > out.fasta                

cat out.fasta
Output:

>BGI_novel_T016697 Solyc03g033550.3.1
CTGACGTATACAATTAAGCCGCG
>BGI_novel_T018109 Solyc03g080075.1.1
GCAAGGGAAAGAAGTATTACTAG
Note that seqtk can be installed, for example, using conda.

BOL

Abhimanyu Singh

Our Sponsors

bash script to extract sequence by ids !