Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




Question: Question: Extract the numeric values from the multiple FASTA sequence file.

Alok Prajapati
3883 days ago

Question: Extract the numeric values from the multiple FASTA sequence file.

I have a multiple fasta sequence file (~12GB size) with certain coordinate information:

> chr13-/454-4567654 (2347645)
AGTGACTGACTGAAGTGACTGA

> chr14-/524-8367954 (6535786)
AGTGACTGAAGTGACTGA

The fasta sequence string would always have only one or more continuous stretch of numbers, like 13-/454-4567654 (2347645) in this case. Rest all will be either alphabets or other special characters. How can I extract the number and store it back in some array?

Answers
0

Hi Alok,

You can try following Perl script on ur dataset. It will extract the numeric values and seperate it's chromosome, start and end coordinates with tabs in outFile.

usage : perl extractNumber.pl infileName > outFile

use strict;
use warnings;

my $filename = "$ARGV[0]";
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open file '$filename' $!";

while (my $row = <$fh>) {
    chomp $row;
   # next if $row !~ /^s+$/;
    my @all_nums = $row =~ /(\d+)/g;  # (123, 456, 789)
    foreach (@all_nums) {
        print "$_\t";
        }
    print "\n";
}

Wish you all the best for your research work. 

Thanks

 

0

Hi Alok,

You can try this Perl onliner

perl -nle '$_ =~s/[^\d.]//g; print "$_\n";' infile

Thanks