BOL: Question: Extract the numeric values from the multiple FASTA sequence file.

Question: Question: Extract the numeric values from the multiple FASTA sequence file.

Alok Prajapati
4019 days ago

Question: Extract the numeric values from the multiple FASTA sequence file.

I have a multiple fasta sequence file (~12GB size) with certain coordinate information:

> chr13-/454-4567654 (2347645)
AGTGACTGACTGAAGTGACTGA

> chr14-/524-8367954 (6535786)
AGTGACTGAAGTGACTGA

The fasta sequence string would always have only one or more continuous stretch of numbers, like 13-/454-4567654 (2347645) in this case. Rest all will be either alphabets or other special characters. How can I extract the number and store it back in some array?

Answers

Hi Alok,

You can try following Perl script on ur dataset. It will extract the numeric values and seperate it's chromosome, start and end coordinates with tabs in outFile.

usage : perl extractNumber.pl infileName > outFile

use strict;
use warnings;

my $filename = "$ARGV[0]";
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Could not open file '$filename' $!";

while (my $row = <$fh>) {
    chomp $row;
   # next if $row !~ /^s+$/;
    my @all_nums = $row =~ /(\d+)/g; # (123, 456, 789)
    foreach (@all_nums) {
        print "$_\t";
        }
    print "\n";
}

Wish you all the best for your research work.

Thanks

Jitendra Narayan 4018 days ago

Hi Alok,

You can try this Perl onliner

perl -nle '$_ =~s/[^\d.]//g; print "$_\n";' infile

Thanks

Rahul Nayak 4012 days ago

BOL

Alok Prajapati

Our Sponsors

Question: Question: Extract the numeric values from the multiple FASTA sequence file.