Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




Question: Question: Unable to index fasta file with samtools !

Radha Agarkar
2454 days ago

Question: Unable to index fasta file with samtools !

I am unable to index fasta file with samtools ! It return "different line length in sequence" error message . How to fix it? 

nc@radha[Downloads] samtools faidx dedup.genome.scf.fasta_assembly.fna []
[fai_build_core] different line length in sequence 'jcf7180000001219'.
Could not build fai index dedup.genome.scf.fasta_assembly.fna.fai
nc@radha[Downloads] more dedup.genome.scf.fasta_assembly.fna| grep ">" | wc -l
336

Answers
1

Your fasta sequence lines are of unequal lengths. SeqKit is the best tool for such corrections. Try this:

urbe@urbo214b[Tools] ~/Tools/seqkit/seqkit/seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.4.5

Author: Wei Shen <shenwei356@gmail.com>

Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Usage:
seqkit [command]

Available Commands:
common find common sequences of multiple files by id/name/sequence
faidx create FASTA index file
fq2fa covert FASTQ to FASTA
fx2tab covert FASTA/Q to tabular format (with length/GC content/GC skew)
grep search sequences by pattern(s) of name or sequence motifs
head print first N FASTA/Q records
locate locate subsequences/motifs
rename rename duplicated IDs
replace replace name/sequence by regular expression
rmdup remove duplicated sequences by id/name/sequence
sample sample sequences by number or proportion
seq transform sequences (revserse, complement, extract ID...)
shuffle shuffle sequences
sliding sliding sequences, circular genome supported
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts
stat simple statistics of FASTA files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx covert tabular format to FASTA/Q format
version print version information and check for update

Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

urbe@urbo214b[Tools]  ~/Tools/seqkit/seqkit/seqkit seq -w 70 seq.fna > seq_corrected.fna