I am unable to index fasta file with samtools ! It return "different line length in sequence" error message . How to fix it?
nc@radha[Downloads] samtools faidx dedup.genome.scf.fasta_assembly.fna [] [fai_build_core] different line length in sequence 'jcf7180000001219'. Could not build fai index dedup.genome.scf.fasta_assembly.fna.fai nc@radha[Downloads] more dedup.genome.scf.fasta_assembly.fna| grep ">" | wc -l 336
Answers
1
Your fasta sequence lines are of unequal lengths. SeqKit is the best tool for such corrections. Try this:
urbe@urbo214b[Tools] ~/Tools/seqkit/seqkit/seqkit SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Available Commands: common find common sequences of multiple files by id/name/sequence faidx create FASTA index file fq2fa covert FASTQ to FASTA fx2tab covert FASTA/Q to tabular format (with length/GC content/GC skew) grep search sequences by pattern(s) of name or sequence motifs head print first N FASTA/Q records locate locate subsequences/motifs rename rename duplicated IDs replace replace name/sequence by regular expression rmdup remove duplicated sequences by id/name/sequence sample sample sequences by number or proportion seq transform sequences (revserse, complement, extract ID...) shuffle shuffle sequences sliding sliding sequences, circular genome supported sort sort sequences by id/name/sequence/length split split sequences into files by id/seq region/size/parts stat simple statistics of FASTA files subseq get subsequences by region/gtf/bed, including flanking sequences tab2fx covert tabular format to FASTA/Q format version print version information and check for update
Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?") -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
Your fasta sequence lines are of unequal lengths. SeqKit is the best tool for such corrections. Try this:
urbe@urbo214b[Tools] ~/Tools/seqkit/seqkit/seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 0.4.5
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962
Usage:
seqkit [command]
Available Commands:
common find common sequences of multiple files by id/name/sequence
faidx create FASTA index file
fq2fa covert FASTQ to FASTA
fx2tab covert FASTA/Q to tabular format (with length/GC content/GC skew)
grep search sequences by pattern(s) of name or sequence motifs
head print first N FASTA/Q records
locate locate subsequences/motifs
rename rename duplicated IDs
replace replace name/sequence by regular expression
rmdup remove duplicated sequences by id/name/sequence
sample sample sequences by number or proportion
seq transform sequences (revserse, complement, extract ID...)
shuffle shuffle sequences
sliding sliding sequences, circular genome supported
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts
stat simple statistics of FASTA files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx covert tabular format to FASTA/Q format
version print version information and check for update
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)