LAMSA: fast split read alignment with long approximate matches

Jit — Tue, 15 May 2018 04:44:42 -0500

LAMSA (Long Approximate Matches-based Split Aligner) is a novel split alignment approach with faster speed and good ability of handling SV events. It is well-suited to align long reads (over thousands of base-pairs). LAMSA takes takes the advantage of the rareness of SVs to implement a specifically designed two-step strategy. That is, LAMSA initially splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats. The alignments of the fragments are then used for implementing a sparse dynamic programming (SDP)-based split alignment approach to handle the large or non-co-linear variants. We benchmarked LAMSA with simulated and real datasets having various read lengths and sequencing error rates, the results demonstrate that it is substantially faster than the state-of-the-art long read aligners; mean-while, it also has good ability to handle various categories of SVs. LAMSA is open source and free for non-commercial use. LAMSA is mainly designed by Bo Liu & Yan Gao and developed by Yan Gao in Center for Bioinformatics, Harbin Institute of Technology, China.

Address of the bookmark: https://github.com/hitbc/LAMSA

Comment by Jit

Jit — Fri, 13 Jul 2018 17:49:52 -0500

➜ LAMSA git:(master) ./lamsa aln

Usage: lamsa aln [options]

Algorithm options:

-t --thread [INT] Number of threads. [1]
-l --seed-len [INT] Length of seeding fragments. [50]
-i --seed-inv [INT] Distance between neighboring seeding fragments. [100]
-p --max-loci [INT] Maximum allowed number of seeding fragments' hits. [200]
-V --SV-len [INT] Expected maximum length of SV. [10000]
-v --ovlp-rat [FLOAT] Minimum overlapping ratio to cluster two skeletons or alignment records.
[0.70]
-s --max-skel [INT] Maximum number of skeletons that are reserved in a cluster. [10]
-R --max-reg [INT] Maximum allowed length of unaligned read part to trigger a bwt-based query.
[300]
-k --bwt-kmer [INT] Length of BWT-seed. [19]
-f --fastest Use GEM-mapper's fastest mode(--fast-mapping=0). [false]

Scoring options:

-m --match-sc [INT] Match score for SW-alignment. [1]
-M --mis-pen [INT] Mismatch penalty for SW-alignment. [3]
-O --open-pen [INT(,INT,INT,INT)]
Gap open penalty for SW-alignment(end2end-global: insertion, deletion,
one-end-extend: insertion, deletion). [5(,5,5,5)]
-E --ext-pen [INT(,INT,INT,INT)]
Gap extension penalty for SW-alignment(end2end-global: insertion, deletion,
one-end-extend: insertion, deletion). [2(,2,2,2)]
-w --band-width[INT] Band width for banded-SW. [10]
-b --end-bonus [INT] Penalty for end-clipping. [5]

Read options:

-e --err-rate [FLOAT] Maximum error rate of read. [0.04]
-d --diff-rate [FLOAT] Maximum length difference ratio between read and reference. [0.04]
-x --mis-rate [FLOAT] Maximum error rate of mismatch within reads. [0.04]

-T --read-type [STR] Specifiy the type of reads and set multiple parameters unless overriden.
[null] (Illumina Moleculo)
pacbio (PacBio SMRT): -i25 -l50 -m1 -M1 -O1,1,2,2 -E1,1,1,1 -w200 -b0 -e0.30 -d0.30
ont2d (Oxford Nanopore): -i25 -l50 -m1 -M1 -O1,1,1,1 -E1,1,1,1 -w100 -b0 -e0.25 -d0.10

Output options:

-r --max-out [INT] Maximum number of output records for a specific split read region. [10]
-g --gap-split [INT] Minimum length of gap that causes a split-alignment. [100]
-S --soft-clip Use soft clipping for supplementary alignment. [false]
-C --comment Append FASTQ comment to SAM output. [false]
-o --output [STR] Output file (SAM format). [stdout]

-h --help Print this short usage.
-H --HELP Print a detailed usage.

BOL: LAMSA: fast split read alignment with long approximate matches

LAMSA: fast split read alignment with long approximate matches

Comment by Jit