BOL: Related items

ARC: pipeline which facilitates iterative, reference guided de novo assemblies

Jit — Thu, 26 Jul 2018 09:20:26 -0500

ARC is a pipeline which facilitates iterative, reference guided de novo assemblies with the intent of:

Reducing time in analysis and increasing accuracy of results by only considering those reads which should assemble together.
Reducing/removing reference bias as compared to mapping based approaches.

The software is designed to work in situations where a whole-genome assembly is not the objective, but rather when the researcher wishes to assemble discreet 'targets' contained within next-generation shotgun sequence data. ARC decomplexifies the traditionally difficult problem of assembly by breaking the reads into small, manageable subsets which can then be assembled quickly and efficiently in parallel. Applications include those in which the researcher wishes to de novo assemble specific content and a set of semi-similar reference targets is available to initialize the assembly process.

https://ibest.github.io/ARC/

Address of the bookmark: https://ibest.github.io/ARC/

MITOS: improved de novo metazoan mitochondrial genome annotation

Jit — Fri, 26 Oct 2018 08:25:39 -0500

Allows automatic annotation of metazoan mitochondrial genomes. MITOS is a pipeline designed to compute a consistent de novo annotation of the mitogenomic sequences. The software allows for a systematic error screening, the standardisation of gene name and gene boundary designation, anticodon labelling of tRNAs, and provides the means for the assessment of the validity of a gene assignment.

Address of the bookmark: http://mitos.bioinf.uni-leipzig.de/index.py

Integrative Meta-Assembly Pipeline (IMAP): Chromosome-level genome assembler combining multiple de novo assemblies

Jit — Sat, 31 Aug 2019 11:30:41 -0500

Chromosome-level genome assembler combining multiple de novo assemblies

https://github.com/jkimlab/IMAP

Address of the bookmark: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221858

nanofilt: Filtering and trimming of long read sequencing data

Jit — Mon, 30 Jul 2018 12:01:52 -0500

Filtering on quality and/or read length, and optional trimming after passing filters.
Reads from stdin, writes to stdout.

Intended to be used:

directly after fastq extraction
prior to mapping
in a stream between extraction and mapping

https://github.com/wdecoster/nanofilt

Address of the bookmark: https://github.com/wdecoster/nanofilt

LncPipe:A Nextflow-based pipeline for comprehensive analyses of long non-coding RNAs from RNA-seq datasets

LEGE — Fri, 17 Sep 2021 01:57:02 -0500

The pipeline was developed based on a popular workflow framework Nextflow, composed of four core procedures including reads alignment, assembly, identification and quantification. It contains various unique features such as well-designed lncRNAs annotation strategy, optimized calculating efficiency, diversified classification and interactive analysis report. LncPipe allows users additional control in interuppting the pipeline, resetting parameters from command line, modifying main script directly and resume analysis from previous checkpoint.

Ref https://www.lncrnablog.com/lncpipe-a-nextflow-based-pipeline-for-identification-and-analysis-of-long-non-coding-rnas-from-rna-seq-data/

Address of the bookmark: https://github.com/likelet/LncPipe

TULIP - The Uncorrected Long read Itegration Pipeline

Jit — Thu, 23 Nov 2017 09:30:01 -0600

#Running TULIP (The Uncorrected Long-read Integration Process), version 0.4 late 2016 (European eel)

TULIP currently consists of to Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow.
Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures. Tulipbulb adds long read sequencing data to these.

https://github.com/Generade-nl/TULIP

Address of the bookmark: https://github.com/Generade-nl/TULIP

TULIP - The Uncorrected Long read Integration Pipeline

Jit — Tue, 15 May 2018 09:06:37 -0500

TULIP currently consists of two Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow. Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures.

Address of the bookmark: https://github.com/Generade-nl/TULIP

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

TAREAN: A computational tool for identification and characterization of satellite DNA from unassembled short reads

Surabhi Chaudhary — Tue, 15 May 2018 02:53:11 -0500

TAndem REpeat ANalyzer -TAREAN – is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures and putative satellite repeats are reported.

How to use TAREAN:

Install a local instance of the pipeline using its source code available from bitbucket repository.
Use public Galaxy-based server at https://repeatexplorer-elixir.cerit-sc.cz/. The server is provided in frame of the Elixir CZ project and is maintained by CESNET and CERIT-SC. Simple registration is required to use this service.

Development of TAREAN was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047).

References

Novak, P., Avila Robledillo, L., Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) – TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res., doi:10.1093/nar/gkx257

Address of the bookmark: https://bitbucket.org/petrnovak/repex_tarean

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/