BOL: Related items

List of bioinformatics workflow management tools !

Rahul Nayak — Sat, 20 Mar 2021 00:15:25 -0500

Here are list of Workflow Managers

BigDataScript – A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [ paper-2014 | web ]
Bpipe – A small language for defining pipeline stages and linking them together to make pipelines. [ web ]
Common Workflow Language – a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [ web ]
Cromwell – A Workflow Management System geared towards scientific workflows. [ web ]
Galaxy – a popular open-source, web-based platform for data intensive biomedical research. Has several features, from data analysis to workflow management to visualization tools. [ paper-2018 | web ]
Nextflow (recommended) – A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [ paper-2018 | web ]
Ruffus – Computation Pipeline library for python widely used in science and bioinformatics. [ paper-2010 | web ]
SeqWare – Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [ paper-2010 | web ]
Snakemake – A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment. [ paper-2018 | web ]
Workflow Descriptor Language – Workflow standard developed by the Broad. [ web ]

ECTOOLS: Long Read Correction and other Correction tools

Jit — Fri, 05 Jan 2018 04:02:22 -0600

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: gurtowsk@cshl.edu

In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data. More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Address of the bookmark: https://github.com/jgurtowski/ectools

Metassembler: merging and optimizing de novo genome assemblies

Rahul Nayak — Tue, 08 May 2018 04:52:33 -0500

Metassembler combines multiple whole genome de novo assemblies into a combined consensus assembly using the best segments of the individual assemblies.

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence.

Address of the bookmark: https://sourceforge.net/projects/metassembler/?source=directory

DNA Nucleotide Counter

Neel — Fri, 12 Oct 2018 04:37:01 -0500

DNA Nucleotide Counter is delivered in a DNA Baser package together with other free molecular biology tools. Download the package and double click it. The programs inside the package will be extracted to the destination folder (specified by you). Go to the destination folder and double click the program you want to use.

It installs in any computer even if you don't have administrator rights!

Address of the bookmark: http://www.dnabaser.com/download/DNA-Counter/index.html

Shasta long read assembler

Jit — Tue, 14 Jan 2020 06:47:07 -0600

The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Computational methods used by the Shasta assembler include:

Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.
Using in some phases of the computation a representation of the read sequence based on markers, a fixed subset of short k-mers (k ≈ 10).

More at https://chanzuckerberg.github.io/shasta/index.html

Address of the bookmark: https://github.com/chanzuckerberg/shasta

Frequent parameters for bioinformatics tools !

BioStar — Tue, 27 Oct 2020 19:42:32 -0500

Third party executable parameters and options.

Trimmomatic

“ILLUMINACLIP:...:2:30:10”

“LEADING:15”

“TRAILING:15”

“SLIDINGWINDOW:4:20”

“MINLEN:20”

“TOPHRED33”

Filtlong

--min_length 500

--min_mean_q 85

--min_window_q 65

FastQ Screen

--aligner bowtie2' (bwa for PacBio)

--subset 1000 (for PacBio)

SPAdes

--careful

--disable-gzip-output

--cov-cutoff auto

--phred-offset 33

HGAP

Pbalign.task_options.min_accuracy: 70

Pbalign.task_options.no_split_subreads: false

Genomic_consensus.task_options.min_confidence: 40

falcon_ns.task_options.HGAP_GenomeLength_str:

6000000

Pbcoretools.task_options.read_length: 0

Genomic_consensus.task_options.use_score: 0

Pbalign.task_options.min_length: 50

Pbalign.task_options.algorithm_options: --minMatch 12

--bestn 10 --minPctSimilarity 70.0

Pbalign.task_options.hit_policy: randombest

Pbcoretools.task_options.other_filters: rq >= 0.7

Pbalign.task_options.concordant: false

Genomic_consensus.task_options.min_coverage: 5

falcon_ns.task_options.HGAP_SeedCoverage_str: 30

falcon_ns.task_options.HGAP_AggressiveAsm_bool: false

Genomic_consensus.task_options.algorithm: best

falcon_ns.task_options.HGAP_SeedLengthCutoff_str: -1

Genomic_consensus.task_options.diploid: false

MeDuSa

-random 100

Prokka

--usegenus

--force

--addgenes

--rfam

--rawproduct

cmsearch (taxonomy, 16S)

--rfam

--noali

blastn (taxonomy, 16S)

-evalue 1E-10

blastn (MLST)

-ungapped

-dust no

-evalue 1E-20

-word_size 32

-culling_limit 2

-perc_identity 95

blastp (VF)

-culling_limit 2

RGI (ABR)

--input_type contig

bowtie2 (mapping)

--sensitive

minimap2 (mapping)

-a

-x map-ont

samtools mpileup (SNP detection)

-uRI

bcftools call (SNP detection)

--variants-only

--skip-variants indels

--output-type v

--ploidy 1

-c

SNPsift filter (SNP detection)

"( QUAL >= 30 ) & (( na FILTER ) | (FILTER = 'PASS')) &

( DP >= 20 ) & ( MQ >= 20 )"

SNPeff ann (SNP detection)

-nodownload

-no-intron

-no-downstream

-no SPLICE_SITE_REGION

-upDownStreamLen 250

bcftools consensus

(phylogenetic tree)

--haplotype 1

fasttreemp

-nt

-boot 100

roary

-e

-n

-cd 100

-g 100000

Bioinformatics tools for telomere to telomere assembly !

BioStar — Tue, 17 Aug 2021 13:17:09 -0500

● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)

Reference:

https://www.pacb.com/blog/young-investigators-share-stellar-science-career-advice-and-bioinformatics-tools-at-smrt-leiden-2021/

Upset plots !

BioStar — Fri, 24 Mar 2023 22:30:23 -0500

Upset plots are a type of visualization used to analyze the intersection of sets or categories. They are particularly useful for displaying data with multiple categories and analyzing their overlaps.

In an upset plot, each row represents a category or set, and each column represents a data point. The length of the bar for each category indicates the number of data points that belong to that category. The plot also shows the intersections between categories, represented by overlapping bars.

Upset plots are useful for visualizing complex data with multiple categories and intersections, and can help identify patterns and relationships between categories. They are often used in fields such as bioinformatics, where they can be used to analyze gene expression data or to compare the results of different experimental conditions.

https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#example-with-the-genomic-regions

Address of the bookmark: https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#example-with-the-genomic-regions

BioKit: a set of tools dedicated to bioinformatics, data visualisation

Neel — Tue, 18 Jun 2024 02:04:39 -0500

BioKit is a set of tools dedicated to bioinformatics, data visualisation (biokit.viz), access to online biological data (e.g. UniProt, NCBI thanks to bioservices). It also contains more advanced tools related to data analysis (e.g., biokit.stats). Since R is quite common in bioinformatics, we also provide a convenient module to run R inside your Python scripts or shell (:mod:biokit.rtools module).

Address of the bookmark: https://biokit.readthedocs.io/en/latest/index.html

MetaPred2CS

Manisha Mishra — Fri, 03 Mar 2017 05:15:07 -0600

MetaPred2CS Web server is a meta-predictor based on Support Vector Machine (SVM) that combines 6 individual sequence based protein-protein interaction prediction methods to predict prokaryotic two-component system protein-protein interactions (PPIs). The methods implemented in MetaPred2CS are 2 co-evolutionary methods: in-silico two hybrid (i2h) and mirror tree (MT) methods and 4 genomics context based methods: phylogenetic profiling (PP), gene fusion (GF), gene neighbourhood (GN) and and gene operon methods (GO).

http://metapred2cs.ibers.aber.ac.uk/

Address of the bookmark: https://github.com/martinjvickers/MetaPred2CS