BOL: Related items

ENCODE3: A collection of research articles and related content describing the Encyclopedia of DNA Elements, its datasets and tools.

Shruti Paniwala — Sat, 08 Aug 2020 08:25:21 -0500

How cells, tissues and organisms interpret the information encoded in the genome has vital implications for our understanding of development, health and disease. Launched in 2003, the ENCyclopedia Of DNA Elements (ENCODE) project has the aim of mapping the functional elements in the human genome (later expanded to include model organisms).

During the first phase of ENCODE, published in 2007, microarray-based technologies were used to detect regions associated with transcription factors, certain histone modifications and open chromatin within a pre-specified 1% of the human genome.

ENCODE’s second phase saw a switch to sequencing-based technologies, the addition of new assay types and the analysis of functional elements genome-wide, described in a collection of research articles in 2012.

The Encyclopedia paper of ENCODE 3, published in Nature, gives an overview of the various assays that were performed in human and mouse cell lines and tissues and describes a Registry of human and mouse candidate cis-regulatory elements (cCREs).

More at https://www.nature.com/immersive/d42859-020-00027-2/index.html

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD

RagTag: a collection of software tools for scaffolding and improving modern genome assemblies

Jit — Sat, 11 Sep 2021 00:28:14 -0500

RagTag is a collection of software tools for scaffolding and improving modern genome assemblies. Tasks include:

Homology-based misassembly correction
Homology-based assembly scaffolding and patching
Scaffold merging

Address of the bookmark: https://github.com/malonge/RagTag

Bioinformatics tools for genome assembly !

BioStar — Mon, 24 Jul 2023 07:04:26 -0500

There are numerous genome assembly tools available, each with its strengths and weaknesses. Here is a list of some widely used genome assembly tools as of my last update in September 2021:

SPAdes: An assembler specifically designed for single-cell and multi-cell bacterial genomes, as well as small eukaryotic genomes.
ABySS: A parallelized assembler for large genomes that uses de Bruijn graphs.
Velvet: Another de Bruijn graph-based assembler optimized for short-read sequencing data.
SOAPdenovo: A de Bruijn graph-based assembler designed for short reads, widely used for assembling large and complex genomes.
MaSuRCA: A hybrid assembler that combines data from multiple sequencing technologies, such as Illumina and PacBio.
Canu: A long-read assembler optimized for PacBio and Oxford Nanopore sequencing data.
Flye: A long-read assembler suitable for bacterial and small eukaryotic genomes.
SMARTdenovo: An assembler designed for long reads, particularly suited for PacBio data.
SPAdes Long Read (SPAdesLR): An extension of SPAdes for long-read data, such as those from PacBio or Nanopore.
Minia: An assembler optimized for low memory consumption, suitable for small and medium-sized genomes.
Unicycler: A hybrid assembler that combines short and long reads for circular bacterial genome assembly.
wtdbg2: A de Bruijn graph assembler for long reads, efficient for very large genomes.
Shasta: A long-read assembler that uses the Overlap-Layout-Consensus approach, suitable for PacBio and Nanopore data.
Sparc: An assembler designed to handle noisy long reads from Nanopore sequencing.
CANA: An assembler for metagenomic data, particularly for complex and diverse microbial communities.
Ra Assembler: A metagenome assembler for long reads, designed for highly complex metagenomic samples.

Please note that the field of bioinformatics is constantly evolving, and new assembly tools may have emerged since my last update. Additionally, the performance of these tools can vary depending on the characteristics of the sequencing data and the genome being assembled. When selecting an assembly tool, consider the specific requirements of your project, the available data types, and the computational resources at your disposal. Always refer to the respective tool's documentation and publications for the most up-to-date information and recommendations.

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

ClinCNV: Detection of copy number changes in Germline/Trio/Somatic contexts in NGS data

Neel — Thu, 16 Jan 2020 23:16:02 -0600

ClinCNV detects CNVs in germline and somatic context in NGS data (targeted and whole-genome). We work in cohorts, so it makes sense to try ClinCNV if you have more than 10 samples (recommended amount - 40 since we estimate variances from the data). By "cohort" we mean samples sequenced with the same enrichment kit with approximately the same depth (ie 1x WGS and 30x WGS better be analysed in separate runs of ClinCNV). Of course it is better if your samples were sequenced within the same sequencing facility.

Address of the bookmark: https://github.com/imgag/ClinCNV

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Jit — Fri, 08 Dec 2017 16:48:40 -0600

Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1 fixes the problems with SNP annotation that arose when NCBI discontinued use of GI numbers. Please read carefully the Preface (page 3) and the File of annotated genomes section (pages 9-10) in the version 3.1 User Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need the annotation feature. All users are encouraged to upgrade to version 3.1.

Address of the bookmark: https://sourceforge.net/projects/ksnp/files/

ReMILO: reference assisted misassembly detection algorithm using short and long reads.

Jit — Fri, 06 Jul 2018 04:27:49 -0500

ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies.

Address of the bookmark: https://github.com/songc001/remilo

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

Genomics for Bioinformatician

Jitendra Narayan — Sat, 20 Jul 2013 07:03:00 -0500

Genomics is the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.

Genomics was established by Fred Sanger when he first sequenced the complete genomes of a virus and a mitochondrion. His group established techniques of sequencing, genome mapping, data storage, and bioinformatic analyses in the 1970-1980s. A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics. Study of the full set of proteins in a cell type or tissue, and the changes during various conditions, is called proteomics. A related concept is materiomics, which is defined as the study of the material properties of biological materials (e.g. hierarchical protein structures and materials, mineralized biological tissues, etc.) and their effect on the macroscopic function and failure in their biological context, linking processes, structure and properties at multiple scales through a materials science approach. The actual term 'genomics' is thought to have been coined by Dr. Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, ME) over beer at a meeting held in Maryland on the mapping of the human genome in 1986.

The outcome of almost two years of intense discussions with literally hundreds of scientists and members of the public, has three major areas of focus: Genomics to Biology, Genomics to Health, and Genomics to Society.

Genomics to Biology:
The human genome sequence provides foundational information that now will allow development of a comprehensive catalog of all of the genome's components, determination of the function of all human genes, and deciphering of how genes and proteins work together in pathways and networks.

Genomics to Health:
Completion of the human genome sequence offers a unique opportunity to understand the role of genetic factors in health and disease, and to apply that understanding rapidly to prevention, diagnosis, and treatment. This opportunity will be realized through such genomics-based approaches as identification of genes and pathways and determining how they interact with environmental factors in health and disease, more precise prediction of disease susceptibility and drug response, early detection of illness, and development of entirely new therapeutic approaches.

Genomics to Society:
Just as the HGP has spawned new areas of research in basic biology and in health, it has created new opportunities in exploring the ethical, legal, and social implications (ELSI) of such work. These include defining policy options regarding the use of genomic information in both medical and non-medical settings and analysis of the impact of genomics on such concepts as race, ethnicity, kinship, individual and group identity, health, disease, and "normality" for traits and behaviors.

This vision for the future of genomics is not just about the NHGRI. It encompasses the whole field of genomics, including the work of all the other Institutes and Centers at the NIH and of a number of other federal agencies. All of the NIH Institutes are already taking full advantage of the sequence and will apply its data to the better understanding of both rare and common diseases, almost all of which have a genetic component. A recent example of the way that the HGP and the knowledge and new technologies it has spawned are already facilitating science is the extremely rapid sequencing by groups in Canada and at the Centers for Disease Control and Prevention (CDC) in Atlanta of the genome of the virus that causes Severe Acute Respiratory Syndrome (SARS). The sequencing of the SARS virus genome provides insight into this new and deadly disease at a speed never before possible in science. In turn, this should lead to the rapid development of diagnostic tests and, in time, vaccines and effective treatments.

Links for the addition material available on Net

Genomes and genomics:

Bioinformatics and Genomics:

Structural genomics tutorial:

Comparative Genomics Tutorial:

GENOME TUTORIAL:

Tools and resources for identifying protein families, domains and motifs

Bioinformatics Tools
Tips, Tutorials, and Terminology for Using Selected Resources in Genome Database Guide:

A Web-Based Comparative Genomics Tutorial for Investigating Microbial Genomes:

Free Online Tutorials Teach Anyone How to Use Genome Databases:

Circos to create concise, explanatory, unique and print-ready visualizations of your data:

Genomics and Comparative Genomics Learning Module:

Computational Challenges in Comparative Genomics

A Tutorial:

A Comparative Genomics Resource for Grains:

PLAZA: A Comparative Genomics Resource to Study Gene and Genome Evolution in Plants:

VISTA :

Software for Genomics

Artemis Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.
Chromas It will display and prints chromatogram files from ABI automated DNA sequencers, and Staden SCF files which the analysis programs for ALF, Li-Cor and Visible Genetics OpenGene sequencers can create.
Glimmer A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DN
Glimmer HMM A fast and accurate gene finder based on a GHMM architecture, developed specifically for eukaryotes. It incorporates splice site models adapted from the GeneSplicer program and uses interpolated Markov models for evaluating the coding regions.
Glimmer M A gene finder derived from Glimmer, but developed specifically for eukaryotes. It is based on a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of these combinations. The d
MUMmer MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.
pDRAW pDRAW32 is being developed as a free time hobby project. It is far from finished, but as it has reached a point where it could be helpful for many labs, it is now available to the scientific community.
Sequin Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions that contain a single short mRNA sequence, and complex submissio
Staden The Staden Package consists of a series of tools for DNA sequence preparation (pregap4), assembly (gap4), editing (gap4) and DNA/protein sequence analysis (spin).

For more software @ http://bioinformaticsonline.com/bookmarks/view/926/list-of-popular-bioinformatics-softwaretools