BOL: Related items

Hapsembler: An Assembler for Highly Polymorphic Genomes

Jit — Tue, 22 May 2018 04:09:53 -0500

Hapsembler is a haplotype-specific genome assembly toolkit that is designed for genomes that are rich in SNPs and other types of polymorphism. Hapsembler can be used to assemble reads from a variety of platforms including Illumina and Roche/454. http://compbio.cs.toronto.edu/hapsembler/

Address of the bookmark: http://compbio.cs.toronto.edu/hapsembler/

Bioinformatics tools developed for Oxford Nanopore data analysis !

biogeek — Wed, 27 Dec 2017 20:47:30 -0600

MinION is the only portable real-time device for DNA and RNA sequencing. Each consumable flow cell can now generate 10–20 Gb of DNA sequence data. Ultra-long read lengths are possible (hundreds of kb) as you can choose your fragment length. One of the technical advantages of ONT data is the read length, which offers great prospects for genome assembly. Generally, assemblers are based on several different types of algorithms, such as greedy, overlap-layout-consensus (OLC), de Bruijn graph (DBG), and string graph.

List of analysis tools developed for Oxford Nanopore data

BWA
Fast nanopore data tuned alignment tool
https://github.com/lh3/bwa

GraphMap
Mapper for long and error-prone reads
https://github.com/isovic/graphmap

LAST
Nanopore tuned alignment tool
http://last.cbrc.jp/

LINKS
Software tool for long read scaffolding
https://github.com/warrenlr/LINKS/

marginAlign
Tools to align nanopore reads to a reference
https://github.com/benedictpaten/marginAlign

minoTour
Real time analysis tools
http://minotour.nottingham.ac.uk/

nanoCORR
Error-correction tool for nanopore sequence data
https://github.com/jgurtowski/nanocorr

NanoOK
Software for nanopore data, quality and error profiles
https://documentation.tgac.ac.uk/display/NANOOK/NanoOK

Nanopolish
Nanopore analysis and genome assembly software
https://github.com/jts/nanopolish

nanopore
Variant-detection tool for nanopore sequence data
https://github.com/mitenjain/nanopore

Nanocorrect
Error-correction tool for nanopore sequence data
https://github.com/jts/nanocorrect/

npReader
Real-time conversion and analysis of nanopore reads
https://github.com/mdcao/npReader

poRe
Tool for analyzing and visualizing nanopore data
https://sourceforge.net/p/rpore/wiki/Home/

PoreSeq
Error-correction and variant-calling software
https://github.com/tszalay/poreseq

Poretools
Nanopore sequence analysis and visualization software
https://github.com/arq5x/poretools

SSPACE-LongRead
Genome scaffolding tool
http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE-longread

SMIS
Genome scaffolding tool
https://sourceforge.net/projects/phusion2/files/smis/

List of assemblers for Oxford Nanopore MinION long reads

LQS
DALIGNER, Celera OLC Nanocorrect,
Nanopolish corrector
https://github.com/jts/nanopolish

PBcR
HGAP or BLASR, Celera OLC
PBcR corrector
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR
–
Canu
MHAP, Celera OLC
Canu corrector
https://github.com/marbl/canu

Falcon
String graph, Celera OLC
Falcon corrector
https://github.com/PacificBiosciences/falcon

Miniasm
OLC
https://github.com/lh3/miniasm

ra-integrate
OLC
https://github.com/mariokostelac/ra-integrate/

ALLPATHS-LG
de Bruijn graph
ALLPATHS-L corrector
https://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=12

SPAdes
de Bruijn graph
SPAdes corrector
http://bioinf.spbau.ru/spades

Protocol for De novo Genome Assembly using Illumina Reads

BioStar — Sat, 16 Jan 2021 21:42:11 -0600

In this protocol, we address and describe the de novo assembly method for small to medium-sized genomes.

What is de novo genome assembly?
The method of taking a large number of short DNA sequences and placing them back together to create a reflection of the original chromosomes from which the DNA originated relates to genome assembly. No previous knowledge of the source DNA sequence length, structure or composition is inferred by De novo genome assemblies. The DNA of the target organism is split up into millions of tiny parts and read on a sequencing computer in a genome sequencing experiment. Depending on the sequencing system used, these "reads" range from 20 to 1000 nucleotide base pairs (bp) in length. Usually, length reads of 36 - 150 bp are produced for Illumina style short read sequencing. These reads can be either “single ended” as described above or “paired end.”

Why genome assembly?
In basic research into why and how they live, as well as in applied topics, identifying the DNA sequence of an organism is useful. Awareness of a DNA sequence may be useful in virtually any biological research because of the relevance of DNA to living things. For example, it may be used in medicine to classify, diagnose and eventually improve genetic disorder therapies. Similarly, pathogens study can lead to treatments for infectious diseases.

Raw NGS data
Reads can be saved as a Fasta file as text or in a FastQ file with their attributes. FastQ is the most common read file format since this is what the Illumina sequencing pipeline creates. This will henceforth be the subject of our conversation.

In a nutshell the protocol:
Get the sequence file(s) read from the sequencing machine (s).
Look at the readings - have an idea of what you have and what the standard is like.
If required, raw data cleanup/quality trimming.
Choose an adequate parameter set for assembly.
Assemble the data into scaffolds/contigs.
Examine the assembly performance and determine the efficiency of the assembly.

Read Quality Control:
Check the qualiy with fastQC.
Script
https://bioinformaticsonline.com/snippets/view/42540/install-fastqc-using-conda

Quality trimming/cleanup of read files.
This function trims adapters, barcodes and other contaminants from the reads.
Script
https://bioinformaticsonline.com/snippets/view/42542/trimmomatic-command

Genome Assembly:
The object of this portion of the protocol is to explain the method of assembling the reads trimmed by quality into draft contigs.

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o result_of_spades_assembly_all_illumina

A significant range of short-read assemblers are available. Everyone with strengths and disadvantages of their own.
Some of the assemblers available include:
Velvet
SOAP-denovo
MIRA
ALLPATHS

Next step is to assess the suitability and what to do with a draft package of contiguous details for the remainder of the study now. Few stuff you can note about the contigs you just created: They're the draft Contigs. Any mis-assemblies can occur.

Mis-assembly checking and assembly metric tools:
QUAST - Quality assessment tool for genome assembly http://bioinf.spbau.ru/quast
Mauve assembly metrics - http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
InGAP-SV - https://sites.google.com/site/nextgengenomics/ingap and http://ingap.sourceforge.net/
inGAP is also useful for finding structural variants between genomes from read mappings.

Genome finishing tools:
Semi-automated gap fillers:
Gap filler - http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/gapfiller/

IMAGE (V2) - http://sourceforge.net/apps/mediawiki/image2/index.php?title=Main_Page

Genome visualisers and editors:
Artemis - http://www.sanger.ac.uk/resources/software/artemis/
IGV - http://www.broadinstitute.org/igv/

Automated and semi automated annotation tools:
Prokka - https://github.com/tseemann/prokka
RAST - http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/RapidAnnotationServer
JCVI Annotation Service - http://www.jcvi.org/cms/research/projects/annotation-service/

Frequent command use for the analysis are at:

https://bioinformaticsonline.com/blog/view/38765/list-of-tools-frequently-used-while-genome-assembly
https://bioinformaticsonline.com/pages/view/42275/frequent-parameters-for-bioinformatics-tools

Binding Site Prediction in Protein !

Poonam Mahapatra — Wed, 25 Apr 2018 04:35:57 -0500

The interaction between proteins and other molecules is fundamental to all biological functions. In this section we include tools that can assist in prediction of interaction sites on protein surface and tools for predicting the structure of the intermolecular complex formed between two or more molecules (docking).

Pockets Identification

CASTp

Automatic Identification of pockets and cavities in proteins structure, and quantitation of their volumes using Delaunay triangulation. Available also as PyMOL plugin

Pocket-Finder

Automatic identification of pockets and cavities in proteins structure, and quantitation of their volumes.

PocketPicker

Grid-based technique for the analysis of protein pockets. PocketPicker available as a plugin for PyMOL

Binding Site Prediction

ConSurf

Identification of functional regions in proteins by surface-mapping of phylogenetic information

CRESCENDO

Identification protein interaction sites. It uses sequence conservation patterns in homologous proteins to distinguish between residues that are conserved due to structural restraints from those due to functional restraints.

Ligand Binding Sites

3DLigandSite

The server utilizes protein-structure prediction to provide structural models of the binding site. Ligands bound to structures are superimposed onto the model and use to predict the binding site.

FINDSITE

A threading-based method for ligand-binding site prediction and functional annotation based on binding-site similarity across superimposed groups of threading templates.

LIGSITE^csc

Prediction of binding site by pocket identification using the Connolly surface and degree of conservation

metaPocketA meta server for ligand-binding site prediction. metaPocket use LIGSITE^csc, PASS, Q-SiteFinder and SURFNET

NCBI Datasets pages

BioStar — Wed, 12 Jul 2023 06:29:31 -0500

Update! Assembly and Genome record pages now redirect to new NCBI Datasets pages. NCBI Datasets is a new resource that makes it easier to find and download genome data. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/07/11/ncbi-datasets-genome-assembly-pages/ #NCBICGR

Effective July 10, 2023, NCBI’s Assembly and Genome record pages now redirect to new NCBI Datasets pages. As previously announced, these updates are part of our ongoing effort to modernize and improve your user experience. NCBI Datasets is a new resource that makes it easier to find and download genome data.  

The following pages have been updated:

The NCBI Assembly record pages now redirect to the new NCBI Datasets Genome record pages that describe assembled genomes and provide links to related NCBI tools such as Genome Data Viewer and BLAST. 
The NCBI Genome record pages now redirect to the NCBI Datasets Taxonomy record pages that provide a taxonomy-focused portal to genes, genomes, and additional NCBI resources.

During this transition, you will have the option to return to the legacy Genome and Assembly record pages. We will remove the legacy pages in early 2024. 

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Research Associate Bioinformatics in IISc Recruitment 2020

Tue, 23 Jun 2020 21:53:34 -0500

Research Associate Bioinformatics in IISc Recruitment 2020

Essential Qualifications: Ph.D. (Bioinformatics/ Biophysics/ Biotechnology or any other stream of biological/ physical sciences) with a minimum of two publications in reputed peer reviewed journals in the area of structural bioinformatics or biophysics or biomolecular modeling/ simulation.

Job description: Development of bioinformatics tools and algorithms/software for structure based analysis of biomolecular systems. Programmatic access to major biomolecular databases using APIs Knowledge based prediction and analysis of biomolecular structure, function and interactions. Docking/simulations for inhibitor design.

Desirable Qualifications (Research Associate/s): i) Strong computer programming skills (in Python/PERL/PHP or C++ or object oriented database management systems like MySQL etc or scripting languages under LINUX/UNIX environment).

ii) Extensive experience in computational analysis of biomolecular structure/interactions and usage of advanced biomolecular simulation softwares. iii) Adequate knowledge of major databases, webservers and softwares in the area of biomolecular structure/function and drug design. iv) Familiarity with Parallel Programming environments and experience in usage of high-end HPC clusters.

The candidates must highlight their experience in above mentioned fields/topics in their CV. Initial appointment will be for a period of 1 year, subject to extension after review of performance.

Emoluments: As per DST, GOI norms and commensurate with experience.

More at https://www.iisc.ac.in/positions-open/

Pollard Lab

Fri, 25 Sep 2020 20:20:50 -0500

We are a bioinformatics research lab focused on developing novel methods and using them to study genome evolution, organization, and regulation. Our mission is to decode biomedical knowledge that is missed without rigorous statistical approaches.

http://docpollard.org/

Tools

http://docpollard.org/resources/software/

Bioinfo Lab

Fri, 04 Mar 2022 00:17:00 -0600

The Institute of Bioinformatics conducts internationally renowned research and provides profound education in bioinformatics. Its research focuses on development and application of machine learning and statistical methods in biology and medicine.

Contact:
Computer Science Building (Science Park 3)
Altenberger Str. 69, A-4040 Linz, Austria
Tel. +43 732 2468 4520 / Fax +43 732 2468 4539
E-mail secretary@bioinf.jku.at

http://www.bioinf.jku.at/

The Role of lncRNA in Bioinformatics: Unlocking the Secrets of the Genome

LEGE — Sat, 07 Dec 2024 02:09:47 -0600

In the intricate dance of molecular biology, long non-coding RNAs (lncRNAs) have emerged as key players, capturing the interest of researchers worldwide. These RNA molecules, once dismissed as "junk," have proven to be vital in the regulation of gene expression, cellular processes, and the progression of diseases. The intersection of lncRNA studies and bioinformatics is transforming our understanding of these enigmatic molecules, offering profound insights into their structure, function, and therapeutic potential.

What Are lncRNAs?

lncRNAs are RNA transcripts longer than 200 nucleotides that do not code for proteins. Despite their non-coding nature, they play diverse roles in gene regulation, including chromatin remodeling, transcriptional control, and post-transcriptional processing. Unlike messenger RNAs (mRNAs), lncRNAs often function as scaffolds, decoys, or guides in cellular machinery, influencing biological processes such as cell differentiation, immune response, and even cancer metastasis.

Challenges in lncRNA Research

Identifying and understanding lncRNAs pose unique challenges:

High Sequence Variability: Unlike protein-coding genes, lncRNAs exhibit low sequence conservation across species, making functional predictions difficult.
Low Expression Levels: lncRNAs are often expressed at low levels, complicating their detection in transcriptomic data.
Diverse Functions: The multifunctional nature of lncRNAs requires advanced computational tools to decipher their roles in complex networks.

Bioinformatics: A Crucial Ally in lncRNA Research

Bioinformatics bridges the gap between raw biological data and meaningful insights, making it indispensable in lncRNA research. Here’s how:

1. Identification and Annotation

High-throughput sequencing technologies like RNA-seq generate vast amounts of data. Bioinformatics tools such as StringTie, Cufflinks, and HISAT2 help assemble and annotate lncRNAs from this data. Additionally, databases like NONCODE, LNCipedia, and Ensembl provide curated repositories of lncRNA sequences and annotations.

2. Functional Prediction

Bioinformatics algorithms predict the potential functions of lncRNAs by analyzing their interactions with DNA, RNA, and proteins. Tools like LncRNA2Function and RIblast utilize sequence motifs and secondary structure predictions to hypothesize about the roles of specific lncRNAs.

3. Network Construction

lncRNAs often act as regulatory hubs. Bioinformatics platforms such as Cytoscape enable the visualization of lncRNA-mediated networks, elucidating their roles in pathways like cell cycle regulation and apoptosis.

4. Epigenetic Studies

lncRNAs are known to interact with chromatin-modifying complexes, influencing gene expression epigenetically. Tools like ChIP-seq and ATAC-seq, combined with computational pipelines, identify these interactions and map them to the genome.

5. Clinical Applications

Bioinformatics aids in the discovery of lncRNA biomarkers for diseases like cancer and neurodegenerative disorders. Machine learning models analyze differential expression profiles, helping prioritize lncRNAs with therapeutic potential.

Case Study: lncRNAs in Cancer Research

lncRNAs such as HOTAIR and MALAT1 have been implicated in cancer progression. Bioinformatics analyses have revealed their roles in promoting metastasis and altering the tumor microenvironment. For example, transcriptome analysis in cancer patients identifies lncRNA expression signatures, enabling precision medicine approaches.

Future Directions

The fusion of bioinformatics with experimental biology is unlocking the secrets of lncRNAs. Advances in artificial intelligence, single-cell sequencing, and structural modeling promise to overcome current limitations. Here are some promising directions:

Integrative Analysis: Combining multi-omics data to understand the interplay of lncRNAs with other biomolecules.
CRISPR Screens: Leveraging bioinformatics to design CRISPR-based functional screens for lncRNAs.
Therapeutic Development: Using bioinformatics to design lncRNA-based therapeutics, including antisense oligonucleotides and RNA interference tools.

Conclusion

lncRNAs are the hidden gems of the genome, and bioinformatics is the key to unearthing their full potential. As research progresses, lncRNAs could pave the way for novel diagnostics, targeted therapies, and personalized medicine, revolutionizing our approach to complex diseases.

The journey into the world of lncRNAs is only beginning, and bioinformatics will continue to play a pivotal role in decoding these molecular mysteries. Whether you’re a researcher, clinician, or bioinformatics enthusiast, the study of lncRNAs offers a fascinating frontier of discovery.

BOL: Related items

Hapsembler: An Assembler for Highly Polymorphic Genomes

Bioinformatics tools developed for Oxford Nanopore data analysis !

Protocol for De novo Genome Assembly using Illumina Reads

Binding Site Prediction in Protein !

Pockets Identification

Binding Site Prediction

NCBI Datasets pages

The following pages have been updated:

Step-by-Step Guide to Running Genome Assembly

What is Genome Assembly?

Step 1: Preparing Your Data

Step 2: Choosing an Assembly Strategy

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

3.2. Canu (Long-Read Assembly)

3.3. Hybrid Assembly with Unicycler

Step 4: Assessing Assembly Quality

Step 5: Post-Assembly Steps

Step 6: Sharing and Archiving

Best Practices

Conclusion

Research Associate Bioinformatics in IISc Recruitment 2020

Pollard Lab

Bioinfo Lab

The Role of lncRNA in Bioinformatics: Unlocking the Secrets of the Genome

What Are lncRNAs?

Challenges in lncRNA Research

Bioinformatics: A Crucial Ally in lncRNA Research

1. Identification and Annotation

2. Functional Prediction

3. Network Construction

4. Epigenetic Studies

5. Clinical Applications

Case Study: lncRNAs in Cancer Research

Future Directions

Conclusion

NCBI Datasets pages