BOL: All site blogs

Protocol for De novo Genome Assembly using Illumina Reads

BioStar — Sat, 16 Jan 2021 21:42:11 -0600

In this protocol, we address and describe the de novo assembly method for small to medium-sized genomes.

What is de novo genome assembly?
The method of taking a large number of short DNA sequences and placing them back together to create a reflection of the original chromosomes from which the DNA originated relates to genome assembly. No previous knowledge of the source DNA sequence length, structure or composition is inferred by De novo genome assemblies. The DNA of the target organism is split up into millions of tiny parts and read on a sequencing computer in a genome sequencing experiment. Depending on the sequencing system used, these "reads" range from 20 to 1000 nucleotide base pairs (bp) in length. Usually, length reads of 36 - 150 bp are produced for Illumina style short read sequencing. These reads can be either “single ended” as described above or “paired end.”

Why genome assembly?
In basic research into why and how they live, as well as in applied topics, identifying the DNA sequence of an organism is useful. Awareness of a DNA sequence may be useful in virtually any biological research because of the relevance of DNA to living things. For example, it may be used in medicine to classify, diagnose and eventually improve genetic disorder therapies. Similarly, pathogens study can lead to treatments for infectious diseases.

Raw NGS data
Reads can be saved as a Fasta file as text or in a FastQ file with their attributes. FastQ is the most common read file format since this is what the Illumina sequencing pipeline creates. This will henceforth be the subject of our conversation.

In a nutshell the protocol:
Get the sequence file(s) read from the sequencing machine (s).
Look at the readings - have an idea of what you have and what the standard is like.
If required, raw data cleanup/quality trimming.
Choose an adequate parameter set for assembly.
Assemble the data into scaffolds/contigs.
Examine the assembly performance and determine the efficiency of the assembly.

Read Quality Control:
Check the qualiy with fastQC.
Script
https://bioinformaticsonline.com/snippets/view/42540/install-fastqc-using-conda

Quality trimming/cleanup of read files.
This function trims adapters, barcodes and other contaminants from the reads.
Script
https://bioinformaticsonline.com/snippets/view/42542/trimmomatic-command

Genome Assembly:
The object of this portion of the protocol is to explain the method of assembling the reads trimmed by quality into draft contigs.

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o result_of_spades_assembly_all_illumina

A significant range of short-read assemblers are available. Everyone with strengths and disadvantages of their own.
Some of the assemblers available include:
Velvet
SOAP-denovo
MIRA
ALLPATHS

Next step is to assess the suitability and what to do with a draft package of contiguous details for the remainder of the study now. Few stuff you can note about the contigs you just created: They're the draft Contigs. Any mis-assemblies can occur.

Mis-assembly checking and assembly metric tools:
QUAST - Quality assessment tool for genome assembly http://bioinf.spbau.ru/quast
Mauve assembly metrics - http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
InGAP-SV - https://sites.google.com/site/nextgengenomics/ingap and http://ingap.sourceforge.net/
inGAP is also useful for finding structural variants between genomes from read mappings.

Genome finishing tools:
Semi-automated gap fillers:
Gap filler - http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/gapfiller/

IMAGE (V2) - http://sourceforge.net/apps/mediawiki/image2/index.php?title=Main_Page

Genome visualisers and editors:
Artemis - http://www.sanger.ac.uk/resources/software/artemis/
IGV - http://www.broadinstitute.org/igv/

Automated and semi automated annotation tools:
Prokka - https://github.com/tseemann/prokka
RAST - http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/RapidAnnotationServer
JCVI Annotation Service - http://www.jcvi.org/cms/research/projects/annotation-service/

Frequent command use for the analysis are at:

https://bioinformaticsonline.com/blog/view/38765/list-of-tools-frequently-used-while-genome-assembly
https://bioinformaticsonline.com/pages/view/42275/frequent-parameters-for-bioinformatics-tools

10 NGS services companies around the globe !

BioStar — Sun, 22 Nov 2020 23:56:17 -0600

The global NGS services market is expected to reach USD 13.1 billion by 2025. Here are the top 10 NGS services companies to look for –

1. Illumina, Inc. (U.S.)

Illumina, Inc. was founded in 1998 and is headquartered at San Diego, U.S. Illumina, Inc. is one of the leading players in DNA sequencing and array-based technologies, serving customers in the research, clinical, and applied markets. The company offers products for applications in the life sciences, oncology, reproductive health, agriculture, and other emerging segments. The company serves government laboratories, genomic research centers, academics institutions as well as pharmaceutical, biotechnology, agrigenomics, commercial molecular diagnostics laboratories and consumer genomics companies. Illumina, Inc. has its geographic presence in North America, Europe, Latin America, Asia-pacific, and others.

2. QIAGEN N.V. (Netherlands)

QIAGEN N.V. was incorporated in 1986 and is headquartered at Venlo, The Netherlands. The Company is engaged in providing Sample to Insight solutions that transform biological samples into molecular insights. QIAGEN provides its workflow to customers in molecular diagnostics, assay technologies, bioservices and automation systems. The company’s genome services are suitable for custom/tailored projects that allow access to genomic sequence information. The Company market its products in more than 100 countries across the Americas, Europe, Asia, Australia, and the Middle-East &Africa through its subsidiaries and channel partners.

3. PerkinElmer, Inc. (U.S.)

PerkinElmer, Inc. was founded in 1947 and is headquartered in Waltham, Massachusetts, the U.S. PerkinElmer, Inc. offers its products & services and solutions for the diagnostics, food, environmental, industrial, life sciences research and laboratory services markets. The company offer comprehensive genetic testing solutions that help to provide insight into the complex nature of rare and inherited diseases. Some of the subsidiaries of the company are Caliper Life Sciences, Improvision, Viacell Inc., ViaCord LLC, among many others. The company has its facilities located in Europe (France, Germany, and Belgium), U.S. and Asia (China, India, and Japan).

4. Eurofins Scientific SE (Luxembourg)

Eurofins Scientific SE was founded in 1987 and is headquartered in Luxembourg, Europe. The company offers a portfolio of over 130,000 analytical methods and more than 150 million assays performed each year to establish the safety, identity, composition, authenticity, origin, traceability, and purity of biological substances and products, as well as carry out human diagnostic services. The company has its geographic presence across 39 countries in Europe, North and South America, and Asia-Pacific.

5. GATC Biotech AG (Germany)

GATC Biotech AG was founded in 1990 and is headquartered in Constance, Germany. The company provides DNA and RNA sequencing and bioservices solutions to academics and industrial areas. It also provides next generation sequencing services including genomes, targeted (re)-sequencing, human sample sequencing, transcriptomes, metagenomes, regulomes, pre-sequencing, NGS barcode labels, and next generation sequencing technologies; and bioservices services, including bioservices tools, pipelines and workflows, compute resources, data analysis reports, and case studies. GATC Biotech AG operates as a subsidiary of Eurofins Scientific SE. It offers its products through distributors in Italy, Japan, Portugal, Spain, and the Czech Republic.

6. Macrogen, Inc. (South Korea)

Macrogen, Inc. was founded in 1997 and is headquartered in Seoul, South Korea. Macrogen, Inc. provides next generation sequencing services such as whole genome, de novo, exome, targeted, transcriptomics, metagenome, and epigenome sequencing. The company also provides a variety of services such as oligo synthesis, database construction, genome research, and bioservices analysis system consulting services. Macrogen, Inc. provides genome research services in Korea and internationally.

7. Genotypic Technology Pvt. Ltd. (India)

Genotypic Technology Pvt. Ltd. was incorporated in 1998 and is headquartered in Bangalore, India. Genotypic Technology is the first Genomics service provider in India providing Microarray, Next Generation Sequencing (NGS), Bioservices and solutions to domestic/ international pharma, biotech companies and academia. The company provides its services for protocol optimization, probe designing, array layouts, project designing, and nucleic acid analysis to in-depth analysis. Genotypic Technology has its geographic presence in North America, Europe, Asia Pacific, Middle East & Africa, and Latin America.

8. GENEWIZ, Inc. (U.S.)

GENEWIZ, Inc. was founded in 1999 and is headquartered in South Plainfield, New Jersey, the U.S.; The company is a leading provider of research service in the field of Next Generation Sequencing, Sanger DNA sequencing, sequencing of bacteria and phage, gene synthesis, DNA cloning, genomics including mutation analysis, single nucleotide polymorphism, and bioservices. GENEWIZ, Inc. has its geographic presence in U.S., China, Germany, France, Japan, and the U.K.

9. Beijing Genomics Institute (China)

Beijing Genomics Institute (BGI) is the world’s largest genomics organization and non-profit research institution that was founded in 1999 and is headquartered in Shenzhen, China. The Company provides a wide range of commercial next generation sequencing services and genetic tests for medical institutions, agricultural and environmental applications. The Company operates all across the globe through its subsidiaries, namely, BGI China (Mainland), BGI Asia Pacific, BGI Americas (North and South America) and BGI Europe (Europe and Africa).

10. SciGenom Labs Pvt. Ltd (India)

SciGenom Labs Pvt. Ltd was founded in 2010 and is headquartered in Cochin, India with offices in Chennai & Hyderabad in India, and San Francisco in the U.S. It is a Genomics R&D services company that provides genomic sequencing and NGS services to life sciences and healthcare businesses globally as well as academic and government institutions in India.

Popular mentions – MedGenome (India), DNA Link, Inc. (South Korea), Otogenetics Corporation (U.S.), Novogene Corporation (China), LGC Limited (U.K.), CD Genomics (U.S.), SeqLL, LLC (U.S.)

Tools and Method for Haplotype phasing !

Manisha Mishra — Fri, 04 Sep 2020 20:41:40 -0500

Huge amounts of genotype data are being produced with recent technological advances, both from increasingly comprehensive and inexpensive genome-wide SNP microarrays and from ever more accessible whole-genome and whole-exome sequencing methods. The vast amount of knowledge contained in these results, however, is best exploited through phased haplotypes, which classify the alleles co-located on the same chromosome. Since sequence and SNP array data normally take the form of unphased genotypes, one does not specifically observe which of the two parental chromosomes, or haplotypes, falls on a specific allele. Fortunately, new advances in both computational and laboratory methods promise improved determination of haplotype phase. Following are useful tools :

Arlequin: http://cmpg.unibe.ch/software/arlequin3/

BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html

fastPHASE: http://stephenslab.uchicago.edu/software.html

GENEHUNTER: http://linkage.rockefeller.edu/soft/gh/

The Genome Analysis Toolkit:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

IMPUTE2: https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

MACH: http://www.sph.umich.edu/csg/abecasis/MACH/

MERLIN: http://www.sph.umich.edu/csg/abecasis/Merlin/

PHASE: http://stephenslab.uchicago.edu/software.html

PL-EM: http://www.people.fas.harvard.edu/~junliu/plem/

“Read-backed phasing” algorithm: http://www.broadinstitute.org/gsa/wiki/index.php/Read-backed_phasing_algorithm

SHAPE-IT: http://www.griv.org/shapeit/

Software for genome assembly !

LEGE — Sun, 30 Aug 2020 09:51:38 -0500

List of bioinformatics tools/Software Website References for genome assembly:

1 Falcon https://github.com/PacificBiosciences/pb-assembly

2 Canu assembler http://canu.readthedocs.io/en/latest/index.html

3 Miniasm assembler https://github.com/lh3/miniasm

4 PBJelly scaffolding tool https://sourceforge.net/projects/pb-jelly/

5 ARCS scaffolding tool https://github.com/bcgsc/arcs

6 Redundans reduction and scaffolding tool https://github.com/Gabaldonlab/redundans

7 Arrow error correction https://github.com/PacificBiosciences/ GenomicConsensus

8 PILON error correction https://github.com/broadinstitute/pilon/wiki

9 BUSCO single copy gene markers http://busco.ezlab.org/

10 Bandage graph assembly viewer https://rrwick.github.io/Bandage/

11 Gepard dotter http://cube.univie.ac.at/gepard

12 MUMmer aligner and plotter http://mummer.sourceforge.net/

Perl one-liner for beginners !

BioStar — Fri, 24 Jul 2020 05:58:28 -0500

I often use the following arguments to perl:

-e Makes the line of code be executed instead of a script
-n Forces your line to be called in a loop. Allows you to take lines from the diamond operator (or stdin)
-p Forces your line to be called in a loop. Prints $_ at the end

This counts the number of quotation marks in each line and prints it

perl -ne '$cnt = tr/"//;print "$cnt\n"' inputFileName.txt

Adds string to each line, followed by tab

perl -pe 's/(.*)/string\t$1/' inFile > outFile

Append a new line to each line

perl -pe 's//\n/' all.sent.classOnly > all.sent.classOnly.sep

Replace all occurrences of pattern1 (e.g. [0-9]) with pattern2

perl -p -i.bak -w -e 's/pattern1/pattern2/g' inputFile

Go through file and only print words that do not have any uppercase letters.

perl -ne 'print unless m/[A-Z]/' allWords.txt > allWordsOnlyLowercase.txt

Go through file, split line at each space and print words one per line.

perl -ne 'print join("\n", split(/ /,$_));print("\n")' someText.txt > wordsPerLine.txt

or in other words, delete every character that is not a letter, white space or line end (replace with nothing)

perl -pne 's/[^a-zA-Z\s]*//g' text_withSpecial.txt > text_lettersOnly.txt

perl -pne 'tr/[A-Z]/[a-z]/' textWithUpperCase.txt > textwithoutuppercase.txt;

Print only the second column of the data when using tabular as a separator

perl -ne '@F = split("\t", $_); print "$F[1]";' columnFileWithTabs.txt > justSecondColumn.txt

One-Liner: Sort lines by their length

perl -e 'print sort {length $a <=> length $b} <>' textFile

One-Liner: Print second column, unless it contains a number

perl">perl -lane 'print $F[1] unless $F[1] =~ m/[0-9]/' wordCounts.txt

Useful links to therapy, disease, drug and drug-target network data:

Jit — Mon, 01 Jun 2020 11:47:51 -0500

Useful links to therapy, disease, drug and drug-target network data:

DrugBank:

a bioinformatics- cheminformatics resource combining detailed drug data with comprehensive drug target information with >4900 drug (~3500 experimental) and >1500 non-redundant protein entries http://www.drugbank.ca/

Drug-Target Network:

network data of 890 drugs and 394 target human proteins http://www.nature.com/nbt/journal/v25/ n10/suppinfo/nbt1338_S1.html

Drug-Therapy Network:

three layers of drug-therapy networks according to the ATC classification http://www.biomedcentral.com/1471-2210/8/5/additional/

FDA Orange Book:

approved drug products with therapeutic equivalence evaluations http://www.fda.gov/cder/ob/HIDdb: Thomson Investigational drugs database including information on 107000 patents, 25000 investigational drugs and 80000 chemical structures http://scientific.thomson.com/products/iddb/HOMIM: a knowledgebase of human genes and genetic disorders http://www.ncbi.nlm.nih.gov/ sites/entrez?db=omim

PDTD:

3D drug target structure database with a target identification option http://www.dddc.ac.cn/pdtd/

Predicted drug targets:

a set of 1383 predicted drug targets http://www.biomedcentral.com/1471-2105/8/353/additional/ [25] Protein ligand network: a network of 4208 ligands and ~15000 binding sites http://pbil.kaist.ac.kr/~parkkw/Lnet/

TDR Targets Database:

identification and ranking targets against neglected tropical diseases http://tdrtargets.org/

Therapeutic Target Database:

lists >1500 therapeutic targets, disease conditions and corresponding drugs http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp

New Machine Learning Packages in R

Rahul Nayak — Fri, 27 Mar 2020 12:11:21 -0500

Machine Learning

autokeras v1.0.1: Implements an interface to AutoKeras, an open source software library for automated machine learning. See README for an example.

MTPS v0.1.9: Implements functions to predict simultaneous multiple outcomes based on revised stacking algorithms as described in Xing et al. (2019). See the vignette to get started.

quanteda.textmodels v0.9.1: Implements methods for scaling models and classifiers based on sparse matrix objects representing textual data. It includes implementations of the Laver et al. (2003) wordscores model, the Perry & Benoit’s (2017) class affinity scaling model, and the Slapin & Proksch (2008) wordfish model. See the vignette to get started.

SeqDetect v1.0.7: Implements the automaton model found in Krleža, Vrdoljak & Brčić (2019) to detect and process sequences. See the vignette for examples and theory.

studyStrap v1.0.0: Implements multi-Study Learning algorithms such as Merging, Study-Specific Ensembling (Trained-on-Observed-Studies Ensemble), the Study Strap, and the Covariate-Matched Study Strap. and offers over 20 similarity measures. See Kishida, et al. (2019) for background and the vignette for how to use the package.

Coronavirus COVID ‐19 Testing Sites In India

Neel — Mon, 16 Mar 2020 16:13:41 -0500

COVID-19 is a new illness that can affect your lungs and airways. It's caused by a virus called coronavirus.

Stay at home if you have coronavirus symptoms

Stay at home if you have either:

a high temperature – you feel hot to touch on your chest or back
a new, continuous cough – this means you've started coughing repeatedly

DO NOT TAKE

Ibrufen

https://amp.theguardian.com/world/2020/mar/14/anti-inflammatory-drugs-may-aggravate-coronavirus-infection

How to avoid catching and spreading coronavirus (social distancing)

Everyone should do what they can to stop coronavirus spreading.

It is particularly important for people who:

are 70 or over
have a long-term condition
are pregnant
have a weakened immune system

Below are the 52 Coronavirus COVID-19 Testing sites/locations in India.

State: Andhra Pradesh

Sri Venkateswara Institute of Medical Sciences, Tirupati
Andhra Medical College, Visakhapatnam, Andhra Pradesh
GMC, Anantapur, AP

State: Andaman & Nicobar islands

Regional Medical Research Centre, Port Blair, Andaman, and Nicobar

State: Assam

Gauhati Medical College, Guwahati
Regional Medical Research Center, Dibrugarh

State: Bihar

Rajendra Memorial Research Institute of Medical Sciences, Patna

State: Chandigarh

Post Graduate Institute of Medical Education & Research, Chandigarh

State: Chhattisgarh

All India Institute Medical Sciences, Raipur

Union Territory: Delhi-NCT

All India Institute Medical Sciences, Delhi
National Centre for Disease Control, Delhi

State: Gujarat

BJ Medical College, Ahmedabad
M.P.Shah Government Medical College, Jamnagar

State: Haryana

Pt. B.D. Sharma Post Graduate Inst. of Med. Sciences, Rohtak, Haryana
BPS Govt Medical College, Sonipat

State: Himachal Pradesh

Indira Gandhi Medical College, Shimla, Himachal Pradesh
Dr.Rajendra Prasad Govt. Med. College, Kangra, Tanda, HP

Union Territory: Jammu and Kashmir

Sher‐e‐ Kashmir Institute of Medical Sciences, Srinagar
Government Medical College, Jammu

State: Jharkhand

MGM Medical College, Jamshedpur

State: Karnataka

Bangalore Medical College & Research Institute, Bangalore
National Institute of Virology Field Unit Bangalore
Mysore Medical College & Research Institute, Mysore
Hassan Inst. of Med. Sciences, Hassan, Karnataka
Shimoga Inst. of Med. Sciences, Shivamogga, Karnataka

State: Kerala

National Institute of Virology Field Unit, Kerala
Govt. Medical College, Thiruvananthapuram, Kerala
Govt. Medical College, Kozhikode, Kerala

State: Madhya Pradesh

All India Institute Medical Sciences, Bhopal
National Institute of Research in Tribal Health (NIRTH), Jabalpur

State: Meghalaya

NEIGRI of Health and Medical Sciences, Shillong, Meghalaya

State: Maharashtra

Indira Gandhi Government Medical College, Nagpur
Kasturba Hospital for Infectious Diseases, Mumbai

State: Manipur

J N Inst. of Med. Sciences Hospital, Imphal‐East, Manipur

State: Odisha

Regional Medical Research Center, Bhubaneswar

Union Territory: Puducherry

Jawaharlal Institute of Postgraduate Medical Education & Research, Puducherry

State: Punjab

Government Medical College, Patiala, Punjab
Government Medical College, Amritsar

State: Rajasthan

Sawai Man Singh, Jaipur
Dr. S.N Medical College, Jodhpur
Jhalawar Medical College, Jhalawar, Rajasthan
SP Med. College, Bikaner, Rajasthan

State: Tamil Nadu

King’s Institute of Preventive Medicine & Research, Chennai
Government Medical College, Theni

State: Tripura

Government Medical College, Agartala

State: Telangana

Gandhi Medical College, Secunderabad

State: Uttar Pradesh

King’s George Medical University, Lucknow
Institute of Medical Sciences, Banaras, Hindu University, Varanasi
Jawaharlal Nehru Medical College, Aligarh

State: Uttarakhand

Government Medical College, Haldwani

State: West Bengal

National Institute of Cholera and Enteric Diseases, Kolkata
IPGMER, Kolkata

Explore taxdump files !

Jit — Sat, 08 Feb 2020 04:44:55 -0600

This is an extract of taxdump-readme.txt to be found at 
ftp://ftp.ncbi.nih.gov/pub/taxonomy/

The content of the archive
--------------------------

It may look like this:

delnodes.dmp
division.dmp
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
readme.txt

The readme.txt file gives a brief description of *.dmp files. These files
contain taxonomic information and are briefly described below. Each of the
files store one record in the single line that are delimited by "\t|\n"
(tab, vertical bar, and newline) characters. Each record consists of one 
or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters.
The brief description of field position and meaning for each file follows.

nodes.dmp
---------

This file represents taxonomy nodes. The description for each node includes 
the following fields:

	tax_id					-- node id in GenBank taxonomy database
 	parent tax_id				-- parent node id in GenBank taxonomy database
 	rank					-- rank of this node (superkingdom, kingdom, ...) 
 	embl code				-- locus-name prefix; not unique
 	division id				-- see division.dmp file
 	inherited div flag  (1 or 0)		-- 1 if node inherits division from parent
 	genetic code id				-- see gencode.dmp file
 	inherited GC  flag  (1 or 0)		-- 1 if node inherits genetic code from parent
 	mitochondrial genetic code id		-- see gencode.dmp file
 	inherited MGC flag  (1 or 0)		-- 1 if node inherits mitochondrial gencode from parent
 	GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
 	hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
 	comments				-- free-text comments and citations

names.dmp
---------
Taxonomy names file has these fields:

	tax_id					-- the id of node associated with this name
	name_txt				-- name itself
	unique name				-- the unique variant of this name if name not unique
	name class				-- (synonym, common name, ...)

division.dmp
------------
Divisions file has these fields:
	division id				-- taxonomy database division id
	division cde				-- GenBank division code (three characters)
	division name				-- e.g. BCT, PLN, VRT, MAM, PRI...
	comments

gencode.dmp
-----------
Genetic codes file:

	genetic code id				-- GenBank genetic code id
	abbreviation				-- genetic code name abbreviation
	name					-- genetic code name
	cde					-- translation table for this genetic code
	starts					-- start codons for this genetic code

delnodes.dmp
------------
Deleted nodes (nodes that existed but were deleted) file field:

	tax_id					-- deleted node id

merged.dmp
----------
Merged nodes file fields:

	old_tax_id                              -- id of nodes which has been merged
	new_tax_id                              -- id of nodes which is result of merging

Linux advantages

Rahul Agarwal — Thu, 30 Jan 2020 06:27:29 -0600

https://www.forbes.com/sites/jasonevangelho/2018/07/30/ditching-windows-heres-how-ubuntu-updates-your-pc-and-why-its-better/#7aa6fa5f7c23

https://www.forbes.com/sites/jasonevangelho/2018/07/23/5-reasons-you-should-switch-from-windows-to-linux-right-now/#70c74923777b