BOL: BioStar's blogs

Common steps for reads mapping !

BioStar — Thu, 09 Mar 2023 02:48:02 -0600

Mapping reads to a reference genome is an essential step in many types of genomic analysis, such as variant calling and gene expression analysis. Here are some general steps to follow for mapping reads to a genome:

Choose a read mapper: There are many read mappers available, such as BWA, Bowtie, and HISAT2. Choose a mapper that is appropriate for your type of data and research question.
Index the reference genome: Before mapping reads, the reference genome needs to be indexed. This involves creating an index of the genome sequence that allows the mapper to quickly find matches to the reads. Most mappers have their own indexing tools.
Prepare the read data: The reads should be in a format that is compatible with the mapper. Most mappers accept FASTQ or BAM files. Depending on the quality of the data, it may need to be filtered or trimmed before mapping.
Run the mapper: The mapper is run with the command-line interface or using a graphical user interface. The specific command depends on the mapper being used, but typically involves specifying the input data, reference genome, and output file format.
Evaluate the mapping results: After the mapping is complete, the results should be evaluated. This includes assessing the quality of the mapping, such as the mapping rate, the number of mapped reads, and the mapping quality score.
Post-processing: Depending on the analysis being performed, post-processing of the mapped reads may be necessary. This can include filtering reads based on quality, removing duplicate reads, and calling variants.

Overall, mapping reads to a reference genome is a complex process that requires careful consideration of the type of data, the research question, and the specific mapper being used.

Common methods to discover tandem repeats

BioStar — Thu, 09 Mar 2023 02:40:52 -0600

Tandem repeats are DNA sequences that are repeated in a contiguous manner in the genome. These sequences are often used as genetic markers and are important in many areas of genetics and genomics research. Here are some methods for discovering tandem repeats in genomes:

Tandem Repeat Finder: Tandem Repeat Finder is a software tool that identifies tandem repeats in DNA sequences. It is available for free download and can be used on both nucleotide and protein sequences. The tool uses a statistical algorithm to identify repeats based on their length, copy number, and overall composition.
RepeatMasker: RepeatMasker is another software tool that can identify tandem repeats in DNA sequences. It works by comparing the input sequence to a database of known repeats and then identifies any tandem repeats that match those in the database.
PCR-based methods: Polymerase chain reaction (PCR) can be used to amplify and detect tandem repeats in genomic DNA. PCR primers are designed to flank the tandem repeat region, and amplification of the target DNA fragment can be visualized on a gel. This method can be useful for detecting novel tandem repeats and for genotyping.
Southern blotting: Southern blotting is a classic method for detecting DNA fragments in a sample. It can be used to detect tandem repeats by digesting genomic DNA with a restriction enzyme, separating the fragments by gel electrophoresis, and then probing the blot with a tandem repeat-specific probe.

Overall, a combination of these methods can be used to comprehensively identify tandem repeats in genomes.

Chromosome breakpoint - a breakup to remember

BioStar — Tue, 07 Mar 2023 13:31:54 -0600

Chromosome breakpoint refers to the physical location where a chromosome is broken and rearranged. Chromosome breakage can occur spontaneously or be induced by environmental factors such as radiation, chemicals, or viruses. The rearrangement of genetic material resulting from a chromosome breakpoint can have important consequences, including the development of genetic diseases, chromosomal abnormalities, or cancer.

Chromosome breakpoints can occur in two ways: interstitial or terminal. Interstitial breakpoints occur within the chromosome, while terminal breakpoints occur at the end of the chromosome. Terminal breakpoints can lead to the loss of genetic material, whereas interstitial breakpoints can result in the duplication or deletion of genetic material.

Chromosome breakpoints can be detected using a variety of techniques, including cytogenetic analysis, fluorescence in situ hybridization (FISH), and molecular methods such as polymerase chain reaction (PCR) and next-generation sequencing (NGS). These techniques can also help identify the exact location of the breakpoint and the nature of the rearrangement, such as translocations, inversions, deletions, or duplications.

Translocations are one of the most common types of chromosome rearrangements caused by breakpoints. In a translocation, genetic material is exchanged between two different chromosomes, resulting in a balanced or unbalanced distribution of genetic material. Unbalanced translocations can cause genetic diseases or developmental abnormalities, while balanced translocations can be inherited without any apparent phenotypic effects.

Inversions occur when a chromosome segment is inverted, resulting in a change in the order of genetic material. Inversions can be pericentric, involving the centromere, or paracentric, not involving the centromere. Inversions can cause genetic diseases or phenotypic effects if they disrupt the function of essential genes or regulatory elements.

Deletions and duplications are caused by interstitial breakpoints that result in the loss or gain of genetic material. Deletions can cause genetic diseases or developmental abnormalities if they involve essential genes or regulatory elements. Duplications can also have phenotypic effects, depending on the location and size of the duplicated segment.

Chromosome breakpoints can also be involved in the formation of complex chromosomal rearrangements, such as ring chromosomes or dicentric chromosomes. These complex rearrangements can have important clinical implications, as they can cause genetic diseases or cancer.

In conclusion, chromosome breakpoints are important genetic events that can lead to the rearrangement of genetic material and have important clinical implications. The detection and characterization of chromosome breakpoints using cytogenetic, molecular, and genomic methods are essential for the diagnosis, prognosis, and treatment of genetic diseases and cancer. Further research is needed to understand the molecular mechanisms underlying chromosome breakage and to develop new therapies targeting these events.

Bioinformatics tools to explore SSRs in genomes !

BioStar — Tue, 07 Mar 2023 13:06:15 -0600

There are several bioinformatics tools that can be used to explore Simple Sequence Repeats (SSRs), which are also known as microsatellites. Here are a few examples:

MISA: MISA (MIcroSAtellite) is a web-based tool that can identify SSRs in DNA sequences. It can be used to analyze nucleotide sequences from various organisms and can identify perfect, compound, and imperfect SSRs.
SSR Locator: SSR Locator is a web-based tool that identifies SSRs in both DNA and RNA sequences. It can identify perfect, compound, and imperfect SSRs, and can also filter out low complexity regions.
SciRoKo: SciRoKo is a software tool that can identify SSRs in DNA sequences. It can be used to analyze genomic and transcriptomic sequences from various organisms and can identify perfect, compound, and imperfect SSRs.
Primer3: Primer3 is a web-based tool that designs PCR primers for SSRs. It can design primers for perfect and imperfect SSRs, and can be used to design primers for SSRs in various organisms.
QDD: QDD (Quick Detection of Duplication) is a software tool that can identify SSRs in DNA sequences and can also identify duplicate loci. It can be used to analyze genomic and transcriptomic sequences from various organisms.

These are just a few examples of the many bioinformatics tools available for exploring SSRs. Depending on your specific needs and research questions, you may find that other tools are more appropriate for your analysis.

Bioinformatics tools for telomere to telomere assembly !

BioStar — Tue, 17 Aug 2021 13:17:09 -0500

● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)

Reference:

https://www.pacb.com/blog/young-investigators-share-stellar-science-career-advice-and-bioinformatics-tools-at-smrt-leiden-2021/

Protocol for De novo Genome Assembly using Illumina Reads

BioStar — Sat, 16 Jan 2021 21:42:11 -0600

In this protocol, we address and describe the de novo assembly method for small to medium-sized genomes.

What is de novo genome assembly?
The method of taking a large number of short DNA sequences and placing them back together to create a reflection of the original chromosomes from which the DNA originated relates to genome assembly. No previous knowledge of the source DNA sequence length, structure or composition is inferred by De novo genome assemblies. The DNA of the target organism is split up into millions of tiny parts and read on a sequencing computer in a genome sequencing experiment. Depending on the sequencing system used, these "reads" range from 20 to 1000 nucleotide base pairs (bp) in length. Usually, length reads of 36 - 150 bp are produced for Illumina style short read sequencing. These reads can be either “single ended” as described above or “paired end.”

Why genome assembly?
In basic research into why and how they live, as well as in applied topics, identifying the DNA sequence of an organism is useful. Awareness of a DNA sequence may be useful in virtually any biological research because of the relevance of DNA to living things. For example, it may be used in medicine to classify, diagnose and eventually improve genetic disorder therapies. Similarly, pathogens study can lead to treatments for infectious diseases.

Raw NGS data
Reads can be saved as a Fasta file as text or in a FastQ file with their attributes. FastQ is the most common read file format since this is what the Illumina sequencing pipeline creates. This will henceforth be the subject of our conversation.

In a nutshell the protocol:
Get the sequence file(s) read from the sequencing machine (s).
Look at the readings - have an idea of what you have and what the standard is like.
If required, raw data cleanup/quality trimming.
Choose an adequate parameter set for assembly.
Assemble the data into scaffolds/contigs.
Examine the assembly performance and determine the efficiency of the assembly.

Read Quality Control:
Check the qualiy with fastQC.
Script
https://bioinformaticsonline.com/snippets/view/42540/install-fastqc-using-conda

Quality trimming/cleanup of read files.
This function trims adapters, barcodes and other contaminants from the reads.
Script
https://bioinformaticsonline.com/snippets/view/42542/trimmomatic-command

Genome Assembly:
The object of this portion of the protocol is to explain the method of assembling the reads trimmed by quality into draft contigs.

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o result_of_spades_assembly_all_illumina

A significant range of short-read assemblers are available. Everyone with strengths and disadvantages of their own.
Some of the assemblers available include:
Velvet
SOAP-denovo
MIRA
ALLPATHS

Next step is to assess the suitability and what to do with a draft package of contiguous details for the remainder of the study now. Few stuff you can note about the contigs you just created: They're the draft Contigs. Any mis-assemblies can occur.

Mis-assembly checking and assembly metric tools:
QUAST - Quality assessment tool for genome assembly http://bioinf.spbau.ru/quast
Mauve assembly metrics - http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
InGAP-SV - https://sites.google.com/site/nextgengenomics/ingap and http://ingap.sourceforge.net/
inGAP is also useful for finding structural variants between genomes from read mappings.

Genome finishing tools:
Semi-automated gap fillers:
Gap filler - http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/gapfiller/

IMAGE (V2) - http://sourceforge.net/apps/mediawiki/image2/index.php?title=Main_Page

Genome visualisers and editors:
Artemis - http://www.sanger.ac.uk/resources/software/artemis/
IGV - http://www.broadinstitute.org/igv/

Automated and semi automated annotation tools:
Prokka - https://github.com/tseemann/prokka
RAST - http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/RapidAnnotationServer
JCVI Annotation Service - http://www.jcvi.org/cms/research/projects/annotation-service/

Frequent command use for the analysis are at:

https://bioinformaticsonline.com/blog/view/38765/list-of-tools-frequently-used-while-genome-assembly
https://bioinformaticsonline.com/pages/view/42275/frequent-parameters-for-bioinformatics-tools

10 NGS services companies around the globe !

BioStar — Sun, 22 Nov 2020 23:56:17 -0600

The global NGS services market is expected to reach USD 13.1 billion by 2025. Here are the top 10 NGS services companies to look for –

1. Illumina, Inc. (U.S.)

Illumina, Inc. was founded in 1998 and is headquartered at San Diego, U.S. Illumina, Inc. is one of the leading players in DNA sequencing and array-based technologies, serving customers in the research, clinical, and applied markets. The company offers products for applications in the life sciences, oncology, reproductive health, agriculture, and other emerging segments. The company serves government laboratories, genomic research centers, academics institutions as well as pharmaceutical, biotechnology, agrigenomics, commercial molecular diagnostics laboratories and consumer genomics companies. Illumina, Inc. has its geographic presence in North America, Europe, Latin America, Asia-pacific, and others.

2. QIAGEN N.V. (Netherlands)

QIAGEN N.V. was incorporated in 1986 and is headquartered at Venlo, The Netherlands. The Company is engaged in providing Sample to Insight solutions that transform biological samples into molecular insights. QIAGEN provides its workflow to customers in molecular diagnostics, assay technologies, bioservices and automation systems. The company’s genome services are suitable for custom/tailored projects that allow access to genomic sequence information. The Company market its products in more than 100 countries across the Americas, Europe, Asia, Australia, and the Middle-East &Africa through its subsidiaries and channel partners.

3. PerkinElmer, Inc. (U.S.)

PerkinElmer, Inc. was founded in 1947 and is headquartered in Waltham, Massachusetts, the U.S. PerkinElmer, Inc. offers its products & services and solutions for the diagnostics, food, environmental, industrial, life sciences research and laboratory services markets. The company offer comprehensive genetic testing solutions that help to provide insight into the complex nature of rare and inherited diseases. Some of the subsidiaries of the company are Caliper Life Sciences, Improvision, Viacell Inc., ViaCord LLC, among many others. The company has its facilities located in Europe (France, Germany, and Belgium), U.S. and Asia (China, India, and Japan).

4. Eurofins Scientific SE (Luxembourg)

Eurofins Scientific SE was founded in 1987 and is headquartered in Luxembourg, Europe. The company offers a portfolio of over 130,000 analytical methods and more than 150 million assays performed each year to establish the safety, identity, composition, authenticity, origin, traceability, and purity of biological substances and products, as well as carry out human diagnostic services. The company has its geographic presence across 39 countries in Europe, North and South America, and Asia-Pacific.

5. GATC Biotech AG (Germany)

GATC Biotech AG was founded in 1990 and is headquartered in Constance, Germany. The company provides DNA and RNA sequencing and bioservices solutions to academics and industrial areas. It also provides next generation sequencing services including genomes, targeted (re)-sequencing, human sample sequencing, transcriptomes, metagenomes, regulomes, pre-sequencing, NGS barcode labels, and next generation sequencing technologies; and bioservices services, including bioservices tools, pipelines and workflows, compute resources, data analysis reports, and case studies. GATC Biotech AG operates as a subsidiary of Eurofins Scientific SE. It offers its products through distributors in Italy, Japan, Portugal, Spain, and the Czech Republic.

6. Macrogen, Inc. (South Korea)

Macrogen, Inc. was founded in 1997 and is headquartered in Seoul, South Korea. Macrogen, Inc. provides next generation sequencing services such as whole genome, de novo, exome, targeted, transcriptomics, metagenome, and epigenome sequencing. The company also provides a variety of services such as oligo synthesis, database construction, genome research, and bioservices analysis system consulting services. Macrogen, Inc. provides genome research services in Korea and internationally.

7. Genotypic Technology Pvt. Ltd. (India)

Genotypic Technology Pvt. Ltd. was incorporated in 1998 and is headquartered in Bangalore, India. Genotypic Technology is the first Genomics service provider in India providing Microarray, Next Generation Sequencing (NGS), Bioservices and solutions to domestic/ international pharma, biotech companies and academia. The company provides its services for protocol optimization, probe designing, array layouts, project designing, and nucleic acid analysis to in-depth analysis. Genotypic Technology has its geographic presence in North America, Europe, Asia Pacific, Middle East & Africa, and Latin America.

8. GENEWIZ, Inc. (U.S.)

GENEWIZ, Inc. was founded in 1999 and is headquartered in South Plainfield, New Jersey, the U.S.; The company is a leading provider of research service in the field of Next Generation Sequencing, Sanger DNA sequencing, sequencing of bacteria and phage, gene synthesis, DNA cloning, genomics including mutation analysis, single nucleotide polymorphism, and bioservices. GENEWIZ, Inc. has its geographic presence in U.S., China, Germany, France, Japan, and the U.K.

9. Beijing Genomics Institute (China)

Beijing Genomics Institute (BGI) is the world’s largest genomics organization and non-profit research institution that was founded in 1999 and is headquartered in Shenzhen, China. The Company provides a wide range of commercial next generation sequencing services and genetic tests for medical institutions, agricultural and environmental applications. The Company operates all across the globe through its subsidiaries, namely, BGI China (Mainland), BGI Asia Pacific, BGI Americas (North and South America) and BGI Europe (Europe and Africa).

10. SciGenom Labs Pvt. Ltd (India)

SciGenom Labs Pvt. Ltd was founded in 2010 and is headquartered in Cochin, India with offices in Chennai & Hyderabad in India, and San Francisco in the U.S. It is a Genomics R&D services company that provides genomic sequencing and NGS services to life sciences and healthcare businesses globally as well as academic and government institutions in India.

Popular mentions – MedGenome (India), DNA Link, Inc. (South Korea), Otogenetics Corporation (U.S.), Novogene Corporation (China), LGC Limited (U.K.), CD Genomics (U.S.), SeqLL, LLC (U.S.)

Perl one-liner for beginners !

BioStar — Fri, 24 Jul 2020 05:58:28 -0500

I often use the following arguments to perl:

-e Makes the line of code be executed instead of a script
-n Forces your line to be called in a loop. Allows you to take lines from the diamond operator (or stdin)
-p Forces your line to be called in a loop. Prints $_ at the end

This counts the number of quotation marks in each line and prints it

perl -ne '$cnt = tr/"//;print "$cnt\n"' inputFileName.txt

Adds string to each line, followed by tab

perl -pe 's/(.*)/string\t$1/' inFile > outFile

Append a new line to each line

perl -pe 's//\n/' all.sent.classOnly > all.sent.classOnly.sep

Replace all occurrences of pattern1 (e.g. [0-9]) with pattern2

perl -p -i.bak -w -e 's/pattern1/pattern2/g' inputFile

Go through file and only print words that do not have any uppercase letters.

perl -ne 'print unless m/[A-Z]/' allWords.txt > allWordsOnlyLowercase.txt

Go through file, split line at each space and print words one per line.

perl -ne 'print join("\n", split(/ /,$_));print("\n")' someText.txt > wordsPerLine.txt

or in other words, delete every character that is not a letter, white space or line end (replace with nothing)

perl -pne 's/[^a-zA-Z\s]*//g' text_withSpecial.txt > text_lettersOnly.txt

perl -pne 'tr/[A-Z]/[a-z]/' textWithUpperCase.txt > textwithoutuppercase.txt;

Print only the second column of the data when using tabular as a separator

perl -ne '@F = split("\t", $_); print "$F[1]";' columnFileWithTabs.txt > justSecondColumn.txt

One-Liner: Sort lines by their length

perl -e 'print sort {length $a <=> length $b} <>' textFile

One-Liner: Print second column, unless it contains a number

perl">perl -lane 'print $F[1] unless $F[1] =~ m/[0-9]/' wordCounts.txt

List of tools frequently used while genome assembly

BioStar — Tue, 22 Jan 2019 09:39:02 -0600

List of tools frequently used while genome assembly:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.