BOL: Related items

Salzberg lab

Mon, 15 May 2017 05:14:01 -0500

We are a computational biology lab that develops novel methods for analysis of DNA and RNA sequences. Our research includes software for aligning and assembling RNA-seq data, whole-genome assembly, and microbiome analysis. We work closely with biomedical scientists to apply these methods to current problems arising in a broad spectrum of biological and medical research areas. We’re also part of the Center for Computational Biology, a group of 20+ faculty members and their labs at Johns Hopkins working on computational, statistical, and mathematical methods that can turn massive genomic data sets into biologically and clinically useful information.

https://salzberg-lab.org/

SIMBA: a web tool for managing bacterial genome assembly generated by Ion PGM sequencing technology

Abhimanyu Singh — Tue, 23 May 2017 05:28:56 -0500

SIMBA, SImple Manager for Bacterial Assemblies, is a Web interface for managing assembly projects of bacterial genomes. SIMBA was created to assist bioinformaticians to assemble bacterial genomes sequenced with NextGeneration Sequencing (NGS) platforms quickly, easily and effectively. SIMBA also is open source tool, i.e., can be freely downloaded, shared and modified.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1344-7

Address of the bookmark: http://ufmg-simba.sourceforge.net/

Phased Human Genome Assembly !

Rahul Nayak — Mon, 08 Oct 2018 09:10:54 -0500

The new publicly available assembly (PacBio HG00733) has the fewest gaps of any human genome assembly, with more than half of the genome contained in gapless sequence at least 27 Mb long. The primary contig assembly is 2.89 Gb long and consists of 865 contigs that were assembled with PacBio data generated with the company’s Sequel® System. Using the FALCON-Unzip assembler, maternal and paternal haplotypes were resolved over more than 80% of the genome. Maternal and paternal haplotype blocks were then further phased using Hi-C technology and the FALCON-Phase methoddeveloped in collaboration with Phase Genomics. The genome was then de novo scaffolded using Phase Genomics’ Proximo Hi-C platform, resulting in the first chromosome-scale diploid assembly of a single individual accomplished with only two technologies. More specific details about the assembly are included on the PacBio blog.

The data are available using NCBI accession IDs: BioProject: (PRJNA483067), assembly: [RBJD00000000] and sequence data (SRP155659).

Additional Resources

Interactive map showcasing global initiatives underway to generate reference-quality human genome assemblies for diverse populations
BioReport Podcast on the value of ethnic-specific reference genomes
Nature Reviews Genetics paper from NHGRI: Prioritizing diversity in human genomics research
Article in The Journal of Precision Medicine: “Minority Report – Ethnic Diversity and the Real Promise for Precision Medicine”
Article in Bio-IT World: “Genomic Data Standards Are a Necessity”
NHGRI Project Award: High Quality Human and Non-Human Primate Genome Assemblies

More details are available on the PacBio website:

Blog post: Data Release: Highest-Quality, Most Contiguous Individual Human Genome Assembly to Date
Blog post: For Reference-Grade Human Genome Assemblies, SMRT Sequencing Yields Optimal Results
Webinar: Assembling High-Quality Human Reference Genomes for Global Populations
FALCON-Phase press release and article preprint
PacBio research focus webpage about Human Population Genetics

Ref: https://stockguru.com/2018/10/08/pacific-biosciences-releases-highest-quality-most-contiguous-individual-human-genome-assembly-to-date/

RaGOO: Fast Reference-Guided Scaffolding of Genome Assembly Contigs

BioJoker — Wed, 17 Apr 2019 19:45:22 -0500

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC: Fast and accurate reference-guided scaffolding of draft genomes. bioRxiv 2019.

RaGOO is a tool for coalescing genome assembly contigs into pseudochromosomes via minimap2 alignments to a closely related reference genome. The focus of this tool is on practicality and therefore has the following features:

Good performance. On a MacBook Pro using Arabidopsis data, pseudochromosome construction takes less than a minute and the whole pipeline with SV calling takes ~2 minutes.
Intact ordering and orienting of contigs.
Chimeric contig correction
GFF lift-over
Structural variant calling with and integrated version of Assemblytics
Confidence scores associated with the grouping, localization, and orientation for each contig.

Address of the bookmark: https://github.com/malonge/RaGOO

MitoFinder

Neel — Tue, 29 Aug 2023 02:13:01 -0500

Allio, R., Schomaker-Bastos, A., Romiguier, J., Prosdocimi, F., Nabholz, B., & Delsuc, F. (2020) Mol Ecol Resour. 20, 892-905. (publication link)

Mitofinder is a pipeline to assemble mitochondrial genomes and annotate mitochondrial genes from trimmed read sequencing data.

MitoFinder is also designed to find and annotate mitochondrial sequences in existing genomic assemblies (generated from Hifi/PacBio/Nanopore/Illumina sequencing data...)

MitoFinder is distributed under the license.

Address of the bookmark: https://github.com/RemiAllio/MitoFinder

Peregrine & SHIMMER Genome Assembly Toolkit

Abhi — Thu, 16 Dec 2021 02:50:19 -0600

Peregrine is a fast genome assembler for accurate long reads (length > 10kb, accuracy > 99%). It can assemble a human genome from 30x reads within 20 cpu hours from reads to polished consensus. It uses Sparse HIereachical MimiMizER (SHIMMER) for fast read-to-read overlaping without quadratic comparisions used in other OLC assemblers.

Address of the bookmark: https://github.com/cschin/Peregrine

VCF Compare !

Rahul Nayak — Wed, 19 Jan 2022 10:30:14 -0600

compare two BWA mapping methods with the online hg18-mapped data

We first operate a rapid inspection of the different BAM files using samtools flagstat. Illumina provided chr21 read mapping obtained with their GA IIx deep sequencing platform <ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam>, aligned to the b36/hg18 reference genome)

Address of the bookmark: https://wiki.bits.vib.be/index.php/NGS_Exercise.6#compare_aln_.26_mem_results_with_vcf-compare

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

Steps to find palindrome in genomes !

BioStar — Thu, 09 Mar 2023 02:56:54 -0600

Palindromes are sequences of nucleotides that read the same backward as forward. They can be present in genomes and have various biological functions. Here are some methods for discovering palindromes in genomes:

Direct sequence search: One of the simplest ways to discover palindromes is to search the genome sequence directly for palindromic sequences using pattern matching tools, such as regular expressions or string algorithms. This approach can be useful for discovering simple palindromes, but may miss more complex palindromic structures.
Dot plot analysis: Dot plot analysis is a graphical method that can be used to identify palindromic regions in a genome. It involves plotting the genome sequence against itself and examining the diagonal patterns that emerge. Palindromic regions will appear as symmetrical patterns along the diagonal.
Restriction enzyme analysis: Some restriction enzymes, such as EcoRI and HindIII, recognize palindromic sequences and cleave DNA at these sites. By digesting the genome with these enzymes and examining the resulting fragments, palindromic regions can be identified.
Next-generation sequencing: High-throughput sequencing technologies, such as PacBio and Oxford Nanopore, can generate long reads that can span entire palindromic regions. By mapping these reads to the genome, palindromic regions can be identified and characterized.
Comparative genomics: Comparing the genomes of related species can also reveal palindromic regions that are conserved across evolutionarily divergent lineages. This approach can help identify functional palindromes that are under selective pressure.

Overall, the discovery of palindromic sequences in genomes can be accomplished using a variety of methods, each with their own advantages and limitations. A combination of these methods can provide a comprehensive understanding of the palindromic landscape of a genome.

Step-by-Step Guide to Detect piRNAs Using Bioinformatics

Abhi — Fri, 13 Dec 2024 11:41:46 -0600

Piwi-interacting RNAs (piRNAs) are a class of small non-coding RNAs that play crucial roles in silencing transposable elements and regulating gene expression, particularly in germline cells. Detecting piRNAs involves identifying their unique characteristics, such as size, sequence motifs, and association with Piwi proteins, from high-throughput RNA sequencing data.

This blog provides a comprehensive step-by-step guide to detect piRNAs using bioinformatics tools and workflows.

Step 1: Prepare Your Data

Obtain RNA Sequencing Data
Acquire raw small RNA-seq data in FASTQ format. Datasets can be sourced from repositories like NCBI SRA, EMBL-EBI, or specific small RNA sequencing projects.
Quality Control (QC)
Use FastQC to assess the quality of raw reads:

fastqc reads.fastq

Evaluate the per-base quality, adapter content, and overrepresented sequences.
Trimming and Adapter Removal
Use tools like Cutadapt or Trim Galore! to remove adapters and low-quality bases:

cutadapt -a TGGAATTCTCGGGTGCCAAGG -o trimmed_reads.fastq reads.fastq

Ensure the remaining reads are of high quality for downstream analysis.

Step 2: Map Reads to the Genome

Mapping reads to the reference genome is crucial for identifying piRNA loci.

Reference Genome Preparation
Download the genome assembly of your organism from databases like Ensembl, UCSC Genome Browser, or NCBI.
Align Reads
Use Bowtie or STAR for small RNA alignment:

bowtie -v 1 -k 1 --best genome_index trimmed_reads.fastq -S aligned_reads.sam
- -v 1: Allows one mismatch.
- -k 1: Reports the best alignment.
Convert SAM to BAM
Convert and sort alignments using SAMtools:

samtools view -Sb aligned_reads.sam | samtools sort -o sorted_reads.bam

Step 3: Identify Small RNAs

piRNAs are characterized by their size (24–32 nt) and strand bias.

Extract Reads by Size
Use tools like BEDtools or custom scripts to filter reads between 24 and 32 nt:

bedtools bamtofastq -i sorted_reads.bam -fq all_reads.fastq seqkit seq -m 24 -M 32 all_reads.fastq > piRNA_size_reads.fastq
Check for Sequence Bias
piRNAs often have a strong bias for a uridine at the 5’ end (1U bias). Use tools like WebLogo to visualize sequence motifs.

Step 4: Detect Ping-Pong Signature

The ping-pong amplification loop is a hallmark of piRNA biogenesis, characterized by a 10 nt overlap between piRNAs on opposite strands.

Generate Overlap Statistics
Use the piPipes tool or custom scripts to calculate overlap:

python ping_pong_overlap.py sorted_reads.bam
Visualize Overlap Distribution
Plot the distribution of overlaps to confirm the presence of the 10 nt ping-pong signature.

Step 5: Annotate piRNA Clusters

piRNAs are often generated from genomic clusters.

Cluster Identification
Use tools like proTRAC or PIRANHA to identify piRNA-producing clusters:

proTRAC.pl -s sorted_reads.bam -g genome.fa -o clusters
Annotate Genomic Regions
Annotate the identified clusters using gene annotation files (GTF/GFF). Tools like BEDtools intersect can help associate piRNA clusters with genes or transposable elements:

bedtools intersect -a clusters.bed -b genome_annotation.gtf > annotated_clusters.bed

Step 6: Functional Analysis

Functional analysis of piRNAs can uncover their targets and regulatory roles.

Predict piRNA Targets
Use tools like IntaRNA or RNAhybrid to predict interactions between piRNAs and potential target mRNAs:

RNAhybrid -t target_transcripts.fa -q piRNAs.fa > piRNA_targets.txt
Enrichment Analysis
Perform GO or KEGG enrichment analysis of target genes using tools like g:Profiler or DAVID.

Step 7: Validation and Visualization

Validate piRNA Candidates
Cross-check the identified piRNAs against known piRNA databases, such as piRBase or piRNAdb.
Visualize Results
- Use IGV (Integrative Genomics Viewer) to visualize piRNA alignment and clusters on the genome.
- Generate heatmaps or circos plots to present piRNA distributions.

Step 8: Share and Publish Findings

Archive Data
Submit sequencing data to public repositories like SRA or GEO with metadata specifying piRNA-related experiments.
Publish Results
Share findings in journals or conferences, emphasizing novel piRNA candidates, target genes, or regulatory mechanisms.

Conclusion

Detecting piRNAs involves a combination of computational and analytical methods to identify these unique small RNAs and their roles in gene regulation and transposable element suppression. By following this step-by-step guide, you can confidently navigate the complexities of piRNA detection and contribute to the growing understanding of their biological significance.