BOL: Related items

KAST

Neel — Wed, 23 Feb 2022 08:28:36 -0600

Perform Alignment-free k-tuple frequency comparisons from sequences. This can be in the form of two input files (e.g. a reference and a query) or a single file for pairwise comparisons to be made.

Address of the bookmark: https://github.com/martinjvickers/KAST

ALE: Assembly Likelihood Estimator

BioStar — Wed, 08 Mar 2023 01:39:33 -0600

Just import the assembly, bam and ALE scores. You can convert the .ale file to a set of .wig files with ale2wiggle.py and IGV can read those directly. Depending on your genome size you may want to convert the .wig files to the BigWig format.

Address of the bookmark: https://github.com/sc932/ALE

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

Tools to access the quality of your assembled genome !

LEGE — Thu, 08 Aug 2024 23:31:18 -0500

FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
ASSEMBLATHON STATS: Assembly statistics
GENOMETOOLS GT STAT: Annotation statistics
NCBI FCS ADAPTOR: Adaptor contamination pass/fail
NCBI FCS GX: Foreign organism contamination pass/fail
BUSCO: Gene-space completeness estimation
TIDK: Telomere repeat identification
LAI: Continuity of repetitive sequences
KRAKEN2: Taxonomy classification
HIC CONTACT MAP: Alignment and visualisation of HiC data
MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
MERQURY: K-mer completeness, consensus quality and phasing assessment

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Genomic architecture surrounding the fusion site of human chromosome 2

LEGE — Tue, 04 Mar 2025 12:26:29 -0600

The article "Genomic Structure and Evolution of the Ancestral Chromosome Fusion Site in 2q13–2q14.1 and Paralogous Regions on Other Human Chromosomes (https://pmc.ncbi.nlm.nih.gov/articles/PMC187548/)" explores the genomic architecture surrounding the fusion site of human chromosome 2. This fusion event is a key evolutionary marker distinguishing humans from other great apes, as humans have 46 chromosomes while chimpanzees, gorillas, and orangutans possess 48. The fusion occurred through an end-to-end joining of two ancestral chromosomes, which remain separate in nonhuman primates.

Key Findings:

Chromosomal Fusion and Its Molecular Signature:
- The fusion site is located at 2q13–2q14.1 and is characterized by degenerate telomeric sequences appearing interstitially, indicating the historical head-to-head joining of ancestral chromosomes.
- Despite being a signature of a past fusion event, these telomeric repeats are no longer functional and have undergone sequence degradation over time.
Extensive Duplications in the Surrounding Genomic Region:
- The study identifies large-scale segmental duplications flanking the fusion site, with several of these regions duplicated and scattered across multiple chromosomes.
- These duplications are predominantly located in subtelomeric and pericentromeric regions, suggesting their role in genomic instability and chromosomal evolution.
Paralogous Regions and Their Evolutionary Relationships:
- A 168-kilobase (kb) segment near the fusion site has 98%–99% sequence identity with three regions on chromosome 9 (9pter, 9p11.2, and 9q13).
- Another 67-kb region distal to the fusion site shows a high degree of homology to sequences in chromosome 22qter.
- Additionally, a 100-kb segment exhibits 96% sequence identity with a region in chromosome 2q11.2.
Comparative Genomics and Evolutionary Implications:
- By comparing the duplicated sequences and their arrangement in primates, the researchers traced the order of duplication events leading to their present distribution.
- The presence of specific repetitive elements within these duplicated segments serves as evolutionary markers that help infer their historical rearrangements.
- Some of these duplicated regions are associated with chromosomal inversion breakpoints, potentially contributing to evolutionary changes in primates.
- Recurrent structural rearrangements in these regions have been linked to human chromosomal disorders.

Conclusions and Implications:

The findings provide valuable insights into the structural evolution of human chromosome 2, which played a crucial role in human speciation.
Understanding these segmental duplications and their evolutionary trajectories sheds light on genomic instability, which may contribute to human genetic diseases.
The study highlights how large-scale chromosomal rearrangements, such as fusion and duplication, have influenced the evolutionary divergence of humans from other primates.

This research advances our understanding of human genome evolution and offers a foundation for studying the effects of structural variants in genetic disorders.

MITObim - mitochondrial baiting and iterative mapping

Rahul Nayak — Tue, 08 May 2018 04:15:25 -0500

This document contains instructions on how to use the MITObim pipeline described in Hahn et al. 2013. The full article can be found here. Kindly cite the article if you are using MITObim in your work. The pipeline was originally developed for Illumina data, but thanks to the versatility of the MIRA assembler, MITObim supports in principle also data from the Iontorrent, 454 and PacBio sequencing platforms.

Below you can find a few basic tutorials for how to run MITObim and I encorage you to give them a try with the testdata that comes with this Repo, just to make sure everything is running smoothly on your system. It'll only take a few minutes and will potentially safe you a lot of time down the line.

I provide further examples here as Jupyter notebooks. Get in touch if you feel like sharing your particular MITObim solution and I'd be happy to put it up here, too!

Address of the bookmark: https://github.com/chrishah/MITObim

chromoMap-An R package for Interactive visualization and mapping of human chromosomes

Rahul Nayak — Mon, 25 Jun 2018 17:22:24 -0500

chromoMap is an R package that provides interactive, configurable and elegant graphics visualization of the human chromosomes allowing users to map chromosome elements (like genes, SNPs etc.) on the chromosome plot. It introduces a special plot viz. the "chromosome heatmap" that, in addition to mapping elements, can visualize the data associated with chromosome elements (like gene expression) in the form of heat colors which can be highly advantageous in the scientific interpretations and research work. Because of the enormous size of the chromosomes, it is impractical to visualize each element on the same plot. But chromoMap plots provide a magnified view for each of chromosome location to render additional information and visualization specific for that location. You can map thousands of genes and can view all mappings easily. Users can investigate the detailed information about the mappings (like gene names or total genes mapped on a location) or can view the magnified single or double stranded view of the chromosome at a location showing each mapped element in sequential order (You will see in the demos below). Not ony that, the plots can be saved as HTML documents that can be customized and shared easily. In addition, you can include them in R Markdown or in R Shiny applications.

https://cran.r-project.org/web/packages/chromoMap/index.html

Scientists map 17,294 proteins produced in human body

Jit — Thu, 29 May 2014 01:57:55 -0500

Indian scientists missed the genomic profiling bus, but they've more than made up for it by creating the first human proteome map which is an extension of the genomic study. Till now, here is no direct equivalent for the human proteome. But recently two groups present mass spectrometry-based analysis of human tissues, body fluids and cells mapping the large majority of the human proteome.

The Indian scientists working in Bangalore, along with their American counterparts, have mapped more than 17,000 proteins in 30 organs of the human body. Just like the human genome was sequenced around the turn of the millennium, this is an equivalent mapping of the human proteome.

The researcher estimated there are around 20,500 proteins in the human body. These scientists have profiled around 17,294, which account for around 84% of the total proteins. Apart from this, the team also traced around 2,500 of 3,000 proteins that had been categorised as "missing proteins".

The work, done by group of Indian scientists, and Johns Hopkins University, published in the renowned journal Nature ( http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html ). Of the 72 people who worked on the project, 46 are Indians.

Reference:

http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html

http://www.proteinatlas.org/ -The antibody-based Human Protein Atlas programme

http://www.humanproteomemap.org/ -Proteogenomic analysis by identifying translated proteins from annotated pseudogenes, non-coding RNAs and untranslated regions.

https://www.proteomicsdb.org/ -Assembled protein evidence for 18,097 genes in ProteomicsDB

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

Neel — Fri, 20 May 2016 18:53:49 -0500

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581