BOL: Related items

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

Tools to access the quality of your assembled genome !

LEGE — Thu, 08 Aug 2024 23:31:18 -0500

FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
ASSEMBLATHON STATS: Assembly statistics
GENOMETOOLS GT STAT: Annotation statistics
NCBI FCS ADAPTOR: Adaptor contamination pass/fail
NCBI FCS GX: Foreign organism contamination pass/fail
BUSCO: Gene-space completeness estimation
TIDK: Telomere repeat identification
LAI: Continuity of repetitive sequences
KRAKEN2: Taxonomy classification
HIC CONTACT MAP: Alignment and visualisation of HiC data
MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
MERQURY: K-mer completeness, consensus quality and phasing assessment

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Genomic architecture surrounding the fusion site of human chromosome 2

LEGE — Tue, 04 Mar 2025 12:26:29 -0600

The article "Genomic Structure and Evolution of the Ancestral Chromosome Fusion Site in 2q13–2q14.1 and Paralogous Regions on Other Human Chromosomes (https://pmc.ncbi.nlm.nih.gov/articles/PMC187548/)" explores the genomic architecture surrounding the fusion site of human chromosome 2. This fusion event is a key evolutionary marker distinguishing humans from other great apes, as humans have 46 chromosomes while chimpanzees, gorillas, and orangutans possess 48. The fusion occurred through an end-to-end joining of two ancestral chromosomes, which remain separate in nonhuman primates.

Key Findings:

Chromosomal Fusion and Its Molecular Signature:
- The fusion site is located at 2q13–2q14.1 and is characterized by degenerate telomeric sequences appearing interstitially, indicating the historical head-to-head joining of ancestral chromosomes.
- Despite being a signature of a past fusion event, these telomeric repeats are no longer functional and have undergone sequence degradation over time.
Extensive Duplications in the Surrounding Genomic Region:
- The study identifies large-scale segmental duplications flanking the fusion site, with several of these regions duplicated and scattered across multiple chromosomes.
- These duplications are predominantly located in subtelomeric and pericentromeric regions, suggesting their role in genomic instability and chromosomal evolution.
Paralogous Regions and Their Evolutionary Relationships:
- A 168-kilobase (kb) segment near the fusion site has 98%–99% sequence identity with three regions on chromosome 9 (9pter, 9p11.2, and 9q13).
- Another 67-kb region distal to the fusion site shows a high degree of homology to sequences in chromosome 22qter.
- Additionally, a 100-kb segment exhibits 96% sequence identity with a region in chromosome 2q11.2.
Comparative Genomics and Evolutionary Implications:
- By comparing the duplicated sequences and their arrangement in primates, the researchers traced the order of duplication events leading to their present distribution.
- The presence of specific repetitive elements within these duplicated segments serves as evolutionary markers that help infer their historical rearrangements.
- Some of these duplicated regions are associated with chromosomal inversion breakpoints, potentially contributing to evolutionary changes in primates.
- Recurrent structural rearrangements in these regions have been linked to human chromosomal disorders.

Conclusions and Implications:

The findings provide valuable insights into the structural evolution of human chromosome 2, which played a crucial role in human speciation.
Understanding these segmental duplications and their evolutionary trajectories sheds light on genomic instability, which may contribute to human genetic diseases.
The study highlights how large-scale chromosomal rearrangements, such as fusion and duplication, have influenced the evolutionary divergence of humans from other primates.

This research advances our understanding of human genome evolution and offers a foundation for studying the effects of structural variants in genetic disorders.

SVEngine: Allele Specific and Haplotype Aware Structural Variants Simulator

Jit — Sat, 04 Jul 2020 05:52:34 -0500

SVEngine (Structural Variants Engine)

SVEngine is a multi-purpose and self-contained simulator for whole genome scale spike-in of thousands of SV events of various types in both single-sample and matched sample scenarios.
SVEngine takes as input reference contigs in FASTA files, variant meta distribution as specified in META files (see Manual) or specific variant information as specified in VAR files (see Manual) and NEWICK files for specifying clonal phylogenetic trees in cancer.
SVEngine outpus alterred contigs in FASTA files, spiked-in variants in VAR files (see Manual), simulated short read in FASTQ files and aligned short reads in BAM files.

Address of the bookmark: https://bitbucket.org/charade/svengine/src/master/

jackalope: A swift, versatile phylogenomic and high-throughput sequencing simulator

Abhimanyu Singh — Fri, 26 Jul 2019 00:58:12 -0500

jackalope simply and efficiently simulates (i) variants from reference genomes and (ii) reads from both Illumina and Pacific Biosciences (PacBio) platforms. It can either read reference genomes from FASTA files or simulate new ones. Genomic variants can be simulated using summary statistics, phylogenies, Variant Call Format (VCF) files, and coalescent simulations—the latter of which can include selection, recombination, and demographic fluctuations. jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as reads from Pacific Biosciences These simulations include sequencing errors, mapping qualities, multiplexing, and optical/PCR duplicates. All outputs can be written to standard file formats.

A swift, versatile phylogenomic and high-throughput sequencing simulator https://jackalope.lucasnell.com

Address of the bookmark: https://github.com/lucasnell/jackalope

MGRA: Breakpoint graphs and ancestral genome reconstructions

Jit — Tue, 25 Jul 2017 08:48:25 -0500

MGRA (Multiple Genome Rearrangements and Ancestors) is a tool for reconstruction of ancestor genomes and evolutionary history of extant genomes.

It takes as an input a set of genomes represented as sequences of genes (or synteny blocks) and produces such sequences for ancestral genomes at the internal nodes of the phylogenetic tree.

The phylogenetic tree may be also specified completely or partially, in the latter case MGRA can reconstruct conserved ancestral regions (CARs) of the ancestral genome of interest.

Since version 2 MGRA supports gene insertion and deletions in addition to genome rearrangements and allows the input genomes to have different gene content.

It also can reconstruct most plausible phylogenetic tree based on the rearrangement characters.

Address of the bookmark: http://mgra.cblab.org/

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Sat, 18 Nov 2017 16:10:16 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Address of the bookmark: http://genomicus.biologie.ens.fr/genomicus-90.01/cgi-bin/search.pl

Scripts for the analysis of HGT in genome sequence data.

Jit — Wed, 29 Nov 2017 16:44:10 -0600

Scripts for the analysis of HGT in genome sequence data

Address of the bookmark: https://github.com/reubwn/hgt

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Jit — Fri, 08 Dec 2017 16:48:40 -0600

Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1 fixes the problems with SNP annotation that arose when NCBI discontinued use of GI numbers. Please read carefully the Preface (page 3) and the File of annotated genomes section (pages 9-10) in the version 3.1 User Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need the annotation feature. All users are encouraged to upgrade to version 3.1.

Address of the bookmark: https://sourceforge.net/projects/ksnp/files/