BOL: Related items

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

Tools to access the quality of your assembled genome !

LEGE — Thu, 08 Aug 2024 23:31:18 -0500

FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
ASSEMBLATHON STATS: Assembly statistics
GENOMETOOLS GT STAT: Annotation statistics
NCBI FCS ADAPTOR: Adaptor contamination pass/fail
NCBI FCS GX: Foreign organism contamination pass/fail
BUSCO: Gene-space completeness estimation
TIDK: Telomere repeat identification
LAI: Continuity of repetitive sequences
KRAKEN2: Taxonomy classification
HIC CONTACT MAP: Alignment and visualisation of HiC data
MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
MERQURY: K-mer completeness, consensus quality and phasing assessment

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

simNGS and simLibrary – Software for Simulating Next-Gen Sequencing Data

Jit — Tue, 28 Nov 2017 06:49:11 -0600

simNGS is software for simulating observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software. By default, observations only incorporate noise due to sequencing and do not incorporate effects from more esoteric sources of noise that may be present in real data ("dust", bubbles, merged clusters, sequence-heterogeneous clusters, etc). Many of these additional sources may optionally applied.

simNGS takes fasta format sequences and a file describing the covariance of noise between bases and cycles observed in an actual run of the machine, randomly generates noisy intensities representing the signals for the sequence at each cycle and calculates likelihoods for all possible base calls.

Address of the bookmark: https://www.ebi.ac.uk/goldman-srv/simNGS/

sim3C: Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)

Jit — Tue, 13 Nov 2018 07:25:38 -0600

Required python modules

biopython
intervaltree
numpy
scipy
tqdm
PyYAML

Address of the bookmark: https://github.com/cerebis/sim3C

04- Informatics Approach to Cancer - Interview with Dr. Joel Saltz

Mon, 07 Oct 2013 14:35:43 -0500

For additional information visit http://www.cancerquest.org/joel-saltz-interview. Dr. Joel Saltz is a Professor in the Departments of Pathology, Biostatistics and Bioinformatics, and Mathematics and Computer Science at Emory University. Dr. Saltz's research on bioinformatics spans several disciplines. One project involves applying computer analysis to medical imaging to yield better results for patients. As an example, a computer program may able to help doctors detect small cancers in a CT scan or mammogram. In this interview segment, Dr. Saltz discusses the informatics approach to cancer. To learn more about cancer and watch additional interviews, please visit the CancerQuest website at http://www.cancerquest.org.

Oldest Hominin DNA Sequenced

Surajeet — Fri, 27 Dec 2013 19:58:31 -0600

Matthias Meyer and his team from the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany, have developed new techniques for retrieving and sequencing highly degraded ancient DNA. They then joined forces with Juan-Luis Arsuaga and applied the new techniques to a cave bear from the Sima de los Huesos site. After this success, the researchers sampled two grams of bone powder from a hominin thigh bone from the cave. They extracted its DNA and sequenced the genome of the mitochondria or mtDNA, a small part of the genome that is passed down along the maternal line and occurs in many copies per cell. The researchers then compared this ancient mitochondrial DNA with Neandertals, Denisovans, present-day humans, and apes.

From the missing mutations in the old DNA sequences the researchers calculated that the Sima hominin lived about 400,000 years ago. They also found that it shared a common ancestor with the Denisovans, an extinct archaic group from Asia related to the Neandertals, about 700,000 years ago. "The fact that the mtDNA of the Sima de los Huesos hominin shares a common ancestor with Denisovan rather than Neandertal mtDNAs is unexpected since its skeletal remains carry Neandertal-derived features," says Matthias Meyer. Considering their age and Neandertal-like features, the Sima hominins were likely related to the population ancestral to both Neandertals and Denisovans. Another possibility is that gene flow from yet another group of hominins brought the Denisova-like mtDNA into the Sima hominins or their ancestors.

Reference

http://www.sciencedaily.com/releases/2013/12/131204132018.htm

Landry Lab

Thu, 17 Jul 2014 14:33:57 -0500

EVOLUTIONARY AND INTEGRATIVE CELL BIOLOGY

Our research is at the crossroad between cell biology, ecological genomics, systems biology, molecular evolution and population genetics. We study the architecture and evolution of protein and signalling networks.

More at http://landrylab.ibis.ulaval.ca/

Yannick Wurm Lab

Thu, 07 Aug 2014 18:02:37 -0500

Evolutionary genomics of social insects. Extensive theoretical work has explained how and why complex societies evolve. However, only little is known about the genes and molecular mechanisms responsible for social phenotypes. We have been identifying genes and mechanisms involved in the evolution of insect societies using modern genomics tools (Illumina, RNAseq, RADseq...). For example we recently:

1. sequenced and analyzed the genome of the invasive red fire ant Solenopsis invicta (PNAS 2011)

2. discovered that a fundamental social trait in this species (how many queens are accepted in the colony) is determined by variants of a social chromosome (Nature 2013).

3. described the gene expression changes that occur in a virgin queen when she is given the opportunity of replacing her mother (Mol Ecol 2010).

Homepage: http://yannick.poulet.org/