BOL: Related items

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

collinearity: scripts to parse and analyse MCScanX collinearity output

Jit — Wed, 29 Nov 2017 16:47:54 -0600

scripts to parse and analyse MCScanX collinearity output

Address of the bookmark: https://github.com/reubwn/collinearity

Circleator

Jit — Sun, 25 Jun 2017 18:04:32 -0500

The Charm City Circleator--or Circleator for short--is a Perl-based visualization tool developed at the Institute for Genome Sciences in the University of Maryland's School of Medicine. Circleator produces circular plots of genome-associated data, like this one:

Common uses of the tool include:

Displaying the sequence and/or genes in a GenBank flat file.
Highlighting differences and/or similarities in gene content between related organisms.
Comparing SNPs and indels between closely-related strains or serovars.
Comparing gene expression values across multiple samples or timepoints.
Visualizing coverage plots of RNA-Seq read alignments.

Key Features

Circleator...

Builds on BioPerl and the input file formats that it supports, including:
- GenBank flat files, GFF, FASTA
Accepts a number of other commonly-used datatypes and file formats:
- BSR and TRF output, SAM/BAM files, VCF-encoded SNPs, tab-delimited files
Outputs publication-ready figures in the SVG (Scalable Vector Graphics) format.
Requires only a single configuration file whose layout mirrors that of the figure itself.
- Predefined configuration files and "track" types are supplied for common datasets.
- Advanced features allow limited analyses to be performed as a figure is drawn.
Includes an extensive set of regression tests.
Offers a prototype web-based GUI (under the "Ringmaster" project.)

https://github.com/jonathancrabtree/Circleator

Address of the bookmark: https://github.com/jonathancrabtree/Circleator

Perlbrew: admin-free perl installation management tool.

Jit — Wed, 12 Jul 2017 03:53:08 -0500

perlbrew is an admin-free perl installation management tool. The latest version is 0.79, read the release note: Release 0.79.

Copy & Paste this line into your terminal:

\curl -L https://install.perlbrew.pl | bash

Or, if your system does not have curl but something else:

# Linux
\wget -O - https://install.perlbrew.pl | bash

# FreeBSD
\fetch -o- https://install.perlbrew.pl | sh

If you prefer to install with cpan, there are two steps:

sudo cpan App::perlbrew
perlbrew init

If it is installed with cpan, the perlbrew executable should be installed as /usr/bin/perlbrew or /usr/local/bin/perlbrew. For all users who want to use perlbrew, a prior perlbrew init needs to be executed.

The default perlbrew root directory is ~/perl5/perlbrew, which can be changed by setting PERLBREW_ROOTenvironment variable before the installation and initialization. For more advanced installation process, please read the perlbrew document.

Address of the bookmark: https://perlbrew.pl/

VCFtools: perform common tasks with VCF files such as file validation, file merging, intersecting, complements

Rahul Nayak — Tue, 07 Aug 2018 10:01:46 -0500

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

http://vcftools.sourceforge.net/perl_module.html

Address of the bookmark: http://vcftools.sourceforge.net/perl_module.html

ACANA: An accurate and consistent alignment tool for DNA sequences

Jit — Wed, 06 Dec 2017 09:45:29 -0600

ACANA is an accurate and consistent alignment tool for DNA sequences. ACANA is specifically designed for aligning sequences that share only some moderately conserved regions and/or have a high frequency of long insertions or deletions. It attempts to combine the best of local and global alignments algorithms in searching for evolutionarily related regions of sequences in order to achieve the best alignment. ACANA is also robust to the small changes of alignment parameters, particularly the gap extension score. As an accurate alignment tool, ACANA is particularly useful in comparative sequence analysis for identifying conserved functional regulatory elements.

Address of the bookmark: https://www.niehs.nih.gov/research/resources/software/biostatistics/acana/index.cfm

SMASH: An alignment-free tool to find and visualise rearrangements between pairs of DNA sequences

Jit — Thu, 21 Dec 2017 08:26:57 -0600

SMASH is a completely alignment-free method to find and visualise rearrangements between pairs of DNA sequences. The detection is based on relative compression, namely using a FCM, also known as Markov model, of high context order (typically 20). The method has been approached with a tool (also called SMASH). For visualization, SMASH outputs a SVG image, with an ideogram output architecture, where the patterns are represented with several HSV values (only value varies). The following image, illustrating the information maps between human and chimpanzee for the several chromosomes, depicts an example:

Address of the bookmark: https://github.com/pratas/smash

Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data

Jit — Thu, 19 Apr 2018 08:06:03 -0500

Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both end of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site associated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html).

Address of the bookmark: https://github.com/meiji-bioinf/heap

Porechop: tool for finding and removing adapters from Oxford Nanopore reads

Rahul Nayak — Tue, 29 May 2018 07:33:44 -0500

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

Address of the bookmark: https://github.com/rrwick/Porechop

HiGlass: a tool for exploring genomic contact matrices and tracks.

Jit — Mon, 11 Jun 2018 09:44:49 -0500

HiGlass is a tool for exploring genomic contact matrices and tracks. Please take a look at the examples and documentation for a description of the ways that it can be configured to explore and compare contact matrices. To load private data, HiGlass can be run locally within a Docker container. The HiC data in the examples below is from Rao et al. (2014) http://higlass.io/

Address of the bookmark: http://higlass.io/