BOL: Related items

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

collinearity: scripts to parse and analyse MCScanX collinearity output

Jit — Wed, 29 Nov 2017 16:47:54 -0600

scripts to parse and analyse MCScanX collinearity output

Address of the bookmark: https://github.com/reubwn/collinearity

Circleator

Jit — Sun, 25 Jun 2017 18:04:32 -0500

The Charm City Circleator--or Circleator for short--is a Perl-based visualization tool developed at the Institute for Genome Sciences in the University of Maryland's School of Medicine. Circleator produces circular plots of genome-associated data, like this one:

Common uses of the tool include:

Displaying the sequence and/or genes in a GenBank flat file.
Highlighting differences and/or similarities in gene content between related organisms.
Comparing SNPs and indels between closely-related strains or serovars.
Comparing gene expression values across multiple samples or timepoints.
Visualizing coverage plots of RNA-Seq read alignments.

Key Features

Circleator...

Builds on BioPerl and the input file formats that it supports, including:
- GenBank flat files, GFF, FASTA
Accepts a number of other commonly-used datatypes and file formats:
- BSR and TRF output, SAM/BAM files, VCF-encoded SNPs, tab-delimited files
Outputs publication-ready figures in the SVG (Scalable Vector Graphics) format.
Requires only a single configuration file whose layout mirrors that of the figure itself.
- Predefined configuration files and "track" types are supplied for common datasets.
- Advanced features allow limited analyses to be performed as a figure is drawn.
Includes an extensive set of regression tests.
Offers a prototype web-based GUI (under the "Ringmaster" project.)

https://github.com/jonathancrabtree/Circleator

Address of the bookmark: https://github.com/jonathancrabtree/Circleator

Perlbrew: admin-free perl installation management tool.

Jit — Wed, 12 Jul 2017 03:53:08 -0500

perlbrew is an admin-free perl installation management tool. The latest version is 0.79, read the release note: Release 0.79.

Copy & Paste this line into your terminal:

\curl -L https://install.perlbrew.pl | bash

Or, if your system does not have curl but something else:

# Linux
\wget -O - https://install.perlbrew.pl | bash

# FreeBSD
\fetch -o- https://install.perlbrew.pl | sh

If you prefer to install with cpan, there are two steps:

sudo cpan App::perlbrew
perlbrew init

If it is installed with cpan, the perlbrew executable should be installed as /usr/bin/perlbrew or /usr/local/bin/perlbrew. For all users who want to use perlbrew, a prior perlbrew init needs to be executed.

The default perlbrew root directory is ~/perl5/perlbrew, which can be changed by setting PERLBREW_ROOTenvironment variable before the installation and initialization. For more advanced installation process, please read the perlbrew document.

Address of the bookmark: https://perlbrew.pl/

VCFtools: perform common tasks with VCF files such as file validation, file merging, intersecting, complements

Rahul Nayak — Tue, 07 Aug 2018 10:01:46 -0500

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

http://vcftools.sourceforge.net/perl_module.html

Address of the bookmark: http://vcftools.sourceforge.net/perl_module.html

maftools : Summarize, Analyze and Visualize MAF Files

Neel — Wed, 23 Dec 2020 05:29:33 -0600

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

Address of the bookmark: https://www.bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

vcfR: a package to manipulate and visualize VCF data in R

Jit — Thu, 25 Oct 2018 09:05:59 -0500

VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices from the VCF data for use with typical R functions. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file or converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and the R environment connecting familiar software with genomic data.

Address of the bookmark: https://github.com/knausb/vcfR

Quip: Aggressive compression of FASTQ, SAM and BAM files.

Neel — Tue, 24 May 2022 06:31:48 -0500

This will help us to reduce the amount of drive space we take up and decrease data transfer times

Quip compresses next-generation sequencing data with extreme prejudice. It supports input and output in the FASTQ and SAM/BAM formats, compressing large datasets to as little as 15% of their original size.

Address of the bookmark: https://github.com/dcjones/quip

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Rahul Nayak — Thu, 14 May 2020 15:13:30 -0500

gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6733440/

Address of the bookmark: https://github.com/kammoji/gapFinisher

Bioinformatics: Introduction to PERL

Archana Malhotra — Thu, 11 Jul 2013 09:49:37 -0500

This course is aimed at those new to programming and provides an introduction to programming using Perl. By the end of this course, attendees should be able to write simple Perl programs and to understand more complex Perl programs written by others. The course will be taught using the online Learning Perl materials created by Sofia Robb of the University of California Riverside. Further information is available.