BOL: Related items

List of tools frequently used while genome assembly

BioStar — Tue, 22 Jan 2019 09:39:02 -0600

List of tools frequently used while genome assembly:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.

mutatrix: a population genome simulator which generates simulated genomes.

Jit — Tue, 28 Jan 2020 04:06:58 -0600

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms

More at https://github.com/ekg/mutatrix

./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta

Address of the bookmark: https://github.com/ekg/mutatrix

Biological databases !

BioStar — Wed, 12 Feb 2020 01:16:29 -0600

Now a days there are a lots of genomics databases available around the world. This bookmark is created to provide all links in one place ...

ftp://ftp.ncbi.nih.gov/genomes/

https://hgdownload.soe.ucsc.edu/downloads.html

Address of the bookmark: ftp://ftp.ncbi.nih.gov/genomes/

odgi: optimized dynamic genome/graph implementation

Abhimanyu Singh — Tue, 01 Feb 2022 23:42:21 -0600

odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses.

Careful encoding of graph entities allows odgi to efficiently compute and transform pangenomes with minimal overheads. odgi implements a dynamic data structure that leveraged multi-core CPUs and can be updated on the fly.

The edges and path steps are recorded as deltas between the current node id and the target node id, where the node id corresponds to the rank in the global array of nodes. Graphs built from biological data sets tend to have local partial order and, when sorted, the deltas be small. This allows them to be compressed with a variable length integer representation, resulting in a small in-memory footprint at the cost of packing and unpacking.

The RAM and computational savings are substantial. In partially ordered regions of the graph, most deltas will require only a single byte.

Address of the bookmark: https://github.com/pangenome/odgi

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Mon, 28 Feb 2022 23:27:37 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Summary statistics of Genomicus version 105.01: (view species tree in pdf or newick)


Number of extant species	200
Number of extant genes	4303993
Number of ancestral species	196
Number of ancestral genes	4624213
Number of ancestral synteny blocks	83342

Address of the bookmark: https://www.genomicus.bio.ens.psl.eu/genomicus-105.01/cgi-bin/search.pl

Perl one-liner for bioinformatician !!!

Abhimanyu Singh — Fri, 30 May 2014 05:49:07 -0500

With the emergence of NGS technologies, and sequencing data most of the bioinformaticians mung and wrangle around massive amounts of genomics text. There are several "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Perl onliner is extremely helpful.

Perl one-liners are small and awesome Perl programs that fit in a single line of code and they do one thing really well. These things include changing line spacing, numbering lines, doing calculations, converting and substituting text, deleting and printing certain lines, parsing logs, editing files in-place, doing statistics, carrying out system administration tasks, updating a bunch of files at once, and many more. Perl one-liners will make you the shell warrior. Anything that took you minutes to solve, will now take you seconds!

perl -pe '$\="\n"'
#double space a file

perl -pe '$_ .= "\n" unless /^$/'
#double space a file except blank lines

perl -pe '$_.="\n"x7'
#7 space in a line.

perl -ne 'print unless /^$/'
#remove all blank lines

perl -lne 'print if length($_) < 20'
#print all lines with length less than 20.

perl -00 -pe ''
#If there are multiple spaces, delete all leaving one(make the file a single spaced file).

perl -00 -pe '$_.="\n"x4'
#Expand single blank lines into 4 consecutive blank lines

perl -pe '$_ = "$. $_"'
#Number all lines in a file

perl -pe '$_ = ++$a." $_" if /./'
#Number only non-empty lines in a file

perl -ne 'print ++$a." $_" if /./'
#Number and print only non-empty lines in a file

perl -pe '$_ = ++$a." $_" if /regex/'
#Number only lines that match a pattern

perl -ne 'print ++$a." $_" if /regex/'
#Number and print only lines that match a pattern

perl -ne 'printf "%-5d %s", $., $_ if /regex/'
#Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)

perl -le 'print scalar(grep{/./}<>)'
#prints the total number of non-empty lines in a file

perl -lne '$a++ if /regex/; END {print $a+0}'
#print the total number of lines that matches the pattern

perl -alne 'print scalar @F'
#print the total number fields(words) in each line.

perl -alne '$t += @F; END { print $t}'
#Find total number of words in the file

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
#find total number of fields that match the pattern

perl -lne '/regex/ && $t++; END { print $t }'
#Find total number of lines that match a pattern

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'
#will calculate the GCD of two numbers.

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'
#will calculate lcd of 20 and 35.

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'
#Generates 10 random numbers between 5 and 15.

perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8'
#Generates a 8 character password from a to z and number 0 – 9.

perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20'
#Generates a 20 nucleotide long random residue.

perl -le 'print "a"x50'
#generate a string of ‘x’ 50 character long

perl -le 'print join ", ", map { ord } split //, "hello world"'
#Will print the ascii value of the string hello world.

perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
#converts ascii values into character strings.

perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
#Generates an array of odd numbers.

perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'
#Generate an array of even numbers

perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file
#Convert the entire file into 13 characters offset(ROT13)

perl -nle 'print uc'
#Convert all text to uppercase:

perl -nle 'print lc'
#Convert text to lowercase:

perl -nle 'print ucfirst lc'
#Convert only first letter of first word to uppercas

perl -ple 'y/A-Za-z/a-zA-Z/'
#Convert upper case to lower case and vice versa

perl -ple 's/(\w+)/\u$1/g'
#Camel Casing

perl -pe 's|\n|\r\n|'
#Convert unix new lines into DOS new lines:

perl -pe 's|\r\n|\n|'
#Convert DOS newlines into unix new line

perl -pe 's|\n|\r|'
#Convert unix newlines into MAC newlines:

perl -pe '/regexp/ && s/foo/bar/'
#Substitute a foo with a bar in a line with a regexp.

Reference/Sources:

http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html

http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/

http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html

http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/

LASTZ

Abhi — Mon, 18 Apr 2016 04:41:55 -0500

LASTZ is a program for aligning DNA sequences, a pairwise aligner. Originally designed to handle sequences the size of human chromosomes and from different species, it is also useful for sequences produced by NGS sequencing technologies such as Roche 454.

More at http://www.bx.psu.edu/~rsharris/lastz/

Thesis: http://www.bx.psu.edu/~rsharris/rsharris_phd_thesis_2007.pdf

Address of the bookmark: http://www.bx.psu.edu/~rsharris/lastz/

Alignment of closely related whole genomes/scaffolds

Rahul Nayak — Fri, 29 Jan 2016 10:37:27 -0600

With the relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution.This bookmaks have been created to report new tools for whole genome alignments.

Please report new whole genome alignment tools under comment sections.

Address of the bookmark: http://www.cs.utoronto.ca/~brudno/721.full.pdf

YASS :: genomic similarity search tool

Jit — Mon, 02 May 2016 09:26:00 -0500

YASS is a genomic similarity search tool, for nucleic (DNA/RNA) sequences in fasta or plain text format (it produces local pairwise alignments). Like most of the heuristic pairwise local alignment tools for DNA sequences (FASTA, BLAST, PATTERNHUNTER, BLASTZ/LASTZ, LAST ...), YASS uses seeds to detect potential similarity regions, and then tries to extend them to local alignments. This genomic search tool uses multiple transition constrained spaced seeds that enable to search more fuzzy repeats, as non-coding DNA/RNA. Another simple, but interesting feature is that you can specify the seed pattern used in the search step (as provided for example by iedera).

Main features of YASS are:

multiple, possibly overlapping seeds and a new hit criterion to ensure a good sensitivity/selectivity trade-off
transition-constrained spaced seeds to improve sensitivity (transition mutations are purine to purine [A<->G] or pyrimidine to pyrimidine [C<->T])
using different scoring schemes with bit-score and E-value evaluated according to the sequence background frequencies
parameterizable output filter for low complexity repeats
reporting of various alignment statistical parameters (mutation bias along triplets, transition/transversion)
post-processing step to group gapped alignments

Address of the bookmark: http://bioinfo.lifl.fr/yass/

methylKit

Jit — Fri, 03 Jun 2016 10:09:29 -0500

methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants, but also target-capture methods such as Agilent SureSelect methyl-seq. In addition, methylKit can deal with base-pair resolution data for 5hmC obtained from Tab-seq or oxBS-seq. It can also handle whole-genome bisulfite sequencing data if proper input format is provided.

Address of the bookmark: https://github.com/al2na/methylKit