BOL: Related items

Tbl2asn: a command-line program that automates the creation of sequence records for submission to GenBank

Poonam Mahapatra — Mon, 29 May 2017 07:37:08 -0500

Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank. It uses many of the same functions as Sequin but is driven generally by data files. Tbl2asn generates .sqn files for submission to GenBank. Additional manual editing is not required before submission.

Tbl2asn is available by anonymous FTP. Copy the right version for your platform, then uncompress the file, rename it to "tbl2asn", and set the permissions, as necessary for the platform.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

DISTRUCT: a program for the graphical display of population structure

Abhimanyu Singh — Mon, 25 Mar 2019 03:33:44 -0500

distruct is a program that can be used to graphically display results produced by the genetic clustering program structure or by other similar programs. The figures produced by distructdisplay individual membership coefficients in the same form as used in "Genetic structure of human populations" Science 298: 2381-2385 (2002). Various options enable the user to control left-to-right printing order of populations, bottom-to-top printing order of clusers, colors, and other graphical details. [Example]

[Download software package (includes the manual)] (you will be directed first to a registration page and we would very much appreciate if you register)
[Download manual]
[Download software note from Molecular Ecology Notes 4: 137-138 (2004)]

To use the UNIX versions, unzip and untar the files in an appropriate directory using

gunzip filename.tar.gz; tar xvf filename.tar

where "filename.tar.gz" is the downloaded file. Winzip will unzip the Windows version. Run the program by typing

./distruct

in UNIX or

distruct

from a Dos prompt in Windows. It will produce a figure using the data that are represented in the Central/South Asia K=5 plot in Science 298: 2381-2385 (2002).

Please send comments or problems with distruct to Noah Rosenberg.

October 15, 2014 — Users of Distruct may also find CLUMPP and CLUMPAK of interest.

Address of the bookmark: https://rosenberglab.stanford.edu/distruct.html

chromosight: Computer vision based program for pattern recognition in chromosome (Hi-C) contact maps

Jit — Mon, 23 Mar 2020 06:20:04 -0500

Python package to detect chromatin loops (and other patterns) in Hi-C contact maps.

Stable version with pip:

pip3 install --user chromosight

Stable version with conda:

conda install -c bioconda -c conda-forge chromosight

or, if you want to get the latest development version:

pip3 install --user -e git+https://github.com/koszullab/chromosight.git@master#egg=chromosight

Address of the bookmark: https://github.com/koszullab/Chromosight

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

GenBank release 257.0 is now available!

Neel — Wed, 23 Aug 2023 00:23:23 -0500

GenBank release 257.0 is now available! This release has 25.10 trillion bases and 3.69 billion records. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/08/21/genbank-release-257/

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences

Jit — Wed, 20 Jun 2018 07:55:29 -0500

git clone https://github.com/lh3/minimap2 cd minimap2 && make # long sequences against a reference genome ./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam # create an index first and then map ./minimap2 -d MT-human.mmi test/MT-human.fa ./minimap2 -a MT-human.mmi test/MT-orang.fa > test.sam # use presets (no test data) ./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio genomic reads ./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads ./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads ./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads ./minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam # Nanopore Direct RNA-seq ./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment ./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap ./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap # man page for detailed command line options man ./minimap2.1

Address of the bookmark: https://github.com/lh3/minimap2

COSINE: non-seeding method for mapping long noisy sequences

Jit — Fri, 26 Oct 2018 00:41:59 -0500

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors.

Address of the bookmark: https://github.com/SUwonglab/COSINE

Gepard: allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes

Jit — Mon, 26 Aug 2019 11:38:30 -0500

Gepard (German: "cheetah", Backronym for "GEnome PAir - Rapid Dotter") allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes. Reference: Krumsiek J, Arnold R, Rattei T. Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896

http://cube.univie.ac.at/gepard

Address of the bookmark: https://github.com/univieCUBE/gepard

Sequence Tube Maps: displays multiple genomic sequences in the form of a tube map

Jit — Wed, 11 Mar 2020 01:12:06 -0500

A JavaScript module for the visualization of genomic sequence graphs. It automatically generates a "tube map"-like visualization of sequence graphs which have been created with vg. (https://github.com/vgteam/vg)

Link to working demo: https://vgteam.github.io/sequenceTubeMap/

Address of the bookmark: https://github.com/vgteam/sequenceTubeMap