BOL: Related items

chromeister: An ultra fast, heuristic approach to detect conserved signals in extremely large pairwise genome comparisons.

Jit — Thu, 03 Feb 2022 04:01:55 -0600

chromeister: An ultra fast, heuristic approach to detect conserved signals in extremely large pairwise genome comparisons.

USAGE:

-query: sequence A in fasta format
-db: sequence B in fasta format
-out: output matrix
-kmer Integer: k>1 (default 32) Use 32 for chromosomes and genomes and 16 for small bacteria
-diffuse Integer: z>0 (default 4) Use 4 for everything - if using large plant genomes you can try using 1
-dimension Size of the output matrix and plot. Integer: d>0 (default 1000) Use 1000 for everything that is not full genome size, where 2000 is recommended

Address of the bookmark: https://github.com/estebanpw/chromeister

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

gVolante: Completeness Assessment of Genome/Transcriptome Sequences

Jit — Tue, 06 Aug 2019 21:37:56 -0500

gVolante provides an online interface for completeness assessment of user’s original or publicly available sequence datasets as well as for browsing results of completeness assessment performed on publicly available genome and transcriptome assemblies.

Address of the bookmark: https://gvolante.riken.jp/

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

COSINE: non-seeding method for mapping long noisy sequences

Jit — Fri, 26 Oct 2018 00:41:59 -0500

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors.

Address of the bookmark: https://github.com/SUwonglab/COSINE

IgBLAST 1.17 is now available with improved identification of productive V gene sequences

Jit — Sun, 01 Nov 2020 16:52:58 -0600

A new release of IgBLAST (1.17), the popular package for classifying and analyzing immunoglobulin and T cell receptor sequences, is now available on the web and from the FTP site. The updated package is better at identifying productive V gene sequences. We added a new field , “V frame shift”, to the IgBLAST output to indicate whether the V gene translation frame contains a frame-shift. We have also updated the definition of a productive V(D)J sequence to now exclude those with internal frame shifts.

See the new IgBLAST manual on the NCBI GitHub site for more information on setting up and running IgBLAST.

If you have any questions or concerns, please email us at blast-help@ncbi.nlm.nih.gov

doubletrouble: identify duplicated genes from whole-genome protein sequences and classify

LEGE — Tue, 05 Mar 2024 00:23:49 -0600

doubletrouble aims to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The duplication modes are i. segmental duplication (SD); ii. tandem duplication (TD); iii. proximal duplication (PD); iv. transposed duplication (TRD) and; v. dispersed duplication (DD). Transposon-derived duplicates (TRD) can be further subdivided into rTRD (retrotransposon-derived duplication) and dTRD (DNA transposon-derived duplication). If users want a simpler classification scheme, duplicates can also be classified into SD- and SSD-derived (small-scale duplication) gene pairs. Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication. Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.

Address of the bookmark: https://bioconductor.org/packages/release/bioc/html/doubletrouble.html

ansible: Simple, agentless IT automation that anyone can use

Rahul Nayak — Wed, 17 Apr 2019 21:41:04 -0500

Ansible is a universal language, unraveling the mystery of how work gets done. Turn tough tasks into repeatable playbooks. Roll out enterprise-wide protocols with the push of a button. Give your team the tools to automate, solve, and share.

Address of the bookmark: https://www.ansible.com/

Scalpel

Shruti Paniwala — Wed, 20 Aug 2014 02:07:58 -0500

A team from Cold Spring Harbor Laboratory has released an algorithm, called Scalpel, for finding insertions and deletions in next generation sequencing data sets. Scalpel, which is open source and available for download on SourceForge, outperformed the popular tools GATK HaplotypeCaller and SOAPindel in test runs on both simulated and real whole human exomes.

Like other indel callers, Scalpel works by performing de novo assembly of regions of interest, so that misalignment to the reference genome cannot obscure the presence of an insertion or deletion. Scalpel's innovation is to repeatedly check its assembly before comparing to the reference genome, to account for simple sequence repeats that are a regular source of error in indel calling. When Scalpel assembles an exon, it collects reads that map to that exon (including partial matches), splits them into k-mers, and creates a de Bruijn graph to span the exon; however, if it detects repeats in the map, it iteratively increases the size of the k-mers by one base until the repeats are eliminated. This ensures that the final assembly of the exon is highly accurate while minimizing compute time.

The Cold Spring Harbor team's validation of Scalpel, published over the weekend in Nature Methods, compares Scalpel's performance on a live whole exome against HaplotypeCaller and SOAPindel. The donor is an individual with serious neurological disorders, which may be linked to a high incidence of indels. One thousand indels from this individual's exome, called by one or more of the informatics pipelines, were selected for focused resequencing. This resequencing revealed a 77% true positive rate for Scalpel calls, dramatically better than the rates for either of the competing tools; Scalpel performed especially well with indels longer than five base pairs, a traditional weak point for indel callers.

Finally, the authors demonstrate Scalpel's use on a large set of genetic data from nearly 600 families who donated samples to the Simons Simplex Collection, a project of the Simons Foundation Autism Research Initiative. Scalpel found a very high enrichment for indels in children affected by autism, compared with their unaffected siblings, a pattern that persisted even after excluding common variants.

Picard

Neel — Fri, 29 Apr 2016 08:21:54 -0500

Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. These file formats are defined in the Hts-specs repository. See especially the SAM specification and the VCF specification.

Note that the information on this page is targeted at end-users. For developers, the source code, building instructions and implementation/development resources are available on GitHub.

The Picard toolkit is open-source under the MIT license and free for all uses.

Enjoy!

Address of the bookmark: http://broadinstitute.github.io/picard/