BOL: Related items

GenBank release 257.0 is now available!

Neel — Wed, 23 Aug 2023 00:23:23 -0500

GenBank release 257.0 is now available! This release has 25.10 trillion bases and 3.69 billion records. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/08/21/genbank-release-257/

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences

Jit — Wed, 20 Jun 2018 07:55:29 -0500

git clone https://github.com/lh3/minimap2 cd minimap2 && make # long sequences against a reference genome ./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam # create an index first and then map ./minimap2 -d MT-human.mmi test/MT-human.fa ./minimap2 -a MT-human.mmi test/MT-orang.fa > test.sam # use presets (no test data) ./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio genomic reads ./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads ./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads ./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads ./minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam # Nanopore Direct RNA-seq ./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment ./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap ./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap # man page for detailed command line options man ./minimap2.1

Address of the bookmark: https://github.com/lh3/minimap2

COSINE: non-seeding method for mapping long noisy sequences

Jit — Fri, 26 Oct 2018 00:41:59 -0500

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors.

Address of the bookmark: https://github.com/SUwonglab/COSINE

LTR_Finder: an efficient program for finding full-length LTR retrotranspsons in genome sequences.

Neel — Sun, 13 Jan 2019 07:05:53 -0600

LTR_Finder is an efficient program for finding full-length LTR retrotranspsons in genome sequences.

The Program first constructs all exact match pairs by a suffix-array based algorithm and extends them to long highly similar pairs. Then Smith-Waterman algorithm is used to adjust the ends of LTR pair candidates to get alignment boundaries. These boundaries are subject to re-adjustment using supporting information of TG..CA box and TSRs and reliable LTRs are selected. Next, LTR_FINDER tries to identify PBS, PPT and RT inside LTR pairs by build-in aligning and counting modules. RT identification includes a dynamic programming to process frame shift. For other protein domains, LTR_FINDER calls ps_scan (from PROSITE, http://www.expasy.org/prosite/) to locate cores of important enzymes if they occur.

Address of the bookmark: https://github.com/xzhub/LTR_Finder

Miropeats: discovers regions of sequence similarity amongst any set of DNA sequences

Poonam Mahapatra — Mon, 26 Aug 2019 17:55:24 -0500

Miropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically. Sequence similarity searching is a very general tool that forms the basis of many different biological sequence analyses but it is limited by the verbosity of traditional alignment presentation styles. Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing extensive large scale sequence similarities on a single page of graphics. The latest version of Miropeats can be used as a general pairwise alignment program or in its traditional role sorting out a big mess of overlapping or similar regions.

Address of the bookmark: http://www.littlest.co.uk/software/bioinf/old_packages/miropeats/

Coronavirus Resources !

Neel — Wed, 25 Mar 2020 17:11:33 -0500

2019nCoVR features comprehensive integration of genomic and proteomic sequences as well as their metadata information from the GISAID, NCBI, NMDC and CNCB/NGDC. It also incorporates a wide range of relevant information including scientific literatures, news, and popular articles for science dissemination, and provides visualization functionalities for genome variation analysis results based on all collected 2019-nCoV strains.

Annotation

https://bigd.big.ac.cn/ncov/variation/annotation

Genome wharehouse

https://bigd.big.ac.cn/gwh/browse/index

Released Genome

https://bigd.big.ac.cn/ncov/release_genome

Download data

ftp://download.big.ac.cn/Genome/Viruses/Coronaviridae/

Raw data

https://bigd.big.ac.cn/gsa/browse/run/?tag=Coronaviridae

Address of the bookmark: https://bigd.big.ac.cn/ncov/about

Tiara: deep learning-based classification system for eukaryotic sequences

Rahul Nayak — Mon, 14 Mar 2022 23:02:11 -0500

With a large number of metagenomic datasets becoming available, eukaryotic metagenomics emerged as a new challenge. The proper classification of eukaryotic nuclear and organellar genomes is an essential step toward a better understanding of eukaryotic diversity.

Address of the bookmark: https://academic.oup.com/bioinformatics/article/38/2/344/6375939

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Jit — Mon, 18 Feb 2019 04:21:50 -0600

MACSE aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

For further details about the underlying algorithm see the original publication:
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.
Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery
PLoS One 2011, 6(9): e22594.

Address of the bookmark: https://bioweb.supagro.inra.fr/macse/index.php?menu=releases

FastANI: fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI)

Jit — Fri, 13 Jul 2018 17:27:01 -0500

FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. FastANI supports pairwise comparison of both complete and draft genome assemblies. Its underlying procedure follows a similar workflow as described by Goris et al. 2007. However, it avoids expensive sequence alignments and uses Mashmap as its MinHash based sequence mapping engine to compute the orthologous mappings and alignment identity estimates. Based on our experiments with complete and draft genomes, its accuracy is on par with BLAST-based ANI solver and it achieves two to three orders of magnitude speedup. Therefore, it is useful for pairwise ANI computation of large number of genome pairs. More details about its speed, accuracy and potential applications are described here: "High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries".

Address of the bookmark: https://github.com/ParBLiSS/FastANI