BOL: Related items

PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

BioStar — Fri, 21 Sep 2018 10:19:52 -0500

Development packages for zlib and libbz2 are needed, as well as a standard compiler environment. On Ubuntu, this can be installed via:

sudo apt-get install build-essential libtool automake zlib1g-dev libbz2-dev pkg-config

On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then:

sudo fink install bzip2-dev pkgconfig

Address of the bookmark: https://github.com/neufeld/pandaseq

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

Jit — Tue, 28 Jan 2020 03:27:33 -0600

FastGT is a program package for whole-genome genotyping of genome variants directly from raw sequencing reads. It is written in C and runs in Linux. FastGT uses a list of variant-specific k-mer pairs that are unique in human genome, counts the frequency of k-mers in sequencing data and predicts the genotype. All this takes less than 1 hour on average low-cost Linux server.

http://bioinfo.ut.ee/FastGT/

https://github.com/bioinfo-ut/GenomeTester4/

Address of the bookmark: http://bioinfo.ut.ee/FastGT/

KAD: Assessing genome assemblies using K-mer copies in assemblies and K-mer abundance in Illumina reads

Jit — Fri, 19 Jun 2020 07:34:12 -0500

KAD is designed for evaluating the accuracy of nucleotide base quality of genome assemblies. Briefly, abundance of k-mers are quantified for both sequencing reads and assembly sequences. Comparison of the two values results in a single value per k-mer, K-mer Abundance Difference (KAD), which indicates how well the assembly matches read data for each k-mer.

where, c is the count of a k-mer from reads, m is the mode of counts of read k-mers, and n is the copy of the k-mer in the assembly.

Address of the bookmark: https://github.com/liu3zhenlab/KAD

Steps to find palindrome in genomes !

BioStar — Thu, 09 Mar 2023 02:56:54 -0600

Palindromes are sequences of nucleotides that read the same backward as forward. They can be present in genomes and have various biological functions. Here are some methods for discovering palindromes in genomes:

Direct sequence search: One of the simplest ways to discover palindromes is to search the genome sequence directly for palindromic sequences using pattern matching tools, such as regular expressions or string algorithms. This approach can be useful for discovering simple palindromes, but may miss more complex palindromic structures.
Dot plot analysis: Dot plot analysis is a graphical method that can be used to identify palindromic regions in a genome. It involves plotting the genome sequence against itself and examining the diagonal patterns that emerge. Palindromic regions will appear as symmetrical patterns along the diagonal.
Restriction enzyme analysis: Some restriction enzymes, such as EcoRI and HindIII, recognize palindromic sequences and cleave DNA at these sites. By digesting the genome with these enzymes and examining the resulting fragments, palindromic regions can be identified.
Next-generation sequencing: High-throughput sequencing technologies, such as PacBio and Oxford Nanopore, can generate long reads that can span entire palindromic regions. By mapping these reads to the genome, palindromic regions can be identified and characterized.
Comparative genomics: Comparing the genomes of related species can also reveal palindromic regions that are conserved across evolutionarily divergent lineages. This approach can help identify functional palindromes that are under selective pressure.

Overall, the discovery of palindromic sequences in genomes can be accomplished using a variety of methods, each with their own advantages and limitations. A combination of these methods can provide a comprehensive understanding of the palindromic landscape of a genome.

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

cutadapt

Radha Agarkar — Fri, 13 May 2016 04:54:50 -0500

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

Address of the bookmark: https://cutadapt.readthedocs.io/en/stable/installation.html#quickstart

WgSim

Jit — Thu, 23 Jun 2016 07:26:49 -0500

Reads simulator

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Address of the bookmark: https://github.com/lh3/wgsim

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/

NxRepair: error correction in de novo assemblies using Nextera Mate Pair Reads

BioStar — Thu, 24 Jan 2019 10:35:12 -0600

NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.

https://nxrepair.readthedocs.io/en/latest/tutorial.html

nxrepair aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta

Address of the bookmark: https://github.com/rebeccaroisin/nxrepair