BOL: Related items

CrossMap: a program for convenient conversion of genome coordinates

Jit — Thu, 31 May 2018 06:00:47 -0500

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)). It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. We do not recommend using CrossMap to convert genome coordinates between species.

Address of the bookmark: http://crossmap.sourceforge.net

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

LINKS scaffolder bloomfilter setting !

Jit — Fri, 15 Jun 2018 10:39:54 -0500

➜ bin git:(master) ✗ ls -l
total 68
drwxrwxr-x 3 urbe urbe 4096 Jun 15 12:15 lib
-rwxrwxrwx 1 urbe urbe 65141 Jun 15 17:13 LINKS
➜ bin git:(master) ✗ pwd
/home/urbe/Tools/LINKS_1.8.6/bin

➜ bloomfilter git:(master) ✗ swig -Wall -c++ -perl5 BloomFilter.i
➜ bloomfilter git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
BloomFilter_wrap.cxx:1892:30: fatal error: ../BloomFilter.hpp: No such file or directory
compilation terminated.
➜ bloomfilter git:(master) ✗ cd swig
➜ swig git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
In file included from BloomFilter_wrap.cxx:1877:0:
../BloomFilter.hpp: In member function ‘void BloomFilter::loadHeader(FILE*)’:
../BloomFilter.hpp:141:59: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&header, sizeof(struct FileHeader), 1, file);
^
➜ swig git:(master) ✗ g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3
➜ swig git:(master) ✗ cd ..
➜ bloomfilter git:(master) ✗ cd ..
➜ lib git:(master) ✗ cd ..
➜ bin git:(master) ✗ ./LINKS
Usage: ./LINKS [v1.8.6]
-f sequences to scaffold (Multi-FASTA format, required)
-s file-of-filenames, full path to long sequence reads or MPET pairs [see below] (Multi-FASTA/fastq format, required)
-m MPET reads (default -m 1 = yes, default = no, optional)
! DO NOT SET IF NOT USING MPET. WHEN SET, LINKS WILL EXPECT A SPECIAL FORMAT UNDER -s
! Paired MPET reads in their original outward orientation <- -> must be separated by ":"
>template_name
ACGACACTATGCATAAGCAGACGAGCAGCGACGCAGCACG:ATATATAGCGCACGACGCAGCACAGCAGCAGACGAC
-d distance between k-mer pairs (ie. target distances to re-scaffold on. default -d 4000, optional)
Multiple distances are separated by comma. eg. -d 500,1000,2000,3000
-k k-mer value (default -k 15, optional)
-t step of sliding window when extracting k-mer pairs from long reads (default -t 2, optional)
Multiple steps are separated by comma. eg. -t 10,5
-o offset position for extracting k-mer pairs (default -o 0, optional)
-e error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default -e 0.1, optional)
-l minimum number of links (k-mer pairs) to compute scaffold (default -l 5, optional)
-a maximum link ratio between two best contig pairs (default -a 0.3, optional)
*higher values lead to least accurate scaffolding*
-z minimum contig length to consider for scaffolding (default -z 500, optional)
-b base name for your output files (optional)
-r Bloom filter input file for sequences supplied in -s (optional, if none provided will output to .bloom)
NOTE: BLOOM FILTER MUST BE DERIVED FROM THE SAME FILE SUPPLIED IN -f WITH SAME -k VALUE
IF YOU DO NOT SUPPLY A BLOOM FILTER, ONE WILL BE CREATED (.bloom)
-p Bloom filter false positive rate (default -p 0.001, optional; increase to prevent memory allocation errors)
-x Turn off Bloom filter functionality (-x 1 = yes, default = no, optional)
-v Runs in verbose mode (-v 1 = yes, default = no, optional)

Error: Missing mandatory options -f and -s.

ERROR fixed

perl: symbol lookup error: /home/urbe/Tools/LINKS_new/bin/./lib/bloomfilter/swig/BloomFilter.so: undefined symbol: Perl_Gthr_key_ptr

GFinisher: a new strategy to refine and finish bacterial genome assemblies

Jit — Thu, 26 Jul 2018 09:31:55 -0500

GFinisher is an application tools for refinement and finalization of prokaryotic genomes assemblies using the bias of GC Skew to identify assembly errors and organizes the contigs/scaffolds with genomes references.

java -Xms2G -Xmx4G -jar GenomeFinisher.jar  \
    -i target_contigs.fasta  \
    -ds alternative_assemblies.fasta -ref reference.fasta  \
    -o outputDirectory

Address of the bookmark: http://gfinisher.sourceforge.net

MITOS: improved de novo metazoan mitochondrial genome annotation

Jit — Fri, 26 Oct 2018 08:25:39 -0500

Allows automatic annotation of metazoan mitochondrial genomes. MITOS is a pipeline designed to compute a consistent de novo annotation of the mitogenomic sequences. The software allows for a systematic error screening, the standardisation of gene name and gene boundary designation, anticodon labelling of tRNAs, and provides the means for the assessment of the validity of a gene assignment.

Address of the bookmark: http://mitos.bioinf.uni-leipzig.de/index.py

ASCIIGenome: genome browser based on command line interface and designed for running from console terminals.

Neel — Fri, 09 Nov 2018 13:50:04 -0600

ASCIIGenome is a genome browser based on command line interface and designed for running from console terminals.

Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers while offering flexibility similar to popular GUI viewers like IGV.

Documentation is at readthedocs/asciigenome.

Address of the bookmark: https://github.com/dariober/ASCIIGenome

Merqury: reference-free quality and phasing assessment for genome assemblies

Jit — Sat, 06 Jun 2020 05:38:34 -0500

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.

https://github.com/marbl/meryl

Address of the bookmark: https://github.com/marbl/merqury

ARCS: scaffolding genome drafts with linked reads

Jit — Mon, 17 Dec 2018 17:40:28 -0600

ARCS requires two input files:

Draft assembly fasta file
Interleaved linked reads file (Barcode sequence expected in the BX tag of the read header or in the form "@readname_barcode" ; Run Long Ranger basic on raw chromium reads to produce this interleaved file)

Address of the bookmark: https://github.com/bcgsc/ARCS/

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Evaluation of genome assembly software based on long reads

BioStar — Fri, 01 Feb 2019 11:55:54 -0600

TGS technologies have been used to produce highly accurate de novo assemblies of hundreds of microbial genomes and highly contiguous reconstructions of many dozens of plant and animal genomes, enabling new insights into evolution and sequence diversity. They have also been applied to resequencing analyses, to create detailed maps of structural variations in many species. Also, these new technologies have been used to fill in many of the gaps in the human reference genome.

In this report, we compare and evaluate several genome assembly software based on TSG technology. The experimentation has been performed on 4 reference genomes and the results evaluated with the QUAST software. The 11 software that have been evaluated are: Celera Assembler , Falcon , Miniasm, Newbler , SGA Assembler, Smartdenovo, Abruijn, Ra, DBG2OLC, Spades and Cerulean. The first 8 software use only long reads, while the 3 last software can merge long and short reads