BOL: Related items

Scripts for the analysis of HGT in genome sequence data.

Jit — Wed, 29 Nov 2017 16:44:10 -0600

Scripts for the analysis of HGT in genome sequence data

Address of the bookmark: https://github.com/reubwn/hgt

Mugsy: multiple whole genome alignment tool

Jit — Fri, 08 Dec 2017 17:41:14 -0600

Mugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.

To cite Mugsy, use:

Angiuoli SV and Salzberg SL. Mugsy: Fast multiple alignment of closely related whole genomes.Bioinformatics 2011 27(3):334-4

Address of the bookmark: http://mugsy.sourceforge.net/

Delta: a new Web-based 3D genome visualization and analysis platform

Jit — Wed, 20 Dec 2017 08:49:55 -0600

Delta is an integrative visualization and analysis platform to facilitate visually annotating and exploring the 3D physical architecture of genomes. Delta takes Hi-C or ChIA-PET contact matrix as input and predicts the topologically associating domains and chromatin loops in the genome. It then generates a physical 3D model which represents the plausible consensus 3D structure of the genome. Deltafeatures a highly interactive visualization tool which enhances the integration of genome topology/physical structure with extensive genome annotation by juxtaposing the 3D model with diverse genomic assay outputs.

https://github.com/zhangzhwlab/delta

Address of the bookmark: https://github.com/zhangzhwlab/delta

MGcV: the microbial genomic context viewer for comparative genome analysis

Jit — Mon, 29 Jan 2018 04:55:46 -0600

MGcV is an interactive web-based visalization tool tailored to facilitate small scale genome analysis. To start using MGcV:

Supply your genes/genomic segments/phylogenetic tree of interest in the input-box by
- selecting the type of identifier and pasting identifiers (one per line)
- or by using the gene ID search tool
- or with the BLAST search tool
Click "Visualize context".

Consult the documentation to learn more about MGcV.

Address of the bookmark: http://mgcv.cmbi.ru.nl/

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Rahul Nayak — Fri, 11 May 2018 05:07:45 -0500

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON.

https://www.nature.com/articles/nmeth.4432

Address of the bookmark: https://github.com/xiaochuanle/MECAT

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

LINKS scaffolder bloomfilter setting !

Jit — Fri, 15 Jun 2018 10:39:54 -0500

➜ bin git:(master) ✗ ls -l
total 68
drwxrwxr-x 3 urbe urbe 4096 Jun 15 12:15 lib
-rwxrwxrwx 1 urbe urbe 65141 Jun 15 17:13 LINKS
➜ bin git:(master) ✗ pwd
/home/urbe/Tools/LINKS_1.8.6/bin

➜ bloomfilter git:(master) ✗ swig -Wall -c++ -perl5 BloomFilter.i
➜ bloomfilter git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
BloomFilter_wrap.cxx:1892:30: fatal error: ../BloomFilter.hpp: No such file or directory
compilation terminated.
➜ bloomfilter git:(master) ✗ cd swig
➜ swig git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
In file included from BloomFilter_wrap.cxx:1877:0:
../BloomFilter.hpp: In member function ‘void BloomFilter::loadHeader(FILE*)’:
../BloomFilter.hpp:141:59: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&header, sizeof(struct FileHeader), 1, file);
^
➜ swig git:(master) ✗ g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3
➜ swig git:(master) ✗ cd ..
➜ bloomfilter git:(master) ✗ cd ..
➜ lib git:(master) ✗ cd ..
➜ bin git:(master) ✗ ./LINKS
Usage: ./LINKS [v1.8.6]
-f sequences to scaffold (Multi-FASTA format, required)
-s file-of-filenames, full path to long sequence reads or MPET pairs [see below] (Multi-FASTA/fastq format, required)
-m MPET reads (default -m 1 = yes, default = no, optional)
! DO NOT SET IF NOT USING MPET. WHEN SET, LINKS WILL EXPECT A SPECIAL FORMAT UNDER -s
! Paired MPET reads in their original outward orientation <- -> must be separated by ":"
>template_name
ACGACACTATGCATAAGCAGACGAGCAGCGACGCAGCACG:ATATATAGCGCACGACGCAGCACAGCAGCAGACGAC
-d distance between k-mer pairs (ie. target distances to re-scaffold on. default -d 4000, optional)
Multiple distances are separated by comma. eg. -d 500,1000,2000,3000
-k k-mer value (default -k 15, optional)
-t step of sliding window when extracting k-mer pairs from long reads (default -t 2, optional)
Multiple steps are separated by comma. eg. -t 10,5
-o offset position for extracting k-mer pairs (default -o 0, optional)
-e error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default -e 0.1, optional)
-l minimum number of links (k-mer pairs) to compute scaffold (default -l 5, optional)
-a maximum link ratio between two best contig pairs (default -a 0.3, optional)
*higher values lead to least accurate scaffolding*
-z minimum contig length to consider for scaffolding (default -z 500, optional)
-b base name for your output files (optional)
-r Bloom filter input file for sequences supplied in -s (optional, if none provided will output to .bloom)
NOTE: BLOOM FILTER MUST BE DERIVED FROM THE SAME FILE SUPPLIED IN -f WITH SAME -k VALUE
IF YOU DO NOT SUPPLY A BLOOM FILTER, ONE WILL BE CREATED (.bloom)
-p Bloom filter false positive rate (default -p 0.001, optional; increase to prevent memory allocation errors)
-x Turn off Bloom filter functionality (-x 1 = yes, default = no, optional)
-v Runs in verbose mode (-v 1 = yes, default = no, optional)

Error: Missing mandatory options -f and -s.

ERROR fixed

perl: symbol lookup error: /home/urbe/Tools/LINKS_new/bin/./lib/bloomfilter/swig/BloomFilter.so: undefined symbol: Perl_Gthr_key_ptr

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads

BioStar — Fri, 07 Sep 2018 05:19:06 -0500

P_RNA_scaffolder is a novel scaffolding tool using Pair-end RNA-seq to scaffold genome fragments. The method is suitable for most genomes. The program could utilize Illumina Paired-end RNA-sequencing reads from target speciesies. Our method provides another practical alternative to existing mate-pair_based approaches or other Protein-based approaches (for instance, PEP_scaffolder ) for scaffolding genome sequences. The most important feature of this method is to improve the completeness of gene regions and long-coding gene regions (for instance, circRNA).

Address of the bookmark: http://www.fishbrowser.org/software/P_RNA_scaffolder/#