BOL: Related items

Illumina based assembly pipeline steps !

Surabhi Chaudhary — Fri, 10 Dec 2021 06:22:54 -0600

Illumina

Merge re-sequenced FastQ files (cat)
Read QC (FastQC)
Adapter trimming (fastp)
Removal of host reads (Kraken 2; optional)
Variant calling
1. Read alignment (Bowtie 2)
2. Sort and index alignments (SAMtools)
3. Primer sequence removal (iVar; amplicon data only)
4. Duplicate read marking (picard; optional)
5. Alignment-level QC (picard, SAMtools)
6. Genome-wide and amplicon coverage QC plots (mosdepth)
7. Choice of multiple variant calling and consensus sequence generation routes (iVar variants and consensus; default for amplicon data || BCFTools, BEDTools; default for metagenomics data)
  - Variant annotation (SnpEff, SnpSift)
  - Consensus assessment report (QUAST)
  - Lineage analysis (Pangolin)
  - Clade assignment, mutation calling and sequence quality checks (Nextclade)
  - Individual variant screenshots with annotation tracks (ASCIIGenome)
8. Intersect variants across callers (BCFTools)
De novo assembly
1. Primer trimming (Cutadapt; amplicon data only)
2. Choice of multiple assembly tools (SPAdes || Unicycler || minia)
  - Blast to reference genome (blastn)
  - Contiguate assembly (ABACAS)
  - Assembly report (PlasmidID)
  - Assembly assessment report (QUAST)
Present QC and visualisation for raw read, alignment, assembly and variant calling results (MultiQC)

Bactopia: a flexible pipeline for complete analysis of bacterial genomes

Abhi — Sat, 08 Jun 2024 16:25:08 -0500

Bactopia is a flexible pipeline for complete analysis of bacterial genomes. The goal of Bactopia is process your data with a broad set of tools, so that you can get to the fun part of analyses quicker!

Bactopia was inspired by Staphopia, a workflow we (Tim Read and myself) released that is targeted towards Staphylococcus aureus genomes. Using what we learned from Staphopia and user feedback, Bactopia was developed from scratch with usability, portability, and speed in mind from the start.

Bactopia uses Nextflow to manage the workflow, allowing for support of many types of environments (e.g. cluster or cloud). Bactopia allows for the usage of many public datasets as well as your own datasets to further enhance the analysis of your sequencing. Bactopia only uses software packages available from Bioconda and Conda-Forge to make installation as simple as possible for all users.

To highlight the use of Bactopia and Bactopia Tools, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. The results from this analysis are published in mSystems under the title: Bactopia: a flexible pipeline for complete analysis of bacterial genomes

Address of the bookmark: https://bactopia.github.io/latest/

Bioinformatician’s Pocket Reference !!

RAJESH DETROJA — Sun, 08 Jun 2014 09:56:58 -0500

It is amusing how brain of bioinformaticians work! Learning a new programming language for days feels so much of fun that making 5 minute discussion with neighbours (unless under special circumstances!) in our own mother-tongue. Today every bioinformatician keeps more than few languages and core IT toolkits on their plate. It has become mandatory to be able to mould different code snippets to build our own custom workflows, and thus keeping syntax at our fingertips has become essential.Although Google is best way to get syntax problem solved, it is not a bad idea to keep reference sheets is our smartphones or stick out some printed sheets on the back of your door, in the old fashion way!!

Address of the bookmark: http://infoplatter.wordpress.com/2014/04/06/bioinformaticians-pocket-reference/

Perl Special Vars Quick Reference

Abhimanyu Singh — Tue, 07 Feb 2017 05:08:47 -0600

`$_`	The default or implicit variable.
`@_`	Subroutine parameters.
`$a` `$b`	sort comparison routine variables.
`@ARGV`	The command-line args.
Regular Expressions
`$`	Regexp parenthetical capture holders.
`$&`	Last successful match (degrades performance).
`${^MATCH}`	Similar to `$&` without performance penalty. Requires /p modifier.
$`	Prematch for last successful match string (degrades performance).
`${^PREMATCH}`	Similar to $` without performance penalty. Requires `/p` modifier.
`$'`	Postmatch for last successful match string (degrades performance).
`${^POSTMATCH}`	Similar to `$'` without performance penalty. Requires `/p` modifier.
`$+`	Last paren match.
`$^N`	Last closed paren match (last submatch).
`@+`	Offsets of ends of successful submatches in scope.
`@-`	Offsets of starts of successful submatches in scope.
`%+`	Like `@+`, but for named submatches.
`%-`	Like `@-`, but for named submatches.
`$^R`	Last regexp (?{code}) result.
`${^RE_DEBUG_FLAGS}`	Current value of regexp debugging flags. See `use re 'debug';`
`${^RE_TRIE_MAXBUF}`	Control memory allocations for RE optimizations for large alternations.
Encoding
`${^ENCODING}`	The object reference to the Encode object, used to convert the source code to Unicode.
`${^OPEN}`	Internal use: \0 separated Input / Output layer information.
`${^UNICODE}`	Read-only Unicode settings.
`${^UTF8CACHE}`	State of the internal UTF-8 offset caching code.
`${^UTF8LOCALE}`	Indicates whether UTF8 locale was detected at startup.
IO and Separators
`$.`	Current line number (or record number) of most recent filehandle.
`$/`	Input record separator.
`$\|`	Output autoflush. 1=autoflush, 0=default. Applies to currently selected handle.
`$,`	Output field separator (lists)
`$\`	Output record separator.
`$"`	Output list separator. (interpolated lists)
`$;`	Subscript separator. (Use a real multidimensional array instead.)
Formats
`$%`	Page number for currently selected output channel.
`$=`	Current page length.
`$-`	Number of lines left on page.
`$~`	Format name.
`$^`	Name of top-of-page format.
`$:`	Format line break characters
`$^L`	Form feed (default "\f").
`$^A`	Format Accumulator
Status Reporting
`$?`	Child error. Status code of most recent system call or pipe.
`$!`	Operating System Error. (What just went 'bang'?)
`%!`	Error number hash
`$^E`	Extended Operating System Error (Extra error explanation).
`$@`	Eval error.
`${^CHILD_ERROR_NATIVE}`	Native status returned by the last pipe close, backtick (`` ) command, successful call to wait() or waitpid(), or from the system() operator.
ID's and Process Information
`$$`	Process ID
`$<`	Real user id of process.
`$>`	Effective user id of process.
`$(`	Real group id of process.
`$)`	Effective group id of process.
`$0`	Program name.
`$^O`	Operating System name.
Perl Status Info
`$]`	Old: Version and patch number of perl interpreter. Deprecated.
`$^C`	Current value of flag associated with -c switch.
`$^D`	Current value of debugging flags
`$^F`	Maximum system file descriptor.
`$^I`	Value of the -i (inplace edit) switch.
`$^M`	Emergency Memory pool.
`$^P`	Internal variable for debugging support.
`$^R`	Last regexp (?{code}) result.
`$^S`	Exceptions being caught. (eval)
`$^T`	Base time of program start.
`$^V`	Perl version.
`$^W`	Status of -w switch
`${^WARNING_BITS}`	Current set of warning checks enabled by `use warnings;`
`$^X`	Perl executable name.
`${^GLOBAL_PHASE}`	Current phase of the Perl interpreter.
`$^H`	Internal use only: Hook into Lexical Scoping.
`%^H`	Internaluse only: Useful to implement scoped pragmas.
`${^TAINT}`	Taint mode read-only flag.
`${^WIN32_SLOPPY_STAT}`	If true on Windows `stat()` won't try to open the file.
Command Line Args
`ARGV`	Filehandle iterates over files from command line (see also `<>`).
`$ARGV`	Name of current file when reading <>
`@ARGV`	List of command line args.
`ARGVOUT`	Output filehandle for -i switch
Miscellaneous
`@F`	Autosplit (-a mode) recipient.
`@INC`	List of library paths.
`%INC`	Keys are filenames, values are paths to modules included via `use, require,` or `do`.
`%ENV`	Hash containing current environment variables
`%SIG`	Signal handlers.
`$[`	Array and substr first element (Deprecated!).

See perlvar for detailed descriptions of each of these (and a few more) special variables.

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Jit — Fri, 08 Dec 2017 16:48:40 -0600

Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1 fixes the problems with SNP annotation that arose when NCBI discontinued use of GI numbers. Please read carefully the Preface (page 3) and the File of annotated genomes section (pages 9-10) in the version 3.1 User Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need the annotation feature. All users are encouraged to upgrade to version 3.1.

Address of the bookmark: https://sourceforge.net/projects/ksnp/files/

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

npScarf: Scaffolding and Completing Assemblies in Real-time Fashion

Jit — Tue, 23 May 2017 04:53:29 -0500

npScarf (jsa.np.npscarf) is a program that scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing. The pipeline can run on a computing cluster as well as on a laptop computer for microbial datasets. It also facilitates the real-time analysis of positional information such as gene ordering and the detection of genes from mobile elements (plasmids and genomic islands).

Complete paper at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5321748/

Address of the bookmark: https://github.com/mdcao/npScarf

WAAFLE: a Workflow to Annotate Assemblies and Find LGT Events.

Neel — Thu, 23 Sep 2021 14:31:06 -0500

Lateral gene transfer (LGT) is an important mechanism for genome diversification in microbial communities, including the human microbiome. While methods exist to identify LGTs from sequenced isolate genomes, identifying LGTs from community metagenomes remains an open problem. To address this, we developed WAAFLE: a Workflow to Annotate Assemblies and Find LGT Events.

Address of the bookmark: http://huttenhower.sph.harvard.edu/waafle

MIX: Combining multiple assemblies from NGS data

Rahul Nayak — Tue, 08 May 2018 04:58:05 -0500

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16.

Address of the bookmark: https://github.com/cbib/MIX

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT