Log in

Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




Latest activity

  • To create a fresh environment for chromatiblock to run in do: conda create --name chromatiblock conda activate chromatiblock conda install chromatiblock --channel conda-forge --channel bioconda Then in future to run chromatiblock you can...
  • BioStar posted to the wire 1596 days ago
    Enter newline at the end of the file $ sed -i -e '$a\' file #NewLine #Sed #Line #New
  • Sibelia: A comparative genomics tool: It assists biologists in analysing the genomic variations that correlate with pathogens, or the genomic changes that help microorganisms adapt in different environments. Sibelia will also be helpful for the...
    Comments
    • BioStar 1529 days ago
      Basic usage
      ===========
      In this manual it is assumed that "Sibelia" is properly installed and the
      directory with "Sibelia.py" is in your "PATH" environment variable or input
      files are in the same folder with "Sibelia" executable.
      
      Directory "examples/Sibelia" contains two sets of bacterial genomes. The easiest
      way to run "Sibelia" is to type:
      
      	Sibelia -s loose <input FASTA file(s)>
      
      For example:
      
      	Sibelia -s loose Helicobacter_pylori.fasta
      
      Important note -- "Sibelia" requires some free space on HDD to run. If you
      experience any problems and see error messages that mention temporary files,
      try to change directory used for temporary files (see section "Directory for
      temporary files").
      
      Above commands will run "Sibelia" on the file "Helicobacter_pylori.fasta" with
      the "loose" simplification parameters. There are two another simplification
      parameters sets, called "fine" and "far". To run "Sibelia" on "Helicobacter_pylori.fasta"
      with "fine" parameters set, type:
      
      	Sibelia -s fine Helicobacter_pylori.fasta
      
      The difference between "loose" and "fine" set is that "loose" usually produces
      fewer blocks, but longer. And it may lose some small synteny blocks 
      (shorter than 15 000 BP), while "fine" option produces more blocks (but shorter)
      and their coverage is worse. Usually "loose" is the best choice, but if you do
      not want to lose information about small-scale rearrangements,  use "fine". See
      also section "Output description" for detailed depiction of the output format.
      The "far" parameters set can be used for analysis of distantly-related genomes.
      However, usage of this set requires more computational time and space and may
      be not appropriate in case of many genomes.
      
      If you are not satisfied by the results (poor coverage, for example), try to set
      simplification parameters manually (see section "Fine tuning"). 
      
      By default, Sibelia filters out synteny blocks shorter than 5 000 BP. You can
      change this behaviour, see section "Minimum block size".
      
      You also may be interested in blocks that occur exactly once in each input
      sequence, such blocks are used as input for MGRA algorithm, for example. To get
      such output use option "-a":
      
      	Sibelia -s loose -a Helicobacter_pylori.fasta
      
      Synteny blocks are visualized with an interactive diagram (see section "d3"
      visualization"). The blocks are also can be visualized with "Circos" (see 
      section "Circos" visualization"). While "Circos" is better for publications,
      "d3" diagram is more suitable for analysis. Please note that sequences that do
      not contain any synteny blocks instances are not shown on these diagrams.
      
      Genomes from the "examples/Sibelia" dir were taken from [5, 6]. Note that you
      can specify multiple FASTA files, just separate them with spaces.
      
      Technical parameters
      ====================
      
      Directory for output files
      --------------------------
      Default directory = "." You can change this by setting cmd parameter:
      
      	-o <dir name> or --outdir <dir name>
      
      By default, "Sibelia" places output files in current working directory. Setting
      this parameter will change output directory.
      
      Directory for temporary files
      -----------------------------
      Default directory is output directory. You can change this by setting cmd parameter:
      
      	-t <dir name> or --tempdir <dir name>
      
      "Sibelia" creates some temporary files while running. By default these files
      are placed in the output directory. If you want to place temporary files in
      another folder due to some reasons, use this parameter. Although the files
      exist only for a very short period of time, they can be quite big -- ~20*N
      bytes, where N is the total size of all input genomes. You can also use switch
      
      	-r or --inram
      	
      that will force "Sibelia" to not create any temporary files and store all it's
      data in RAM.
      
      Output description
      ==================
      By default, "Sibelia" produces following files: 
      
      1. Blocks coordinates
      2. Genomes represented as permutations of the synteny blocks
      3. Coverage report
      4. Files for generating a "Circos" picture
      5. Interactive html-diagram of synteny blocks
      
      There are also optional output files:
      
      1. Sequences file
      2. Dot file with resulting de Bruijn graph
      
      All these files are described below in details.
      
      Blocks coordinates
      ------------------
      File name = "blocks_coords.txt". First part of this file lists input sequences,
      their IDs, sizes and descriptions. IDs are just index numbers of sequences (in
      the same order as they apper in the input files).
      
      Second part of the file describes synteny blocks in sections separated by 
      dashes. Each section starts with the synteny block ID. Then all instances
      of this block are listed in tabular format. Each row of a table depicts
      one instance of this synteny block. Columns of the table designate following:
      
      1. Seq_id -- ID of the sequence, that the instance belongs to
      2. Strand -- strand of the synteny block instance, either '+' or '-'. Input
      sequences are treated as positive strands
      3. Start -- one-based index of the first base pair of the instance
      4. End -- one-based index of the last base pair of the instance
      5. Length -- length of the instance of the synteny block
      
      Note that all coordinates are given relatively to POSITIVE strand of the
      sequence. If an instance of a synteny block is located on the positive strand,
      then start < end, otherwise start > end. 
      
      If you wish, you can obtain coordinates in GFF format. If you use the flag:
      
      	-gff
      
      Then coordinates of synteny blocks are listed in file "blocks_coords.gff". For
      description of this format, see:
      
      	https://cgwb.nci.nih.gov/FAQ/FAQformat.html#format3
      
      Each record represents different copies of a synteny block. Copies having the
      same number in the "tag" field (last column) are instances of the same synteny
      block.
      
      Genomes permutations
      --------------------
      File name = "genomes_permutations.txt".
      
      This file contains input sequences represented as permutations of the synteny
      block. It has two lines for each input sequence:
      
      1. Header line -- FASTA header of the sequence (starting with '>')
      2. Genome line -- sequence of synteny blocks instances as they appear on the 
      positive strand of the sequence. Each instance is represented as a signed
      integer, where '+' sign depicts direct version of the block, and '-' depicts
      reversed block
      
      Coverage report
      ---------------
      File name = "coverage_report.txt".
      
      The file describes portion of the genomes, that found synteny block cover.
      First part of the file describes input sequencess (see "Blocks coordinates"
      section). Second part of the file is a table with the following columns:
      
      1. Degree -- multiplicity of a synteny block. For example, if a synteny block
      has degree = 3, then the are three instances of this block in the input
      sequence
      2. Count -- number of synteny blocks with a given degree
      3. Total -- portion of all the input sequences that are covered by the blocks
      with a given degree
      4. Seq <n> -- portion of the sequence with id <n> that is covered by blocks
      with a given degree
      
      Here is an example of such table from a report file.
      
      | Degree | Count | Total   | Seq 1   | Seq 2   | Seq 3   | Seq 4   |
      | :----- | :---: | :-----: | :-----: | :-----: | :-----: | :-----: |
      | 2	 | 11	 | 3.82%   | 6.59%   | 2.41%   | 2.96%   | 3.30%   |
      | 3	 | 4	 | 1.68%   | 2.24%   | 2.19%   | 1.44%   | 0.85%   |
      | 4	 | 21	 | 91.93%  | 91.34%  | 94.71%  | 87.54%  | 94.53%  |
      | All	 | 36	 | 95.66%  | 97.44%  | 97.98%  | 90.67%  | 96.89%  |
      
      This table contains one row for each degree (2, 3, 4) and one ("All") row for
      the overall coverage. It means that there are 11 blocks with degree = 2, i.e.
      11 * 2 instances, and they cover 3.82% of all four genomes, 6.59% of Seq 1 and
      2.41% of Seq 2. And also there are 21 synteny blocks with degree = 4, i.e.
      4 * 21 instances and they cover 91.93% of all genomes. All the blocks cover
      95.66% of all the input sequences.
      
      Sequences file
      --------------
      File name = "blocks_sequences.fasta". By default this file is not written. To
      output this file, set cmd parameter:
      
      	-q or --sequencesfile
      
      This FASTA file contains sequences of instances of the synteny block. Each
      sequence has header in following format:
      
      	>Seq="<header>",Strand=<sign>,Block_id=<block id>,Start=<x>,End=<y>
      
      Where "<header>" is a header of the FASTA sequence where the block instance
      is located. Other fields are described in section "Coordinates file".
      Sequences of synteny blocks are also written in SAM format in the file
      "blocks_sequences.sam".
      
      "d3" visualization
      ------------------
      File name = "d3_blocks_diagram.html".
      
      "Sibelia" generates an interactive html diagram that shows found synteny blocks.
      Coordinates follow the same convention as described in section "Coordinates 
      file".
      
      "Circos" visualization
      ----------------------
      You can visualize synteny blocks with a colorful circular diagram by using
      the "Circos" software [3]. Files for generating such diagram are written in
      directory "circos" inside the output directory. To generate "Circos" diagram
      do following:
      
      1. Download and install "Circos" software
      2. Run "Sibelia"
      3. Run "Circos" in "circos" directory
      
      For example of such diagrams (generated from "Helicobacter_pylori.fasta),
      see "examples/Sibelia/Helicobacter_pylori/circos/circos.png". Also note that
      such diagrams can become very piled with larger number of genomes. To overcome
      this, plot only big blocks, see section "Minimum block size". Blocks located
      on the positive strand are colored green, while blocks from negative strand
      are red.
      
      By default, "Sibelia" plots only blocks obtained after the last stage.
      You can also view blocks at the intermediate stages by using switch:
      
      	-v or --visualize
      
      On the resulting diagram the outermost circle shows blocks obtained at the
      first stage, then the second stage and so on. Please note that this option
      slows down the computation.
      
      Resulting de Bruijn graph
      -------------------------
      File name = "de_bruijn_graph.dot". By default this file is not written. To
      output this file, set cmd parameter:
      
      	-g or --graphfile
      	
      If you are a curious person, you can also view condensed de Bruijn graph that
      is used for generating synteny blocks. To understand the graph, see [1].
      Condensed means that only bifurcations in the graph are plotted and 
      non-branching paths are collapsed into a single edge. Blue edges are generated
      from positive strand and red edges are from negative strand respectively. Note
      that this graph is generated for K = min(Kn, MinimumBlockSize) or for value of
      "--lastk" cmd parameter if it is set, where Kn is the value of K used for the
      last stage (see section "Fine tuning").
      
      If one is interested in graph output only, he or she can use the "--noblocks"
      option:
      
      	--noblocks
      
      In this case Sibelia doesn't compute the synteny blocks and doesn't ouput them,
      but can output the graph. For example, to get only non-modified compressed de
      Bruijn grahp for k=25, one can use the following command line:
      
      	Sibelia -k run.stage --noblocks -g -m 25 <input_file>
      
      Where "run.stage" contains single number "0".
      
      Fine tuning
      ===========
      Here we will describe parameters that can affect computation results.
      
      Minimum block size
      ------------------
      Default value = 5000. To change this value, set cmd parameter:
      
      	-m <integer> or --minblocksize <integer>
      
      If you are interested only in big synteny blocks, like > 100 000 BP, set
      this parameter to an appropriate value.
      
      Outputting only shared blocks
      -----------------------------
      Default = not set. Add flag to cmd parameters to set:
      
      	-a or --sharedonly
      
      Output only blocks that occur exactly once in each input sequence. This option
      assumes that all input genomes contain single chromosome.
      
      Postprocessing
      --------------
      By default, synteny blocks are postprocessed after computation by gluing
      "stripes" consisting of the same synteny blocks. For example, if each 
      occurence of synteny block 1 is followed by synteny block 2 and vice a versa,
      their directions are consistent, then they are "glued" together to form a
      single synteny block. Postprocessing could be turned off by specifying flag:
      
      	--nopostprocess
      
      Output blocks from all stages
      -----------------------------
      "Sibelia" performs computations in multiple stages. Every stage produces it's
      own synteny blocks. You can get blocks from all stages by specifying flag:
      
      	--allstages
      
      Files "blocks_coordsN.(txt|gff)" will contain their coordinates, where N is the
      number of stage. Zero corresponds to blocks obtained without any simplification.
      
      Boundaries correction
      ---------------------
      Algorithm of "Sibelia" depends on presence of solid k-mers within syntenic
      regions. If such regions contain variations close to their borders, they will
      be truncated. In case of two genomes, some synteny blocks (of multiplicity two)
      could be corrected using local alinment algorithms. Use flag:
      
      	--correctboundaries
      
      This flag is used by "C-Sibelia".
      
      Parameters set
      --------------
      Default value = not set. To select the parameters set, use cmd parameter:
      
      	-s <loose|fine|far> or --parameters <loose|fine>
      
      This option is incompatible with "-k", you must specify one of these, not both.
      Approach used in "Sibelia" is parameter dependent. To understand the details,
      please see the next section and [1]. The "loose" option produces longer blocks
      and better coverage, while "fine" can capture small-scale rearrangements, for
      example, inversions of size < 15 000 BP. The "far" set is for distant genomes.
      
      Custom parameters set
      ---------------------
      Default value = not set. To specify the file that contains custom parameters
      set, use cmd parameter:
      
      	-k <file name> or --stagefile <file name>
      
      The algorithm consists of several stages of computations. Each stage has two 
      parameters, K and D. Let's call K-mer a substring of length K. At each stage
      "Sibelia" constructs so called de Bruijn graph, graph of K-mers that occur
      in the genome, and simplifies it by removing special type of undirected cycles
      called "bulges", see [1] for more details.
      
      Graph is a good model for describing the algorithm, but to understand "physical
      meaning" of the parameters it is useful to consider operations that are 
      actually performed with the genome behind the graph model. Suppose that 
      somewhere in the genome exist two pairs of K-mers K1 and K2:
      
      1st pair: ... K1 ABCD K2 ...  
      2nd pair: ... K1 FGHE K2 ...  
      
      If the distance between K1 and K2 within each pair is less than D, then "Sibelia"
      replaces FGHE with ABCD to obtain longer "synteny block":
      
      1st pair: ... K1 ABCD K2 ...  
      2nd pair: ... K1 ABCD K2 ...  
      
      More concrete example. Suppose that K = 3, D = 5 and somewhere in the genome we
      find:
      
      1st pair: ... act gaga ggc ...  
      2nd pair: ... act gatg ggc ...  
      
      As we see, distance between "act" and "ggc" is less than 5 nucleotides so we
      replace "gatg" by "gaga":
      
      1st pair: ... act gaga ggc ...  
      2nd pair: ... act gaga ggc ...  
      
      "Sibelia" keeps track of all changes so it is able to locate original locations
      of the synteny blocks obtained by the simplification. This process continues 
      step by step, we start with small values of K to obtain longer K-mers shared
      between synteny regions and then increase K and D. The "loose" parameters set
      has 4 stages:
      
      | K        | D         |
      | :------- | --------: |
      | 30       | 150       |
      | 100      | 1000      |
      | 1000     | 5000      |
      | 5000     | 15000     |
      
      The "fine" set consists of 3 stages and it's final values are less:
      
      | K        | D         |
      | :------- | --------: |
      | 30       | 150       |
      | 100      | 500       |
      | 500      | 1500      |
      
      The "far" set is for distant genomes:
      
      | K        | D         |
      | :------- | --------: |
      | 15       | 120       |
      | 100      | 500       |
      | 500      | 1500      |
      
      As you can see, "loose" set is more aggressive -- at it's final stage it glues
      together 5000-mers that are separated from each other by at most 15000 symbols.
      Although this description is very simplified and lacks many important technical
      details, it is enough to infer your own parameter set. Stage file that you may
      use to specify your own parameters has following simple format:
      
      M  
      K1 D1  
      K2 D2  
      ...  
      KM KM  
      
      Where M is the number of stages. So, running with the stage file:
      
      4  
      30 150  
      100 1000  
      1000 5000  
      5000 15000  
      
      Is equivalent to running with the -s "loose" cmd option. As you may notice, the
      algorithm relies on exact K-mers shared between the genomes. If input genomes
      doesn't have such shared substrings, then "Sibelia" won't be able to locate the
      synteny blocks.
      
      If you cannot find synteny blocks with the default parameters, try to start
      with smaller values of K (~20), increase D values or vary number of stages.
      
      Last value of K
      ---------------
      In "Sibelia" de Bruijn graph is constructed (N + 1) times, where N is the
      number of simplification stages. De Bruijn graph which is constructed last is
      used for inferring synteny blocks. Value of K for this grahp is determined by
      min(Kn, MinimumBlockSize)  where Kn is the value of K used for the last stage.
      You can override this value by setting the "--lastk" parameter:
      
         --lastk <integer > 1>
      
      
      Maximum number of iterations
      ----------------------------
      Default value = 4. Tho change this value, set cmd parameter:
      
      	-i <integer> or --maxiterations <integer>
      
      Maximum number of iterations during a stage of simplification. Increasing
      this parameter may slightly improve coverage.
      
      References
      ==========
      1. Ilya Minkin, Nikolay Vyahhi, Son Pham. "SyntenyFinder: A Synteny Blocks 
      Generation and Genome Comparison Tool" (poster), WABI 2012
      http://bioinf.spbau.ru/sites/default/files/SyntenyFinder.pdf
      2. Max A. Alekseyev and Pavel A. Pevzner. "Breakpoint graphs and ancestral
      genome reconstructions", Genome Res. 2009. 19: 943-957.
      3. Circos. http://circos.ca
      4. D3. http://d3js.org/
      5. Helicobacter pylori. http://www.ncbi.nlm.nih.gov/genome/169
      6. Staphylococcus aureus. http://www.ncbi.nlm.nih.gov/genome/154
  • DBT-Junior Research Fellowship (DBT-JRF) in Biotechnology (2020)
  • mixtureS that can de novo identify bacterial strains from shotgun reads of a clonal or metagenomic sample, without prior knowledge about the strains and their variations. Tested on 243 simulated datasets and 195 experimental datasets, mixtureS...
  • biogeek posted a new ad in the ResearchLabs PLANT COMPUTATIONAL GENOMICS LAB – JILL WEGRZYN 1597 days ago
  • Methods in Molecular Biology https://www.springer.com/series/7651 Accept bioinformatics tools, such as ART etc
  • BioStar created a new bio-script Bash script to handle Multifasta files 1601 days ago
  • SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes...
  • SHAMAN is a shiny application for differential analysis of metagenomic data (16S, 18S, 23S, 28S, ITS and WGS) including bioinformatics treatment of raw reads for targeted metagenomics, statistical analysis and results visualization with a large...
  • BioStar posted to the wire 1605 days ago
    Useful Gene Ontology dataset at NCBI ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ #Data #NCBI #Genes
  • ClueGO is a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes in a functionally grouped network and it can be used in combination with GOlorize. The identifiers can be uploaded from a text file or...
  • proActiv is an R package that estimates promoter activity from RNA-Seq data. proActiv uses aligned reads and genome annotations as input, and provides absolute and relative promoter activity as output. The package can be used to identify active...
  • PyParanoid is a pipeline for rapid identification of homologous gene families in a set of genomes - a central task of any comparative genomics analysis. The "gold standard" for identifying homologs is to use reciprocal best hits (RBHs) which depends...
  • kallisto can process either single-end or paired-end reads. The default running mode is paired-end and requires an even number of FASTQ files represented as pairs, e.g. kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq...
  • PANNZER (Protein ANNotation with Z-scoRE) is a fully automated service for functional annotation of prokaryotic and eukaryotic proteins of unknown function. PANNZER (Protein ANNotation with Z-scoRE) is a fully automated service for...
  • SEASTAR (Systematic Evaluation of Alternative STArt site in RNA) is a software package for Transcription Start Site (TSS) identification and quantification using only RNA-seq data. It assembles novel TSSs based only on RNA-Seq data and merges them...
  • Rahul Agarwal posted to the wire 1606 days ago