tfsites.FindTfSitesAlteredBySequenceVariation

tfsites.AnnotateAndVisualizeInSilicoSnvs v1

Author(s): Joe Solvason

Contact: Joe Solvason (solvason@eng.ucsd.edu)

Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)

Task Type: Transciption factor analysis

LSID: urn:lsid:genepattern.org:module.analysis:00443

Introduction

AnnotateAndVisualizeInSilicoSnvs reports the effects of all possible in silico single-nucleotide variants (SNVs) in a given sequence. Possible SNV effects include increasing (or optimizing) the affinity/score of a binding site, decreasing (or sub-optimizing) the affinity/score of a binding site, deleting a binding site, or creating a binding site.

The in silico SNV analysis is performed on one transcription factor, but the binding sites of multiple different transcription factors can be displayed on the plot. Each binding site is labeled with the TF name and a unique binding site ID. If the relative affinity/score dataset is provided for a transcription factor, the affinity/score of this site will be labeled and the intensity of the binding site’s color will be proportional to the affinity/score.

Methodology

For every nucleotide in the sequence, all possible SNVs are made. For each SNV, we determine its effect, if any, on any binding sites that exist in the sequence. These are the possible effects of a SNV on a binding site:

If an optimization threshold is provided by the user, then we report only the binding sites that have an increased affinity/score with a fold change greater than or equal to the threshold. Similarly, if a sub-optimization threshold is provided, then we report only the binding sites that have a decreased affinity/score with a fold change less than or equal to the threshold.

Using the list of all identified SNV effects, an image of the sequence is generated and it contains a table of all possible alternate nucleotides. Each cell in the table is colored according to the mutation type of the SNV. If the SNV has no effect, then its background is grey. If a SNV has multiple effects, then its background is white.

To find and plot all putative binding sites, we iterate across every k-mer in the DNA sequence and identify those that conform to the binding site definition for each transcription factor. The user can also choose to plot all denovo binding sites created from SNVs, in addition to existing putative binding sites.

The image can be outputted in one of two ways: (1) zoom into a portion of the sequence or (2) separate the entire sequence into windows. If the sequence is greater than 500 nucleotides in length, the sequence will automatically be separated into windows and outputted as separate files. The maximum size for each window is 500 nucleotides.

Parameters

* indicates required parameter

Inputs and Outputs

Other Parameters

Plotting Parameters

Input Files

  1. DNA sequence(s) to annotate (.tsv)
    • Columns:
    • Sequence Name: name of the DNA sequence
    • Sequence: the sequence
Sequence Name	    Sequence
ZRS                 AACTTTAATGCCTATGTTTGATTTGAAGTCATAGCATAAAAGGTAACATAAGCAACATCCTGACCAATTATCCAAACCATCCAGACATCCCTGAATGGC...
Hand2               CACCACTGGGTGATCCATAGTATGGAATATTTTTATGAGAAACAGCCACATAACATGTACCTGTTAATGTAGGCTTTGTGTTTATTTGCAATAGCAGAG...
  1. PBM or PFM reference data for SNV analysis (.tsv)
    • Columns
      • PBM Kmer: the sequence of every possible k-mer
      • PBM Relative Affinity: the relative affinity of each k-mer normalized to the k-mer with the highest MFI
PBM Kmer     PBM Relative Affinity
AAAAAAAA     0.15
AAAAAAAC     0.11
AAAAAAAG     0.13
AAAAAAAT     0.13
AAAAAACA     0.12
  1. TF information (.tsv)
    • Columns:
    • TF Name: name of the transcription factor
    • Binding Site Definition: minimal IUPAC binding site definition for transcription factor
    • Color: binding site color on the output visualization
    • PBM Reference Data: relative affinity data obtained from DefineTfSites.from.PBM (optional)
    • PFM Reference Data: relative score data obtained from DefineTfSites.from.PFM (optional)
TF Name     Binding Site Definition     Color     PBM Reference Data           PFM Reference Data
ETS         NNGGAWNN                    blue      input_ets-pbm.tsv	
HOX         NYNNTNAA                    gold      input_hox-pbm.tsv	
HAND        CANNTG                      pink	       
  1. all TF reference data (.tsv)
    • Columns
      • PBM Kmer: the sequence of every possible k-mer
      • PBM Relative Affinity: the relative affinity of each k-mer normalized to the k-mer with the highest MFI
PBM Kmer     PBM Relative Affinity
AAAAAAAA     0.55
AAAAAAAC     0.56
AAAAAAAG     0.54
AAAAAAAT     0.54
AAAAAACA     0.56

Output Files

  1. SNV effects output table (.tsv)
    • Note: if PFM reference data is provided instead of PBM reference data, then the columns Reference Affinity and Alternate Affinity will instead be labeled Reference Score and Alternate Score
    • Columns:
    • Sequence Name: name of the sequence being analyzed
    • Kmer ID: unique ID given to binding site
    • Start Position (0-indexed): position at which the binding site starts
    • Position (0-indexed): position of the SNV
    • Reference Nucleotide: reference nucleotide
    • Alternate Nucleotide: alternate nucleotide
    • Reference Kmer: reference binding site
    • Alternate Kmer: alternate binding site
    • Site Direction: direction of the binding site (+ if it follows the given IUPAC or - if it follows the reverse complement of the IUPAC)
    • Reference Affinity: the affinity of the reference binding site
    • Alternate Affinity: the affinity of the alternate binding site
    • Fold Change: the ratio between Reference Affinity and Alternate Affinity
    • SNV Effect: the type of SNV effect
  2. annotated sequence image(s) (.png)

Example Data

Example input data is available on github

Version Comments