tfsites.VisualizeTFSitesOnSequences

tfsites.AnnotateAndVisualizeTFSites v1

Author(s): Joe Solvason

Contact: Joe Solvason (solvason@eng.ucsd.edu)

Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)

Task Type: Transciption factor analysis

LSID: urn:lsid:genepattern.org:module.analysis:00442

Introduction

AnnotateAndVisualizeTfSites annotates transcription factor binding sites across a DNA sequence. Multiple transcription factors can be analyzed. Each binding site is labeled with the TF name and a unique binding site ID. If a relative affinity/score dataset is provided for a transcription factor, the affinity/score of this site will be labeled and the intensity of the binding site’s color will be proportional to the affinity/score.

Methodology

We iterate across every k-mer in the DNA sequence and identify those that conform to the binding site definition for each transcription factor. For each binding site, we report its sequence, position, TF name, affinity/score (if PBM/PFM data is given), direction (“+” if it follows the given binding site definition and “-” if it follows the reverse complement of the binding site definition), and a unique ID.

Using the binding sites identified in the DNA sequence, an image of the DNA sequence and all annotated binding sites is generated. Each binding site is plotted as a polygon that points in the direction of the site (right for positive, left for negative, and straight for a palindrome sequence).

The image can be outputted in one of two ways: (1) zoom into a portion of the sequence or (2) separate the entire sequence into windows. If the sequence is greater than 500 nucleotides in length, the sequence will automatically be separated into windows and outputted as separate files. The maximum size for each window is 500 nucleotides.

Parameters

* indicates required parameter

Inputs and Outputs

Plotting Parameters

Input File(s)

  1. DNA sequence(s) to annotate (.tsv)
    • Columns:
    • Sequence Name: name of the DNA sequence
    • Sequence: the sequence
Sequence Name	    Sequence
ZRS                 AACTTTAATGCCTATGTTTGATTTGAAGTCATAGCATAAAAGGTAACATAAGCAACATCCTGACCAATTATCCAAACCATCCAGACATCCCTGAATGGC...
Hand2               CACCACTGGGTGATCCATAGTATGGAATATTTTTATGAGAAACAGCCACATAACATGTACCTGTTAATGTAGGCTTTGTGTTTATTTGCAATAGCAGAG...
  1. TF information (.tsv)
    • Columns:
    • TF Name: name of the transcription factor
    • Binding Site Definition: minimal IUPAC binding site definition for transcription factor
    • Color: binding site color on the output visualization
    • PBM Reference Data: relative affinity data obtained from DefineTfSites.from.PBM (optional)
    • PFM Reference Data: relative score data obtained from DefineTfSites.from.PFM (optional)
TF Name     Binding Site Definition     Color     PBM Reference Data           PFM Reference Data
ETS         NNGGAWNN                    blue      input_ets-pbm.tsv    
HOX         NYNNTNAA                    gold      input_hox-pbm.tsv
HAND        CANNTG                      pink	       
  1. all TF reference data (.tsv)
    • Can provide more than one file
    • Columns for PBM Reference Data
      • PBM Kmer: the sequence of every possible k-mer
      • PBM Relative Affinity: the relative affinity of each k-mer normalized to the k-mer with the highest MFI

ETS

PBM Kmer     PBM Relative Affinity
AAAAAAAA     0.15
AAAAAAAC     0.11
AAAAAAAG     0.13
AAAAAAAT     0.13
AAAAAACA     0.12

HOX

PBM Kmer     PBM Relative Affinity
AAAAAAAA     0.55
AAAAAAAC     0.56
AAAAAAAG     0.54
AAAAAAAT     0.54
AAAAAACA     0.56

Output File(s)

  1. TF sites output table (.tsv)
    • Columns
    • Sequence Name: Name of the sequence
    • Kmer ID: unique ID associated with each k-mer
    • Kmer: sequence of the k-mer
    • Position (0-indexed): starting position of the k-mer, where counting begins at zero
    • PBM Affinity: relative affinity of the k-mer (optional)
    • PFM Score: relative score of the k-mer (optional)
    • Site Direction: direction of the binding site (+ if it follows the given IUPAC or - if it follows the reverse complement of the IUPAC)
    • Duplicate Kmer IDs: list of k-mer IDs for k-mers that have the same sequence
Sequence Name    Kmer ID     Kmer          Position (0-indexed)    PBM Affinity   PFM Score    Site Direction    Duplicate Kmer IDs
Hand2            Hand2:1     TGATCCAT      10                      0.11                         -	
Hand2            Hand2:2     ATGGAATA      21                      0.10                         +                Hand2:2,Hand2:22
Hand2            Hand2:3     GAGGAACT      113                     0.13                         +	
Hand2            Hand2:4     ATGGATTC      154                     0.10                         +	
Hand2            Hand2:5     TGATCCTA      343                     0.10                         -	
Hand2            Hand2:6     AATTCCAT      493                     0.11                         -                Hand2:6,Hand2:17
  1. annotated sequence image(s) (.png)

Example Data

Example input data is available on github

Version Comments