Author(s): Joe Solvason
Contact: Joe Solvason (solvason@eng.ucsd.edu)
Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)
Task Type: Transciption factor analysis
LSID: urn:lsid:genepattern.org:module.analysis:00442
AnnotateAndVisualizeTfSites
annotates transcription factor binding sites across a DNA sequence. Multiple transcription factors can be analyzed. Each binding site is labeled with the TF name and a unique binding site ID. If a relative affinity/score dataset is provided for a transcription factor, the affinity/score of this site will be labeled and the intensity of the binding site’s color will be proportional to the affinity/score.
We iterate across every k-mer in the DNA sequence and identify those that conform to the binding site definition for each transcription factor. For each binding site, we report its sequence, position, TF name, affinity/score (if PBM/PFM data is given), direction (“+” if it follows the given binding site definition and “-” if it follows the reverse complement of the binding site definition), and a unique ID.
Using the binding sites identified in the DNA sequence, an image of the DNA sequence and all annotated binding sites is generated. Each binding site is plotted as a polygon that points in the direction of the site (right for positive, left for negative, and straight for a palindrome sequence).
The image can be outputted in one of two ways: (1) zoom into a portion of the sequence or (2) separate the entire sequence into windows. If the sequence is greater than 500 nucleotides in length, the sequence will automatically be separated into windows and outputted as separate files. The maximum size for each window is 500 nucleotides.
* indicates required parameter
Zoom
indicates the region of the DNA sequence to visualize, given a start and end coordinate, which can be specified using the zoom range
option below. Windows
will output the entire DNA sequence into separate images. The size of the window, or the number of bases plotted per window, can be specified by the window size
option below.Default = None
output image format = Windows
Default = 500
output image format = Zoom
Default = None
Default = 200
Sequence Name:
name of the DNA sequenceSequence:
the sequenceSequence Name Sequence
ZRS AACTTTAATGCCTATGTTTGATTTGAAGTCATAGCATAAAAGGTAACATAAGCAACATCCTGACCAATTATCCAAACCATCCAGACATCCCTGAATGGC...
Hand2 CACCACTGGGTGATCCATAGTATGGAATATTTTTATGAGAAACAGCCACATAACATGTACCTGTTAATGTAGGCTTTGTGTTTATTTGCAATAGCAGAG...
TF Name:
name of the transcription factorBinding Site Definition:
minimal IUPAC binding site definition for transcription factorColor:
binding site color on the output visualizationPBM Reference Data:
relative affinity data obtained from DefineTfSites.from.PBM
(optional)PFM Reference Data:
relative score data obtained from DefineTfSites.from.PFM
(optional)TF Name Binding Site Definition Color PBM Reference Data PFM Reference Data
ETS NNGGAWNN blue input_ets-pbm.tsv
HOX NYNNTNAA gold input_hox-pbm.tsv
HAND CANNTG pink
PBM Kmer:
the sequence of every possible k-merPBM Relative Affinity:
the relative affinity of each k-mer normalized to the k-mer with the highest MFIETS
PBM Kmer PBM Relative Affinity
AAAAAAAA 0.15
AAAAAAAC 0.11
AAAAAAAG 0.13
AAAAAAAT 0.13
AAAAAACA 0.12
HOX
PBM Kmer PBM Relative Affinity
AAAAAAAA 0.55
AAAAAAAC 0.56
AAAAAAAG 0.54
AAAAAAAT 0.54
AAAAAACA 0.56
Sequence Name:
Name of the sequenceKmer ID:
unique ID associated with each k-merKmer:
sequence of the k-merPosition (0-indexed):
starting position of the k-mer, where counting begins at zeroPBM Affinity:
relative affinity of the k-mer (optional)PFM Score:
relative score of the k-mer (optional)Site Direction:
direction of the binding site (+ if it follows the given IUPAC or - if it follows the reverse complement of the IUPAC)Duplicate Kmer IDs:
list of k-mer IDs for k-mers that have the same sequenceSequence Name Kmer ID Kmer Position (0-indexed) PBM Affinity PFM Score Site Direction Duplicate Kmer IDs
Hand2 Hand2:1 TGATCCAT 10 0.11 -
Hand2 Hand2:2 ATGGAATA 21 0.10 + Hand2:2,Hand2:22
Hand2 Hand2:3 GAGGAACT 113 0.13 +
Hand2 Hand2:4 ATGGATTC 154 0.10 +
Hand2 Hand2:5 TGATCCTA 343 0.10 -
Hand2 Hand2:6 AATTCCAT 493 0.11 - Hand2:6,Hand2:17
ZRS sequence:
Hand2 sequence:
Example input data is available on github