Author(s): Joe Solvason, Simran Jandu
Contact: Joe Solvason (solvason@ucsd.edu)
Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)
Task Type: Transciption factor analysis
LSID: urn:lsid:genepattern.org:module.analysis:00442
VisualizeTfSitesOnSequences
annotates transcription factor binding sites across a DNA sequence. Multiple transcription factors can be analyzed. Each binding site is labeled with a unique binding site ID and its start and end position. If reference data is provided for a transcription factor, the affinity/score of this site will be labeled and the intensity of the binding site’s color will be proportional to the affinity/score.
Transcription factor information can be given in multiple ways. It can be provided in the TF information file and/or in the batch motif input file. For batch motif data, there are several features that can be customized, including the minimum score and site color.
The two reference data types that can be provided are affinity (i.e. PBM) and score (i.e. PWM). To find predicted binding sites for affinity data, we iterate across every k-mer in the DNA sequence and identify those that conform to the core binding site definition for each transcription factor. To find predicted binding sites for PWM data, we can also use a binding site definition but it is not required. If a site definition is not provided, we use the minimum score to define a predicted binding site. For each binding site, we report its sequence, TF name, matrix ID (if using PWM data), start position, end position, reference data type, value (if reference data is given), direction (“+” if it follows the given binding site definition and “-” if it follows the reverse complement of the binding site definition), and a unique ID. The TFs given in the TF information file and PWM data file are outputted in two separate files.
Using the list of binding sites predicted in the DNA sequence, an image of the DNA sequence and all annotated binding sites is generated. Each binding site is plotted as a polygon that points in the direction of the site (right for positive, left for negative, and no direction for a palindrome sequence).
If the user wishes to analyze only a portion of the sequence, then a zoom range can be specified. If the sequence is greater than 500 nucleotides in length, the sequence will automatically be separated into 500-bp windows and outputted as separate files. In addition, the individual files will be appended together to create a single output file with the entire sequence.
* indicates required parameter
Default = 0.7
Default = grey
GenerateMotifDatabase
.Default = None
Default = pfm
Default = False
Default = 0.25,0.25,0.25,0.25
Default = False
.svg
in addition to .png
. For manuscript preparation, .svg
format is preferable.Default = 150
.svg
files.Default = None
Sequence Name:
name of the DNA sequenceSequence:
the sequenceSequence Name Sequence
ZRS AACTTTAATGCCTATGTTTGATTTGAAGTCATAGCATAAAAGGTAACATAAGCAACATCCTGACCAATTATCCAAACCATCCAGACATCCCTGAATGGC...
TF Name:
name of the transcription factorColor:
binding site color on the output visualizationCore Site:
minimal IUPAC binding site definition for transcription factorAffinity Data:
relative affinity data obtained from NormalizeTfDnaAffinityData
or relative score data (optional)Minimum Affinity:
threshold required to plot binding sites above a certain affinity (optional)TF Name Color Core Site Affinity Data Minimum Affinity
ETS blue NNGGAWNN input_ets1-pbm.tsv
HOX gold NYNNTNAA input_hoxa13-pbm.tsv 0.12
HAND pink CANNTG
ETS
Kmer Relative Affinity
AAAAAAAA 0.15
AAAAAAAC 0.11
AAAAAAAG 0.13
AAAAAAAT 0.13
AAAAAACA 0.12
HOX
Kmer Relative Affinity
AAAAAAAA 0.55
AAAAAAAC 0.56
AAAAAAAG 0.54
AAAAAAAT 0.54
AAAAAACA 0.56
>MA1113.3 PBX2
A [ 4925 26620 225 24368 27245 27259 704 2298 25945 ]
C [ 19645 629 588 2266 574 754 453 23894 848 ]
G [ 1585 1710 317 817 343 569 327 555 352 ]
T [ 3441 637 28466 2145 1434 1014 28112 2849 2451 ]
Sequence Name:
Name of the sequenceTF Name:
Name of the transcription factorMatrix ID:
PWM ID from JASPAR (optional)Kmer ID:
unique ID associated with each k-merKmer:
sequence of the k-merStart Position (1-indexed):
starting position of the k-mer, where counting begins at oneEnd Position (1-indexed):
starting position of the k-mer, where counting begins at oneRef Data Type:
either ‘Affinity’ or ‘Score’ depending on type of data being used (optional)Value:
relative affinity or score of the k-mer (optional)Site Direction:
direction of the binding siteDuplicate Kmer IDs:
list of k-mer IDs for k-mers that have the same sequencetf info table
Sequence Name TF Name Kmer ID Kmer Start Position (1-indexed) End Position (1-indexed) Ref Data Type Value Site Direction Duplicate Kmer IDs
ZRS ETS ETS:1 CTATCCTG 335 328 Affinity 0.15 -
ZRS ETS ETS:2 TTTTCCCC 432 425 Affinity 0.14 - ETS:1,ETS:20
ZRS HOX HOX:1 TTTAATAT 323 316 Affinity 0.75 -
ZRS HOX HOX:2 TTTATGAC 415 408 Affinity 0.84 -
ZRS HAND HAND:1 CAGATG 416 421
pfm data
Sequence Name TF Name Matrix ID Kmer ID Kmer Start Position (1-indexed) End Position (1-indexed) Ref Data Type Value Site Direction Duplicate Kmer IDs
ZRS PBX2 MA1113.3 PBX2:MA1113.3:1 CATAAACCA 365 357 Score 0.88 -
ZRS PBX2 MA1113.3 PBX2:MA1113.3:2 CATAAAATA 413 405 Score 0.84 -
Example input data is available at here.