Author(s): Joe Solvason, Maggie Ma
Contact: Joe Solvason (solvason@ucsd.edu)
Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)
Task Type: Transcription factor analysis
LSID: urn:lsid:genepattern.org:module.analysis:00479
tfsites.CompareAcrossSeqsFromGenomicVariants
takes in the genomic coordinates of variants to map how ref/alt sequence changes impact the function of TF binding sites. In biomedical applications, comparisons can be made between reference and alternate alleles that are associated with diseases or changes in gene expression.
Reference and alternate sequences will be extracted based on the genomic coordinates provided. The size of the sequence extracted will depend on the window size, which is based on the length of the binding sites being analyzed. Each pair of ref/alt sequences will be analyzed separately.
Enhancers that will be compared are labeled with groups – wild-type, control and test. Control sequences have sequence variation but have the same function (ie, sequence changes do not alter enhancer activity). Test sequences have sequence variation and drive differential enhancer activity. When a control group is available, it is helpful as it allows you to see which binding sites that uniquely arise in the sequence variants that drive differential expression. Sites that arise in both control and test groups are not as interesting. If a wild-type is provided, the analysis will report how all binding sites relate to the wild-type.
The scan_seqs
function scans every provided enhancer sequence for binding sites. There are 3 methods to predict binding sites: score-pwm, core-only and core-affinity.
A multiple sequence alignment of all enhancers is inputted into the tool. When predicting binding sites within a single sequence, the tool ignores ‘-‘ characters. For example, if a PWM is 8bp long, the tool test each contiguous 8-mer that contains only A/T/G or C. Any ‘-‘ characters are skipped until a total of 8bp of nt are reached. However, if a ‘-‘ resides within a kmer (for example, the 8mer G-AGGAAGT which is 9bp long with 8bp of real sequence), it allows the start position to start at either the G or the -. This allows the tool to map orthologous binding sites across sequences which contain indels.
Finally, compare seqs collapses on binding sites that appear in the same location across multiple sequences. Binding sites with scores or affinities found to be altered across genetic variants are reported. Gain-of-function binding sites are either de novo (binding site in test sequence but not wild-type or controls) or increase (binding site is present in all sequences, but the test sequence score or affinity is increased). Loss-of-function binding sites are either deletion (binding site is absent test sequence but present in wild-type or controls) or decrease (binding site is present in all sequences, but the test sequence score or affinity is decreased).
* indicates required parameter
*Either tf affinity information (.tsv) or motif input file (JASPAR format) or both must be provided.
tfsites.GenerateMotifDatabase
module.0.1
.0.8
.Chr:
chromosomePos:
positionRef:
reference alleleAlt:
alternate alleleHypothesis
: specify whether gof/lof/both/naChr Pos Ref Alt Hypothesis
chr2 65502802 G A gof
chr9 21776615 T G gof
chr9 21842300 G C lof
chr8 27337045 A T lof
TF Name:
name of the transcription factorCore Site:
minimal IUPAC binding site definition for transcription factorAffinity Data (optional):
name of the relative affinity data fileTF Name Core Site Affinity Data
ETS NNGGAWNN input_ets1-pbm.tsv
ETS-only NNGGAWNN
>MA1113.3 PBX2
A [ 4925 26620 225 24368 27245 27259 704 2298 25945 ]
C [ 19645 629 588 2266 574 754 453 23894 848 ]
G [ 1585 1710 317 817 343 569 327 555 352 ]
T [ 3441 637 28466 2145 1434 1014 28112 2849 2451 ]
Example input data is available on github