tfsites.CompareAcrossSeqsFromGenomicVariants

tfsites.CompareAcrossSeqsFromGenomicVariants v1

Author(s): Joe Solvason, Maggie Ma

Contact: Joe Solvason (solvason@ucsd.edu)

Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)

Task Type: Transcription factor analysis

LSID: urn:lsid:genepattern.org:module.analysis:00479

Introduction

tfsites.CompareAcrossSeqsFromGenomicVariants takes in the genomic coordinates of variants to map how ref/alt sequence changes impact the function of TF binding sites. In biomedical applications, comparisons can be made between reference and alternate alleles that are associated with diseases or changes in gene expression.

Methodology

Reference and alternate sequences will be extracted based on the genomic coordinates provided. The size of the sequence extracted will depend on the window size, which is based on the length of the binding sites being analyzed. Each pair of ref/alt sequences will be analyzed separately.

Enhancers that will be compared are labeled with groups – wild-type, control and test. Control sequences have sequence variation but have the same function (ie, sequence changes do not alter enhancer activity). Test sequences have sequence variation and drive differential enhancer activity. When a control group is available, it is helpful as it allows you to see which binding sites that uniquely arise in the sequence variants that drive differential expression. Sites that arise in both control and test groups are not as interesting. If a wild-type is provided, the analysis will report how all binding sites relate to the wild-type.

The scan_seqs function scans every provided enhancer sequence for binding sites. There are 3 methods to predict binding sites: score-pwm, core-only and core-affinity.

  1. Score-pwm calculates a binding score using a PWM, and predicts binding sites that are above a particular binding score threshold. Many PWMs can be inputted into a single analysis which enables high-throughput of PWMs which could potentially explain enhancer differential activity.
  2. Core-only uses IUPAC nomenclature to define a binding site core (e.g. GGAW where W=A/T). This will report de novo and deletions.
  3. Core-affinity also uses IUPAC nomenclature to define abinding site core, but will also integrate affinity datasets (i.e. MITOMI, PBM). This enables reporting not just if the sequence variant drives de novo or deletion, but what the affinity of each site is as well. In addition, this method will report sequence variants that drive substantial changes in affinity.

A multiple sequence alignment of all enhancers is inputted into the tool. When predicting binding sites within a single sequence, the tool ignores ‘-‘ characters. For example, if a PWM is 8bp long, the tool test each contiguous 8-mer that contains only A/T/G or C. Any ‘-‘ characters are skipped until a total of 8bp of nt are reached. However, if a ‘-‘ resides within a kmer (for example, the 8mer G-AGGAAGT which is 9bp long with 8bp of real sequence), it allows the start position to start at either the G or the -. This allows the tool to map orthologous binding sites across sequences which contain indels.

Finally, compare seqs collapses on binding sites that appear in the same location across multiple sequences. Binding sites with scores or affinities found to be altered across genetic variants are reported. Gain-of-function binding sites are either de novo (binding site in test sequence but not wild-type or controls) or increase (binding site is present in all sequences, but the test sequence score or affinity is increased). Loss-of-function binding sites are either deletion (binding site is absent test sequence but present in wild-type or controls) or decrease (binding site is present in all sequences, but the test sequence score or affinity is decreased).

Parameters

* indicates required parameter

Input and Outputs

*Either tf affinity information (.tsv) or motif input file (JASPAR format) or both must be provided.

Other Parameters

Input Files

  1. genome file (.pkl)
    • Contains python dictionary object in the following format: {‘chr1’:’ACGTATTAGCCTAGAGATCA’, …}
  2. variant file (.tsv)
    • Assumes header is present
    • Columns:
    • Chr: chromosome
    • Pos: position
    • Ref: reference allele
    • Alt: alternate allele
    • Hypothesis: specify whether gof/lof/both/na
Chr     Pos	        Ref	Alt	Hypothesis
chr2	65502802	G	A	gof
chr9	21776615	T	G	gof
chr9	21842300	G	C	lof
chr8	27337045	A	T	lof
  1. tf affinity information (.tsv)
    • Assumes header is present
    • Columns:
    • TF Name: name of the transcription factor
    • Core Site: minimal IUPAC binding site definition for transcription factor
    • Affinity Data (optional): name of the relative affinity data file
TF Name    Core Site    Affinity Data
ETS        NNGGAWNN     input_ets1-pbm.tsv    
ETS-only   NNGGAWNN
  1. motif input file (JASPAR format)
    • Can provide multiple PWMs
>MA1113.3	PBX2
A  [  4925  26620    225  24368  27245  27259    704   2298  25945 ]
C  [ 19645    629    588   2266    574    754    453  23894    848 ]
G  [  1585   1710    317    817    343    569    327    555    352 ]
T  [  3441    637  28466   2145   1434   1014  28112   2849   2451 ]

Output Files

Example Data

Example input data is available on github

References

Version Comments