tfsites.CompareTfSitesAcrossSequences

tfsites.CompareTfSitesAcrossSequences v1

Author(s): Joe Solvason

Contact: Joe Solvason (solvason@eng.ucsd.edu)

Adapted as a GenePattern Module by: Ted Liefeld (jliefeld@cloud.ucsd.edu)

Task Type: Transcription factor analysis

LSID: urn:lsid:genepattern.org:module.analysis:00448

Introduction

tfsites.CompareTfSitesAcrossSequencess tool fills this need by taking in a multiple sequence alignment of two or more enhancers to map how sequence variation impacts function of TF binding sites. In biomedical applications, comparisons can be made between reference and alternate alleles that are associated with diseases or changes in gene expression. In biomedical applications, tfsites.CompareTfSitesAcrossSequencess can be used to determine which binding sites are lost, gained, or changed across genetic variants of enhancers. In evolutionary applications, tfsites.CompareTfSitesAcrossSequencess can be used to determine which binding sites are lost, gained, or changed within a particular clade of species.

Methodology

Enhancers that will be compared are labeled with groups – wild-type, control and test. Control sequences have sequence variation but have the same function (ie, sequence changes do not alter enhancer activity). Test sequences have sequence variation and drive differential enhancer activity. When a control group is available, it is helpful as it allows you to see which binding sites that uniquely arise in the sequence variants that drive differential expression. Sites that arise in both control and test groups are not as interesting. If a wild-type is provided, the analysis will report how all binding sites relate to the wild-type.

The scan_seqs function scans every provided enhancer sequence for binding sites. There are 3 methods to predict binding sites: score-pwm, core-only and core-affinity.

  1. Score-pwm calculates a binding score using a PWM, and predicts binding sites that are above a particular binding score threshold. Many PWMs can be inputted into a single analysis which enables high-throughput of PWMs which could potentially explain enhancer differential activity.
  2. Core-only uses IUPAC nomenclature to define a binding site core (e.g. GGAW where W=A/T). This will report de novo and deletions.
  3. Core-affinity also uses IUPAC nomenclature to define abinding site core, but will also integrate affinity datasets (i.e. MITOMI, PBM). This enables reporting not just if the sequence variant drives de novo or deletion, but what the affinity of each site is as well. In addition, this method will report sequence variants that drive substantial changes in affinity.

A multiple sequence alignment of all enhancers is inputted into the tool. When predicting binding sites within a single sequence, the tool ignores ‘-‘ characters. For example, if a PWM is 8bp long, the tool test each contiguous 8-mer that contains only A/T/G or C. Any ‘-‘ characters are skipped until a total of 8bp of nt are reached. However, if a ‘-‘ resides within a kmer (for example, the 8mer G-AGGAAGT which is 9bp long with 8bp of real sequence), it allows the start position to start at either the G or the -. This allows the tool to map orthologous binding sites across sequences which contain indels.

Finally, compare seqs collapses on binding sites that appear in the same location across multiple sequences. Binding sites with scores or affinities found to be altered across genetic variants are reported. Gain-of-function binding sites are either de novo (binding site in test sequence but not wild-type or controls) or increase (binding site is present in all sequences, but the test sequence score or affinity is increased). Loss-of-function binding sites are either deletion (binding site is absent test sequence but present in wild-type or controls) or decrease (binding site is present in all sequences, but the test sequence score or affinity is decreased).

Parameters

* indicates required parameter

Input and Outputs

*Either tf affinity information (.tsv) or motif input file (JASPAR format) or both must be provided.

Other Parameters

Input Files

  1. enhancer alignment table data (.tsv)
  2. enhancer functional group table (.tsv)
  3. tf affinity information (.tsv)
  4. motif input file (JASPAR format)

Output Files

Example Data

Example input data is available on github

References

Version Comments