Module Documentation

STREAM.DetectDifferentiallyExpressedGenes


LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00402
Author(s)
Huidong Chen, Massachussets General Hospital, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: <Huidong.Chen  at mgh dot harvard dot edu>

Module wrapping issues:  Ted Liefeld  < jliefeld at cloud dot ucsd dot edu>


Introduction

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) is an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. Within GenePattern STREAM is implemented as a collection of modules that cover the entire STREAM processing pipeline to allow individual steps to be performed interactively for data exploration.

STREAM.DetectDifferentiallyExpressedGenes is used to detect differentially expressed genes between pairs of branches.

Algorithm

For each pair of branches 𝐡𝑖 and 𝐡𝑗 , and for the gene E, the gene expression values across cells from both branches are scaled to the range [0,1]. For gene expression 𝐸𝑖 from 𝐡𝑖 and gene expression 𝐸𝑗 from 𝐡𝑗 , we first calculate their mean values. Then, we check the fold change between mean values to make sure it is above a specified threshold (the default log2 fold change value is >0.25). Mann–Whitney U test is then used to test whether 𝐸𝑖 is greater than 𝐸𝑗 or 𝐸𝑖 is less than 𝐸𝑗 . Since the statistic U could be approximated by a normal distribution for large samples, and U depends on specific datasets, we standardize Uto Z-score to make it comparable between different datasets. For small samples where this test is underpowered (<20 cells per branch), we report only the fold change to qualitatively evaluate the difference between 𝐸𝑖 and 𝐸𝑗 . Genes with Z-score or fold change greater than the specified threshold (2.0 by default) are considered as differentially expressed genes between two branches. Formally: 𝑧=1+π‘ˆβˆ’π‘šπ‘ˆ(πœŽπ‘ˆ) Where π‘šπ‘ˆ , πœŽπ‘ˆ are the mean and standard deviation, and π‘šπ‘ˆ=𝑛𝑖𝑛𝑗2 πœŽπ‘ˆ=𝑛𝑖𝑛𝑗12⎯⎯⎯⎯⎯⎯⎯⎯√((𝑛+1)βˆ’βˆ‘π‘™=1π‘˜π‘‘3π‘™βˆ’π‘‘π‘™π‘›(π‘›βˆ’1) Where 𝑛=𝑛𝑖+𝑛𝑗 𝑛𝑖 , 𝑛𝑗 are the number of cells in each branch, 𝑑𝑖 is the number of cells sharing rank 𝑙 and π‘˜ is the number of distinct ranks.

References

H Chen, L Albergante, JY Hsu, CA Lareau, GL Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, L Pinello Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nature Communications, volume 10, Article number: 1903 (2019)

Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Pinello Lab   STREAM Github Repository

Input Files

  1. data file *
    A STREAM pkl file containing an annotated AnnData matrix of gene expression data.

Output Files

  1. <output filename>_stream_result.pkl
    Output file in STREAM AnnData extended pickle (.pkl) file format suitable for passing to the next step of the STREAM analysis.
  2. de_genes/de_genes_S#_S#_and_S#_S#.pdf Bar plot of differentially expressed genes between branches of the trajectories.
  3. de_genes/de_genes_greater_S#_S#_and_S#_S#.tsv TSV of upregulated genes between branches of the trajectories. Columns include z-scoew, log fold change, mean up, mean down p value and q value.
  4. de_genes/de_genes_less_S#_S#_and_S#_S#.tsv TSV of downregulated genes between branches of the trajectories. Columns include z-scoew, log fold change, mean up, mean down p value and q value.

Example Data

Example data for the STREAM workflow can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Example data for this specific step can be found at stream_epg_result.pkl

Requirements

GenePattern 3.9.11 or later (dockerized).

Parameters

Inputs and Outputs

Name Description
data file* A STREAM pkl file containing an annotated AnnData matrix of gene expression data/td>
output filename* The output filename prefix.

Select Variable Genes

Parameters used if variable genes are to be selected as the feature.
Name

Differential Expression

Name Description
root The starting node.
preference The preference of nodes. The branch with speficied nodes are preferred and put on the top part of subway plot. The higher ranks the node have, the closer to the top the branch with that node is. e.g. S3,S4.
percentil expr Between 0 and 100. Between 0 and 100. Specify the percentile of gene expression greater than 0 to filter out some extreme gene expressions.
use precomputed If True, the previously computed scaled gene expression will be used.
cutoff zscore The z-score cutoff used for Mann - Whitney U test.
cutoff logfc The log-transformed fold change cutoff between a pair of branches.

Plotting

Parameters controlling the output figures.
Name Description
num genes The number of genes to plot.
figure height Figure height as used in matplotlib graphs. Default=8.
figure width Figure width as used in matplotlib plots. Default=8

* - required