Module Documentation

STREAM.SeedEPGStructure


LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00398
Author(s)
Huidong Chen, Massachussets General Hospital, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: <Huidong.Chen  at mgh dot harvard dot edu>

Module wrapping issues:  Ted Liefeld  < jliefeld at cloud dot ucsd dot edu>


Introduction

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) is an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. Within GenePattern STREAM is implemented as a collection of modules that cover the entire STREAM processing pipeline to allow individual steps to be performed interactively for data exploration.

STREAM.SeedEPGStructure is use to seed the initial elastic principal graph prior to starting the trajectory learning process.

Algorithm

Elastic principal graphs are structured data approximators, consisting of vertices connected by edges. The vertices are embedded into the space of the data, minimizing the mean squared distance (MSD) to the data points, similarly to k-means. Unlike unstructured k-means, the edges connecting the vertices are used to define an elastic energy term. The elastic energy term and MSD are used to create penalties for edge stretching and bending of branches.

The principal graph inference is based on a greedy optimization procedure that may lead to local minima, therefore in STREAM we use the STREAM.SeedEPGStructure module as an initialization procedure that improves the quality of the inferred solutions and speeds up convergence. First, cells are clustered in the low-dimensional space (by default, k-means is used. Alternatively another two clustering methods including affinity propagation(ap) and spectral clustering(sc) are also available). Based on the centroids obtained, a minimum spanning tree (MST) is constructed using the Kruskal’s algorithm. The obtained tree is then used as initial tree structure for the ElPiGraph procedure.

References

H Chen, L Albergante, JY Hsu, CA Lareau, GL Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, L Pinello Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nature Communications, volume 10, Article number: 1903 (2019)

Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Pinello Lab   STREAM Github Repository

Input Files

  1. data file *
    A STREAM pkl file containing an annotated AnnData matrix of gene expression data.

Output Files

  1. <output filename>_stream_result.pkl
    Output file in STREAM AnnData extended pickle (.pkl) file format suitable for passing to the next step of the STREAM analysis.
  2. <output filename>_variable_genes.png Plot of genes against the fitted curve (if select variable genes is selected).

Example Data

Example data for the STREAM workflow can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

An input file suitable for this step is available at dimred_stream_result.pkl

Requirements

GenePattern 3.9.11 or later (dockerized).

Parameters

Inputs and Outputs

Name Description
data file* A STREAM pkl file containing an annotated AnnData matrix of gene expression data/td>
output filename* The output filename prefix.

Structure Learning

Name Description
percent neighbor cells* Neighbor percentage. The percentage of points used as neighbors for spectral clustering.
num clusters* Number of clusters (only valid once 'clustering' is specificed as 'Spectral Clustering' or 'K-Means').
damping* Damping factor (between 0.5 and 1) for affinity propagation.
preference percentile* Preference percentile (between 0 and 100). The percentile of the input similarities for affinity propagation.
max clusters* Number of clusters (only valid once 'clustering' is specificed as 'Spectral Clustering' or 'K-Means').
clustering* Clustering method used to infer the initial nodes. Choose from affinity propagation, K-Means clustering, Spectral Clustering

Plotting

Parameters controlling the output figures.
Name Description
num components* The number of components to be plotted.
component x* Component used for the x axis
component y* Component used for the y axis
figure height Figure height as used in matplotlib graphs. Default=8.
figure width Figure width as used in matplotlib plots. Default=8
figure legend num columns* The number of columns that the legend has.

* - required