Module Documentation

STREAM.DimensionReduction


LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00396
Author(s)
Huidong Chen, Massachussets General Hospital, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: <Huidong.Chen  at mgh dot harvard dot edu>

Module wrapping issues:  Ted Liefeld  < jliefeld at cloud dot ucsd dot edu>


Introduction

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) is an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. Within GenePattern STREAM is implemented as a collection of modules that cover the entire STREAM processing pipeline to allow individual steps to be performed interactively for data exploration.

STREAM.DimensionReduction is used to reduce the dimensionality of the dataset to be used in the downstream analysis.

 

Algorithm

Each cell can be thought as a vector in a multi-dimensional vector space in which each component is the expression level of a gene. Typically, even after feature selection, each cell still has hundreds of components, making it difficult to reliably assess similarity or distances between cells, a problem often referred as the curse of dimensionality. To mitigate this problem, starting from the genes selected in the previous step we project cells to a lower dimensional space using a non-linear dimensionality reduction method called Modified Locally Linear Embedding (MLLE).

Several alternative dimension reduction methods are also supported, spectral embedding, umap, pca. By default, this module uses MLLE.

For large datasets, spectral embedding works faster than MLLE while preserving a similar compact structure to MLLE. For large datasets, lowering the percent neighbor cells parameter (0.1 by default) will speed up this step.

By default we set the number of components to keep to 3. For biological process with simple bifurcation or linear trajectory, keeping only two components would be recommended.

References

H Chen, L Albergante, JY Hsu, CA Lareau, GL Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, L Pinello Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nature Communications, volume 10, Article number: 1903 (2019)

Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Pinello Lab   STREAM Github Repository

ADD GPNB NOTEBOOK HERE WHEN READY

Input Files

  1. data file *
    A STREAM pkl file containing an annotated AnnData matrix of gene expression data.

Output Files

  1. <output filename>_stream_result.pkl
    Output file in STREAM AnnData extended pickle (.pkl) file format suitable for passing to the next step of the STREAM analysis.
  2. <output filename>_stddev_dotplot.png Plot of the standard deviation of the dimensions.

Example Data

Example data can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Example data for the STREAM workflow can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

An input file suitable for this step is available at preprocessed_stream_result.pkl

Requirements

GenePattern 3.9.11 or later (dockerized).

Parameters

Inputs and Outputs

Name Description
data file* A STREAM pkl file containing an annotated AnnData matrix of gene expression data/td>
output filename* The output filename prefix.

Dimension Reduction

Name Description
percent neighbor cells The percentage neighbor cells (only valid when 'mlle', 'se', or 'umap' is specified).
num components to keep The number of components to keep in the resulting dataset.
feature Feature used for dimension reduction. Choose from ['var_genes','top_pcs','all']. 'var_genes': most variable genes. 'top_pcs': top principal components. 'all': all genes.
method Method used for dimension reduction.Choose from {{'mlle','umap','pca'}}

Plotting

Parameters controlling the output figures.
Name Description
num components to plot Number of components to be plotted.
component x Component used for x axis.
component y Component used for y axis.
figure height Figure height as used in matplotlib graphs. Default=8.
figure width Figure width as used in matplotlib plots. Default=8

* - required