Module Documentation

STREAM.FeatureSelection


LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00396
Author(s)
Huidong Chen, Massachussets General Hospital, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: <Huidong.Chen  at mgh dot harvard dot edu>

Module wrapping issues:  Ted Liefeld  < jliefeld at cloud dot ucsd dot edu>


Introduction

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) is an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. Within GenePattern STREAM is implemented as a collection of modules that cover the entire STREAM processing pipeline to allow individual steps to be performed interactively for data exploration.

STREAM.FeatureSelection is used to identify features to be used in the downstream analysis. Two types of features can be used;

  • Variable genes
  • Top principal components

For transcriptomic data (single-cell RNA-seq or qPCR), the input of STREAM is a gene expression matrix, where rows represent genes, columns represent cells. Each entry contains an adjusted gene expression value (after library size normalization and log2 transformation, typically performed using the STREAM.Preprocessing module).

 

Algorithm

By default the most variable genes are selected as features. For each gene, its mean value and standard deviation are calculated across all the cells. Then a non-parametric local regression method (LOESS) is used to fit the relationship between mean and standard deviation values. Genes above the curve that diverge significantly are selected as variable genes.

Alternatively, users can also perform PCA on the scaled matrix and select the top principal components based on the variance ratio elbow plot.

References

H Chen, L Albergante, JY Hsu, CA Lareau, GL Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, L Pinello Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nature Communications, volume 10, Article number: 1903 (2019)

Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Pinello Lab   STREAM Github Repository

ADD GPNB NOTEBOOK HERE WHEN READY

Input Files

  1. data file *
    A STREAM pkl file containing an annotated AnnData matrix of gene expression data.

Output Files

  1. <output filename>_stream_result.pkl
    Output file in STREAM AnnData extended pickle (.pkl) file format suitable for passing to the next step, STREAM>DimensionReduction.
  2. <output filename>_variable_genes.png Plot of genes against the fitted curve (if select variable genes is selected).
  3. <output filename>_variable_genes.png Plot of principal components agains variance ratio (if PCA is selected).

Example Data

Example data for the STREAM workflow can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

An input file suitable for this step is available at filtered_stream_result.pkl

Requirements

GenePattern 3.9.11 or later (dockerized).

Parameters

Inputs and Outputs

Name Description
data file* A STREAM pkl file containing an annotated AnnData matrix of gene expression data/td>
output filename* The output filename prefix.

Select Variable Genes

Parameters used if variable genes are to be selected as the feature.
Name Description
find variable genes Wether to find variable genes and add them to the output pkl object True/False.
loess fraction Between 0 and 1. The fraction of the data used when estimating each y-value in LOWESS function.
percentile Between 0 and 100. Specify the percentile to select genes.Genes are ordered based on its distance from the fitted curve.
num genes Specify the number of selected genes. Genes are ordered based on its distance from the fitted curve.

Principal Component Analysis

Parameters used if PCA components are to be selected as the feature.
Name Description
find principal components Do a principal compnents Analysis (PCA) True/False.
feature Choose from the genes in the dataset, Features used for pricipal component analysis. If None, all the genes will be used. IF 'var_genes', the most variable genes obtained from select variable genes will be used.
num principal components The number of principal components.
max principal components The maximum number of principal components used for variance Ratio plot.
first principal component If True, the first principal component will be included. True/False
use precomputed If True, the PCA results from previous computing will be used. True/False

Plotting

Parameters controlling the output figures.
Name Description
figure height Figure height as used in matplotlib graphs. Default=8.
figure width Figure width as used in matplotlib plots. Default=8

* - required