Module Documentation

STREAM.Preprocess


LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00396
Author(s)
Huidong Chen, Massachussets General Hospital, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: <Huidong.Chen  at mgh dot harvard dot edu>

Module wrapping issues:  Ted Liefeld  < jliefeld at cloud dot ucsd dot edu>


Introduction

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) is an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. Within GenePattern STREAM is implemented as a collection of modules that cover the entire STREAM processing pipeline to allow individual steps to be performed interactively for data exploration.

STREAM.Preprocess is used to normalize and filter single-cell transcriptomic data and format it for analsysis using the STREAM piplene.

To prepare for processing in the follow-on STREAM modules, typically we will will first normalize the raw gene expression values based on library size. Then the gene expression values will be logarithmized. The mitochondrial genes will be removed.

Algorithm

With this module, we can filter out cells based on several cell-centric metrics, including the minimum number of genes expressed, the minimum percentage of genes expressed, and the minimum number of read counts for one cell.

We can also filter out genes based on gene-centric metrics, including the minimum number of cells expressing one gene, the minimum percentage of cells expressing one gene, and the minimum number of read counts for one gene.

References

H Chen, L Albergante, JY Hsu, CA Lareau, GL Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, L Pinello Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nature Communications, volume 10, Article number: 1903 (2019)

Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Pinello Lab   STREAM Github Repository

ADD GPNB NOTEBOOK HERE WHEN READY

Input Files

  1. data file *
    A csv, tsv or STREAM pkl file containing single cell transcriptomic data.
  2. cell label file
    A tsv file containing cell labels.
  3. cell label color file
    A tsv file containing colors (as hex values, e.g. #FF0000) to use for cell labels when creating stream plots.

Output Files

  1. <output filename>_stream_result.pkl
    Output file in STREAM AnnData extended pickle (.pkl) file format suitable for passing to the next step of the STREAM analysis.

Example Data

Example data can be downloaded from dropbox: Stream Example Data
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

Requirements

GenePattern 3.9.11 or later (dockerized).

Parameters

Inputs and Outputs

Name Description
data file* A STREAM pkl file containing an annotated AnnData matrix of gene expression data/td>
cell label file A tsv file containing cell labels.
cell label color file A tsv file containing cell label colors
output filename* The output filename prefix.

Cell Filtering

Name Description
min percent genes The minimum percentage of genes expressed to keep a cell.
min count genes The minimum number of read counts for each gene.

Gene Filtering

Name Description
min num cells The minimum number of cells expressing a gene.
min percent cells The minimum percentage of cells expressing a gene to keep a gene.
min count cells The minimum number of read counts for one cell.

Other Preprocessing

Name Description
expression cutoff < /td> The expression cutoff used to determine if a gene is expressed. If expression is greater than expr_cutoff,the gene is considered 'expressed'.
normalize Normalize the data, True/False
log transform Log transform the dataset, True/False
remove mitochondrial genes Remove mitochondrial genes, True/False

* - required