ScanpyUtilities GenePattern Module

LSID

urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00382

Author(s)

Tom Sherman, Fertig Lab, Johns Hopkins University, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.

Contact(s)

Algorithm and scientific questions: Tom Sherman <tsherma4 at jhu dot edu>

Module wrapping issues: Ted Liefeld <liefeld at cloud dot ucsd dot edu>

Introduction

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Some common single-cell preprocessing steps require the entire dataset to be loaded into memory which can be an issue on less powerful machines or with larger datasets. This makes it useful to expose ScanPy as a genepattern module where the execution can be offloaded to different, larger, compute resources to perform initial steps to reduce the dataset size before extensive, interactive visualization such as might be done in a GenePattern notebook.

This module exposes many functions from scanpy version 1.3.3 to be used as a GenePattern module.

Algorithm

This module was created to support a workflow tat roughly follows a standard preprocessing pipeline outlined in https://github.com/theislab/single-cell-tutorial/blob/master/latest_notebook/Case-study_Mouse-intestinal-epithelium_1903.ipynb with one notable exception. Rather than "manually" labeling cell types, we use the R package garnett (also included in tis module) to automatically label the cell types based on a list of provided marker genes. Most of the preprocessing steps are done with the python package scanpy [8]. This will be the default package used for visualization and recording all the information learned from our analysis.

References

Wolf, A.F., et al. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology. 19:15

Lun, A.T.L., et al. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res., 5, 2122.

https://github.com/theislab/single-cell-tutorial/blob/master/latest_notebook/Case-study_Mouse-intestinal-epithelium_1903.ipynb

Census of Immune Cells: Single-Cell Workflow with CoGAPS

https://github.com/theislab/scanpy

Input Files

Data File *
Datafile containing single-cell counts data in h5, h5ad, loom or mtx formats.

Cell type marker file
A text file describing the marker genes for each cell type. This should be in the format accepted by the R package garnett as defined in https://cole-trapnell-lab.github.io/garnett/docs/.

Output Files

<output basename>.h5ad<
Output file in AnnData h5ad format.

Example Data

Sample data from the Human Cell Atlas Census of Immune Cells is available at https://preview.data.humancellatlas.org/.

Requirements

GenePattern 3.9.11 or later (dockerized)

Parameters

Name	Description
Input and Output
data file*	A file containing single-cell data. h5ad, loom and mtx file formats are accepted. An h5 file may also be provided if a genome is also provided which will be used converting to h5ad format.
output basename*	Base filename for the output file.
genome	When converting a 10x formated HDF5 file, the module will internally call read_10x_h5 from the scanpy package. This function expects a genome argument which specifies the name of the data set in the HDF5 file. e.g. 'GRCh38'
Annotation
Add count information to the data file. This will generate a new h5ad file that adds the number of counts expressed in each gene and cell as well as the total number of cells expresssed in each gene and vice versa. For cells new new annotations are called n_counts, log_counts, n_genes and for genes n_counts, n_cells.
Name	Description
annotate*	Whether to perform annotation True/False.
Cell Filtering
Filter out cells based on the following thresholds.
Name	Description
cells min counts	Filter out cells with fewer total counts than min counts
cells max counts	Filter out cells with more total counts than min counts
cells min genes	Filter out cells with fewer than min genes expressed
cells max genes	Filter out cells with more than min genes expressed
Gene Filtering
Filter out genes based on the following thresholds.
Name	Description
genes min counts	Filter out cells with fewer total counts than min counts
genes max counts	Filter out cells with more total counts than min counts
genes min cells	Filter out cells with fewer than min genes expressed
genes max cells	Filter out cells with more than min genes expressed
Cell Type Identification
Convert (using the R package garnett) the gene names we've provided in the marker file to the gene ids we've used as the index in our data.
Name	Description
cell type marker file	A text file describing the marker genes for each cell type. This should be in the format accepted by the R package garnett as defined in https://cole-trapnell-lab.github.io/garnett/docs/.
gene annotation database	This can be either "org.Hs.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) or "org.Mm.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html). This annotation database allows this module (using the R package garnett) to convert the gene names we've provided in the marker file to the gene ids we've used as the index in our data.
Normalization
Normalization consists of the following steps. First it generates clusters for normalization by running pca, computing neighbors and running louvain clustering. Next it computes size factors of the clusters with scran. Finally it normalizes the data using the size factors and log(D+1).
Name	Description
normalize*	Whether to perform normalization True/False. This step is performed after filtering if filtering is on.
High Variance Genes
Filter and subset the data to retain only the N most variable genes.
Name	Description
n high variance genes	Subset to the top N highly variable genes.
Dimernsion Reduction
Whether to compute UMAP True/False.
Name	Description
computer tsne	Whether to compute TSNE True/False.

* - required

ScanpyUtilities

Introduction

Algorithm

References

Input Files

Output Files

Example Data

Requirements

Parameters

Input and Output

Annotation

Cell Filtering

Gene Filtering

Cell Type Identification

Normalization

High Variance Genes

Dimernsion Reduction