ScanpyUtilities

LSID
urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00382
Author(s)
Tom Sherman, Fertig Lab, Johns Hopkins University, wrapped as a module by Ted Liefeld, Mesirov Lab, UCSD School of Medicine.
Contact(s)

Algorithm and scientific questions: Tom Sherman <tsherma4  at  jhu dot edu>

Module wrapping issues:  Ted Liefeld <liefeld at cloud dot ucsd dot edu>


Introduction

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Some common single-cell preprocessing steps require the entire dataset to be loaded into memory which can be an issue on less powerful machines or with larger datasets. This makes it useful to expose ScanPy as a genepattern module where the execution can be offloaded to different, larger, compute resources to perform initial steps to reduce the dataset size before extensive, interactive visualization such as might be done in a GenePattern notebook.

This module exposes many functions from scanpy version 1.3.3 to be used as a GenePattern module.

 

Algorithm

This module was created to support a workflow tat roughly follows a standard preprocessing pipeline outlined in https://github.com/theislab/single-cell-tutorial/blob/master/latest_notebook/Case-study_Mouse-intestinal-epithelium_1903.ipynb with one notable exception. Rather than "manually" labeling cell types, we use the R package garnett (also included in tis module) to automatically label the cell types based on a list of provided marker genes. Most of the preprocessing steps are done with the python package scanpy [8]. This will be the default package used for visualization and recording all the information learned from our analysis.

References

Wolf, A.F., et al. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology. 19:15

Lun, A.T.L., et al. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res., 5, 2122.

https://github.com/theislab/single-cell-tutorial/blob/master/latest_notebook/Case-study_Mouse-intestinal-epithelium_1903.ipynb

Census of Immune Cells: Single-Cell Workflow with CoGAPS

https://github.com/theislab/scanpy

Input Files

  1. Data File *
    Datafile containing single-cell counts data in h5, h5ad, loom or mtx formats.
  2. Cell type marker file
    A text file describing the marker genes for each cell type. This should be in the format accepted by the R package garnett as defined in https://cole-trapnell-lab.github.io/garnett/docs/.

Output Files

  1. <output basename>.h5ad<
    Output file in AnnData h5ad format.

Example Data

Sample data from the Human Cell Atlas Census of Immune Cells is available at https://preview.data.humancellatlas.org/. 

Requirements

GenePattern 3.9.11 or later (dockerized)

Parameters

Input and Output

Name Description
data file* A file containing single-cell data. h5ad, loom and mtx file formats are accepted. An h5 file may also be provided if a genome is also provided which will be used converting to h5ad format.
output basename* Base filename for the output file.
genome When converting a 10x formated HDF5 file, the module will internally call read_10x_h5 from the scanpy package. This function expects a genome argument which specifies the name of the data set in the HDF5 file. e.g. 'GRCh38'

Annotation

Add count information to the data file. This will generate a new h5ad file that adds the number of counts expressed in each gene and cell as well as the total number of cells expresssed in each gene and vice versa. For cells new new annotations are called n_counts, log_counts, n_genes and for genes n_counts, n_cells.
Name Description
annotate* Whether to perform annotation True/False.

Cell Filtering

Filter out cells based on the following thresholds.
Name Description
cells min counts Filter out cells with fewer total counts than min counts
cells max counts Filter out cells with more total counts than min counts
cells min genes Filter out cells with fewer than min genes expressed
cells max genes Filter out cells with more than min genes expressed

Gene Filtering

Filter out genes based on the following thresholds.
Name Description
genes min counts Filter out cells with fewer total counts than min counts
genes max counts Filter out cells with more total counts than min counts
genes min cells Filter out cells with fewer than min genes expressed
genes max cells Filter out cells with more than min genes expressed

Cell Type Identification

Convert (using the R package garnett) the gene names we've provided in the marker file to the gene ids we've used as the index in our data.
Name Description
cell type marker file A text file describing the marker genes for each cell type. This should be in the format accepted by the R package garnett as defined in https://cole-trapnell-lab.github.io/garnett/docs/.
gene annotation database This can be either "org.Hs.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) or "org.Mm.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html). This annotation database allows this module (using the R package garnett) to convert the gene names we've provided in the marker file to the gene ids we've used as the index in our data.

Normalization

Normalization consists of the following steps. First it generates clusters for normalization by running pca, computing neighbors and running louvain clustering. Next it computes size factors of the clusters with scran. Finally it normalizes the data using the size factors and log(D+1).
Name Description
normalize* Whether to perform normalization True/False. This step is performed after filtering if filtering is on.

High Variance Genes

Filter and subset the data to retain only the N most variable genes.
Name Description
n high variance genes Subset to the top N highly variable genes.

Dimernsion Reduction

Whether to compute UMAP True/False.
Name Description
computer tsne Whether to compute TSNE True/False.

* - required