Algorithm and scientific questions: Tom Sherman <tsherma4 at  jhu dot edu>
Module wrapping issues: Ted Liefeld <liefeld at cloud dot ucsd dot edu>
Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.
Some common single-cell preprocessing steps require the entire dataset to be loaded into memory which can be an issue on less powerful machines or with larger datasets. This makes it useful to expose ScanPy as a genepattern module where the execution can be offloaded to different, larger, compute resources to perform initial steps to reduce the dataset size before extensive, interactive visualization such as might be done in a GenePattern notebook.
This module exposes many functions from scanpy version 1.3.3 to be used as a GenePattern module.
This module was created to support a workflow tat roughly follows a standard preprocessing pipeline outlined in https://github.com/theislab/single-cell-tutorial/blob/master/latest_notebook/Case-study_Mouse-intestinal-epithelium_1903.ipynb with one notable exception. Rather than "manually" labeling cell types, we use the R package garnett (also included in tis module) to automatically label the cell types based on a list of provided marker genes. Most of the preprocessing steps are done with the python package scanpy . This will be the default package used for visualization and recording all the information learned from our analysis.
Wolf, A.F., et al. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology. 19:15
Lun, A.T.L., et al. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res., 5, 2122.
Sample data from the Human Cell Atlas Census of Immune Cells is available at https://preview.data.humancellatlas.org/.
GenePattern 3.9.11 or later (dockerized)
Input and Output
|data file*||A file containing single-cell data. h5ad, loom and mtx file formats are accepted. An h5 file may also be provided if a genome is also provided which will be used converting to h5ad format.|
|output basename*||Base filename for the output file.|
|genome||When converting a 10x formated HDF5 file, the module will internally call read_10x_h5 from the scanpy package. This function expects a genome argument which specifies the name of the data set in the HDF5 file. e.g. 'GRCh38'|
|Add count information to the data file. This will generate a new h5ad file that adds the number of counts expressed in each gene and cell as well as the total number of cells expresssed in each gene and vice versa. For cells new new annotations are called n_counts, log_counts, n_genes and for genes n_counts, n_cells.|
|annotate*||Whether to perform annotation True/False.|
|Filter out cells based on the following thresholds.|
|cells min counts||Filter out cells with fewer total counts than min counts|
|cells max counts||Filter out cells with more total counts than min counts|
|cells min genes||Filter out cells with fewer than min genes expressed|
|cells max genes||Filter out cells with more than min genes expressed|
|Filter out genes based on the following thresholds.|
|genes min counts||Filter out cells with fewer total counts than min counts|
|genes max counts||Filter out cells with more total counts than min counts|
|genes min cells||Filter out cells with fewer than min genes expressed|
|genes max cells||Filter out cells with more than min genes expressed|
Cell Type Identification
|Convert (using the R package garnett) the gene names we've provided in the marker file to the gene ids we've used as the index in our data.|
|cell type marker file||A text file describing the marker genes for each cell type. This should be in the format accepted by the R package garnett as defined in https://cole-trapnell-lab.github.io/garnett/docs/.|
|gene annotation database||This can be either "org.Hs.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) or "org.Mm.eg.db" (https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html). This annotation database allows this module (using the R package garnett) to convert the gene names we've provided in the marker file to the gene ids we've used as the index in our data.|
|Normalization consists of the following steps. First it generates clusters for normalization by running pca, computing neighbors and running louvain clustering. Next it computes size factors of the clusters with scran. Finally it normalizes the data using the size factors and log(D+1).|
|normalize*||Whether to perform normalization True/False. This step is performed after filtering if filtering is on.|
High Variance Genes
|Filter and subset the data to retain only the N most variable genes.|
|n high variance genes||Subset to the top N highly variable genes.|
|Whether to compute UMAP True/False.|
|computer tsne||Whether to compute TSNE True/False.|
* - required