Scanpy and scvi-tools tutorial


A table of contents is viewable in the Google colab sidebar.

Package installation

Dataset preprocessing

Murine spleen and lymph nodes measured using CITE-seq (~30k cells)

Load data

Scanpy has a number of methods to load data from popular formats, including files outputted by Cell Ranger.

AnnData exploration

AnnData provides a convenient way to store expression data with cell- and gene-level metadata.

Cell-level (resp. gene-level) metadata is accessible with adata.obs (resp. adata.var), which returns a pd.DataFrame.

Quality control

Normalization

Here we use a series of steps to preserve the count data for later, and the log normalized data.

Standard workflow

The standard workflow consists of

  1. Feature-wise scaling
  2. Principal components analysis
  3. Neighbor graph computation

scVI workflow

scVI is a model in the scvi-tools package that provides functionality that parallels many of the Scanpy functions including:

scVI excels at learning integrated low-dimensional latent spaces, and is highly scalable in the number of datasets to integrate, as well as the number of cells. scVI runs much faster with a discrete GPU, which makes Colab the perfect place to process datasets with scVI.

Setup anndata

setup_anndata alerts scvi-tools models to the location of relevant data in the AnnData object. As scvi-tools models require the count data, we need to specify the layer containing the counts. We'd also like to perform integration over the 'batch' metadata, so we use the batch_key parameter.

See here for the full documentation of this function.

For visual confirmation of how scvi-tools registered this information, we run view_anndata_setup.

Train model and store outputs

model = scvi.model.SCVI(adata, use_cuda=True)
model.train()
model.save("spleen_lymph_cite_scvi", overwrite=True)

Exploratory analysis

Differential expression

scVI DE

The change mode follows protocol described in Boyeau et al.

We are comparing $h_{1g}$, the "decoded" expression of gene $g$ in cell type 1, with $h_{2g}$, the "decoded" expression of $g$ in cell type 2.

The hypotheses are:

$$ M^g_1: |f(h_{1g}, h_{2g})| > \delta $$$$ M^g_0: |f(h_{1g}, h_{2g})| \leq \delta $$

where $\delta$ is a user-defined effect size threshold, and $f$ is the log fold change of experssion.

DE "significance" between cell types 1 and 2 for each gene can then be based on the Bayes factors:

$$ \text{Natural Log Bayes Factor for gene g in cell types 1 and 2} = \ln ( {BF^g_{10}) = \ln(\frac{ p(M^g_1 | x_1, x_2)}{p(M^g_0 | x_1, x_2)}}) $$

**Note that the scvi returns the natural logarithm of the Bayes Factor.

Scanpy DE

Scanpy offers access to the Wilcoxon rank-sum test and t-test (default), among others.

Visualization

We sort the scVI DE results to focus on genes that are sufficiently expressed, significant, and have high LFC.

Computing a dendrogram is useful for Scanpy's visualization functions.

Embedding density

SCANVI for label tranfer

We've used our previous workflow to explore and annotate our dataset. Now consider the case where we run more samples and would like to automate the annotation of these new cells. scANVI was designed just for this -- automated label transfer.

Preprocessing

We have a bit of a class imbalance problem, so we only allow scANVI to "see" a subset of the labels such that the classes have better balance.

Train model

model = scvi.model.SCANVI(adata, unlabeled_category='unknown')
model.train(n_epochs_semisupervised = 20)
model.save('spleen_lymph_scanvi')

SCVI in R

https://www.scvi-tools.org/en/stable/user_guide/notebooks/scvi_in_R.html