scNT-seq human hematopoiesis dynamics
In this tutorial, we will demonstrate how to perform absolute total RNA
velocity analyses with the metabolic labeling datasets that was reported
in the dynamo Cell paper. This notebook reproduces the total RNA
velocity analyses reported in figure 3B of the original cell study. Note
that dynamo is equiped with the processed hematopoiesis dataset produced
following the steps presented in this notebook. You can download the
processed data via dyn.sample_data.hematopoiesis
.
%%capture
import dynamo as dyn
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
First let us download the raw hematopoiesis data via
dyn.sample_data.hematopoiesis_raw
. Thenew
and total
layers
correspond to the labeled RNA and total RNA and will be used for total
RNA velocity analyses estimation. Note that in order to ensure the
reproducibility, we will also included the umap embedding in
.obsm['X_umap']
generated previously.
adata_hsc_raw = dyn.sample_data.hematopoiesis_raw()
adata_hsc_raw
AnnData object with n_obs × n_vars = 1947 × 26193
obs: 'batch', 'cell_type', 'time'
var: 'gene_name_mapping'
uns: 'genes_to_use'
obsm: 'X_umap'
layers: 'new', 'spliced', 'total', 'unspliced'
We use the monocle recipe to preprocess the adata
object.
Importantly, to ensure the reproducibility, we also use a predefined
gene list that includes the highly variable genes and known markers
genes based on previous reports (Paul, cell,
2015, Weintrab, Science,
2020, etc). The gene list
is stored in adata_hsc_raw.uns["genes_to_use"]
of the hematopoiesis
raw dataset came with dynamo.
Note that the time
key indicates the RNA metabolic labeling
duration. Since we labeled cells from the same harvest time point for a
single time pulse, so the experiment_type is set to be “one-shot”
(although we labeled cells at day 7 for 3 hours while cells at day 10
for 5 hours).
selected_genes_to_use = adata_hsc_raw.uns["genes_to_use"]
preprocessor = dyn.pp.Preprocessor(force_gene_list=selected_genes_to_use)
preprocessor.config_monocle_recipe(adata_hsc_raw, n_top_genes=len(selected_genes_to_use))
preprocessor.preprocess_adata_monocle(
adata_hsc_raw,
tkey="time",
experiment_type="one-shot",
)
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index.
Dynamo is converting those names to official gene names.
Error: The requests_cache python module is required to use request caching.
See - https://requests-cache.readthedocs.io/en/latest/user_guide.html#installation
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-26193...done.
Finished.
4 input query terms found dup hits:
[('ENSG00000229425', 2), ('ENSG00000249738', 2), ('ENSG00000260788', 2), ('ENSG00000278903', 3)]
66 input query terms found no hit:
['ENSG00000112096', 'ENSG00000168078', 'ENSG00000189144', 'ENSG00000203812', 'ENSG00000215271', 'ENS
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
|-----> [Preprocessor-monocle] completed [57.7065s]
adata_hsc_raw.var.use_for_pca.sum()
1754
dyn.tl.reduceDimension(adata_hsc_raw)
|-----> retrieve data for non-linear dimension reduction...
|-----? adata already have basis umap. dimension reduction umap will be skipped!
set enforce=True to re-performing dimension reduction.
|-----> Start computing neighbor graph...
|-----------> X_data is None, fetching or recomputing...
|-----> fetching X data from layer:None, basis:pca
|-----> method arg is None, choosing methods automatically...
|-----------> method ball_tree selected
|-----> [UMAP] completed [0.1294s]
Estimate RNA velocity with the Model 2
In general, dynamo supports two major models for estimating kinetic parameters and RNA velocity for tscRNA-seq data. The Model 2 doesn’t consider RNA splicing while Monocle 3 does (see Fig. SI2. A).
Note that we also use labeling time to group cells for gene expression
smoothing via dyn.tl.moments
.
dyn.tl.moments(adata_hsc_raw, group="time")
|-----> calculating first/second moments...
|-----> [moments calculation] completed [11.1526s]
Since we actually have unsplicing/splicing data in our adata, dynamo’s
preprocess module automatically recognizes this and then tag the
adata
to have both splicing and labeling information. In order to
use Model 2, here we purposely set has_splicing
to be false, which
then considers labeling data (new/total) while ignores
unsplicing/splicing information.
Note that in order to ensure the reproducibility, we set
one_shot_method="sci_fate", model="deterministic"
but running with
default parameters will give you very similar results.
adata_hsc_raw.uns["pp"]["has_splicing"] = False
dyn.tl.dynamics(adata_hsc_raw, group="time", one_shot_method="sci_fate", model="deterministic");
|-----> calculating first/second moments...
|-----> [moments calculation] completed [2.7040s]
estimating gamma: 100%|████████████████████| 1754/1754 [00:07<00:00, 242.37it/s]
estimating alpha: 100%|██████████████████| 1754/1754 [00:00<00:00, 96611.98it/s]
|-----> calculating first/second moments...
|-----> [moments calculation] completed [1.8486s]
estimating gamma: 100%|████████████████████| 1754/1754 [00:05<00:00, 348.18it/s]
estimating alpha: 100%|██████████████████| 1754/1754 [00:00<00:00, 95851.69it/s]
Next, because we actually quantified both the labeling and splicing information, we used the second formula that involves both splicing and labeling data to define total RNA velocity (\(\dot{r} = n / (1 - e^{-rt}) \cdot r - \gamma s\)) where \(r, n, t, \gamma, s\) are total RNA, new RNA, labeling time, splicing rate and spliced RNA respectively.
Once the high-dimensional total RNA velocities are calculated, we will
then projected them to two-dimensional UMAP space and visualized with
the streamline plot, using dynamo with default parameters
(dyn.tl.cell_velocities
).
adata_hsc_raw.obs.time.unique()
array([3, 5])
adata_hsc_raw
AnnData object with n_obs × n_vars = 1947 × 21595
obs: 'batch', 'cell_type', 'time', 'nGenes', 'nCounts', 'pMito', 'pass_basic_filter', 'total_Size_Factor', 'initial_total_cell_size', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'Size_Factor', 'initial_cell_size', 'new_Size_Factor', 'initial_new_cell_size', 'ntr'
var: 'gene_name_mapping', 'nCells', 'nCounts', 'query', 'scopes', '_id', '_score', 'symbol', 'pass_basic_filter', 'log_m', 'score', 'log_cv', 'frac', 'use_for_pca', 'ntr', 'time_3_alpha', 'time_3_beta', 'time_3_gamma', 'time_3_half_life', 'time_3_alpha_b', 'time_3_alpha_r2', 'time_3_gamma_b', 'time_3_gamma_r2', 'time_3_gamma_logLL', 'time_3_delta_b', 'time_3_delta_r2', 'time_3_bs', 'time_3_bf', 'time_3_uu0', 'time_3_ul0', 'time_3_su0', 'time_3_sl0', 'time_3_U0', 'time_3_S0', 'time_3_total0', 'time_3_beta_k', 'time_3_gamma_k', 'time_5_alpha', 'time_5_beta', 'time_5_gamma', 'time_5_half_life', 'time_5_alpha_b', 'time_5_alpha_r2', 'time_5_gamma_b', 'time_5_gamma_r2', 'time_5_gamma_logLL', 'time_5_bs', 'time_5_bf', 'time_5_uu0', 'time_5_ul0', 'time_5_su0', 'time_5_sl0', 'time_5_U0', 'time_5_S0', 'time_5_total0', 'time_5_beta_k', 'time_5_gamma_k', 'use_for_dynamics'
uns: 'genes_to_use', 'pp', 'velocyto_SVR', 'feature_selection', 'PCs', 'explained_variance_ratio_', 'pca_mean', 'neighbors', 'dynamics'
obsm: 'X_umap', 'X_pca', 'X'
layers: 'new', 'spliced', 'total', 'unspliced', 'X_total', 'X_unspliced', 'X_spliced', 'X_new', 'M_u', 'M_uu', 'M_s', 'M_us', 'M_t', 'M_tt', 'M_n', 'M_tn', 'M_ss', 'M_nn', 'velocity_N', 'velocity_T'
obsp: 'distances', 'connectivities', 'moments_con'
We have two time points in hsc dataset. Here we split the dataset based on time points and prepare data for calculation next.
pca_genes = adata_hsc_raw.var.use_for_pca
new_expr = adata_hsc_raw[:, pca_genes].layers["M_n"]
time_3_gamma = adata_hsc_raw[:, pca_genes].var.time_3_gamma.astype(float)
time_5_gamma = adata_hsc_raw[:, pca_genes].var.time_5_gamma.astype(float)
t = adata_hsc_raw.obs.time.astype(float)
M_s = adata_hsc_raw.layers["M_s"][:, pca_genes]
time_3_cells = adata_hsc_raw.obs.time == 3
time_5_cells = adata_hsc_raw.obs.time == 5
Next, we will calculate total RNA velocity
according to
def alpha_minus_gamma_s(new, gamma, t, M_s):
# equation: alpha = new / (1 - e^{-rt}) * r
alpha = new.A.T / (1 - np.exp(-gamma.values[:, None] * t.values[None, :])) * gamma.values[:, None]
gamma_s = gamma.values[:, None] * M_s.A.T
alpha_minus_gamma_s = alpha - gamma_s
return alpha_minus_gamma_s
time_3_velocity_n = alpha_minus_gamma_s(new_expr[time_3_cells, :], time_3_gamma, t[time_3_cells], M_s[time_3_cells, :])
time_5_velocity_n = alpha_minus_gamma_s(new_expr[time_5_cells, :], time_5_gamma, t[time_5_cells], M_s[time_5_cells, :])
velocity_n = adata_hsc_raw.layers["velocity_N"].copy()
valid_velocity_n = velocity_n[:, pca_genes].copy()
valid_velocity_n[time_3_cells, :] = time_3_velocity_n.T
valid_velocity_n[time_5_cells, :] = time_5_velocity_n.T
velocity_n[:, pca_genes] = valid_velocity_n.copy()
adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"] = velocity_n.copy()
The results are stored in
adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"]
, which can be
further projected to low dimension space for visualization.
dyn.tl.cell_velocities(
adata_hsc_raw,
enforce=True,
X=adata_hsc_raw.layers["M_t"],
V=adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"],
method="cosine",
);
Now let us plot the total RNA stream line plot and visualize the PF4 gene expression on the UMAP space with default parameters. This reproduces total RNA velocity streamline plot in Figure 3B and Figure 3C.
dyn.pl.streamline_plot(
adata_hsc_raw,
color=["batch", "cell_type", "PF4"],
ncols=4,
basis="umap",
)

Here we can also visualize the total RNA phase diagram in Figure 3E using dynamo with default settings