Dynamo: Mapping Vector Field of Single Cells¶
Understanding how gene expression in single cells progress over time is vital for revealing the mechanisms governing cell fate transitions. RNA velocity, which infers immediate changes in gene expression by comparing levels of new (unspliced) versus mature (spliced) transcripts (La Manno et al. 2018), represents an important advance to these efforts. A key question remaining is whether it is possible to predict the most probable cell state backward or forward over arbitrary timescales. To this end, we introduce an inclusive model (termed Dynamo) capable of predicting cell states over extended time periods, that incorporates promoter state switching, transcription, splicing, translation and RNA/protein degradation by taking advantage of scRNAseq and the coassay of transcriptome and proteome. We also implement scSLAMseq by extending SLAMseq to platebased scRNAseq (Hendriks et al. 2018; Erhard et al. 2019; Cao, Zhou, et al. 2019) and augment the model by explicitly incorporating the metabolic labelling of nascent RNA. We show that through careful design of labelling experiments and an efficient mathematical framework, the entire kinetic behavior of a cell from this model can be robustly and accurately inferred. Aided by the improved framework, we show that it is possible to analytically reconstruct the transcriptomic vector field from sparse and noisy vector samples generated by single cell experiments. The analytically reconstructed vector further enables global mapping of potential landscapes that reflects the relative stability of a given cell state, and the minimal transition time and most probable paths between any cell states in the state space This work thus foreshadows the possibility of predicting longterm trajectories of cells during a dynamic process instead of short time velocity estimates. Our methods are implemented as an open source tool, dynamo.
Discussion¶
Please use github issue tracker to report coding related issues of dynamo. For community discussion of novel usage cases, analysis tips and biological interpretations of dynamo, please join our public slack workspace: dynamodiscussion (Only a working email address is required from the slack side).
Contribution¶
If you want to contribute to the development of dynamo, please check out CONTRIBUTION instruction: Contribution
10 minutes to dynamo¶
Welcome to dynamo!
Dynamo is a computational framework that includes an inclusive model of expression dynamics with scSLAMseq / multiomics, vector field reconstruction and potential landscape mapping.
Why dynamo¶
Dynamo currently provides a complete solution (see below) to analyze expression dynamics of conventional scRNAseq or timeresolved metabolic labeling based scRNAseq. It aspires to become the leading tools in continuously integrating the most exciting developments in machine learning, systems biology, information theory, stochastic physics, etc. to model, understand and interpret datasets generated from various cuttingedge single cell genomics techniques (developments of dynamo 2/3 is under way). We hope those models, understandings and interpretations not only facilitate your research but may also eventually lead to new biological discovery. Dynamo has a strong community so you will feel supported no matter you are a newcomer of computational biology or a veteran researcher who wants to contribute to dynamo’s development.
How to install¶
Dynamo requires Python 3.6 or later.
Dynamo now has been released to PyPi, you can install the PyPi version via:
pip install dynamorelease
To install the newest version of dynamo, you can git clone our repo and then pip install:
git clone https://github.com/aristoteleo/dynamorelease.git
pip install dynamorelease/ user
Note that user
flag is used to install the package to your home directory, in case you don’t have the root privilege.
Alternatively, you can install dynamo when you are in the dynamorelease folder by directly using python’s setup install:
git clone https://github.com/aristoteleo/dynamorelease.git
cd dynamorelease/
python setup.py install user
from source, using the following script:
pip install git+https://github.com:aristoteleo/dynamorelease
In order to ensure dynamo run properly, your python environment needs to satisfy dynamo’s dependencies. We provide a helper function for you to check the versions of dynamo’s all dependencies.
import dynamo as dyn
dyn.get_all_dependencies_version()
Architecture of dynamo¶
Dynamo has a few standard modules like most other single cell analysis toolkits (Scanpy, Monocle or Seurat), for example, data loading (dyn.read*
), preprocessing (dyn.pp.*
), tool analysis (dyn.tl.*
), and plotting (dyn.pl.*
). Modules specific to dynamo include:
 a comprehensive estimation framework (
dyn.est.*
) of expression dynamics that includes: conventional single cell RNAseq (scRNAseq) modeling (
dyn.est.csc.*
) for standard RNA velocity estimation and more;timeresolved metabolic labeling based single cell RNAseq (scRNAseq) modeling (
dyn.est.tsc.*
) for labeling based RNA velocity estimation and more;
 a comprehensive estimation framework (
vector field reconstruction and vector calculus (
dyn.vf.*
);cell fate prediction (
dyn.pd.*
);create movie of cell fate predictions (
dyn.mv.*
);stochastic simulation of various metabolic labeling experiments (
dyn.sim.*
);integration with external tools built by us or others (
dyn.ext.*
);and more.
Typical workflow¶
A typical workflow in dynamo is similar to most of other single cell analysis toolkits (Scanpy, Monocle or Seurat), including steps like importing dynamo (import dynamo as dyn
), loading data (dyn.read*
), preprocessing (dyn.pp.*
), tool analysis (dyn.tl.*
) and plotting (dyn.pl.*
). To get the best of dynamo though, you need to use the dyn.vf.*
, dyn.pd.*
and dyn.mv.*
modules.
Import dynamo as:
import dynamo as dyn
We provide a few nice visualization defaults for different purpose:
dyn.configuration.set_figure_params('dynamo', background='white') # jupter notebooks
dyn.configuration.set_figure_params('dynamo', background='black') # presentation
dyn.configuration.set_pub_style() # manuscript
Load data¶
Dynamo relies on anndata for data IO. You can read your own data via read
, read_loom
, read_h5ad
, read_h5
or load_NASC_seq
, etc:
adata = dyn.read(filename)
Dynamo also comes with a few builtin sample datasets so you can familiarize with dynamo before analyzing your own dataset. For example, you can load the Dentate Gyrus example dataset:
adata = dyn.sample_data.DentateGyrus()
There are many sample datasets available. You can check other available datasets via dyn.sample_data.*
.
To process the scSLAMseq data, please refer to the NASCseq analysis pipeline. We are also working on a command line tool for this and will release it in due time. For processing splicing data, you can either use the velocyto command line interface or the bustool from Pachter lab.
Preprocess data¶
After loading data, you are ready to performs some preprocessing. You can run the recipe_monocle
function that uses similar but generalized strategy from Monocle 3 to normalize all datasets in different layers (the spliced and unspliced or new, i.e. metabolic labelled, and total mRNAs or others), followed by feature selection and PCA dimension reduction.
dyn.pp.recipe_monocle(adata)
Learn dynamics¶
Next you will want to estimate the kinetic parameters of expression dynamics and then learn the velocity values for all genes that pass some filters (selected feature genes, by default) across cells. The dyn.tl.dynamics
does all the hard work for you:
dyn.tl.dynamics(adata)
implicitly calls dyn.tl.moments
first
dyn.tl.moments(adata)
which calculates the first, second moments (and sometimes covariance between different layers) of the expression data. First / second moments are basically mean and uncentered variance of gene expression, which are calculated based on local smoothing via a nearest neighbours graph, constructed in the reduced PCA space from the spliced or total mRNA expression of single cells.
And it then performs the following steps:
checks the data you have and determine the experimental type automatically, either the conventional scRNAseq,
kinetics
,degradation
oroneshot
singlecell metabolic labelling experiment or theCITEseq
orREAPseq
coassay, etc.learns the velocity for each feature gene using either the original deterministic model based on a steadystate assumption from the seminal RNA velocity work or a few new methods, including the
stochastic
(default) ornegative binomial method
for conventional scRNAseq orkinetic
,degradation
oroneshot
models for metabolic labeling based scRNAseq.
Those later methods are based on moment equations. All those methods use all or part of the output from dyn.tl.moments(adata)
.
Kinetic estimation of the conventional scRNAseq and metabolic labeling based scRNAseq is often tricky and has a lot pitfalls. Sometimes you may even observed undesired backward vector flow. You can evaluate the confidence of genewise velocity via:
dyn.tl.gene_wise_confidence(adata, group='group', lineage_dict={'Progenitor': ['terminal_cell_state']})
and filter those low confidence genes for downstream Velocity vectors analysis, etc (See more details in FAQ).
Dimension reduction¶
By default, we use umap
algorithm for dimension reduction.
dyn.tl.reduceDimension(adata)
If the requested reduced dimension is already existed, dynamo won’t touch it unless you set enforce=True
.
dyn.tl.reduceDimension(adata, basis='umap', enforce=True)
Velocity vectors¶
We need to project the velocity vector onto low dimensional embedding for later visualization. To get there, we can either use the default correlation/cosine kernel
or the novel Itô kernel from us.
dyn.tl.cell_velocities(adata)
The above function projects and evaluates velocity vectors on umap
space but you can also operate them on other basis, for example pca
space:
dyn.tl.cell_velocities(adata, basis='pca')
You can check the confidence of cellwise velocity to understand how reliable the recovered velocity is across cells via:
dyn.tl.cell_wise_confidence(adata)
Obviously dynamo doesn’t stop here. The really exciting part of dynamo lays in the fact that it learns a functional form of vector field
in the full transcriptomic space which can be then used to predict cell fate and map single cell potential landscape.
Vector field reconstruction¶
In classical physics, including fluidics and aerodynamics, velocity and acceleration vector fields are used as fundamental tools to describe motion or external force of objects, respectively. In analogy, RNA velocity or protein accelerations estimated from single cells can be regarded as sparse samples in the velocity (La Manno et al. 2018) or acceleration vector field (Gorin, Svensson, and Pachter 2019) that defined on the gene expression space.
In general, a vector field can be defined as a vectorvalued function f that maps any points (or cells’ expression state) x in a domain Ω with D dimension (or the gene expression system with D transcripts / proteins) to a vector y (for example, the velocity or acceleration for different genes or proteins), that is f(x) = y.
To formally define the problem of velocity vector field learning, we consider a set of measured cells with pairs of current and estimated future expression states. The difference between the predicted future state and current state for each cell corresponds to the velocity vector. We note that the measured singlecell velocity (conventional RNA velocity) is sampled from a smooth, differentiable vector field f that maps from xi to yi on the entire domain. Normally, single cell velocity measurements are results of biased, noisy and sparse sampling of the entire state space, thus the goal of velocity vector field reconstruction is to robustly learn a mapping function f that outputs yj given any point xj on the domain based on the observed data with certain smoothness constraints (Jiayi Ma et al. 2013). Under ideal scenario, the mapping function f should recover the true velocity vector field on the entire domain and predict the true dynamics in regions of expression space that are not sampled. To reconstruct vector field function in dynamo, you can simply use the following function to do all the heavylifting:
dyn.vf.VectorField(adata)
By default, it learns the vector field in the pca space but you can of course learn it in any space or even the original gene expression space.
Characterize vector field topology¶
Since we learn the vector field function of the data, we can then characterize the topology of the full vector field space. For example, we are able to identify
the fixed points (attractor/saddles, etc.) which may corresponds to terminal cell types or progenitors;
nullcline, separatrices of a recovered dynamic system, which may formally define the dynamical behaviour or the boundary of cell types in gene expression space.
Again, you only need to simply run the following function to get all those information.
dyn.vf.topography(adata, basis='umap')
Map potential landscape¶
The concept of potential landscape is widely appreciated across various biological disciplines, for example the adaptive landscape in population genetics, proteinfolding funnel landscape in biochemistry, epigenetic landscape in developmental biology. In the context of cell fate transition, for example, differentiation, carcinogenesis, etc, a potential landscape will not only offers an intuitive description of the global dynamics of the biological process but also provides key insights to understand the multistability and transition rate between different cell types as well as to quantify the optimal path of cell fate transition.
Because the classical definition of potential function in physics requires gradient systems (no curl
or cycling dynamics), which is often not applicable to open biological system. In dynamo we provided several ways to quantify the potential of single cells by decomposing the vector field into gradient, curl parts, etc. The recommended method is built on the Hodge decomposition on simplicial complexes (a sparse directional graph) constructed based on the learned vector field function that provides fruitful analogy of gradient, curl and harmonic (cyclic) flows on manifold:
dyn.ext.ddhodge(adata)
In addition, we and others proposed various strategies to decompose the stochastic differential equations
into either the gradient or the curl component from first principles. We then can use the gradient part to define the potential.
Although an analytical decomposition on the reconstructed vector field is challenging, we are able to use a numerical algorithm we recently developed for our purpose. This approach uses a least action method under the Atype stochastic integration (Shi et al. 2012) to globally map the potential landscape Ψ(x) (Tang et al. 2017) by taking the vector field function f(x) as input.
dyn.vf.Potential(adata)
Visualization¶
In two or three dimensions, a streamline plot can be used to visualize the paths of cells will follow if released in different regions of the gene expression state space under a steady flow field. Although we currently do not support this, for vector field that changes over time, similar methods, for example, streakline, pathline, timeline, etc. can also be used to visualize the evolution of single cell or cell populations.
In dynamo, we have three standard visual representations of vector fields, including the cell wise
, grid
quiver plots and the streamline plot
. Another intuitive way to visualize the structure of vector field is the so called line integral convolution method or LIC (Cabral and Leedom 1993), which works by adding random blackandwhite paint sources on the vector field and letting the flowing particles on the vector field picking up some texture to ensure points on the same streamline having similar intensity. We relies on the yt’s annotate_line_integral_convolution
function to visualize the LIC vector field reconstructed from dynamo.
dyn.pl.cell_wise_vectors(adata, color=colors, ncols=3)
dyn.pl.grid_vectors(adata, color=colors, ncols=3)
dyn.pl.stremline_plot(adata, color=colors, ncols=3)
dyn.pl.line_integral_conv(adata)
Note that colors
here is a list or str that can be either the column name in .obs
or gene names
.
To visualize the topological structure of the reconstructed 2D vector fields, we provide the dyn.pl.topography
function in dynamo.
dyn.vf.VectorField(adata, basis='umap')
dyn.pl.topography(adata)
Plotting functions in dynamo are designed to be extremely flexible. For example, you can combine different types of dynamo plots together (when you visualize only one item for each plot function)
import matplotlib.pyplot as plt
fig1, f1_axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True, figsize=(12, 10))
f1_axes
f1_axes[0, 0] = dyn.pl.cell_wise_vectors(adata, color='umap_ddhodge_potential', pointsize=0.1, alpha = 0.7, ax=f1_axes[0, 0], quiver_length=6, quiver_size=6, save_show_or_return='return')
f1_axes[0, 1] = dyn.pl.grid_vectors(adata, color='speed_umap', ax=f1_axes[0, 1], quiver_length=12, quiver_size=12, save_show_or_return='return')
f1_axes[1, 0] = dyn.pl.streamline_plot(adata, color='divergence_pca', ax=f1_axes[1, 0], save_show_or_return='return')
f1_axes[1, 1] = dyn.pl.topography(adata, color='acceleration_umap', ax=f1_axes[1, 1], save_show_or_return='return')
plt.show()
The above creates a 2x2 plot that puts cell_wise_vectors, grid_vectors, streamline_plot and topography plots together.
Compatibility¶
Dynamo is fully compatible with velocyto, scanpy and scvelo. So you can use your loom or annadata object as input for dynamo. The velocity vector samples estimated from either velocyto or scvelo can be also directly used to reconstruct the functional form of vector field and to map the potential landscape in the entire expression space.
Mapping Vector Field of Single Cells
API¶
Import dynamo as:
import dynamo as dyn
Data IO¶
(See more at anndatadocs)

Read .h5adformatted hdf5 file. 

Read .h5adformatted hdf5 file. 

Read .loomformatted hdf5 file. 
Preprocessing (pp)¶

This function is partly based on Monocle R package (https://github.com/coletrapnelllab/monocle3). 

Call cell cycle positions for cells within the population. 
Estimation (est)¶
Note
Classes in est are internally to Tools. See our estimation classes here: estimation
Tools (tl)¶
kNN and moments of expressions

Function to search nearest neighbors of the adata object. 

Function to calculate mutual nearest neighbor graph across specific data layers. 

Calculate kNN based first and second moments (including uncentered covariance) for 
Kinetic parameters and RNA/protein velocity

Inclusive model of expression dynamics considers splicing, metabolic labeling and protein translation. 
Dimension reduction

Compute a low dimension reduction projection of an annodata object first with PCA, followed by nonlinear dimension reduction methods 

This function is a pure Python implementation of the DDRTree algorithm. 

This function is a pure Python implementation of the PSL algorithm. 
Clustering

Apply hdbscan to cluster cells in the space defined by basis. 

Cluster cells based on vector field features. 
Velocity projection

Compute transition probability and project high dimension velocity vector to existing low dimension embedding. 

Confidently compute transition probability and project high dimension velocity vector to existing low dimension embeddings using progeintors and mature cell groups priors. 
Velocity metrics

Calculate the cellwise velocity confidence metric. 

Diagnostic measure to identify genes contributed to “wrong” directionality of the vector flow. 
Markov chain

Apply the diffusion map algorithm on the transition matrix build from Itô kernel. 

Compute stationary distribution of cells using the transition matrix. 

Find the state distribution of a Markov process. 

Find the expected returning time. 
Markers and differential expressions

Identify genes with strong spatial autocorrelation with Moran’s I test. 

Find marker genes for each group of cells based on gene expression or velocity values as specified by the layer. 

Find marker genes between two groups of cells based on gene expression or velocity values as specified by the layer. 

Filter cluster deg (Moran’s I test) results and retrieve top markers for each cluster. 

Differential genes expression tests using generalized linear regressions. 
Converter

convert adata to loom object  we may save_fig to a temp directory automatically  we may write a onthefly converter which doesn’t involve saving and reading files 
Vector field (vf)¶
Vector field reconstruction
Note
Vector field class is internally to vf.VectorField. See our vector field classes here: vector field

Apply sparseVFC (vector field consensus) algorithm to learn a functional form of the vector field from random samples with outlier on the entire space robustly and efficiently. 

Learn a function of high dimensional vector field from sparse single cell samples in the entire space robustly. 
Vector field topology

Map the topography of the single cell vector field in (first) two dimensions. 
Beyond RNA velocity

Calculate the speed for each cell with the reconstructed vector field function. 

Calculate divergence for each cell with the reconstructed vector field function. 

Calculate Curl for each cell with the reconstructed vector field function. 

Calculate acceleration for each cell with the reconstructed vector field function. 

Calculate curvature for each cell with the reconstructed vector field function. 

Calculate torsion for each cell with the reconstructed vector field function. 
Acceleration field

Compute RNA acceleration field via reconstructed vector field and project it to low dimensional embeddings. 
Vector field ranking

Rank gene’s absolute, positive, negative speed by different cell groups. 

Rank gene’s absolute, positive, negative divergence by different cell groups. 

Rank gene’s absolute, positive, negative acceleration by different cell groups. 

Rank gene’s absolute, positive, negative curvature by different cell groups. 
Single cell potential: three approaches

Calculate the fixed points of (learned) vector field function . 

Calculate the gradient of the (learned) vector field function for the least action path (LAP) symbolically 

Calculate the action of the path based on the (reconstructed) vector field function and diffusion matrix (Eq. 
Diffusion matrix can be variable dependent 


It calculates the minimized action value given an intial path, ODE, and diffusion matrix. 

Function to map out the pseudopotential landscape. 

A deterministic map of Waddington’s epigenetic landscape for cell fate specification Sudin Bhattacharya, Qiang Zhang and Melvin E. 

Align potential values so all pathpotentials end up at same global min and then generate potential surface with interpolation on a grid. 

Calculate action by path integral by Wang’s method. 

Calculating least action path based methods from Jin Wang and colleagues (http://www.pnas.org/cgi/doi/10.1073/pnas.1017017108) 

Calculate the rate to convert from one cell state to another cell state by taking the optimal path. 

Calculate the MFPT (mean first passage time) to convert from one cell state to another cell state by taking the optimal path. 

Mapping potential landscape with the algorithm developed by Ao method. 

Function to calculate Q matrix by a least square method 
Stochastic processes

“Calculate the diffusion matrix from the estimated velocity vector and the reconstructed vector field. 
Vector field graph

A class for manipulating the graph creating from the transition matrix, built from the (reconstructed) vector field. 
Prediction (pd)¶

Predict the historical and future cell transcriptomic states over arbitrary time scales. 

Calculate the lineage (fate) bias of states whose trajectory are predicted. 

Estimate the transition probability between cell types using method of vector field integrations. 
Plotting (pl)¶
Preprocessing

Plot the basic statics (nGenes, nCounts and pMito) of each category of adata. 

Plot the fraction of each category of data used in the velocity estimation. 

Plot selected feature genes on top of the mean vs. 

Plot the accumulative variance explained by the principal components. 

Plot the (labeled) expression values of genes across different groups (time points). 
Cell cycle staging

Plot a heatmap of cells ordered by cell cycle position 
Scatter base

Plot an embedding as points. 
Phase diagram: conventional scRNAseq

Draw the phase portrait, expression values , velocity on the low dimensional embedding. 
Kinetic models: labeling based scRNAseq

Plot the data and fitting of different metabolic labeling experiments. 
Kinetics

Plot the gene expression dynamics over time (pseudotime or inferred real time) as kinetic curves. 

Plot the gene expression dynamics over time (pseudotime or inferred real time) in a heatmap. 

Plot the gene expression dynamics over time (pseudotime or inferred real time) in a heatmap. 
Dimension reduction

Scatter plot with pca basis. 

Scatter plot with tsne basis. 

Scatter plot with umap basis. 

Scatter plot with trimap basis. 
Neighbor graph

Plot nearest neighbor graph of cells used to embed data into low dimension space. 

Plot a summarized cell type (state) transition graph. 
Vector field plots: velocities and accelerations

Plot the velocity or acceleration vector of each cell. 

Plot the velocity or acceleration vector of each cell on a grid. 

Plot the velocity vector of each cell. 

Visualize vector field with quiver, streamline and line integral convolution (LIC), using velocity estimates on a grid from the associated data. 

Plot the energy and energy change rate over each optimization iteration. 
Vector field topology

Plots the flow field with line thickness proportional to speed. 

Plot fixed points stored in the VectorField2D class. 

Plot nullclines stored in the VectorField2D class. 

Plot separatrix on phase portrait. 

Plots a trajectory on a phase portrait. 

Plot the streamline, fixed points (attractor / saddles), nullcline, separatrices of a recovered dynamic system for single cells. 
Beyond RNA velocity

Scatter plot with cells colored by the estimated velocity speed (and other information if provided). 

Scatter plot with cells colored by the estimated divergence (and other information if provided). 

Scatter plot with cells colored by the estimated curl (and other information if provided). 

Scatter plot with cells colored by the estimated curvature (and other information if provided). 

Scatter plot with pca basis. 

Plot the Jacobian matrix for each cell as a heatmap. 
Potential landscape

Plot the quasipotential landscape. 
Cell fate

Plot the lineage (fate) bias of cells states whose vector field trajectories are predicted. 
Save figures

Save a figure from pyplot. 
Moive (mv)¶

Animating cell fate commitment prediction via reconstructed vector field function. 

Animating cell fate commitment prediction via reconstructed vector field function. 
Simulation (sim)¶
Simple ODE vector field simulation

The ODE model for the famous Pu.1Gata.1 like network motif with selfactivation and mutual inhibition. 

The ODE model for the neurogenesis system that used in benchmarking Monocle 2, Scribe and dynamo (here), original from Xiaojie Qiu, et. 

Right hand side (rhs) for toggle ODEs. 

network used in the potential landscape paper from Ying, et. 
Gillespie simulation

A simulator of RNA dynamics that includes RNA bursting, transcription, metabolic labeling, splicing, transcription, RNA/protein degradation 

Simulate the gene expression dynamics via deterministic ODE model 

Sample N points from the dim dimension gene expression space while restricting the values to be between min_val and max_val. 

Function to evaluate the vector field related reference quantities vs. 
External (ext)¶

Modeling Latent Flow Structure using Hodge Decomposition based on the creation of sparse diffusion graph from the reconstructed vector field function. 

Apply Scribe to calculate causal network from spliced/unspliced, metabolic labeling based and other “real” time series datasets. 

Calculate mutual information (as well as pearson correlation) of genes between two different layers. 

Reconstruction of regulatory network (Cao, et. al, Nature Biotechnology, 2020) from TFs to other target 
Utilities¶
Package versions

Adapted from answer 2 in https://stackoverflow.com/questions/40428931/packageforlistingversionofpackagesusedinajupyternotebook 
Clean up adata

clean up adata before saving it to a file 
Figures configuration

Set resolution/size, styling and format of figures. 

formatting helper function that can be used to save publishable figures 
Class¶
Estimation¶
Conventional scRNAseq (est.csc)¶

class
csc.
ss_estimation
(U=None, Ul=None, S=None, Sl=None, P=None, US=None, S2=None, conn=None, t=None, ind_for_proteins=None, model='stochastic', est_method='gmm', experiment_type='deg', assumption_mRNA=None, assumption_protein='ss', concat_data=True, cores=1, **kwargs)¶ The class that estimates parameters with input data.
 Parameters
U (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA count.Ul (
ndarray
or sparse csr_matrix) – A matrix of unspliced, labeled mRNA count.S (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA count.Sl (
ndarray
or sparse csr_matrix) – A matrix of spliced, labeled mRNA count.P (
ndarray
or sparse csr_matrix) – A matrix of protein count.US (
ndarray
or sparse csr_matrix) – A matrix of second moment of unspliced/spliced gene expression count for conventional or NTR velocity.S2 (
ndarray
or sparse csr_matrix) – A matrix of second moment of spliced gene expression count for conventional or NTR velocity.conn (
ndarray
or sparse csr_matrix) – The connectivity matrix that can be used to calculate first /second moment of the data.t (
ss_estimation
) – A vector of time points.ind_for_proteins (
ndarray
) – A 1D vector of the indices in the U, Ul, S, Sl layers that corresponds to the row name in the protein or X_protein key of .obsm attribute.experiment_type (str) – labelling experiment type. Available options are: (1) ‘deg’: degradation experiment; (2) ‘kin’: synthesis experiment; (3) ‘oneshot’: oneshot kinetic experiment; (4) ‘mix_std_stm’: a mixed steady state and stimulation labeling experiment.
assumption_mRNA (str) – Parameter estimation assumption for mRNA. Available options are: (1) ‘ss’: pseudo steady state; (2) None: kinetic data with no assumption.
assumption_protein (str) – Parameter estimation assumption for protein. Available options are: (1) ‘ss’: pseudo steady state;
concat_data (bool (default: True)) – Whether to concatenate data
cores (int (default: 1)) – Number of cores to run the estimation. If cores is set to be > 1, multiprocessing will be used to parallel the parameter estimation.
 Returns
t (
ss_estimation
) – A vector of time points.data (dict) – A dictionary with uu, ul, su, sl, p as its keys.
extyp (str) – labelling experiment type.
asspt_mRNA (str) – Parameter estimation assumption for mRNA.
asspt_prot (str) – Parameter estimation assumption for protein.
parameters (dict) –
 A dictionary with alpha, beta, gamma, eta, delta as its keys.
alpha: transcription rate beta: RNA splicing rate gamma: spliced mRNA degradation rate eta: translation rate delta: protein degradation rate

concatenate_data
()¶ Concatenate available data into a single matrix.
See “concat_time_series_matrices” for details.

fit
(intercept=False, perc_left=None, perc_right=5, clusters=None, one_shot_method='combined')¶ Fit the input data to estimate all or a subset of the parameters
 Parameters
intercept (bool) – If using steady state assumption for fitting, then: True – the linear regression is performed with an unfixed intercept; False – the linear regression is performed with a fixed zero intercept.
perc_left (float (default: 5)) – The percentage of samples included in the linear regression in the left tail. If set to None, then all the samples are included.
perc_right (float (default: 5)) – The percentage of samples included in the linear regression in the right tail. If set to None, then all the samples are included.
clusters (list) – A list of n clusters, each element is a list of indices of the samples which belong to this cluster.

fit_alpha_oneshot
(t, U, beta, clusters=None)¶ Estimate alpha with the oneshot data.
 Parameters
 Returns
alpha – A numpy array with the dimension of n_genes x clusters.
 Return type
ndarray

fit_beta_gamma_lsq
(t, U, S)¶ Estimate beta and gamma with the degradation data using the least squares method.
 Parameters
t (
ndarray
) – A vector of time points.U (
ndarray
) – A matrix of unspliced mRNA counts. Dimension: genes x cells.S (
ndarray
) – A matrix of spliced mRNA counts. Dimension: genes x cells.
 Returns
beta (
ndarray
) – A vector of betas for all the genes.gamma (
ndarray
) – A vector of gammas for all the genes.u0 (float) – Initial value of u.
s0 (float) – Initial value of s.

fit_gamma_nosplicing_lsq
(t, L)¶ Estimate gamma with the degradation data using the least squares method when there is no splicing data.
 Parameters
t (
ndarray
) – A vector of time points.L (
ndarray
) – A matrix of labeled mRNA counts. Dimension: genes x cells.
 Returns
gamma (
ndarray
) – A vector of gammas for all the genes.l0 (float) – The estimated value for the initial spliced, labeled mRNA count.

fit_gamma_steady_state
(u, s, intercept=True, perc_left=None, perc_right=5, normalize=True)¶ Estimate gamma using linear regression based on the steady state assumption.
 Parameters
u (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA counts. Dimension: genes x cells.s (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA counts. Dimension: genes x cells.intercept (bool) – If using steady state assumption for fitting, then: True – the linear regression is performed with an unfixed intercept; False – the linear regresssion is performed with a fixed zero intercept.
perc_left (float) – The percentage of samples included in the linear regression in the left tail. If set to None, then all the left samples are excluded.
perc_right (float) – The percentage of samples included in the linear regression in the right tail. If set to None, then all the samples are included.
normalize (bool) – Whether to first normalize the
 Returns
k (float) – The slope of the linear regression model, which is gamma under the steady state assumption.
b (float) – The intercept of the linear regression model.
r2 (float) – Coefficient of determination or r square for the extreme data points.
r2 (float) – Coefficient of determination or r square for the extreme data points.
all_r2 (float) – Coefficient of determination or r square for all data points.

fit_gamma_stochastic
(est_method, u, s, us, ss, perc_left=None, perc_right=5, normalize=True)¶ Estimate gamma using GMM (generalized method of moments) or negbin distrubtion based on the steady state assumption.
 Parameters
est_method (str {gmm, negbin} The estimation method to be used when using the stochastic model.) –
Available options when the model is ‘ss’ include:
(2) ‘gmm’: The new generalized methods of moments from us that is based on master equations, similar to the “moment” model in the excellent scVelo package; (3) ‘negbin’: The new method from us that models steady state RNA expression as a negative binomial distribution, also built upon on master equations. Note that all those methods require using extreme data points (except negbin, which use all data points) for estimation. Extreme data points are defined as the data from cells whose expression of unspliced / spliced or new / total RNA, etc. are in the top or bottom, 5%, for example. linear_regression only considers the mean of RNA species (based on the deterministic ordinary different equations) while moment based methods (gmm, negbin) considers both first moment (mean) and second moment (uncentered variance) of RNA species (based on the stochastic master equations). The above method are all (generalized) linear regression based method. In order to return estimated parameters (including RNA halflife), it additionally returns Rsquared (either just for extreme data points or all data points) as well as the loglikelihood of the fitting, which will be used for transition matrix and velocity embedding. All est_method uses least square to estimate optimal parameters with latin cubic sampler for initial sampling.
u (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA counts. Dimension: genes x cells.s (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA counts. Dimension: genes x cells.us (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA counts. Dimension: genes x cells.ss (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA counts. Dimension: genes x cells.perc_left (float) – The percentage of samples included in the linear regression in the left tail. If set to None, then all the left samples are excluded.
perc_right (float) – The percentage of samples included in the linear regression in the right tail. If set to None, then all the samples are included.
normalize (bool) – Whether to first normalize the
 Returns
k (float) – The slope of the linear regression model, which is gamma under the steady state assumption.
b (float) – The intercept of the linear regression model.
r2 (float) – Coefficient of determination or r square for the extreme data points.
r2 (float) – Coefficient of determination or r square for the extreme data points.
all_r2 (float) – Coefficient of determination or r square for all data points.

get_exist_data_names
()¶ Get the names of all the data that are not ‘None’.

get_n_genes
(key=None, data=None)¶ Get the number of genes.

set_parameter
(name, value)¶ Set the value for the specified parameter.
 Parameters
name (string) – The name of the parameter. E.g. ‘beta’.
value (
ndarray
) – A vector of values for the parameter to be set to.

solve_alpha_mix_std_stm
(t, ul, beta, clusters=None, alpha_time_dependent=True)¶ Estimate the steady state transcription rate and analytically calculate the stimulation transcription rate given beta and steady state alpha for a mixed steady state and stimulation labeling experiment.
This approach assumes the same constant beta or gamma for both steady state or stimulation period.
 Parameters
t (list or numpy.ndarray) – Time period for stimulation state labeling for each cell.
ul – A vector of labeled RNA amount in each cell.
beta (numpy.ndarray) – A list of splicing rate for genes.
clusters (list) – A list of n clusters, each element is a list of indices of the samples which belong to this cluster.
alpha_time_dependent (bool) – Whether or not to model the simulation alpha rate as a time dependent variable.
 Returns
alpha_std, alpha_stm – The constant steady state transcription rate (alpha_std) or timedependent or timeindependent (determined by alpha_time_dependent) transcription rate (alpha_stm)
 Return type
numpy.ndarray, numpy.ndarray

class
csc.
velocity
(alpha=None, beta=None, gamma=None, eta=None, delta=None, t=None, estimation=None)¶ The class that computes RNA/protein velocity given unknown parameters.
 Parameters
alpha (
ndarray
) – A matrix of transcription rate.beta (
ndarray
) – A vector of splicing rate constant for each gene.gamma (
ndarray
) – A vector of spliced mRNA degradation rate constant for each gene.eta (
ndarray
) – A vector of protein synthesis rate constant for each gene.delta (
ndarray
) – A vector of protein degradation rate constant for each gene.t (
ndarray
or None (default: None)) – A vector of the measured time points for cellsestimation (
ss_estimation
) – An instance of the estimation class. If this not None, the parameters will be taken from this class instead of the input arguments.

get_n_cells
()¶ Get the number of cells if the parameter alpha is given.
 Returns
n_cells – The second dimension of the alpha matrix, if alpha is given.
 Return type

get_n_genes
()¶ Get the number of genes.
 Returns
n_genes – The first dimension of the alpha matrix, if alpha is given. Or, the length of beta, gamma, eta, or delta, if they are given.
 Return type

vel_p
(S, P)¶ Calculate the protein velocity.
 Parameters
S (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA counts. Dimension: genes x cells.P (
ndarray
or sparse csr_matrix) – A matrix of protein counts. Dimension: genes x cells.
 Returns
V – Each column of V is a velocity vector for the corresponding cell. Dimension: genes x cells.
 Return type
ndarray
or sparse csr_matrix

vel_s
(U, S)¶ Calculate the unspliced mRNA velocity.
 Parameters
U (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA counts. Dimension: genes x cells.S (
ndarray
or sparse csr_matrix) – A matrix of spliced mRNA counts. Dimension: genes x cells.
 Returns
V – Each column of V is a velocity vector for the corresponding cell. Dimension: genes x cells.
 Return type
ndarray
or sparse csr_matrix

vel_u
(U)¶ Calculate the unspliced mRNA velocity.
 Parameters
U (
ndarray
or sparse csr_matrix) – A matrix of unspliced mRNA count. Dimension: genes x cells. Returns
V – Each column of V is a velocity vector for the corresponding cell. Dimension: genes x cells.
 Return type
ndarray
or sparse csr_matrix
Timeresolved metabolic labeling based scRNAseq (est.tsc)¶
Base class: a general estimation framework

class
tsc.
kinetic_estimation
(param_ranges, x0_ranges, simulator)¶ A general parameter estimation framework for all types of timeseris data
 Parameters
param_ranges (
ndarray
) – A nby2 numpy array containing the lower and upper ranges of n parameters (and initial conditions if not fixed).x0_ranges (
ndarray
) – Lower and upper bounds for initial conditions for the integrators. To fix a parameter, set its lower and upper bounds to the same value.simulator (
utils_kinetic.Linear_ODE
) – An instance of python class which solves ODEs. It should have properties ‘t’ (k time points, 1d numpy array), ‘x0’ (initial conditions for m species, 1d numpy array), and ‘x’ (solution, kbym array), as well as two functions: integrate (numerical integration), solve (analytical method).

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.
Deterministic models via analytical solution of ODEs

class
tsc.
Estimation_DeterministicDeg
(beta=None, gamma=None, x0=None)¶ An estimation class for degradation (with splicing) experiments. Order of species: <unspliced>, <spliced>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_DeterministicDegNosp
(gamma=None, x0=None)¶ An estimation class for degradation (without splicing) experiments.

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_DeterministicKinNosp
(alpha, gamma, x0=0)¶ An estimation class for kinetics (without splicing) experiments with the deterministic model. Order of species: <unspliced>, <spliced>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_DeterministicKin
(alpha, beta, gamma, x0=array([0.0, 0.0]))¶ An estimation class for kinetics experiments with the deterministic model. Order of species: <unspliced>, <spliced>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.

Stochastic models via matrix form of moment equations

class
tsc.
Estimation_MomentDeg
(beta=None, gamma=None, x0=None, include_cov=True)¶ An estimation class for degradation (with splicing) experiments. Order of species: <unspliced>, <spliced>, <uu>, <ss>, <us> Order of parameters: beta, gamma

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_MomentDegNosp
(gamma=None, x0=None)¶ An estimation class for degradation (without splicing) experiments.
An estimation class for degradation (without splicing) experiments. Order of species: <r>, <rr>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_MomentKin
(a, b, alpha_a, alpha_i, beta, gamma, include_cov=True)¶ An estimation class for kinetics experiments. Order of species: <unspliced>, <spliced>, <uu>, <ss>, <us>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Estimation_MomentKinNosp
(a, b, alpha_a, alpha_i, gamma)¶ An estimation class for kinetics experiments. Order of species: <r>, <rr>

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.

Mixture models for kinetic / degradation experiments

class
tsc.
Lambda_NoSwitching
(model1, model2, alpha=None, lambd=None, gamma=None, x0=None, beta=None)¶ An estimation class with the mixture model. If beta is None, it is assumed that the data does not have the splicing process.

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.


class
tsc.
Mixture_KinDeg_NoSwitching
(model1, model2, alpha=None, gamma=None, x0=None, beta=None)¶ An estimation class with the mixture model. If beta is None, it is assumed that the data does not have the splicing process.

fit_lsq
(t, x_data, p0=None, n_p0=1, bounds=None, sample_method='lhs', method=None, normalize=True)¶ Fit timeseris data using least squares
 Parameters
t (
ndarray
) – A numpy array of n time points.x_data (
ndarray
) – A mbyn numpy a array of m species, each having n values for the n time points.p0 (
numpy.ndarray
, optional, default: None) – Initial guesses of parameters. If None, a random number is generated within the bounds.n_p0 (int, optional, default: 1) – Number of initial guesses.
bounds (tuple, optional, default: None) – Lower and upper bounds for parameters.
sample_method (str, optional, default: lhs) – Method used for sampling initial guesses of parameters: lhs: latin hypercube sampling; uniform: uniform random sampling.
method (str or None, optional, default: None) – Method used for solving ODEs. See options in simulator classes. If None, default method is used.
normalize (bool, optional, default: True) – Whether or not normalize values in x_data across species, so that large values do not dominate the optimizer.
 Returns
popt (
ndarray
) – Optimal parameters.cost (float) – The cost function evaluated at the optimum.

test_chi2
(t, x_data, species=None, method='matrix', normalize=True)¶ perform a Pearson’s chisquare test. The statistics is computed as: sum_i (O_i  E_i)^2 / E_i, where O_i is the data and E_i is the model predication.
The data can be either 1. stratified moments: ‘t’ is an array of k distinct time points, ‘x_data’ is a mbyk matrix of data, where m is the number of species. or 2. raw data: ‘t’ is an array of k time points for k cells, ‘x_data’ is a mbyk matrix of data, where m is the number of species. Note that if the method is ‘numerical’, t has to monotonically increasing.
If not all species are included in the data, use ‘species’ to specify the species of interest.
 Returns
p (float)
The pvalue of a onetailed chisquare test.
c2 (float)
The chisquare statistics.
df (int)
Degree of freedom.

Vector field¶
Vector field class¶

class
dynamo.vf.
vectorfield
(X=None, V=None, Grid=None, **kwargs)[source]¶ Initialize the VectorField class.
 Parameters
X ('np.ndarray' (dimension: n_obs x n_features)) – Original data.
V ('np.ndarray' (dimension: n_obs x n_features)) – Velocities of cells in the same order and dimension of X.
Grid ('np.ndarray') – The function that returns diffusion matrix which can be dependent on the variables (for example, genes)
M ('int' (default: None)) – The number of basis functions to approximate the vector field. By default it is calculated as min(len(X), int(1500 * np.log(len(X)) / (np.log(len(X)) + np.log(100)))). So that any datasets with less than about 900 data points (cells) will use full data for vector field reconstruction while any dataset larger than that will at most use 1500 data points.
a (float (default 5)) – Parameter of the model of outliers. We assume the outliers obey uniform distribution, and the volume of outlier’s variation space is a.
beta (float (default: None)) – Parameter of Gaussian Kernel, k(x, y) = exp(beta*xy^2). If None, a ruleofthumb bandwidth will be computed automatically.
ecr (float (default: 1e5)) – The minimum limitation of energy change rate in the iteration process.
gamma (float (default: 0.9)) – Percentage of inliers in the samples. This is an inital value for EM iteration, and it is not important. Default value is 0.9.
lambda (float (default: 3)) – Represents the tradeoff between the goodness of data fit and regularization.
minP (float (default: 1e5)) – The posterior probability Matrix P may be singular for matrix inversion. We set the minimum value of P as minP.
MaxIter (int (default: 500)) – Maximum iteration times.
theta (float (default 0.75)) – Define how could be an inlier. If the posterior probability of a sample is an inlier is larger than theta, then it is regarded as an inlier.
div_cur_free_kernels (bool (default: False)) – A logic flag to determine whether the divergencefree or curlfree kernels will be used for learning the vector field.
sigma ('int') – Bandwidth parameter.
eta ('int') – Combination coefficient for the divergencefree or the curlfree kernels.
seed (int or 1d array_like, optional (default: 0)) – Seed for RandomState. Must be convertible to 32 bit unsigned integers. Used in sampling control points. Default is to be 0 for ensure consistency between different runs.

fit
(normalize=False, method='SparseVFC', **kwargs)[source]¶ Learn an function of vector field from sparse single cell samples in the entire space robustly. Reference: Regularized vector field learning with sparse approximation for mismatch removal, Ma, Jiayi, etc. al, Pattern Recognition
 Parameters
normalize ('bool' (default: False)) – Logic flag to determine whether to normalize the data to have zero means and unit covariance. This is often required for raw dataset (for example, raw UMI counts and RNA velocity values in high dimension). But it is normally not required for low dimensional embeddings by PCA or other nonlinear dimension reduction methods.
method ('string') – Method that is used to reconstruct the vector field functionally. Currently only SparseVFC supported but other improved approaches are under development.
 Returns
VecFld – A dictionary which contains X, Y, beta, V, C, P, VFCIndex. Where V = f(X), P is the posterior probability and VFCIndex is the indexes of inliers which found by VFC.
 Return type
`dict’

get_Jacobian
(method='analytical', input_vector_convention='row', **kwargs)[source]¶ Get the Jacobian of the vector field function. If method is ‘analytical’: The analytical Jacobian will be returned and it always take row vectors as input no matter what input_vector_convention is.
If method is ‘numerical’: If the input_vector_convention is ‘row’, it means that fjac takes row vectors as input, otherwise the input should be an array of column vectors. Note that the returned Jacobian would behave exactly the same if the input is an 1d array.
The column vector convention is slightly faster than the row vector convention. So the matrix of row vector convention is converted into column vector convention under the hood.
No matter the method and input vector convention, the returned Jacobian is of the following format:
df_1/dx_1 df_1/dx_2 df_1/dx_3 … df_2/dx_1 df_2/dx_2 df_2/dx_3 … df_3/dx_1 df_3/dx_2 df_3/dx_3 … … … … …

evaluate
(CorrectIndex, VFCIndex, siz)[source]¶ Evaluate the precision, recall, corrRate of the sparseVFC algorithm.
 Parameters
CorrectIndex ('List') – Ground truth indexes of the correct vector field samples.
VFCIndex ('List') – Indexes of the correct vector field samples learned by VFC.
siz ('int') – Number of initial matches.
 Returns
A tuple of precision, recall, corrRate
Precision, recall, corrRate (Precision and recall of VFC, percentage of initial correct matches.)
See also::
sparseVFC()
.
Release notes¶
Information to be added.
Reference¶
 Qiu19
Qi Qiu, Peng Hu, et al. (2019), Massively parallel, timeresolved singlecell RNA sequencing with scNTSeq, Biorxiv.
 Qi19
Xiaojie Qiu et al. (2019), Mapping vector field of single cells, Biorxiv.
 Qiu18
Xiaojie Qiu et al. (2019), Inferring Causal Gene Regulatory Networks from Coupled SingleCell Expression Dynamics Using Scribes, Cell systems.
 Qiu17
Xiaojie Qiu et al. (2017), Reversed graph embedding resolves complex singlecell trajectories, Nature methods.
 Trapnell14
Cole Trapnell et al. (2014), The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells, Nature Biotechnology.
 Bergen19
Volker Bergen et al. (2020), Generalizing RNA velocity to transient cell states through dynamical modeling, Nature biotechnology.
 Melsted19
Páll Melsted et al. (2019), Modular and efficient preprocessing of singlecell RNAseq, Biorxiv.
 Gorin19
Gennady Gorin et al. (2019), RNA velocity and protein acceleration from singlecell multiomics experiments, Genome biology.
 Manno18
La Manno et al. (2018), RNA velocity of single cells, Nature.
 Wolf18
Wolf et al. (2018), Scanpy: largescale singlecell gene expression data analysis, Genome Biology.
Acknowledgement¶
We would like to sincerely thank the developers of velocyto (La Manno Gioele and others), scanpy (Alex Wolf and others) and svelo (Volker Bergen and others) on their amazing tools which demonstrate the best practice of scientific programming in Python. Dynamo takes various technical inspiration from those packages. It also provides full compatibilities with them. Velocity estimations from either velocyto or scvelo can both be used as input in dynamo to learn the functional form of vector field and then to predict the cell fate over extended time period as well as to map global cell state potential.
Zebrafish pigmentation¶
This tutorial uses data from Saunders, et al (2019). Special thanks also go to Lauren for the tutorial improvement.
In this study, the authors profiled thousands of neural crestderived cells from trunks of postembryonic zebrafish. These cell classes include pigment cells, multipotent pigment cell progenitors, peripheral neurons, Schwann cells, chromaffin cells and others. These cells were collected during an active period of postembryonic development, which has many similarities to fetal and neonatal development in mammals, when many of these cell types are migrating and differentiating as the animal transitions into its adult form. This study also explores the role of thyroid hormone (TH), a common endocrine factor, on the development of these different cell types.
Such developmental and other dynamical processes are especially suitable for dynamo analysis as dynamo is designed to accurately estimate direction and magnitude of expression dynamics (RNA velocity
), predict the entire lineage trajectory of any intial cell state (vector field
), characterize the structure (vector field topology
) of full gene expression space, as well as fate commitment potential (single cell potential
).
[ ]:
# get the latest pypi version
# to get the latest version on github and other installations approaches, see:
# https://dynamorelease.readthedocs.io/en/latest/ten_minutes_to_dynamo.html#howtoinstall
!pip install dynamorelease upgrade quiet
Import the package and silence some warning information (mostly is_categorical_dtype
warning from anndata)
[1]:
import warnings
warnings.filterwarnings('ignore')
import dynamo as dyn
this is like R’s sessionInfo() which helps you to debug version related bugs if any.
[2]:
dyn.get_all_dependencies_version()
package  dynamorelease  umaplearn  anndata  cvxopt  hdbscan  loompy  matplotlib  numba  numpy  pandas  pynndescent  pythonigraph  scikitlearn  scipy  seaborn  setuptools  statsmodels  tqdm  trimap  numdifftools  colorcet 

version  0.95.2  0.4.6  0.7.4  1.2.3  0.8.26  3.0.6  3.3.0  0.51.0  1.19.1  1.1.1  0.4.8  0.8.2  0.23.2  1.5.2  0.9.0  49.6.0  0.11.1  4.48.2  1.0.12  0.9.39  2.0.2 
emulate ggplot2 plotting style with white background
[3]:
dyn.configuration.set_figure_params('dynamo', background='white')
Load data¶
Dynamo comes with a few builtin sample datasets so you can familiarize with dynamo before analyzing your own dataset. You can read your own data via read
, read_loom
, read_h5ad
, read_h5
(powered by the anndata package) or load_NASC_seq, etc. Here I just load the zebrafish sample data that comes with dynamo. This dataset has 4181 cells and 16940 genes. Its .obs
attribute also included condition
, batch
information from the original study (you should also store those information to your .obs
attribute which is essentially a Pandas Dataframe, see more at anndata). Cluster
, Cell_type
, umap coordinates that was originally analyzed with Monocle 3 are also provided.
[4]:
adata = dyn.sample_data.zebrafish()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
After loading data, you are ready to performs some preprocessing. You can run the recipe_monocle
function that uses similar but generalized strategy from Monocle 3 to normalize all datasets in different layers (the spliced and unspliced or new, i.e. metabolic labelled, and total mRNAs or others), followed by feature selection, log1p
transformation of the data and PCA dimension reduction. recipe_monocle
also does a few additionl
steps, which include:
converting ensemble gene names to gene official name and set them as
.var_names
if needed.calculating number of expressed genes (
nGenes
), total expression values (nCounts
), percentage of total mitochondria gene values (pMito
) for each cell and save them to.obs
.detecting your experiment type (conventional scRNAseq or timeresolved metabolic labeling datasets) and set certain proper layers (i.e. ignore some unconventional layers provided by the users) to be size factor normalized,
log1p
transformed, etc.makings cell (
.obs_names
) and gene names (.var_names
) unique.savings data in
.layers
ascsr
sparse matrix for the purpose of memory efficency.adding collapsed
new, total
andunspliced, spliced
layers from theuu, ul, su, sl
layers of a metabolic labeling experiment.calculating each cell’s cell cycle stage score.
calculating new to total ratio (
ntr
) for each gene and cell.
Note that by default, we don’t filter any cells or genes for your adata object to avoid the trouble of losing your favorite genes/cells. However, if your dataset is huge, we recommend filtering them by setting keep_filtered_cells=False, keep_filtered_genes=False
in recipe_monocle
.
[5]:
dyn.pp.recipe_monocle(adata)
[5]:
AnnData object with n_obs × n_vars = 4181 × 16940
obs: 'split_id', 'sample', 'Size_Factor', 'condition', 'Cluster', 'Cell_type', 'umap_1', 'umap_2', 'batch', 'nGenes', 'nCounts', 'pMito', 'use_for_pca', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'initial_cell_size', 'ntr'
var: 'pass_basic_filter', 'score', 'log_m', 'log_cv', 'use_for_pca', 'ntr'
uns: 'velocyto_SVR', 'pp_norm_method', 'PCs', 'explained_variance_ratio_', 'pca_fit', 'feature_selection'
obsm: 'X_pca', 'X'
layers: 'spliced', 'unspliced', 'X_spliced', 'X_unspliced'
RNA velocity with parallelism¶
RNA velocity (\(\frac{ds}{dt}\)) for conventional scRNAseq is just \(\frac{ds}{dt} = \beta u  \gamma s\) (while \(u/s\) is the unspliced or spliced mRNA respectively.\(\beta\) is splicing rate and is generally assumed to be 1 while \(\gamma\) is degration rate and is what we need to estimate). To estimate gamma for conventional scRNAseq data, we provided three approaches deterministic
, stochastic
and negbin
. The first one is equivalent to
velocyto’s implementation or scvelo’s deterministic mode while the second one scvelo’s stochastic mode. Negative binomal is a novel method from us that relies on the negative binomial formulation of gene exrpession distribution at steady state. Furthermore, we support multicore parallelism of gamma estimation so you can analyze your large singlecell datasets with dynamo efficiently.
dyn.tl.dynamics
function combines gamma estimation and velocity calculation in oneshot. Furthermore, it implicitly calls dyn.tl.moments
first, and then performs the following steps:
checks the data you have and determines the experimental type automatically, either the conventional scRNAseq, kinetics, degradation or oneshot singlecell metabolic labelling experiment or the CITEseq or REAPseq coassay, etc.
learns the velocity for each feature gene using either the original deterministic model based on a steadystate assumption from the seminal RNA velocity work or a few new methods, including the stochastic (default) or negative binomial method for conventional scRNAseq or kinetic, degradation or oneshot models for metabolic labeling based scRNAseq.
Those later methods are based on moment equations which basically considers both mean and uncentered variance of gene expression. The moment based models require calculation of the first and second moments of the expression data, which relies on the cell nearest neighbours graph, constructed in the reduced PCA space from the spliced or total mRNA expression.
[6]:
dyn.tl.dynamics(adata, model='stochastic', cores=3)
# or dyn.tl.dynamics(adata, model='deterministic')
# or dyn.tl.dynamics(adata, model='stochastic', est_method='negbin')
[6]:
AnnData object with n_obs × n_vars = 4181 × 16940
obs: 'split_id', 'sample', 'Size_Factor', 'condition', 'Cluster', 'Cell_type', 'umap_1', 'umap_2', 'batch', 'nGenes', 'nCounts', 'pMito', 'use_for_pca', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'initial_cell_size', 'ntr'
var: 'pass_basic_filter', 'score', 'log_m', 'log_cv', 'use_for_pca', 'ntr', 'beta', 'gamma', 'half_life', 'alpha_b', 'alpha_r2', 'gamma_b', 'gamma_r2', 'gamma_logLL', 'delta_b', 'delta_r2', 'uu0', 'ul0', 'su0', 'sl0', 'U0', 'S0', 'total0', 'use_for_dynamics'
uns: 'velocyto_SVR', 'pp_norm_method', 'PCs', 'explained_variance_ratio_', 'pca_fit', 'feature_selection', 'dynamics'
obsm: 'X_pca', 'X'
layers: 'spliced', 'unspliced', 'X_spliced', 'X_unspliced', 'M_u', 'M_uu', 'M_s', 'M_us', 'M_ss', 'velocity_S'
obsp: 'moments_con'
Next we perform dimension reduction (by default, UMAP) and visualize the UMAP embedding of cells. The provided Cell_type
information is also used to color cells. To get cluster/cell type information for your own data, dynamo also provides facilities to perform clustering and marker gene detection. By default we use HDBSCAN for clustering. HDBSCAN package was developed also by Leland
McInnes, the developer of UMAP. You may clustering your single cells in UMAP space (set basis='umap'
instead of the default pca
in HDBSCAN). See more discussion aboout this here.
For marker gene detection, please check functions in Markers and differential expressions section in our API. A more detailed tutorial designated for this will be released soon.
[7]:
dyn.tl.reduceDimension(adata)
dyn.pl.umap(adata, color='Cell_type')
<Figure size 600x400 with 0 Axes>
Kinetic estimation of the conventional scRNAseq and metabolic labeling based scRNAseq is often tricky and has a lot pitfalls. Sometimes you may even observed undesired backward vector flow. You can evaluate the confidence of genewise velocity via:
dyn.tl.gene_wise_confidence(adata, group='group', lineage_dict={'Progenitor': ['terminal_cell_state']})
Here group
is the column for the group informations for cells in the .obs
. lineage_dict
is a dictionary indicates broad lineage information in which key points to the progenitor group while values (a list) are the possible terminal cell groups, all from the group
column.
In the following, let us have a look at the phase diagram of some genes that have velocity calculated. You will see the pvalb1
gene has a strange phase diagram with a few cells have high spliced expression values but extremely low unspliced expression values. Those kind of phase space may points to inproper intron capture of those genes during the library prepartion or sequencing and they should never be used for velocity projection and vector field analysis. A tutorial with details for
identifying those genes, evaluating the confidence of velocity estimation and then correcting (briefly mentioned below) the RNA velocity results will be released soon.
[8]:
dyn.pl.phase_portraits(adata, genes=adata.var_names[adata.var.use_for_dynamics][:4], figsize=(6, 4), color='Cell_type')
Velocity projection¶
In order to visualize the velocity vectors, we need to project the high dimensional velocity vector of cells to lower dimension (although dynamo also enables you to visualize raw genepair velocity vectors, see below). The projection involves calculating a transition matrix between cells for local averaging of velocity vectors in low dimension. There are three methods to calculate the transition matrix, either kmc
, cosine
, pearson
. kmc
is our new approach to learn the transition
matrix via diffusion approximation or an Itô kernel. cosine
or pearson
are the methods used in the original velocyto or the scvelo implementation. Kernels that are based on the reconstructed vector field
in high dimension is also possible and maybe more suitable because of its and robustness and smoothness. We will show you how to do that in another tutorial soon!
[9]:
dyn.tl.cell_velocities(adata, method='pearson', other_kernels_dict={'transform': 'sqrt'})
calculating transition matrix via pearson kernel with sqrt transform.: 100%██████████ 4181/4181 [00:10<00:00, 409.52it/s]
projecting velocity vector to low dimensional embedding...: 100%██████████ 4181/4181 [00:01<00:00, 3947.53it/s]
[9]:
AnnData object with n_obs × n_vars = 4181 × 16940
obs: 'split_id', 'sample', 'Size_Factor', 'condition', 'Cluster', 'Cell_type', 'umap_1', 'umap_2', 'batch', 'nGenes', 'nCounts', 'pMito', 'use_for_pca', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'initial_cell_size', 'ntr'
var: 'pass_basic_filter', 'score', 'log_m', 'log_cv', 'use_for_pca', 'ntr', 'beta', 'gamma', 'half_life', 'alpha_b', 'alpha_r2', 'gamma_b', 'gamma_r2', 'gamma_logLL', 'delta_b', 'delta_r2', 'uu0', 'ul0', 'su0', 'sl0', 'U0', 'S0', 'total0', 'use_for_dynamics', 'use_for_transition'
uns: 'velocyto_SVR', 'pp_norm_method', 'PCs', 'explained_variance_ratio_', 'pca_fit', 'feature_selection', 'dynamics', 'neighbors', 'umap_fit', 'grid_velocity_umap'
obsm: 'X_pca', 'X', 'X_umap', 'velocity_umap'
layers: 'spliced', 'unspliced', 'X_spliced', 'X_unspliced', 'M_u', 'M_uu', 'M_s', 'M_us', 'M_ss', 'velocity_S'
obsp: 'moments_con', 'connectivities', 'distances', 'pearson_transition_matrix'
You can check the confidence of cellwise velocity to understand how reliable the recovered velocity is across cells or even correct
velocty based on some prior:
dyn.tl.cell_wise_confidence(adata, basis='pca')
dyn.tl.confident_cell_velocities(adata, group='group', lineage_dict={'Progenitor': ['terminal_cell_state']},)
There are three methods implemented for calculating the cell wise velocity confidence metric. By default it uses jaccard
index, which measures how well each velocity vector meets the geometric constraints defined by the local neighborhood structure. Jaccard index is calculated as the fraction of the number of the intersected set of nearest neighbors from each cell at current expression state (X) and that from the future expression state (X + V) over the number of the union of these two sets.
The cosine
or correlation
method is similar to that used by scVelo.
Next let us visualize the projected RNA velocity. We can see that the recovered RNA velocity flow shows a nice transition from proliferating progenitors
to pigment progenitors
which then bifurcate into either melanophore
or iridopore
on the left. In the middle, the proliferating progenitors
bifurcate upward into either chromaffin
, neuron
or satellite glia cells
. On the right, the proliferation progenitors
bifurcate into either Schwann cell precursor
which
then become Schwann cells
or other glia
. In the bottom, some proliferating progenitors
choose to become an unkown
cell lineage. In addition, the xanthophore
cells are seem to be an outlier group on the top, indicating it has a different lineage path comparing to melanophore
or iridophore
pigment cells. The transcriptional discontinuity from multipotent progenitors to xanthophore
cells may also imply its lineage trajectory is more rapid comparing to that of
melanophore
or iridophore
pigment cells.
[10]:
dyn.pl.cell_wise_vectors(adata, color=['Cell_type'], basis='umap', show_legend='on data', quiver_length=6, quiver_size=6, pointsize=0.1, show_arrowed_spines=False)
<Figure size 600x400 with 0 Axes>
[11]:
dyn.pl.streamline_plot(adata, color=['Cell_type'], basis='umap', show_legend='on data', show_arrowed_spines=True)
<Figure size 600x400 with 0 Axes>
Note that, if you pass x='gene_a', y='gene_b'
to cell_wise_vectors
, grid_vectors
or streamline_plot
, you can visualize the raw genepair velocity flows. gene_a
and gene_b
need to have velocity calculated (or use_for_dynamics
in .var
for those genes are True
)
Reconstruct vector field¶
In classical physics, including fluidics and aerodynamics, velocity and acceleration vector fields are used as fundamental tools to describe motion or external force of objects, respectively. In analogy, RNA velocity or protein accelerations estimated from single cells can be regarded as sparse samples in the velocity (La Manno et al. 2018) or acceleration vector field (Gorin, Svensson, and Pachter 2019) that defined on the gene expression space.
In general, a vector field can be defined as a vectorvalued function f that maps any points (or cells’ expression state) x in a domain Ω with D dimension (or the gene expression system with D transcripts / proteins) to a vector y (for example, the velocity or acceleration for different genes or proteins), that is f(x) = y.
To formally define the problem of velocity vector field learning, we consider a set of measured cells with pairs of current and estimated future expression states. The difference between the predicted future state and current state for each cell corresponds to the velocity. We suppose that the measured singlecell velocity is sampled from a smooth, differentiable vector field f that maps from xi to yi on the entire domain. Normally, single cell velocity measurements are results of biased, noisy and sparse sampling of the entire state space, thus the goal of velocity vector field reconstruction is to robustly learn a mapping function f that outputs yj given any point xj on the domain based on the observed data with certain smoothness constraints (Jiayi Ma et al. 2013). Under ideal scenario, the mapping function f should recover the true velocity vector field on the entire domain and predict the true dynamics in regions of expression space that are not sampled. To reconstruct vector field function in dynamo, you can simply use the following function to do all the heavylifting:
[12]:
# you can set `verbose = 1/2/3` to obtain different levels of running information of vector field reconstruction
dyn.vf.VectorField(adata, basis='umap', M=1000, pot_curl_div=True)
Constructing diffusion graph from reconstructed vector field: 4181it [03:22, 20.61it/s]
Calculating 2D curl: 100%██████████ 4181/4181 [00:00<00:00, 11763.55it/s]
Calculating divergence: 100%██████████ 4181/4181 [00:00<00:00, 10544.14it/s]
Vector field recunstruction is blazingly efficient and scale linearly with the cell number and dimensions. So you can do vector field reconstruction for hundred thousands of cells in PCA space on a matter of minutes. How good your vector field reconstruction is? We have several metrics to quantify that and we will provide a detailed tutorial on that in a couple of days. The easiest way, though, is to check the energy / energy change rate to see whether they are decreasing and gradually stabiling during the vector field learning process:
dyn.pl.plot_energy(adata)
Characterize vector field topology¶
Since we learn the vector field function of the data, we can then characterize the topology of the full vector field space. For example, we are able to identify
the fixed points (attractor/saddles, etc.) which may corresponds to terminal cell types or progenitors;
nullcline, separatrices of a recovered dynamic system, which may formally define the dynamical behaviour or the boundary of cell types in gene expression space.
Note that we use the name of topography
instead of topology
in tools
or plot
modules because we figured out the 2D full vector field plot (instead of just domains with cells as those visualized by streamline_plot
function) with those fixed points, nullclines, etc. looks like a topography plot. Enlighten us if you have a better idea. And see also more discussion here.
When we recostruct a 2 D vector field (which is the case above), we automatically characterize the vector field topology. Let us take a look a the fixed points identified by dynamo for this system.
[13]:
dyn.pl.topography(adata, basis='umap', background='white', color=['ntr', 'Cell_type'], streamline_color='black', show_legend='on data', frontier=True)
There are a lot of fixed points identified by dynamo. Some of them are less confident than others and we use the filled color of each node to represent the confidence. The shape of node also has some meaning. Half circles are saddle points while full circle are stable fixed points (the eigenvalue of the jacobian matrix at those places are all negative based on the reconstructed vector field). The color of digits in each node is related to the type of fixed points:
\(\color{black}{\text{black}}\): absorbing fixed points;
\(\color{red}{\text{red}}\): emitting fixed points;
\(\color{blue}{\text{blue}}\): unstable fixed points.
We notice that, interesting, node 6
corresponds an emitting fixed point which makes sense as it is located in the domain of progenitor cell state; on the other hand, nodes 70, 44, 14 and 72
are absorbing fixed points, and each corresponds to the melanophore
, iridophore
, unknown
or the xanthophore
terminal cell type state. Lastly, nodes 20 and 29
are unstable fixed points (saddle points), each corresponds to the bifurcation point of the iridophore and melanophore
lineages or that of the neuron and satellite glia lineages.
So overall this topology analysis did a pretty good job!
The concept of potential landscape is widely appreciated across various biological disciplines, for example the adaptive landscape in population genetics, proteinfolding funnel landscape in biochemistry, epigenetic landscape in developmental biology. In the context of cell fate transition, for example, differentiation, carcinogenesis, etc, a potential landscape not only offers an intuitive description of the global dynamics of the biological process but also provides key insights to understand the multistability and transition rate between different cell types as well as to quantify the optimal path of cell fate transition.
The classical definition of potential function in physics requires gradient systems (no curl/cycling part), it thus is often not applicable to open biological system. In dynamo we provided several ways to quantify the potential of single cells by decomposing the vector field into gradient, curl parts, etc and use the gradient part to define potential. The recommended method is built on the Hodge decomposition on simplicial complexes (a sparse directional graph) constructed based on the learned vector field function that provides fruitful analogy of gradient, curl and harmonic (cyclic) flows on manifold.
Single cell potential (In fact, it is the negative of potential here for the purpose to match up with the common usuage of pseudotime
so that small values correspond to the progenitor state while large values terminal cell states.) estimated by dynamo can be regarded as a replacement of pseudotime. Since dynamo utilizes velocity which consists of direction and magnitude of cell dynamics, potential should be more relevant to real time and intrinsically directional (so you don’t need to orient
the trajectory).
[14]:
dyn.pl.umap(adata, color='umap_ddhodge_potential', frontier=True)
<Figure size 600x400 with 0 Axes>
Here we can check a few genes from figure 3 (si 5)
of Saunders, et al (2019) to see their expression dynamics over time. As expected, we can see that mitfa
expression declined only marginally with melanophore differentiation yet decreased markedly with a transition from progenitor to iridophore as expected (Curran et al., 2010). pax3a
was expressed in pigment progenitors and decreased across pseudotime in melanophores, whereas expression of tfec
, a transcription factor expressed
in iridophores (Lister et al., 2011), increased over pseudotime. Melanin synthesis enzyme genes, dct
and tyrp1b
, as well as pmel
, encoding a melanosomeassociated transmembrane protein, all increased over pseudotime in melanophores. In iridophores, gpnmb
and pnp4a
showed elevated expression late in pseudotime, as expected (Curran et al., 2010; Higdon et al., 2013).
[15]:
import numpy as np
fig3_si5 = ['mitfa', 'pax3a', 'tfec', 'dct', 'alx4b', 'tyrp1b', 'gpnmb', 'pmela', 'pnp4a']
dyn.pl.scatters(adata, x=np.repeat('umap_ddhodge_potential', 9), pointsize=0.25, alpha=0.8, y=fig3_si5, layer='X_spliced', color='Cell_type',
ncols=3, background='white', figsize=(7, 4))
Beyond RNA velocity¶
Here let us take a glimpse on how dynamo can go beyond RNA velocity analysis by taking advantage of the analytical vector field function it learns. Here we will first project the RNA velocity to pca space and then reconstruct the vector field function in the PCA space. We then followed by calculating curl
(curl
is calculated in 2 dimensional UMAP space by default as it is only defined in 2/3 dimension), divergence
, acceleration
and curvature
. Those calculations are incredibly
efficient (on the order of seconds for ten thousands of cells in 30 PCs) as they are calculated analytically based on the reconstructed vector field function.
curl: a quantity to characterize the infinitesimal rotation of a cell state based on the reconstructed vector field.
in 2D, curl is a value; in 3D curl, is a matrix.
if rotation is clockwise, 2D curl has negative value and vice versa
combinbing with expression of cell cycle markers, curl analysis can help us to reveal whether a cell is going through a strong cell cycle process.
divergence: a quantity to characterize local “outgoingness” of a cell – the extent to which there is more of the field vectors exiting an infinitesimal region of space than entering it.
positive values means cells is going out to become other cells or cell’s movement to other cell is speeded up and vice versa.
divergence analysis can be used to reveal progenitor (source) or a terminal cell state (sink).
acceleration: the derivative of velocity vector.
if cell speeds up (normally happen when cells exit cell cycle and start to commit), the acceleration will be positive and vice versa.
RNA acceleration is a vector like RNA velocity vector so you can actually plot acceleration field like velocity field (that is why we name our vector flow related plotting functions
cell_wise_vectors
,grid_vectors
to support plotting bothvelocity
andacceleration field
(see below)).Here the norm of the acceleration for all PC components in each cells will be calculated and visualized (like the speed/magnitude of the velocity vector).
curvature: a quantity to characterize the curviness a cell’s vector field trajectory.
if a progenitor develops into multiple lineages, some of those paths will have curvature (it is like making a turn on a crossroad while driving a car).
genes strongly contribute to the curvature correspond to regulatory genes steering the cell fate
[16]:
dyn.tl.cell_velocities(adata, basis='pca')
dyn.vf.VectorField(adata, basis='pca')
dyn.vf.speed(adata, basis='pca')
dyn.vf.curl(adata, basis='umap')
dyn.vf.divergence(adata, basis='pca')
dyn.vf.acceleration(adata, basis='pca')
dyn.vf.curvature(adata, basis='pca')
projecting velocity vector to low dimensional embedding...: 16%█▌  651/4181 [00:00<00:01, 3228.03it/s]
Using existing pearson_transition_matrix found in .obsp.
projecting velocity vector to low dimensional embedding...: 100%██████████ 4181/4181 [00:01<00:00, 3505.59it/s]
Calculating 2D curl: 100%██████████ 4181/4181 [00:00<00:00, 11446.05it/s]
Calculating divergence: 100%██████████ 4181/4181 [00:00<00:00, 6807.54it/s]
Calculating acceleration: 100%██████████ 4181/4181 [00:00<00:00, 124851.45it/s]
Calculating acceleration: 100%██████████ 4181/4181 [00:00<00:00, 166839.99it/s]
Calculating curvature: 100%██████████ 4181/4181 [00:00<00:00, 48393.99it/s]
Integrative analysis¶
We can integrate those above quantities to fully characterize the regulatory mechanism during zebrafish pigmentation.
A separate tutorial is needed to fully explore these analyses, but let’s take a quick look at the results. We can see that:
from cell speed and acceleration, progenitors generally have low speed as it is like a metastable cell state. However transition of pigment progenitors and proliferating progenitors speeds up after committing to a particular lineage, for example, iridophore/melanophore/shawnn cell lineage, etc.
from cell divergence, those progenitors (pigment progenitors and proliferating progenitors) functions like a source with high divergence while melanophore/iridophores/chromaffin/schawn cells as well as other cell types functions like a sink with significantly lower divergence.
from cell curvature, when cell makes cell fate decisions (at the bifurcation point of iridophore and melanophore lineages or that of the neuron and satellite glia lineages), strong curvature is apparent. Curvature is also artificially strong when velocity is noisy.
[17]:
import matplotlib.pyplot as plt
fig1, f1_axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True, figsize=(12, 8))
f1_axes
f1_axes[0, 0] = dyn.pl.cell_wise_vectors(adata, color='speed_pca', pointsize=0.5, alpha = 0.7, ax=f1_axes[0, 0], quiver_length=6, quiver_size=6, save_show_or_return='return')
f1_axes[0, 1] = dyn.pl.grid_vectors(adata, color='divergence_pca', ax=f1_axes[0, 1], quiver_length=12, quiver_size=12, save_show_or_return='return')
f1_axes[1, 0] = dyn.pl.streamline_plot(adata, color='acceleration_pca', ax=f1_axes[1, 0], save_show_or_return='return')
f1_axes[1, 1] = dyn.pl.streamline_plot(adata, color='curvature_pca', ax=f1_axes[1, 1], save_show_or_return='return')
plt.show()
Emulate ggplot2 plotting styple with black background, get ready for a cool presentation!!!
[18]:
dyn.configuration.set_figure_params('dynamo', background='black')
[19]:
fig1, f1_axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True, figsize=(12, 8))
f1_axes
f1_axes[0, 0] = dyn.pl.cell_wise_vectors(adata, color='speed_pca', pointsize=0.1, alpha = 0.7, ax=f1_axes[0, 0], quiver_length=6, quiver_size=6, save_show_or_return='return', background='black')
f1_axes[0, 1] = dyn.pl.grid_vectors(adata, color='divergence_pca', ax=f1_axes[0, 1], quiver_length=12, quiver_size=12, save_show_or_return='return', background='black')
f1_axes[1, 0] = dyn.pl.streamline_plot(adata, color='acceleration_pca', ax=f1_axes[1, 0], save_show_or_return='return', background='black')
f1_axes[1, 1] = dyn.pl.streamline_plot(adata, color='curvature_pca', ax=f1_axes[1, 1], save_show_or_return='return', background='black')
plt.show()
Animate fate transition¶
Before we go, let us have some fun with animating cell fate commitment predictions via reconstructed vector field function. This cool application hopefully will also convince you that vector field reconstruction can enable some amazing analysis that is hardly imaginable before. With those and many other possibilities in single cell genomics, the prospect of biology to finally become a discipline as qualitative as physics and mathematics has never been so promising!
To animate cell fate prediction, we need to first select some progenitor cells as initial cell states.
[20]:
progenitor = adata.obs_names[adata.obs.Cell_type.isin(['Proliferating Progenitor', 'Pigment Progenitor'])]
len(progenitor)
[20]:
1194
Then, we need to predict the cell fate trajectory via integrating with the vector field function, starting from those initial cell states.
[21]:
dyn.pd.fate(adata, basis='umap', init_cells=progenitor, interpolation_num=100, direction='forward',
inverse_transform=False, average=False, cores=3)
[21]:
AnnData object with n_obs × n_vars = 4181 × 16940
obs: 'split_id', 'sample', 'Size_Factor', 'condition', 'Cluster', 'Cell_type', 'umap_1', 'umap_2', 'batch', 'nGenes', 'nCounts', 'pMito', 'use_for_pca', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'initial_cell_size', 'ntr', 'umap_ddhodge_div', 'umap_ddhodge_potential', 'curl_umap', 'divergence_umap', 'speed_pca', 'divergence_pca', 'acceleration_pca', 'curvature_pca'
var: 'pass_basic_filter', 'score', 'log_m', 'log_cv', 'use_for_pca', 'ntr', 'beta', 'gamma', 'half_life', 'alpha_b', 'alpha_r2', 'gamma_b', 'gamma_r2', 'gamma_logLL', 'delta_b', 'delta_r2', 'uu0', 'ul0', 'su0', 'sl0', 'U0', 'S0', 'total0', 'use_for_dynamics', 'use_for_transition'
uns: 'velocyto_SVR', 'pp_norm_method', 'PCs', 'explained_variance_ratio_', 'pca_fit', 'feature_selection', 'dynamics', 'neighbors', 'umap_fit', 'grid_velocity_umap', 'VecFld_umap', 'VecFld', 'grid_velocity_pca', 'VecFld_pca', 'curvature_pca', 'fate_umap'
obsm: 'X_pca', 'X', 'X_umap', 'velocity_umap', 'velocity_umap_SparseVFC', 'X_umap_SparseVFC', 'velocity_pca', 'velocity_pca_SparseVFC', 'X_pca_SparseVFC', 'acceleration_pca'
layers: 'spliced', 'unspliced', 'X_spliced', 'X_unspliced', 'M_u', 'M_uu', 'M_s', 'M_us', 'M_ss', 'velocity_S', 'acceleration'
obsp: 'moments_con', 'connectivities', 'distances', 'pearson_transition_matrix', 'umap_ddhodge'
Furthermore, we need to prepare a matplotlib axes
as the background of the animation and then the animated components from each frame will be plotted on its top. Here I use the topography
plot as the background but you can use other plots if you like.
[22]:
%%capture
fig, ax = plt.subplots()
ax = dyn.pl.topography(adata, color='Cell_type', ax=ax, save_show_or_return='return')
The dyn.mv.*
module provides functionalities to create necessary components to produce an animation that describes the estimated speed of a set of cells at each time point, its movement in gene expression space and the long range trajectory predicted by the reconstructed vector field functions. Thus it provides intuitive visual understanding of the RNA velocity, speed, acceleration, and cell fate commitment in action!!
[23]:
%%capture
instance = dyn.mv.StreamFuncAnim(adata=adata, color='Cell_type', ax=ax)
Lastly, let us embed the animation into our notebook.
Note that here I have to set animation.embed_limit rc parameter
to a big value (in MB) to ensure all frames of the animation will be embedded in this notebook.
[24]:
import matplotlib
matplotlib.rcParams['animation.embed_limit'] = 2**128 # Ensure all frames will be embedded.
from matplotlib import animation
import numpy as np
anim = animation.FuncAnimation(instance.fig, instance.update, init_func=instance.init_background,
frames=np.arange(100), interval=100, blit=True)
from IPython.core.display import display, HTML
HTML(anim.to_jshtml()) # embedding to jupyter notebook.
[24]: