dynamo.pp.recipe_monocle

dynamo.pp.recipe_monocle(adata, reset_X=False, tkey=None, t_label_keys=None, experiment_type=None, normalized=None, layer=None, total_layers=None, splicing_total_layers=False, X_total_layers=False, genes_use_for_norm=None, genes_to_use=None, genes_to_append=None, genes_to_exclude=None, exprs_frac_for_gene_exclusion=1, method='pca', num_dim=30, sz_method='median', scale_to=None, norm_method=None, pseudo_expr=1, feature_selection='SVR', n_top_genes=2000, maintain_n_top_genes=True, relative_expr=True, keep_filtered_cells=None, keep_filtered_genes=None, keep_raw_layers=None, scopes=None, fc_kwargs=None, fg_kwargs=None, sg_kwargs=None, copy=False, feature_selection_layer='X')[source]

The monocle style preprocessing recipe.

This function is partly based on Monocle R package (https://github.com/cole-trapnell-lab/monocle3).

Parameters:
  • adata (AnnData) – an AnnData object.

  • reset_X (bool) –

    whether do you want to let dynamo reset adata.X data based on layers stored in your experiment. One critical functionality of dynamo is about visualizing RNA velocity vector flows which requires proper data into which the high dimensional RNA velocity vectors will be projected.

    1. For kinetics experiment, we recommend the use of total layer as adata.X;

    2. For degradation/conventional experiment scRNA-seq, we recommend using splicing layer as adata.X.

    Set reset_X to True to set those default values if you are not sure. Defaults to False.

  • tkey (Optional[str]) – the column key for the labeling time of cells in .obs. Used for labeling based scRNA-seq data (will also support for conventional scRNA-seq data). Note that tkey will be saved to adata.uns[‘pp’][‘tkey’] and used in dyn.tl.dynamics in which when group is None, tkey will also be used for calculating 1st/2st moment or covariance. We recommend to use hour as the unit of time. Defaults to None.

  • t_label_keys (Union[List[str], str, None]) – the column key(s) for the labeling time label of cells in .obs. Used for either “conventional” or “labeling based” scRNA-seq data. Not used for now and tkey is implicitly assumed as t_label_key (however, tkey should just be the time of the experiment). Defaults to None.

  • experiment_type (Optional[str]) –

    experiment type for labeling single cell RNA-seq. Available options are: (1) ‘conventional’: conventional single-cell RNA-seq experiment, if experiment_type is None and there is

    only splicing data, this will be set to conventional;

    1. ’deg’: chase/degradation experiment. Cells are first labeled with an extended period, followed by chase;

    2. ’kin’: pulse/synthesis/kinetics experiment. Cells are labeled for different duration in a time-series;

    (4) ‘one-shot’: one-shot kinetic experiment. Cells are only labeled for a short pulse duration; Other possible experiments include: (5) ‘mix_pulse_chase’ or ‘mix_kin_deg’: This is a mixture chase experiment in which the entire experiment lasts for a certain period of time with an initial pulse followed by washing out at different time point but chasing cells at the same time point. This type of labeling strategy was adopted in scEU-seq paper. For this kind of experiment, we need to assume a non-steady state dynamics. (6) ‘mix_std_stm’;. Defaults to None.

  • normalized (Optional[bool]) – if you already normalized your data (or run recipe_monocle already), set this to be True to avoid renormalizing your data. By default it is set to be None and the first 20 values of adata.X (if adata.X is sparse) or its first column will be checked to determine whether you already normalized your data. This only works for UMI based or read-counts data. Defaults to None.

  • layer (Optional[str]) – the layer(s) to be normalized. if not supplied, all layers would be used, including RNA (X, raw) or spliced, unspliced, protein, etc. Defaults to None.

  • total_layers (Union[bool, List[str], None]) – the layer(s) that can be summed up to get the total mRNA. for example, [“spliced”, “unspliced”], [“uu”, “ul”, “su”, “sl”] or [“total”], etc. If total_layers is True, total_layers will be set to be total or [“uu”, “ul”, “su”, “sl”] depends on whether you have labeling but no splicing or labeling and splicing data. Defaults to None.

  • splicing_total_layers (bool) – whether to also normalize spliced / unspliced layers by size factor fromtotal RNA. Defaults to False.

  • X_total_layers (bool) – whether to also normalize adata.X by size factor from total RNA. Defaults to False.

  • genes_use_for_norm (Optional[List[str]]) – a list of gene names that will be used to calculate total RNA for each cell and then the size factor for normalization. This is often very useful when you want to use only the host genes to normalize the dataset in a virus infection experiment (i.e. CMV or SARS-CoV-2 infection). Defaults to None.

  • genes_to_use (Optional[List[str]]) – a list of gene names that will be used to set as the feature genes for downstream analysis. Defaults to None.

  • genes_to_append (Optional[List[str]]) – a list of gene names that will be appended to the feature genes list for downstream analysis. Defaults to None.

  • genes_to_exclude (Optional[List[str]]) – a list of gene names that will be excluded to the feature genes list for downstream analysis. Defaults to None.

  • exprs_frac_for_gene_exclusion (float) – the minimal fraction of gene counts to the total counts across cells that will used to filter genes. By default it is 1 which means we don’t filter any genes, but we need to change it to 0.005 or something in order to remove some highly expressed housekeeping genes. Defaults to 1.

  • method (str) – the linear dimension reduction methods to be used. Defaults to “pca”.

  • num_dim (int) – the number of linear dimensions reduced to. Defaults to 30.

  • sz_method (str) – the method used to calculate the expected total reads / UMI used in size factor calculation. Only mean-geometric-mean-total / geometric and median are supported. When median is used, locfunc will be replaced with np.nanmedian. Defaults to “median”.

  • scale_to (Optional[float]) – the final total expression for each cell that will be scaled to. Defaults to None.

  • norm_method (Optional[str]) – the method to normalize the data. Can be any numpy function or Freeman_Tukey. By default, only .X will be size normalized and log1p transformed while data in other layers will only be size factor normalized. Defaults to None.

  • pseudo_expr (int) – a pseudocount added to the gene expression value before log/log2 normalization. Defaults to 1.

  • feature_selection (str) – Which sorting method, either dispersion, SVR or Gini index, to be used to select genes. Defaults to “SVR”.

  • n_top_genes (int) – how many top genes based on scoring method (specified by sort_by) will be selected as feature genes. Defaults to 2000.

  • maintain_n_top_genes (bool) – whether to ensure 2000 feature genes selected no matter what genes_to_use, genes_to_append, etc. are specified. The only exception is that if genes_to_use is supplied with n_top_genes. Defaults to True.

  • relative_expr (bool) – whether we need to divide gene expression values first by size factor before normalization. Defaults to True.

  • keep_filtered_cells (Optional[bool]) – whether to keep genes that don’t pass the filtering in the returned adata object. Defaults to None.

  • keep_filtered_genes (Optional[bool]) – whether to keep genes that don’t pass the filtering in the returned adata object. Defaults to None.

  • keep_raw_layers (Optional[bool]) – whether to keep layers with raw measurements in the returned adata object. Defaults to None.

  • scopes (Union[str, Iterable, None]) – scopes are needed when you use non-official gene name as your gene indices (or adata.var_name). This argument corresponds to types of identifiers, either a list or a comma-separated fields to specify type of input qterms, e.g. “entrezgene”, “entrezgene,symbol”, [“ensemblgene”, “symbol”]. Refer to official MyGene.info docs (https://docs.mygene.info/en/latest/doc/query_service.html#available_fields) for the full list of fields. Defaults to None.

  • fc_kwargs (Optional[dict]) – other Parameters passed into the filter_cells function. Defaults to None.

  • fg_kwargs (Optional[dict]) – other Parameters passed into the filter_genes function. Defaults to None.

  • sg_kwargs (Optional[dict]) – other Parameters passed into the select_genes function. Defaults to None.

  • copy (bool) – whether to return a new deep copy of adata instead of updating adata object passed in arguments. Defaults to False.

  • feature_selection_layer (Union[List[str], ndarray, array, str]) – name of layers to apply feature selection. Defaults to DKM.X_LAYER.

Raises:
  • ValueError – time key does not existed in adata.obs.

  • ValueError – provided experiment type is invalid.

  • Exception – no genes pass basic filter.

  • Exception – no cells pass basic filter.

  • Exception – genes_to_use contains genes that are not found in adata.

  • ValueError – provided layer(s) is invalid.

  • ValueError – genes_to_append contains invalid genes.

Return type:

Optional[AnnData]

Returns:

A new updated anndata object if copy arg is True. In the object, Size_Factor, normalized expression values, X and reduced dimensions, etc., are updated. Otherwise, return None.