dynamo.tl.find_group_markers

dynamo.tl.find_group_markers(adata, group, genes=None, layer=None, exp_frac_thresh=None, log2_fc_thresh=None, qval_thresh=0.05, de_frequency=1, subset_control_vals=None)[source]

Find marker genes for each group of cells based on gene expression or velocity values as specified by the layer.

Tests each gene for differential expression between cells in one group to cells from all other groups via Mann-Whitney U test. It also calculates the fraction of cells with non-zero expression, log 2-fold changes as well as the specificity (calculated as 1 - Jessen-Shannon distance between the distribution of percentage of cells with expression across all groups to the hypothetical perfect distribution in which only the test group of cells has expression). In addition, Rank-biserial correlation (rbc) and qval are calculated. The rank biserial correlation is used to assess the relationship between a dichotomous categorical variable and an ordinal variable. The rank biserial test is very similar to the non-parametric Mann-Whitney U test that is used to compare two independent groups on an ordinal variable. Mann-Whitney U tests are preferable to rank biserial correlations when comparing independent groups. Rank biserial correlations can only be used with dichotomous (two levels) categorical variables. qval is calculated using Benjamini-Hochberg adjustment.

Note that this function is designed in a general way so that you can either use the total, new, unspliced or velocity, etc. to identify differentially expressed genes.

This function is adapted from https://github.com/KarlssonG/nabo/blob/master/nabo/_marker.py and Monocle 3alpha.

Parameters:
  • adata (AnnData) – An AnnData object.

  • group (str) – The column key/name that identifies the grouping information (for example, clusters that correspond to different cell types or different time points) of cells. This will be used for calculating group-specific genes.

  • genes (Optional[List[str]]) – The list of genes that will be used to subset the data for dimension reduction and clustering. If None, all genes will be used. Defaults to None.

  • layer (Optional[str]) – The layer that will be used to retrieve data for dimension reduction and clustering. If None, .X is used. Defaults to None.

  • exp_frac_thresh (Optional[float]) – The minimum percentage of cells with expression for a gene to proceed differential expression test. If layer is not velocity related (i.e. velocity_S), exp_frac_thresh by default is set to be 0.1, otherwise 0. Defaults to None.

  • log2_fc_thresh (Optional[float]) – The minimal threshold of log2 fold change for a gene to proceed differential expression test. If layer is not velocity related (i.e. velocity_S), log2_fc_thresh by default is set to be 1, otherwise 0. Defaults to None.

  • qval_thresh (float) – The minimal threshold of qval to be considered as significant genes. Defaults to 0.05.

  • de_frequency (int) – Minimum number of clusters against a gene should be significantly differentially expressed for it to qualify as a marker. Defaults to 1.

  • subset_control_vals (Optional[bool]) – Whether to subset the top ranked control values. When subset_control_vals = None, this is subset to be True when layer is not related to either velocity related or acceleration or curvature related layers and False otherwise. When layer is not related to either velocity related or acceleration or curvature related layers used, the control values will be sorted by absolute values. Defaults to None.

Raises:
  • ValueError – Gene list does not overlap with genes in adata.

  • ValueErrorgroup is invalid.

  • ValueError – .obs[group] does not contain enough number of groups.

Return type:

AnnData

Returns:

An updated ~anndata.AnnData with a new property cluster_markers in the .uns attribute, which includes a concated pandas DataFrame of the differential expression analysis result for all groups and a dictionary where keys are cluster numbers and values are lists of marker genes for the corresponding clusters.