dynamo.tl.two_groups_degs

dynamo.tl.two_groups_degs(adata, genes=None, layer=None, group=None, test_group=None, control_groups=None, X_data=None, exp_frac_thresh=None, log2_fc_thresh=None, qval_thresh=0.05, subset_control_vals=None)[source]

Find marker genes between two groups of cells based on gene expression or velocity as specified by the layer.

Tests each gene for differential expression between cells in one group to cells from another groups via Mann-Whitney U test. It also calculates the fraction of cells with non-zero expression, log 2-fold changes as well as the specificity (calculated as 1 - Jessen-Shannon distance between the distribution of percentage of cells with expression across all groups to the hypothetical perfect distribution in which only the current group of cells has expression). In addition, Rank-biserial correlation (rbc) and qval are calculated. The rank biserial correlation is used to assess the relationship between a dichotomous categorical variable and an ordinal variable. The rank biserial test is very similar to the non-parametric Mann-Whitney U test that is used to compare two independent groups on an ordinal variable. Mann-Whitney U tests are preferable to rank biserial correlations when comparing independent groups. Rank biserial correlations can only be used with dichotomous (two levels) categorical variables. qval is calculated using Benjamini-Hochberg adjustment.

Parameters:
  • adata (AnnData) – An AnnData object.

  • genes (Optional[List[str]]) – The list of genes that will be used to subset the data for dimension reduction and clustering. If None, all genes will be used.

  • layer (Optional[str]) – The layer that will be used to retrieve data for dimension reduction and clustering. If None, .X is used.

  • group (Optional[str]) – The column key/name that identifies the grouping information (for example, clusters that correspond to different cell types or different time points) of cells. This will be used for calculating group-specific genes.

  • test_group (Optional[str]) – The group name from group for which markers has to be found.

  • control_groups (Optional[List[str]]) – The list of group name(s) from group for which markers has to be tested against.

  • X_data (Optional[ndarray]) – the user supplied data that will be used for marker gene detection directly. Defaults to None.

  • exp_frac_thresh (Optional[float]) – the minimum percentage of cells with expression for a gene to proceed differential expression test. If layer is not velocity related (i.e. velocity_S), exp_frac_thresh by default is set to be 0.1, otherwise 0. Defaults to None.

  • log2_fc_thresh (Optional[str]) – The minimal threshold of log2 fold change for a gene to proceed differential expression test. If layer is not velocity related (i.e. velocity_S), log2_fc_thresh by default is set to be 1, otherwise 0. Defaults to None.

  • qval_thresh (float) – The maximal threshold of qval to be considered as significant genes. Defaults to 0.05.

  • subset_control_vals (Optional[bool]) – Whether to subset the top ranked control values. When subset_control_vals = None, this is subset to be True when layer is not related to either velocity related or acceleration or curvature related layers and False otherwise. When layer is not related to either velocity related or acceleration or curvature related layers used, the control values will be sorted by absolute values. Defaults to None.

Raises:

ValueErrorX_data is provided but genes does not correspond to its columns.

Return type:

DataFrame

Returns:

A pandas DataFrame of the differential expression analysis result between the two groups.