dynamo.pp.select_genes_by_pearson_residuals

dynamo.pp.select_genes_by_pearson_residuals(adata, layer=None, theta=100, clip=None, n_top_genes=2000, batch_key=None, chunksize=1000, check_values=True, inplace=True)[source]

Gene selection and normalization based on [Lause21].

This function applies gene selection based on Pearson residuals. Expects raw count input on the resulting subset.

Parameters:
  • adata (AnnData) – an annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

  • layer (Optional[str]) – the layer to perform gene selection on.

  • theta (float) – the negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

  • clip (Optional[float]) – the threshold to determine if and how residuals are clipped. If None, residuals are clipped to the interval [-sqrt(n), sqrt(n)] where n is the number of cells in the dataset (default behavior). If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

  • n_top_genes (int) – the number of highly-variable genes to keep.

  • batch_key (Optional[str]) – the key to indicate how highly-variable genes are selected within each batch separately and merged later. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are an HVG. Ties are broken by the median rank (across batches) based on within-batch residual variance.

  • chunksize (int) – the number of genes are processed at once while computing the Pearson residual variance. Choosing a smaller value will reduce the required memory.

  • check_values (bool) – whether to check if counts in selected layer are integers. A Warning is returned if set to True.

  • inplace (bool) – whether to place results in adata or return them.

Return type:

Optional[Tuple[DataFrame, DataFrame]]

Returns:

If inplace is ‘True’, the ‘adata’ will be updated without return values. Otherwise, the ‘adata’ object and selected highly-variable genes will be returned.