{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gene ID → Gene Symbol conversion in `dynamo`\n", "\n", "This notebook is a **practical companion** for a very common preprocessing step in single-cell workflows:\n", "converting feature identifiers (e.g. Ensembl gene IDs like `ENSG00000141510`) into **human-readable gene symbols**\n", "(e.g. `TP53`) and storing the result in your `AnnData` object.\n", "\n", "Why this matters:\n", "\n", "- Many public datasets (and some pipelines) ship with **Ensembl IDs** as `adata.var_names`. \n", "- Downstream steps like marker inspection, gene set scoring, plotting, and cross-dataset integration are usually easier\n", " when **gene symbols** are used consistently.\n", "- Some methods/tools also expect symbols, or at least benefit from a standardized identifier space.\n", "\n", "In this tutorial we demonstrate two typical scenarios:\n", "\n", "1. **Human (Hematopoiesis)**: convert Ensembl IDs in a `dynamo.sample_data` dataset.\n", "2. **Zebrafish**: convert Ensembl IDs (often with version suffixes like `.1`) and show how to specify the\n", " expected ID “scope” / database release.\n", "\n", "> **Tip:** ID conversion is never perfectly lossless. Always check the **mapping rate** and how you want to handle\n", "> duplicates after conversion (two Ensembl IDs mapping to the same symbol).\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup\n", "\n", "We start by importing `dynamo` and setting plotting defaults. \n", "This follows the style of the official dynamo tutorials: keep the notebook reproducible, and make figures look\n", "consistent across machines (especially important when sharing results).\n", "\n", "If you run into font warnings on a server (common on Linux/HPC), you can ignore them; it won't affect the analysis.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Load an example dataset (human)\n", "\n", "Here we use `dynamo.sample_data.hematopoiesis_raw()` as a compact, real-world example.\n", "\n", "- It contains raw count matrices and metadata needed for typical preprocessing.\n", "- In many datasets like this, genes can be indexed by **Ensembl IDs** instead of symbols.\n", "\n", "We'll inspect `adata.var_names` first to understand what identifier system we are starting from.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "711fc94a-a16f-4ac8-9ed8-bf99d6be8cbe", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "4c75efe0-825e-481a-9eaf-554fb6b7460e", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import dynamo as dyn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. `convert2gene_symbol` \n", "\n", "`dynamo.pp.convert2gene_symbol` performs **batch ID mapping** and returns a table containing at least:\n", "\n", "- `query`: the ID used for the query (often the Ensembl ID **without** version suffix)\n", "- `symbol`: the mapped gene symbol\n", "\n", "Under the hood, the conversion relies on an identifier query service (the same *concept* as MyGene.info-style\n", "“batch query” APIs), where the key idea is you must tell the service what kind of IDs you are providing via the\n", "`scopes` argument (e.g. `ensembl.gene` or `ensembl.transcript`). \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Convert test\n", "\n", "A typical safe pattern is:\n", "\n", "1. Store the “query-ready” ID in `adata.var['query']` (often stripping version suffix).\n", "2. Call `convert2gene_symbol(...)`.\n", "3. Merge the returned table back into `adata.var`.\n", "4. Subset to successfully mapped genes (optional but common).\n", "5. Set `adata.var_names = adata.var['symbol']`.\n", "\n", "After you set `adata.var_names` to symbols, consider also keeping the original IDs in a separate column (e.g.\n", "`adata.var['ensembl_id']`) for traceability.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "9c846c87-d8d7-4e79-aa59-39272f1caf26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|-----> Auto-detected species: human\n", "|-----> Conversion finished. Found 2/2 symbols.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
symbol_score
query
ENSG00000167286CD3D1.0
ENSG00000156738MS4A11.0
\n", "
" ], "text/plain": [ " symbol _score\n", "query \n", "ENSG00000167286 CD3D 1.0\n", "ENSG00000156738 MS4A1 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dyn.preprocessing.convert2gene_symbol(\n", " ['ENSG00000167286','ENSG00000156738'],#ensembl_release=109,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Zebrafish example (when IDs are `ENSDARG...`)\n", "\n", "For non-human datasets, pay extra attention to **species** and **annotation version**.\n", "\n", "Zebrafish Ensembl gene IDs typically start with `ENSDARG`. Many pipelines also append a version suffix (e.g. `.1`),\n", "so we strip it before conversion.\n", "\n", "Depending on your pipeline, you may also want to pass an organism-specific **Ensembl release** (or otherwise match the\n", "annotation build you used for quantification). If the mapping looks unexpectedly poor, the most common causes are:\n", "\n", "- using the wrong release / annotation build\n", "- providing transcript IDs while querying as gene IDs (or vice versa)\n", "- keeping the version suffix\n", "\n", "The goal of this section is not to claim one “correct” release universally, but to show the pattern for **making the\n", "mapping explicit** and reproducible.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "2cb9846f-c158-44fb-8d86-ab96fd9a4b01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|-----> Auto-detected species: zebrafish\n", "|-----> Conversion finished. Found 1/1 symbols.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
symbol_score
query
ENSDARG00000035558gps21.0
\n", "
" ], "text/plain": [ " symbol _score\n", "query \n", "ENSDARG00000035558 gps2 1.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dyn.preprocessing.convert2gene_symbol(\n", " ['ENSDARG00000035558'],ensembl_release=77,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "4e4cdc9f-d936-4348-972a-c763ef78120f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|-----> Downloading raw hematopoiesis adata\n", "|-----> Downloading data to ./data/hematopoiesis_raw.h5ad\n", "|-----> File ./data/hematopoiesis_raw.h5ad already exists.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_name_mapping
gene_id
ENSG00000000003None
ENSG00000000005None
ENSG00000000419None
ENSG00000000457None
ENSG00000000460None
\n", "
" ], "text/plain": [ " gene_name_mapping\n", "gene_id \n", "ENSG00000000003 None\n", "ENSG00000000005 None\n", "ENSG00000000419 None\n", "ENSG00000000457 None\n", "ENSG00000000460 None" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata = dyn.sample_data.hematopoiesis_raw()\n", "adata.var.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. (Optional) ID conversion automatically in preprocess\n", "\n", "Once your gene identifiers are standardized, you can proceed with your preferred preprocessing recipe.\n", "\n", "Here we show `dyn.pp.recipe_monocle`, which performs typical steps (filtering, normalization, feature selection, PCA)\n", "and is commonly used in dynamo workflows.\n", "\n", "> Note: If you change `var_names` after preprocessing, you may break assumptions in downstream cached results.\n", "> In practice, it's best to do ID conversion **before** running the main preprocessing pipeline.\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "fca3ddff-eba1-4c16-b002-0d31715c47f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|-----> Running monocle preprocessing pipeline...\n", "|-----> convert ensemble name to official gene name\n", "|-----? Your adata object uses non-official gene names as gene index. \n", "Dynamo is converting those names to official gene names.\n", "|-----> Auto-detected species: human\n", "|-----> Conversion finished. Found 24635/26193 symbols.\n", "|-----------> filtered out 0 outlier cells\n", "|-----------> filtered out 23299 outlier genes\n", "|-----> PCA dimension reduction\n", "|-----> X_pca to obsm in AnnData Object.\n", "|-----> [Preprocessor-monocle] completed [1.6145s]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/fernandozeng/Desktop/analysis/dynamo-release/dynamo/preprocessing/utils.py:911: RuntimeWarning: invalid value encountered in divide\n", " var_ntr = adata.layers[\"new\"].sum(0) / adata.layers[\"total\"].sum(0)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_name_mappingqueryscopessymbol_scorenCellsnCountspass_basic_filterlog_mlog_cvscorefracuse_for_pcantr
TSPAN6NoneENSG00000000003ENSG00000000003TSPAN61.01616.0FalseNaNNaNNaN0.000004False0.238095
TNMDNoneENSG00000000005ENSG00000000005TNMD1.000.0FalseNaNNaNNaN0.000000False0.000000
DPM1NoneENSG00000000419ENSG00000000419DPM11.0161180.0True-3.6302292.0047510.0193260.000046True0.601476
SCYL3NoneENSG00000000457ENSG00000000457SCYL31.07679.0FalseNaNNaNNaN0.000024False0.364486
C1orf112NoneENSG00000000460ENSG00000000460C1orf1121.07381.0FalseNaNNaNNaN0.000020False0.500000
\n", "
" ], "text/plain": [ " gene_name_mapping query scopes symbol \\\n", "TSPAN6 None ENSG00000000003 ENSG00000000003 TSPAN6 \n", "TNMD None ENSG00000000005 ENSG00000000005 TNMD \n", "DPM1 None ENSG00000000419 ENSG00000000419 DPM1 \n", "SCYL3 None ENSG00000000457 ENSG00000000457 SCYL3 \n", "C1orf112 None ENSG00000000460 ENSG00000000460 C1orf112 \n", "\n", " _score nCells nCounts pass_basic_filter log_m log_cv \\\n", "TSPAN6 1.0 16 16.0 False NaN NaN \n", "TNMD 1.0 0 0.0 False NaN NaN \n", "DPM1 1.0 161 180.0 True -3.630229 2.004751 \n", "SCYL3 1.0 76 79.0 False NaN NaN \n", "C1orf112 1.0 73 81.0 False NaN NaN \n", "\n", " score frac use_for_pca ntr \n", "TSPAN6 NaN 0.000004 False 0.238095 \n", "TNMD NaN 0.000000 False 0.000000 \n", "DPM1 0.019326 0.000046 True 0.601476 \n", "SCYL3 NaN 0.000024 False 0.364486 \n", "C1orf112 NaN 0.000020 False 0.500000 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessor = dyn.pp.Preprocessor()\n", "preprocessor.config_monocle_recipe(\n", " adata,\n", " n_top_genes=2000\n", ")\n", "preprocessor.preprocess_adata_monocle(\n", " \tadata,\n", " tkey=\"time\",\n", " experiment_type=\"one-shot\",\n", ")\n", "adata.var.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b44acca5-9b5e-49f2-9cae-98d87f8998de", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" } }, "nbformat": 4, "nbformat_minor": 5 }