{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gene ID → Gene Symbol conversion in `dynamo`\n",
    "\n",
    "This notebook is a **practical companion** for a very common preprocessing step in single-cell workflows:\n",
    "converting feature identifiers (e.g. Ensembl gene IDs like `ENSG00000141510`) into **human-readable gene symbols**\n",
    "(e.g. `TP53`) and storing the result in your `AnnData` object.\n",
    "\n",
    "Why this matters:\n",
    "\n",
    "- Many public datasets (and some pipelines) ship with **Ensembl IDs** as `adata.var_names`.  \n",
    "- Downstream steps like marker inspection, gene set scoring, plotting, and cross-dataset integration are usually easier\n",
    "  when **gene symbols** are used consistently.\n",
    "- Some methods/tools also expect symbols, or at least benefit from a standardized identifier space.\n",
    "\n",
    "In this tutorial we demonstrate two typical scenarios:\n",
    "\n",
    "1. **Human (Hematopoiesis)**: convert Ensembl IDs in a `dynamo.sample_data` dataset.\n",
    "2. **Zebrafish**: convert Ensembl IDs (often with version suffixes like `.1`) and show how to specify the\n",
    "   expected ID “scope” / database release.\n",
    "\n",
    "> **Tip:** ID conversion is never perfectly lossless. Always check the **mapping rate** and how you want to handle\n",
    "> duplicates after conversion (two Ensembl IDs mapping to the same symbol).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup\n",
    "\n",
    "We start by importing `dynamo` and setting plotting defaults.  \n",
    "This follows the style of the official dynamo tutorials: keep the notebook reproducible, and make figures look\n",
    "consistent across machines (especially important when sharing results).\n",
    "\n",
    "If you run into font warnings on a server (common on Linux/HPC), you can ignore them; it won't affect the analysis.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load an example dataset (human)\n",
    "\n",
    "Here we use `dynamo.sample_data.hematopoiesis_raw()` as a compact, real-world example.\n",
    "\n",
    "- It contains raw count matrices and metadata needed for typical preprocessing.\n",
    "- In many datasets like this, genes can be indexed by **Ensembl IDs** instead of symbols.\n",
    "\n",
    "We'll inspect `adata.var_names` first to understand what identifier system we are starting from.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "711fc94a-a16f-4ac8-9ed8-bf99d6be8cbe",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4c75efe0-825e-481a-9eaf-554fb6b7460e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import dynamo as dyn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `convert2gene_symbol` \n",
    "\n",
    "`dynamo.pp.convert2gene_symbol` performs **batch ID mapping** and returns a table containing at least:\n",
    "\n",
    "- `query`: the ID used for the query (often the Ensembl ID **without** version suffix)\n",
    "- `symbol`: the mapped gene symbol\n",
    "\n",
    "Under the hood, the conversion relies on an identifier query service (the same *concept* as MyGene.info-style\n",
    "“batch query” APIs), where the key idea is you must tell the service what kind of IDs you are providing via the\n",
    "`scopes` argument (e.g. `ensembl.gene` or `ensembl.transcript`). \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Convert test\n",
    "\n",
    "A typical safe pattern is:\n",
    "\n",
    "1. Store the “query-ready” ID in `adata.var['query']` (often stripping version suffix).\n",
    "2. Call `convert2gene_symbol(...)`.\n",
    "3. Merge the returned table back into `adata.var`.\n",
    "4. Subset to successfully mapped genes (optional but common).\n",
    "5. Set `adata.var_names = adata.var['symbol']`.\n",
    "\n",
    "After you set `adata.var_names` to symbols, consider also keeping the original IDs in a separate column (e.g.\n",
    "`adata.var['ensembl_id']`) for traceability.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9c846c87-d8d7-4e79-aa59-39272f1caf26",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|-----> Auto-detected species: human\n",
      "|-----> Conversion finished. Found 2/2 symbols.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symbol</th>\n",
       "      <th>_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>query</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000167286</th>\n",
       "      <td>CD3D</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000156738</th>\n",
       "      <td>MS4A1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                symbol  _score\n",
       "query                         \n",
       "ENSG00000167286   CD3D     1.0\n",
       "ENSG00000156738  MS4A1     1.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dyn.preprocessing.convert2gene_symbol(\n",
    "    ['ENSG00000167286','ENSG00000156738'],#ensembl_release=109,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Zebrafish example (when IDs are `ENSDARG...`)\n",
    "\n",
    "For non-human datasets, pay extra attention to **species** and **annotation version**.\n",
    "\n",
    "Zebrafish Ensembl gene IDs typically start with `ENSDARG`. Many pipelines also append a version suffix (e.g. `.1`),\n",
    "so we strip it before conversion.\n",
    "\n",
    "Depending on your pipeline, you may also want to pass an organism-specific **Ensembl release** (or otherwise match the\n",
    "annotation build you used for quantification). If the mapping looks unexpectedly poor, the most common causes are:\n",
    "\n",
    "- using the wrong release / annotation build\n",
    "- providing transcript IDs while querying as gene IDs (or vice versa)\n",
    "- keeping the version suffix\n",
    "\n",
    "The goal of this section is not to claim one “correct” release universally, but to show the pattern for **making the\n",
    "mapping explicit** and reproducible.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2cb9846f-c158-44fb-8d86-ab96fd9a4b01",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|-----> Auto-detected species: zebrafish\n",
      "|-----> Conversion finished. Found 1/1 symbols.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symbol</th>\n",
       "      <th>_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>query</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSDARG00000035558</th>\n",
       "      <td>gps2</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   symbol  _score\n",
       "query                            \n",
       "ENSDARG00000035558   gps2     1.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dyn.preprocessing.convert2gene_symbol(\n",
    "    ['ENSDARG00000035558'],ensembl_release=77,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4e4cdc9f-d936-4348-972a-c763ef78120f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|-----> Downloading raw hematopoiesis adata\n",
      "|-----> Downloading data to ./data/hematopoiesis_raw.h5ad\n",
      "|-----> File ./data/hematopoiesis_raw.h5ad already exists.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_name_mapping</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gene_id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSG00000000003</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000005</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000419</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000457</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSG00000000460</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                gene_name_mapping\n",
       "gene_id                          \n",
       "ENSG00000000003              None\n",
       "ENSG00000000005              None\n",
       "ENSG00000000419              None\n",
       "ENSG00000000457              None\n",
       "ENSG00000000460              None"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata = dyn.sample_data.hematopoiesis_raw()\n",
    "adata.var.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. (Optional) ID conversion automatically in preprocess\n",
    "\n",
    "Once your gene identifiers are standardized, you can proceed with your preferred preprocessing recipe.\n",
    "\n",
    "Here we show `dyn.pp.recipe_monocle`, which performs typical steps (filtering, normalization, feature selection, PCA)\n",
    "and is commonly used in dynamo workflows.\n",
    "\n",
    "> Note: If you change `var_names` after preprocessing, you may break assumptions in downstream cached results.\n",
    "> In practice, it's best to do ID conversion **before** running the main preprocessing pipeline.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fca3ddff-eba1-4c16-b002-0d31715c47f8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|-----> Running monocle preprocessing pipeline...\n",
      "|-----> convert ensemble name to official gene name\n",
      "|-----? Your adata object uses non-official gene names as gene index. \n",
      "Dynamo is converting those names to official gene names.\n",
      "|-----> Auto-detected species: human\n",
      "|-----> Conversion finished. Found 24635/26193 symbols.\n",
      "|-----------> filtered out 0 outlier cells\n",
      "|-----------> filtered out 23299 outlier genes\n",
      "|-----> PCA dimension reduction\n",
      "|-----> <insert> X_pca to obsm in AnnData Object.\n",
      "|-----> [Preprocessor-monocle] completed [1.6145s]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/fernandozeng/Desktop/analysis/dynamo-release/dynamo/preprocessing/utils.py:911: RuntimeWarning: invalid value encountered in divide\n",
      "  var_ntr = adata.layers[\"new\"].sum(0) / adata.layers[\"total\"].sum(0)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_name_mapping</th>\n",
       "      <th>query</th>\n",
       "      <th>scopes</th>\n",
       "      <th>symbol</th>\n",
       "      <th>_score</th>\n",
       "      <th>nCells</th>\n",
       "      <th>nCounts</th>\n",
       "      <th>pass_basic_filter</th>\n",
       "      <th>log_m</th>\n",
       "      <th>log_cv</th>\n",
       "      <th>score</th>\n",
       "      <th>frac</th>\n",
       "      <th>use_for_pca</th>\n",
       "      <th>ntr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TSPAN6</th>\n",
       "      <td>None</td>\n",
       "      <td>ENSG00000000003</td>\n",
       "      <td>ENSG00000000003</td>\n",
       "      <td>TSPAN6</td>\n",
       "      <td>1.0</td>\n",
       "      <td>16</td>\n",
       "      <td>16.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000004</td>\n",
       "      <td>False</td>\n",
       "      <td>0.238095</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TNMD</th>\n",
       "      <td>None</td>\n",
       "      <td>ENSG00000000005</td>\n",
       "      <td>ENSG00000000005</td>\n",
       "      <td>TNMD</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>False</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DPM1</th>\n",
       "      <td>None</td>\n",
       "      <td>ENSG00000000419</td>\n",
       "      <td>ENSG00000000419</td>\n",
       "      <td>DPM1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>161</td>\n",
       "      <td>180.0</td>\n",
       "      <td>True</td>\n",
       "      <td>-3.630229</td>\n",
       "      <td>2.004751</td>\n",
       "      <td>0.019326</td>\n",
       "      <td>0.000046</td>\n",
       "      <td>True</td>\n",
       "      <td>0.601476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SCYL3</th>\n",
       "      <td>None</td>\n",
       "      <td>ENSG00000000457</td>\n",
       "      <td>ENSG00000000457</td>\n",
       "      <td>SCYL3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>76</td>\n",
       "      <td>79.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000024</td>\n",
       "      <td>False</td>\n",
       "      <td>0.364486</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C1orf112</th>\n",
       "      <td>None</td>\n",
       "      <td>ENSG00000000460</td>\n",
       "      <td>ENSG00000000460</td>\n",
       "      <td>C1orf112</td>\n",
       "      <td>1.0</td>\n",
       "      <td>73</td>\n",
       "      <td>81.0</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000020</td>\n",
       "      <td>False</td>\n",
       "      <td>0.500000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         gene_name_mapping            query           scopes    symbol  \\\n",
       "TSPAN6                None  ENSG00000000003  ENSG00000000003    TSPAN6   \n",
       "TNMD                  None  ENSG00000000005  ENSG00000000005      TNMD   \n",
       "DPM1                  None  ENSG00000000419  ENSG00000000419      DPM1   \n",
       "SCYL3                 None  ENSG00000000457  ENSG00000000457     SCYL3   \n",
       "C1orf112              None  ENSG00000000460  ENSG00000000460  C1orf112   \n",
       "\n",
       "          _score  nCells  nCounts  pass_basic_filter     log_m    log_cv  \\\n",
       "TSPAN6       1.0      16     16.0              False       NaN       NaN   \n",
       "TNMD         1.0       0      0.0              False       NaN       NaN   \n",
       "DPM1         1.0     161    180.0               True -3.630229  2.004751   \n",
       "SCYL3        1.0      76     79.0              False       NaN       NaN   \n",
       "C1orf112     1.0      73     81.0              False       NaN       NaN   \n",
       "\n",
       "             score      frac  use_for_pca       ntr  \n",
       "TSPAN6         NaN  0.000004        False  0.238095  \n",
       "TNMD           NaN  0.000000        False  0.000000  \n",
       "DPM1      0.019326  0.000046         True  0.601476  \n",
       "SCYL3          NaN  0.000024        False  0.364486  \n",
       "C1orf112       NaN  0.000020        False  0.500000  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessor = dyn.pp.Preprocessor()\n",
    "preprocessor.config_monocle_recipe(\n",
    "    adata,\n",
    "    n_top_genes=2000\n",
    ")\n",
    "preprocessor.preprocess_adata_monocle(\n",
    "   \tadata,\n",
    "    tkey=\"time\",\n",
    "    experiment_type=\"one-shot\",\n",
    ")\n",
    "adata.var.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b44acca5-9b5e-49f2-9cae-98d87f8998de",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}