{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9144495b-2433-4bb9-9b6f-6e282ea07891",
   "metadata": {},
   "source": [
    "# 3 LINCS SCIPLEX GENE MATCHING\n",
    "\n",
    "**Requires**\n",
    "* `'lincs_full_smiles.h5ad'`\n",
    "* `'sciplex_raw_chunk_{i}.h5ad'` with $i \\in \\{0,1,2,3,4\\}$\n",
    "\n",
    "**Output**\n",
    "* `'sciplex3_matched_genes_lincs.h5ad'`\n",
    "* `lincs`: `'sciplex3_lincs_genes.h5ad'`\n",
    "* `sciplex`: `'lincs_full_smiles_sciplex_genes.h5ad'`\n",
    "\n",
    "\n",
    "\n",
    "## Description \n",
    "\n",
    "The goal of this notebook is to match and merge genes between the LINCS and SciPlex datasets, resulting in the creation of three new datasets:\n",
    "\n",
    "### Created datasets\n",
    "\n",
    "- **`sciplex3_matched_genes_lincs.h5ad`**: Contains **SciPlex observations**. **Genes are limited to the intersection** of the genes found in both LINCS and SciPlex datasets, and or highly variable genes in sciplex.\n",
    "\n",
    "\n",
    "- **`sciplex3_lincs_genes.h5ad`**: Contains **SciPlex data**, but filtered to include **only the genes that are shared with the LINCS dataset**. (strict intersection, 977 genes)\n",
    "\n",
    "- **`lincs_full_smiles_sciplex_genes.h5ad`**: Contains **LINCS data**, but filtered to include **only the genes that are shared with the SciPlex dataset**.\n",
    "\n",
    "\n",
    "\n",
    "To create these datasets, we need to match the genes between the two datasets, which is done as follows:\n",
    "\n",
    "### Gene Matching\n",
    "\n",
    "1. **Gene ID Assignment**: SciPlex gene names are standardized to Ensembl gene IDs by extracting the primary identifier and using either **sfaira** or a predefined mapping (`symbols_dict.json`). The LINCS dataset is already standardized.\n",
    "\n",
    "2. **Identifying Shared Genes**: We then compute the intersection of the gene IDs (`gene_id`) inside LINCS and SciPlex. Both datasets are then filtered to retain only these shared genes.\n",
    "\n",
    "3. **Reindexing**: The LINCS dataset is reindexed to match the order of genes in the SciPlex dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9a33a003-9ca0-4994-955c-305852e4d354",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/requests/__init__.py:104: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n",
      "  RequestsDependencyWarning)\n",
      "2023-08-19 10:31:31.638164: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2023-08-19 10:31:34.020338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n",
      "2023-08-19 10:31:34.020465: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n",
      "2023-08-19 10:31:34.020477: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.2 pynndescent==0.5.6\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import sys\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import sfaira\n",
    "import warnings\n",
    "os.getcwd()\n",
    "\n",
    "from chemCPA.paths import DATA_DIR, PROJECT_DIR\n",
    "\n",
    "pd.set_option('display.max_columns', 100)\n",
    "\n",
    "root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
    "sys.path.append(root_dir)\n",
    "import logging\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "from notebook_utils import suppress_output\n",
    "\n",
    "import scanpy as sc\n",
    "with suppress_output():\n",
    "    sc.set_figure_params(dpi=80, frameon=False)\n",
    "    sc.logging.print_header()\n",
    "    warnings.filterwarnings('ignore')\n",
    "\n",
    "# logging.info is visible when running as python script \n",
    "if not any('ipykernel' in arg for arg in sys.argv):\n",
    "    logging.basicConfig(\n",
    "        level=logging.INFO,\n",
    "        format='%(asctime)s - %(levelname)s - %(message)s',\n",
    "        datefmt='%Y-%m-%d %H:%M:%S'\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d3c097de-7254-43e5-89e3-d3095e45f270",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The autoreload extension is already loaded. To reload it, use:\n",
      "  %reload_ext autoreload\n"
     ]
    }
   ],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c6073db-2c0a-413c-b4b5-17dcef7e064c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bcf6d3b-64c3-48cd-987b-f984c0e76ddd",
   "metadata": {},
   "source": [
    "Load lincs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9a72ac06-5233-4e7a-ba63-79786d6d2c31",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "adata_lincs = sc.read(DATA_DIR/'lincs_full_smiles.h5ad' )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05efc7d8-b2ac-4fb5-bfd5-0938e5b80b1a",
   "metadata": {},
   "source": [
    "Load sciplex "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b365aa7a-9957-4359-977d-dafe400df570",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n",
      "  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "from chemCPA.paths import DATA_DIR, PROJECT_DIR\n",
    "from raw_data.datasets import sciplex\n",
    "\n",
    "# Load and concatenate chunks\n",
    "adatas_sciplex = []\n",
    "logging.info(\"Starting to load in sciplex data\")\n",
    "\n",
    "# Get paths to all sciplex chunks\n",
    "chunk_paths = sciplex()\n",
    "\n",
    "# Load chunks with progress bar\n",
    "for chunk_path in tqdm(chunk_paths, desc=\"Loading sciplex chunks\"):\n",
    "    tqdm.write(f\"Loading {os.path.basename(chunk_path)}\")\n",
    "    adatas_sciplex.append(sc.read(chunk_path))\n",
    "    \n",
    "adata_sciplex = adatas_sciplex[0].concatenate(adatas_sciplex[1:])\n",
    "logging.info(\"Sciplex data loaded\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f5c24ac-3b55-40f2-abee-22286b4c6d16",
   "metadata": {},
   "source": [
    "Add gene_id to sciplex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b72d957d-18eb-4994-875b-0bdd9db254c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.var['gene_id'] = adata_sciplex.var.id.str.split('.').str[0]\n",
    "adata_sciplex.var['gene_id'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0caf549-00e6-4d18-bb19-6c344c9f62c9",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Get gene ids from symbols via sfaira"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "065cb35d-ce63-4152-9fcb-ca939295bc29",
   "metadata": {},
   "source": [
    "Load genome container with sfaira"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "11109df9-7724-4658-8865-7bb19ec98e3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "try: \n",
    "    # load json file with symbol to id mapping\n",
    "    import json\n",
    "    with open(DATA_DIR/ 'symbols_dict.json') as json_file:\n",
    "        symbols_dict = json.load(json_file)\n",
    "except: \n",
    "    logging.info(\"No symbols_dict.json found, falling back to sfaira\")\n",
    "    genome_container = sfaira.versions.genomes.GenomeContainer(organism=\"homo_sapiens\", release=\"82\")\n",
    "    symbols_dict = genome_container.symbol_to_id_dict\n",
    "    # Extend symbols dict with unknown symbol\n",
    "    symbols_dict.update({'PLSCR3':'ENSG00000187838'})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d071d09e-dc42-4872-8cf5-88ad20f35fc2",
   "metadata": {},
   "source": [
    "Identify genes that are shared between lincs and trapnell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c9aba778-ade2-4bb1-b0e9-2e8bc7a95602",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# For lincs\n",
    "adata_lincs.var['gene_id'] = adata_lincs.var_names.map(symbols_dict)\n",
    "adata_lincs.var['in_sciplex'] = adata_lincs.var.gene_id.isin(adata_sciplex.var.gene_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d79a3562-60ad-4795-823d-bdaed56e3fb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For trapnell\n",
    "adata_sciplex.var['in_lincs'] = adata_sciplex.var.gene_id.isin(adata_lincs.var.gene_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f4c3d35-0070-40d8-a87a-dce02883cf6e",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Preprocess sciplex dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce8ecbc7-7d41-4e38-8817-f8c8d01ad29f",
   "metadata": {},
   "source": [
    "See `sciplex3.ipynb`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45442825-0f56-42bd-88f9-48fd0468010c",
   "metadata": {},
   "source": [
    "The original CPA implementation required to subset the data due to scaling limitations.   \n",
    "In this version we expect to be able to handle the full sciplex dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "68ff6c9b-6e46-402c-96aa-a2da57af9c79",
   "metadata": {},
   "outputs": [],
   "source": [
    "SUBSET = False\n",
    "\n",
    "if SUBSET: \n",
    "    sc.pp.subsample(adata_sciplex, fraction=0.5, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "363f0cf5-340c-4589-bc95-de8a1e22fbbe",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.normalize_per_cell(adata_sciplex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fb7a8ae8-e5db-4c8e-bf0d-965f7c8e4dbe",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.log1p(adata_sciplex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "aecbc3c6-3882-442c-8b84-ce163f704b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.highly_variable_genes(adata_sciplex, n_top_genes=1032, subset=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc91a1a0-2011-4834-afb6-278206d15e71",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Combine HVG with lincs genes\n",
    "\n",
    "Union of genes that are considered highly variable and those that are shared with lincs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "761a5c25-2947-4f66-ab8c-c33a1b713444",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "((adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db01d26e-e0f7-44a0-a8b8-380223049f81",
   "metadata": {},
   "source": [
    "Subset to that union of genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6e6328cd-cd03-4e72-85dc-b72eed2632f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex = adata_sciplex[:, (adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b25d10e9-6e1e-4a13-a512-3580fc1295c8",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Create additional meta data "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "985349fe-37bf-4efc-9765-d612d8d440c8",
   "metadata": {},
   "source": [
    "Normalise dose values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "62b9e529-ca45-4f04-b2d6-75dd246aa36c",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['dose_val'] = adata_sciplex.obs.dose.astype(float) / np.max(adata_sciplex.obs.dose.astype(float))\n",
    "adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'dose_val'] = 1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ed4ea831-0650-4b6b-bc09-7fbaf5f004b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.001    153013\n",
       "0.010    147670\n",
       "0.100    141828\n",
       "1.000    139266\n",
       "Name: dose_val, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs['dose_val'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4908bb2-4fd0-40d5-a6d7-24e68bcf9bb0",
   "metadata": {},
   "source": [
    "Change `product_name`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "71393716-2328-41d8-a077-ef8fc435bf61",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['product_name'] = [x.split(' ')[0] for x in adata_sciplex.obs['product_name']]\n",
    "adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'product_name'] = 'control'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bcd577f-500f-409e-bdc9-d1acac3dc583",
   "metadata": {},
   "source": [
    "Create copy of `product_name` with column name `control`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b1cbcd93-43cb-4740-8c84-2331ccb4b066",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['condition'] = adata_sciplex.obs.product_name.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d148e98-c819-4455-92b4-ccc5cf2a46de",
   "metadata": {},
   "source": [
    "Add combinations of drug (`condition`), dose (`dose_val`), and cell_type (`cell_type`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "05249c77-612d-4236-af95-12cf2d2aefdf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# make column of dataframe to categorical \n",
    "adata_sciplex.obs[\"condition\"] = adata_sciplex.obs[\"condition\"].astype('category').cat.rename_categories({\"(+)-JQ1\": \"JQ1\"})\n",
    "adata_sciplex.obs['drug_dose_name'] = adata_sciplex.obs.condition.astype(str) + '_' + adata_sciplex.obs.dose_val.astype(str)\n",
    "adata_sciplex.obs['cov_drug_dose_name'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.drug_dose_name.astype(str)\n",
    "adata_sciplex.obs['cov_drug'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.condition.astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58850330-62b6-4d2c-a533-1cf238663805",
   "metadata": {},
   "source": [
    "Add `control` columns with vale `1` where only the vehicle was used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "5b7be27d-e6b8-42d6-b21b-400ddd5b3641",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['control'] = [1 if x == 'control_1.0' else 0 for x in adata_sciplex.obs.drug_dose_name.values]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c06409b8-d30e-45a1-82aa-784fa5c2f1b0",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Compute DE genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7753ba03-c908-4011-8365-2d871574cd56",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A549\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MCF7\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "K562\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    }
   ],
   "source": [
    "from chemCPA.helper import rank_genes_groups_by_cov\n",
    "\n",
    "rank_genes_groups_by_cov(adata_sciplex, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='all_DEGs')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "4d9f098d-a04b-4407-a3a2-5041cfb480ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A549\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MCF7\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "K562\n",
      "WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
      "  df[key] = c\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'scores'] = scores[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
      "/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`\n",
      "  foldchanges[global_indices]\n"
     ]
    }
   ],
   "source": [
    "adata_subset = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()\n",
    "rank_genes_groups_by_cov(adata_subset, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='lincs_DEGs')\n",
    "adata_sciplex.uns['lincs_DEGs'] = adata_subset.uns['lincs_DEGs']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "428882cd-4e02-4af0-b281-51210aafbf79",
   "metadata": {},
   "source": [
    "### Map all unique `cov_drug_dose_name` to the computed DEGs, independent of the dose value\n",
    "\n",
    "Create mapping between names with dose and without dose"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "238338ab-4950-4c29-94c5-2d2c4b5738ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "cov_drug_dose_unique = adata_sciplex.obs.cov_drug_dose_name.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "08aea617-1179-43e1-8391-d38eeee3b748",
   "metadata": {},
   "outputs": [],
   "source": [
    "remove_dose = lambda s: '_'.join(s.split('_')[:-1])\n",
    "cov_drug = pd.Series(cov_drug_dose_unique).apply(remove_dose)\n",
    "dose_no_dose_dict = dict(zip(cov_drug_dose_unique, cov_drug))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c78a6e80-5012-442f-b7bf-a6c581da92dd",
   "metadata": {},
   "source": [
    "### Compute new dicts for DEGs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "b6594da7-64c0-4212-8cca-58d247b2cc5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "uns_keys = ['all_DEGs', 'lincs_DEGs']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "d5c73b35-31d3-4814-a5c0-658b23f1d0a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "for uns_key in uns_keys:\n",
    "    new_DEGs_dict = {}\n",
    "\n",
    "    df_DEGs = pd.Series(adata_sciplex.uns[uns_key])\n",
    "\n",
    "    for key, value in dose_no_dose_dict.items():\n",
    "        if 'control' in key:\n",
    "            continue\n",
    "        new_DEGs_dict[key] = df_DEGs.loc[value]\n",
    "    adata_sciplex.uns[uns_key] = new_DEGs_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "f713118a-514c-4cdb-b887-7118508ee37c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 581777 × 2000\n",
       "    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control'\n",
       "    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n",
       "    uns: 'log1p', 'hvg', 'all_DEGs', 'lincs_DEGs'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d23ef784-5747-46de-a9bd-d5d869ff8042",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Create sciplex splits\n",
    "\n",
    "This is not the right configuration fot the experiments we want but for the moment this is okay"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6acf9d6e-d1af-4544-8021-8b9f4185d938",
   "metadata": {
    "tags": []
   },
   "source": [
    "### OOD in Pathways"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "4b70dba8-132a-41fa-8958-78812063b738",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DNA damage & DNA repair                  6640\n",
       "Epigenetic regulation                    6093\n",
       "Tyrosine kinase signaling                5846\n",
       "Protein folding & Protein degradation    3863\n",
       "Neuronal signaling                       3635\n",
       "Antioxidant                              3616\n",
       "HIF signaling                            3501\n",
       "Metabolic regulation                     3470\n",
       "Focal adhesion signaling                 3450\n",
       "Nuclear receptor signaling               3420\n",
       "JAK/STAT signaling                       3155\n",
       "Apoptotic regulation                     3141\n",
       "TGF/BMP signaling                        2794\n",
       "PKC signaling                            2778\n",
       "Cell cycle regulation                    2237\n",
       "Other                                       0\n",
       "Vehicle                                     0\n",
       "Name: pathway_level_1, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs['split_ho_pathway'] = 'train'  # reset\n",
    "\n",
    "ho_drugs = [\n",
    "    # selection of drugs from various pathways\n",
    "    \"Azacitidine\",\n",
    "    \"Carmofur\",\n",
    "    \"Pracinostat\",\n",
    "    \"Cediranib\",\n",
    "    \"Luminespib\",\n",
    "    \"Crizotinib\",\n",
    "    \"SNS-314\",\n",
    "    \"Obatoclax\",\n",
    "    \"Momelotinib\",\n",
    "    \"AG-14361\",\n",
    "    \"Entacapone\",\n",
    "    \"Fulvestrant\",\n",
    "    \"Mesna\",\n",
    "    \"Zileuton\",\n",
    "    \"Enzastaurin\",\n",
    "    \"IOX2\",\n",
    "    \"Alvespimycin\",\n",
    "    \"XAV-939\",\n",
    "    \"Fasudil\",\n",
    "]\n",
    "\n",
    "ho_drug_pathway = adata_sciplex.obs['condition'].isin(ho_drugs)\n",
    "adata_sciplex.obs.loc[ho_drug_pathway, 'pathway_level_1'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "65e41d95-3d6a-400b-b3d6-142161773d4d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "57639"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ho_drug_pathway.sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "3ce7605f-8fef-4c7d-9b62-be3879bd2991",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs.loc[ho_drug_pathway & (adata_sciplex.obs['dose_val'] == 1.0), 'split_ho_pathway'] = 'ood'\n",
    "\n",
    "test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs['split_ho_pathway'] != 'ood'], .15, copy=True).obs.index\n",
    "adata_sciplex.obs.loc[test_idx, 'split_ho_pathway'] = 'test'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "89cf167d-67bc-4603-b9f8-a73dd9980280",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>condition</th>\n",
       "      <th>AG-14361</th>\n",
       "      <th>Alvespimycin</th>\n",
       "      <th>Azacitidine</th>\n",
       "      <th>Carmofur</th>\n",
       "      <th>Cediranib</th>\n",
       "      <th>Crizotinib</th>\n",
       "      <th>Entacapone</th>\n",
       "      <th>Enzastaurin</th>\n",
       "      <th>Fasudil</th>\n",
       "      <th>Fulvestrant</th>\n",
       "      <th>IOX2</th>\n",
       "      <th>Luminespib</th>\n",
       "      <th>Mesna</th>\n",
       "      <th>Momelotinib</th>\n",
       "      <th>Obatoclax</th>\n",
       "      <th>Pracinostat</th>\n",
       "      <th>SNS-314</th>\n",
       "      <th>XAV-939</th>\n",
       "      <th>Zileuton</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pathway_level_1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Antioxidant</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3616</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Apoptotic regulation</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3141</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cell cycle regulation</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DNA damage &amp; DNA repair</th>\n",
       "      <td>3401</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3239</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epigenetic regulation</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3151</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2942</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Focal adhesion signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3450</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HIF signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3501</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JAK/STAT signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3155</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Metabolic regulation</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Neuronal signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3635</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nuclear receptor signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3420</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PKC signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2778</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Protein folding &amp; Protein degradation</th>\n",
       "      <td>0</td>\n",
       "      <td>1858</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2005</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TGF/BMP signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2794</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tyrosine kinase signaling</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3060</td>\n",
       "      <td>2786</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "condition                              AG-14361  Alvespimycin  Azacitidine  \\\n",
       "pathway_level_1                                                              \n",
       "Antioxidant                                   0             0            0   \n",
       "Apoptotic regulation                          0             0            0   \n",
       "Cell cycle regulation                         0             0            0   \n",
       "DNA damage & DNA repair                    3401             0            0   \n",
       "Epigenetic regulation                         0             0         3151   \n",
       "Focal adhesion signaling                      0             0            0   \n",
       "HIF signaling                                 0             0            0   \n",
       "JAK/STAT signaling                            0             0            0   \n",
       "Metabolic regulation                          0             0            0   \n",
       "Neuronal signaling                            0             0            0   \n",
       "Nuclear receptor signaling                    0             0            0   \n",
       "PKC signaling                                 0             0            0   \n",
       "Protein folding & Protein degradation         0          1858            0   \n",
       "TGF/BMP signaling                             0             0            0   \n",
       "Tyrosine kinase signaling                     0             0            0   \n",
       "\n",
       "condition                              Carmofur  Cediranib  Crizotinib  \\\n",
       "pathway_level_1                                                          \n",
       "Antioxidant                                   0          0           0   \n",
       "Apoptotic regulation                          0          0           0   \n",
       "Cell cycle regulation                         0          0           0   \n",
       "DNA damage & DNA repair                    3239          0           0   \n",
       "Epigenetic regulation                         0          0           0   \n",
       "Focal adhesion signaling                      0          0           0   \n",
       "HIF signaling                                 0          0           0   \n",
       "JAK/STAT signaling                            0          0           0   \n",
       "Metabolic regulation                          0          0           0   \n",
       "Neuronal signaling                            0          0           0   \n",
       "Nuclear receptor signaling                    0          0           0   \n",
       "PKC signaling                                 0          0           0   \n",
       "Protein folding & Protein degradation         0          0           0   \n",
       "TGF/BMP signaling                             0          0           0   \n",
       "Tyrosine kinase signaling                     0       3060        2786   \n",
       "\n",
       "condition                              Entacapone  Enzastaurin  Fasudil  \\\n",
       "pathway_level_1                                                           \n",
       "Antioxidant                                     0            0        0   \n",
       "Apoptotic regulation                            0            0        0   \n",
       "Cell cycle regulation                           0            0        0   \n",
       "DNA damage & DNA repair                         0            0        0   \n",
       "Epigenetic regulation                           0            0        0   \n",
       "Focal adhesion signaling                        0            0     3450   \n",
       "HIF signaling                                   0            0        0   \n",
       "JAK/STAT signaling                              0            0        0   \n",
       "Metabolic regulation                            0            0        0   \n",
       "Neuronal signaling                           3635            0        0   \n",
       "Nuclear receptor signaling                      0            0        0   \n",
       "PKC signaling                                   0         2778        0   \n",
       "Protein folding & Protein degradation           0            0        0   \n",
       "TGF/BMP signaling                               0            0        0   \n",
       "Tyrosine kinase signaling                       0            0        0   \n",
       "\n",
       "condition                              Fulvestrant  IOX2  Luminespib  Mesna  \\\n",
       "pathway_level_1                                                               \n",
       "Antioxidant                                      0     0           0   3616   \n",
       "Apoptotic regulation                             0     0           0      0   \n",
       "Cell cycle regulation                            0     0           0      0   \n",
       "DNA damage & DNA repair                          0     0           0      0   \n",
       "Epigenetic regulation                            0     0           0      0   \n",
       "Focal adhesion signaling                         0     0           0      0   \n",
       "HIF signaling                                    0  3501           0      0   \n",
       "JAK/STAT signaling                               0     0           0      0   \n",
       "Metabolic regulation                             0     0           0      0   \n",
       "Neuronal signaling                               0     0           0      0   \n",
       "Nuclear receptor signaling                    3420     0           0      0   \n",
       "PKC signaling                                    0     0           0      0   \n",
       "Protein folding & Protein degradation            0     0        2005      0   \n",
       "TGF/BMP signaling                                0     0           0      0   \n",
       "Tyrosine kinase signaling                        0     0           0      0   \n",
       "\n",
       "condition                              Momelotinib  Obatoclax  Pracinostat  \\\n",
       "pathway_level_1                                                              \n",
       "Antioxidant                                      0          0            0   \n",
       "Apoptotic regulation                             0       3141            0   \n",
       "Cell cycle regulation                            0          0            0   \n",
       "DNA damage & DNA repair                          0          0            0   \n",
       "Epigenetic regulation                            0          0         2942   \n",
       "Focal adhesion signaling                         0          0            0   \n",
       "HIF signaling                                    0          0            0   \n",
       "JAK/STAT signaling                            3155          0            0   \n",
       "Metabolic regulation                             0          0            0   \n",
       "Neuronal signaling                               0          0            0   \n",
       "Nuclear receptor signaling                       0          0            0   \n",
       "PKC signaling                                    0          0            0   \n",
       "Protein folding & Protein degradation            0          0            0   \n",
       "TGF/BMP signaling                                0          0            0   \n",
       "Tyrosine kinase signaling                        0          0            0   \n",
       "\n",
       "condition                              SNS-314  XAV-939  Zileuton  \n",
       "pathway_level_1                                                    \n",
       "Antioxidant                                  0        0         0  \n",
       "Apoptotic regulation                         0        0         0  \n",
       "Cell cycle regulation                     2237        0         0  \n",
       "DNA damage & DNA repair                      0        0         0  \n",
       "Epigenetic regulation                        0        0         0  \n",
       "Focal adhesion signaling                     0        0         0  \n",
       "HIF signaling                                0        0         0  \n",
       "JAK/STAT signaling                           0        0         0  \n",
       "Metabolic regulation                         0        0      3470  \n",
       "Neuronal signaling                           0        0         0  \n",
       "Nuclear receptor signaling                   0        0         0  \n",
       "PKC signaling                                0        0         0  \n",
       "Protein folding & Protein degradation        0        0         0  \n",
       "TGF/BMP signaling                            0     2794         0  \n",
       "Tyrosine kinase signaling                    0        0         0  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.pathway_level_1, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(ho_drugs)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "3325a1e0-dd95-4e53-a773-99df4b463767",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    483951\n",
       "test      85403\n",
       "ood       12423\n",
       "Name: split_ho_pathway, dtype: int64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs['split_ho_pathway'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "dfecfaa3-55c2-4d7d-872b-0e0208eac6a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Fasudil         966\n",
       "IOX2            913\n",
       "Mesna           884\n",
       "Entacapone      868\n",
       "Fulvestrant     836\n",
       "Zileuton        822\n",
       "Carmofur        767\n",
       "AG-14361        759\n",
       "Azacitidine     736\n",
       "Enzastaurin     694\n",
       "Pracinostat     658\n",
       "SNS-314         547\n",
       "Cediranib       528\n",
       "Momelotinib     487\n",
       "XAV-939         479\n",
       "Crizotinib      464\n",
       "Luminespib      405\n",
       "Obatoclax       404\n",
       "Alvespimycin    206\n",
       "Name: condition, dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'ood'].obs.condition.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "a591e3d0-c1dd-4723-879f-76b37b16b962",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "control         1964\n",
       "ENMD-2076        914\n",
       "RG108            604\n",
       "GSK-LSD1         596\n",
       "Altretamine      573\n",
       "                ... \n",
       "Luminespib       236\n",
       "Patupilone       228\n",
       "Flavopiridol     207\n",
       "Epothilone       181\n",
       "YM155            112\n",
       "Name: condition, Length: 188, dtype: int64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'test'].obs.condition.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff1a2cd4-2a68-4f67-8e13-46fe8fe06c42",
   "metadata": {
    "tags": []
   },
   "source": [
    "### OOD drugs in epigenetic regulation, Tyrosine kinase signaling, cell cycle regulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "244d46ca-9ff8-4e4c-b225-c1a26c84b8da",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Epigenetic regulation                    147875\n",
       "Tyrosine kinase signaling                 85503\n",
       "JAK/STAT signaling                        70922\n",
       "DNA damage & DNA repair                   60042\n",
       "Cell cycle regulation                     53952\n",
       "Other                                     19980\n",
       "Nuclear receptor signaling                19940\n",
       "Protein folding & Protein degradation     19191\n",
       "Metabolic regulation                      17989\n",
       "Neuronal signaling                        14071\n",
       "Antioxidant                               13414\n",
       "Apoptotic regulation                      13141\n",
       "Vehicle                                   13004\n",
       "HIF signaling                              9279\n",
       "PKC signaling                              8804\n",
       "TGF/BMP signaling                          8774\n",
       "Focal adhesion signaling                   5896\n",
       "Name: pathway_level_1, dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs['pathway_level_1'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "448e9822-947a-49ed-b80b-1485c60a218b",
   "metadata": {
    "tags": []
   },
   "source": [
    "___\n",
    "\n",
    "#### Tyrosine signaling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "aac8694c-1c90-40b5-870e-ff1e41fb8527",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PD98059         3763\n",
       "AG-490          3533\n",
       "Motesanib       3363\n",
       "TGX-221         3358\n",
       "Ki8751          3347\n",
       "                ... \n",
       "Fedratinib         0\n",
       "Filgotinib         0\n",
       "Flavopiridol       0\n",
       "Fluorouracil       0\n",
       "control            0\n",
       "Name: condition, Length: 188, dtype: int64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"]),'condition'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "cbce2aca-0b65-456c-bf26-f01a982b2e99",
   "metadata": {},
   "outputs": [],
   "source": [
    "tyrosine_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"]),'condition'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "a7f03a94-0ee5-4e84-9367-020a0b20988e",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['split_tyrosine_ood'] = 'train'  \n",
    "\n",
    "test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"])], .20, copy=True).obs.index\n",
    "adata_sciplex.obs.loc[test_idx, 'split_tyrosine_ood'] = 'test'\n",
    "\n",
    "adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"Cediranib\", \"Crizotinib\", \"Motesanib\", \"BMS-754807\", \"Nintedanib\"]), 'split_tyrosine_ood'] = 'ood'  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "a6386e14-7463-4f08-8ea0-d6991b9e3af1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    552761\n",
       "ood       14880\n",
       "test      14136\n",
       "Name: split_tyrosine_ood, dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.split_tyrosine_ood.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "8cc683c9-5fdb-47a1-b057-e357b93442a9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>condition</th>\n",
       "      <th>AC480</th>\n",
       "      <th>AG-490</th>\n",
       "      <th>BMS-536924</th>\n",
       "      <th>BMS-754807</th>\n",
       "      <th>Bosutinib</th>\n",
       "      <th>Cediranib</th>\n",
       "      <th>Crizotinib</th>\n",
       "      <th>Dasatinib</th>\n",
       "      <th>Glesatinib?(MGCD265)</th>\n",
       "      <th>KW-2449</th>\n",
       "      <th>Ki8751</th>\n",
       "      <th>Lapatinib</th>\n",
       "      <th>Linifanib</th>\n",
       "      <th>Motesanib</th>\n",
       "      <th>Nilotinib</th>\n",
       "      <th>Nintedanib</th>\n",
       "      <th>PD173074</th>\n",
       "      <th>PD98059</th>\n",
       "      <th>Pelitinib</th>\n",
       "      <th>Regorafenib</th>\n",
       "      <th>Rigosertib</th>\n",
       "      <th>SL-327</th>\n",
       "      <th>Sorafenib</th>\n",
       "      <th>TAK-901</th>\n",
       "      <th>TGX-221</th>\n",
       "      <th>Temsirolimus</th>\n",
       "      <th>Tie2</th>\n",
       "      <th>Trametinib</th>\n",
       "      <th>Vandetanib</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_tyrosine_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2676</td>\n",
       "      <td>0</td>\n",
       "      <td>3060</td>\n",
       "      <td>2786</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3363</td>\n",
       "      <td>0</td>\n",
       "      <td>2995</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>645</td>\n",
       "      <td>728</td>\n",
       "      <td>582</td>\n",
       "      <td>0</td>\n",
       "      <td>491</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>491</td>\n",
       "      <td>656</td>\n",
       "      <td>580</td>\n",
       "      <td>641</td>\n",
       "      <td>603</td>\n",
       "      <td>678</td>\n",
       "      <td>0</td>\n",
       "      <td>639</td>\n",
       "      <td>0</td>\n",
       "      <td>702</td>\n",
       "      <td>723</td>\n",
       "      <td>620</td>\n",
       "      <td>502</td>\n",
       "      <td>377</td>\n",
       "      <td>678</td>\n",
       "      <td>658</td>\n",
       "      <td>419</td>\n",
       "      <td>620</td>\n",
       "      <td>453</td>\n",
       "      <td>647</td>\n",
       "      <td>443</td>\n",
       "      <td>560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>2597</td>\n",
       "      <td>2805</td>\n",
       "      <td>2318</td>\n",
       "      <td>0</td>\n",
       "      <td>1945</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2047</td>\n",
       "      <td>2527</td>\n",
       "      <td>2452</td>\n",
       "      <td>2706</td>\n",
       "      <td>2435</td>\n",
       "      <td>2487</td>\n",
       "      <td>0</td>\n",
       "      <td>2448</td>\n",
       "      <td>0</td>\n",
       "      <td>2588</td>\n",
       "      <td>3040</td>\n",
       "      <td>2306</td>\n",
       "      <td>2182</td>\n",
       "      <td>1562</td>\n",
       "      <td>2521</td>\n",
       "      <td>2413</td>\n",
       "      <td>1649</td>\n",
       "      <td>2738</td>\n",
       "      <td>1780</td>\n",
       "      <td>2616</td>\n",
       "      <td>2031</td>\n",
       "      <td>2294</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "condition           AC480  AG-490  BMS-536924  BMS-754807  Bosutinib  \\\n",
       "split_tyrosine_ood                                                     \n",
       "ood                     0       0           0        2676          0   \n",
       "test                  645     728         582           0        491   \n",
       "train                2597    2805        2318           0       1945   \n",
       "\n",
       "condition           Cediranib  Crizotinib  Dasatinib  Glesatinib?(MGCD265)  \\\n",
       "split_tyrosine_ood                                                           \n",
       "ood                      3060        2786          0                     0   \n",
       "test                        0           0        491                   656   \n",
       "train                       0           0       2047                  2527   \n",
       "\n",
       "condition           KW-2449  Ki8751  Lapatinib  Linifanib  Motesanib  \\\n",
       "split_tyrosine_ood                                                     \n",
       "ood                       0       0          0          0       3363   \n",
       "test                    580     641        603        678          0   \n",
       "train                  2452    2706       2435       2487          0   \n",
       "\n",
       "condition           Nilotinib  Nintedanib  PD173074  PD98059  Pelitinib  \\\n",
       "split_tyrosine_ood                                                        \n",
       "ood                         0        2995         0        0          0   \n",
       "test                      639           0       702      723        620   \n",
       "train                    2448           0      2588     3040       2306   \n",
       "\n",
       "condition           Regorafenib  Rigosertib  SL-327  Sorafenib  TAK-901  \\\n",
       "split_tyrosine_ood                                                        \n",
       "ood                           0           0       0          0        0   \n",
       "test                        502         377     678        658      419   \n",
       "train                      2182        1562    2521       2413     1649   \n",
       "\n",
       "condition           TGX-221  Temsirolimus  Tie2  Trametinib  Vandetanib  \n",
       "split_tyrosine_ood                                                       \n",
       "ood                       0             0     0           0           0  \n",
       "test                    620           453   647         443         560  \n",
       "train                  2738          1780  2616        2031        2294  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(tyrosine_drugs)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "2fa637e1-a444-4235-8c5d-0c8b5acc1b9a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>dose_val</th>\n",
       "      <th>0.001</th>\n",
       "      <th>0.010</th>\n",
       "      <th>0.100</th>\n",
       "      <th>1.000</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_tyrosine_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>4226</td>\n",
       "      <td>4118</td>\n",
       "      <td>3822</td>\n",
       "      <td>2714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>3928</td>\n",
       "      <td>3930</td>\n",
       "      <td>3590</td>\n",
       "      <td>2688</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>144859</td>\n",
       "      <td>139622</td>\n",
       "      <td>134416</td>\n",
       "      <td>133864</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "dose_val             0.001   0.010   0.100   1.000\n",
       "split_tyrosine_ood                                \n",
       "ood                   4226    4118    3822    2714\n",
       "test                  3928    3930    3590    2688\n",
       "train               144859  139622  134416  133864"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c16410d8-57f6-4958-8aec-db63f5acbfd2",
   "metadata": {
    "tags": []
   },
   "source": [
    "____\n",
    "\n",
    "#### Epigenetic regulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "226d2855-8739-4bf4-bab0-eaac30ffe7b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RG108           3715\n",
       "Tubastatin      3710\n",
       "GSK-LSD1        3688\n",
       "SRT2104         3687\n",
       "Tacedinaline    3664\n",
       "                ... \n",
       "Fulvestrant        0\n",
       "G007-LK            0\n",
       "GSK1070916         0\n",
       "Gandotinib         0\n",
       "control            0\n",
       "Name: condition, Length: 188, dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"]),'condition'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "bf8532c2-e843-4d6e-87bb-a12dd3333d27",
   "metadata": {},
   "outputs": [],
   "source": [
    "epigenetic_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"]),'condition'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "a3548623-3991-49fe-add3-aed28f6a3ee5",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['split_epigenetic_ood'] = 'train'  \n",
    "\n",
    "test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"])], .20, copy=True).obs.index\n",
    "adata_sciplex.obs.loc[test_idx, 'split_epigenetic_ood'] = 'test'\n",
    "\n",
    "adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"Azacitidine\", \"Pracinostat\", \"Trichostatin\", \"Quisinostat\", \"Tazemetostat\"]), 'split_epigenetic_ood'] = 'ood'  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "fed7945c-8e2b-44a8-860e-38fea4fac1b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    540070\n",
       "test      26538\n",
       "ood       15169\n",
       "Name: split_epigenetic_ood, dtype: int64"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.split_epigenetic_ood.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "0474d068-1bb1-4a9a-b8f3-f6661054f30b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>condition</th>\n",
       "      <th>JQ1</th>\n",
       "      <th>A-366</th>\n",
       "      <th>AR-42</th>\n",
       "      <th>Abexinostat</th>\n",
       "      <th>Anacardic</th>\n",
       "      <th>Azacitidine</th>\n",
       "      <th>BRD4770</th>\n",
       "      <th>Belinostat</th>\n",
       "      <th>CUDC-101</th>\n",
       "      <th>CUDC-907</th>\n",
       "      <th>Dacinostat</th>\n",
       "      <th>Decitabine</th>\n",
       "      <th>Divalproex</th>\n",
       "      <th>Droxinostat</th>\n",
       "      <th>EED226</th>\n",
       "      <th>Entinostat</th>\n",
       "      <th>GSK</th>\n",
       "      <th>GSK-LSD1</th>\n",
       "      <th>Givinostat</th>\n",
       "      <th>ITSA-1</th>\n",
       "      <th>M344</th>\n",
       "      <th>MC1568</th>\n",
       "      <th>Mocetinostat</th>\n",
       "      <th>PCI-34051</th>\n",
       "      <th>PFI-1</th>\n",
       "      <th>Panobinostat</th>\n",
       "      <th>Pracinostat</th>\n",
       "      <th>Quisinostat</th>\n",
       "      <th>RG108</th>\n",
       "      <th>Resminostat</th>\n",
       "      <th>Resveratrol</th>\n",
       "      <th>SRT1720</th>\n",
       "      <th>SRT2104</th>\n",
       "      <th>SRT3025</th>\n",
       "      <th>Selisistat</th>\n",
       "      <th>Sirtinol</th>\n",
       "      <th>Sodium</th>\n",
       "      <th>TMP195</th>\n",
       "      <th>Tacedinaline</th>\n",
       "      <th>Tazemetostat</th>\n",
       "      <th>Trichostatin</th>\n",
       "      <th>Tubastatin</th>\n",
       "      <th>Tucidinostat</th>\n",
       "      <th>UNC0379</th>\n",
       "      <th>UNC0631</th>\n",
       "      <th>UNC1999</th>\n",
       "      <th>Valproic</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_epigenetic_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3151</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2942</td>\n",
       "      <td>2354</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3639</td>\n",
       "      <td>3083</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>625</td>\n",
       "      <td>645</td>\n",
       "      <td>623</td>\n",
       "      <td>582</td>\n",
       "      <td>728</td>\n",
       "      <td>0</td>\n",
       "      <td>743</td>\n",
       "      <td>581</td>\n",
       "      <td>661</td>\n",
       "      <td>519</td>\n",
       "      <td>518</td>\n",
       "      <td>491</td>\n",
       "      <td>647</td>\n",
       "      <td>652</td>\n",
       "      <td>645</td>\n",
       "      <td>716</td>\n",
       "      <td>690</td>\n",
       "      <td>686</td>\n",
       "      <td>631</td>\n",
       "      <td>544</td>\n",
       "      <td>611</td>\n",
       "      <td>655</td>\n",
       "      <td>385</td>\n",
       "      <td>591</td>\n",
       "      <td>618</td>\n",
       "      <td>517</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>701</td>\n",
       "      <td>649</td>\n",
       "      <td>655</td>\n",
       "      <td>583</td>\n",
       "      <td>779</td>\n",
       "      <td>605</td>\n",
       "      <td>690</td>\n",
       "      <td>669</td>\n",
       "      <td>710</td>\n",
       "      <td>511</td>\n",
       "      <td>747</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>718</td>\n",
       "      <td>453</td>\n",
       "      <td>686</td>\n",
       "      <td>664</td>\n",
       "      <td>686</td>\n",
       "      <td>728</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>2412</td>\n",
       "      <td>2751</td>\n",
       "      <td>2278</td>\n",
       "      <td>2331</td>\n",
       "      <td>2876</td>\n",
       "      <td>0</td>\n",
       "      <td>2886</td>\n",
       "      <td>2444</td>\n",
       "      <td>2548</td>\n",
       "      <td>1898</td>\n",
       "      <td>1998</td>\n",
       "      <td>1866</td>\n",
       "      <td>2581</td>\n",
       "      <td>2545</td>\n",
       "      <td>2624</td>\n",
       "      <td>2669</td>\n",
       "      <td>2911</td>\n",
       "      <td>3002</td>\n",
       "      <td>2474</td>\n",
       "      <td>2282</td>\n",
       "      <td>2543</td>\n",
       "      <td>2761</td>\n",
       "      <td>1593</td>\n",
       "      <td>2350</td>\n",
       "      <td>2589</td>\n",
       "      <td>2056</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3014</td>\n",
       "      <td>2670</td>\n",
       "      <td>2317</td>\n",
       "      <td>2487</td>\n",
       "      <td>2908</td>\n",
       "      <td>2405</td>\n",
       "      <td>2684</td>\n",
       "      <td>2872</td>\n",
       "      <td>2787</td>\n",
       "      <td>2067</td>\n",
       "      <td>2917</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2992</td>\n",
       "      <td>1800</td>\n",
       "      <td>2595</td>\n",
       "      <td>2890</td>\n",
       "      <td>2683</td>\n",
       "      <td>2812</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "condition              JQ1  A-366  AR-42  Abexinostat  Anacardic  Azacitidine  \\\n",
       "split_epigenetic_ood                                                            \n",
       "ood                      0      0      0            0          0         3151   \n",
       "test                   625    645    623          582        728            0   \n",
       "train                 2412   2751   2278         2331       2876            0   \n",
       "\n",
       "condition             BRD4770  Belinostat  CUDC-101  CUDC-907  Dacinostat  \\\n",
       "split_epigenetic_ood                                                        \n",
       "ood                         0           0         0         0           0   \n",
       "test                      743         581       661       519         518   \n",
       "train                    2886        2444      2548      1898        1998   \n",
       "\n",
       "condition             Decitabine  Divalproex  Droxinostat  EED226  Entinostat  \\\n",
       "split_epigenetic_ood                                                            \n",
       "ood                            0           0            0       0           0   \n",
       "test                         491         647          652     645         716   \n",
       "train                       1866        2581         2545    2624        2669   \n",
       "\n",
       "condition              GSK  GSK-LSD1  Givinostat  ITSA-1  M344  MC1568  \\\n",
       "split_epigenetic_ood                                                     \n",
       "ood                      0         0           0       0     0       0   \n",
       "test                   690       686         631     544   611     655   \n",
       "train                 2911      3002        2474    2282  2543    2761   \n",
       "\n",
       "condition             Mocetinostat  PCI-34051  PFI-1  Panobinostat  \\\n",
       "split_epigenetic_ood                                                 \n",
       "ood                              0          0      0             0   \n",
       "test                           385        591    618           517   \n",
       "train                         1593       2350   2589          2056   \n",
       "\n",
       "condition             Pracinostat  Quisinostat  RG108  Resminostat  \\\n",
       "split_epigenetic_ood                                                 \n",
       "ood                          2942         2354      0            0   \n",
       "test                            0            0    701          649   \n",
       "train                           0            0   3014         2670   \n",
       "\n",
       "condition             Resveratrol  SRT1720  SRT2104  SRT3025  Selisistat  \\\n",
       "split_epigenetic_ood                                                       \n",
       "ood                             0        0        0        0           0   \n",
       "test                          655      583      779      605         690   \n",
       "train                        2317     2487     2908     2405        2684   \n",
       "\n",
       "condition             Sirtinol  Sodium  TMP195  Tacedinaline  Tazemetostat  \\\n",
       "split_epigenetic_ood                                                         \n",
       "ood                          0       0       0             0          3639   \n",
       "test                       669     710     511           747             0   \n",
       "train                     2872    2787    2067          2917             0   \n",
       "\n",
       "condition             Trichostatin  Tubastatin  Tucidinostat  UNC0379  \\\n",
       "split_epigenetic_ood                                                    \n",
       "ood                           3083           0             0        0   \n",
       "test                             0         718           453      686   \n",
       "train                            0        2992          1800     2595   \n",
       "\n",
       "condition             UNC0631  UNC1999  Valproic  \n",
       "split_epigenetic_ood                              \n",
       "ood                         0        0         0  \n",
       "test                      664      686       728  \n",
       "train                    2890     2683      2812  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_epigenetic_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(epigenetic_drugs)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "7fbc8c54-82e7-40bb-b61e-c7c0e30c6717",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>dose_val</th>\n",
       "      <th>0.001</th>\n",
       "      <th>0.010</th>\n",
       "      <th>0.100</th>\n",
       "      <th>1.000</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_tyrosine_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>4226</td>\n",
       "      <td>4118</td>\n",
       "      <td>3822</td>\n",
       "      <td>2714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>3928</td>\n",
       "      <td>3930</td>\n",
       "      <td>3590</td>\n",
       "      <td>2688</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>144859</td>\n",
       "      <td>139622</td>\n",
       "      <td>134416</td>\n",
       "      <td>133864</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "dose_val             0.001   0.010   0.100   1.000\n",
       "split_tyrosine_ood                                \n",
       "ood                   4226    4118    3822    2714\n",
       "test                  3928    3930    3590    2688\n",
       "train               144859  139622  134416  133864"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f5dc02f-c298-46a3-b79d-39b414f87d0f",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "__________\n",
    "\n",
    "#### Cell cycle regulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "b97eade0-5807-41b7-b657-809cb7f9b930",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ENMD-2076       5757\n",
       "BMS-265246      3274\n",
       "Roscovitine     3254\n",
       "Aurora          3036\n",
       "MK-5108         3006\n",
       "                ... \n",
       "Fedratinib         0\n",
       "Filgotinib         0\n",
       "Fluorouracil       0\n",
       "Fulvestrant        0\n",
       "control            0\n",
       "Name: condition, Length: 188, dtype: int64"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"]),'condition'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "04af6f98-298d-4630-9d53-1cd4892ed0d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_cycle_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"]),'condition'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "2c47f50d-2178-4122-90b3-bb4202cc9f36",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex.obs['split_cellcycle_ood'] = 'train'  \n",
    "\n",
    "test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"])], .20, copy=True).obs.index\n",
    "adata_sciplex.obs.loc[test_idx, 'split_cellcycle_ood'] = 'test'\n",
    "\n",
    "adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"SNS-314\", \"Flavopiridol\", \"Roscovitine\"]), 'split_cellcycle_ood'] = 'ood'  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "4845abcb-fbe7-4206-9d6b-19d9a937bd6b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    565503\n",
       "test       9376\n",
       "ood        6898\n",
       "Name: split_cellcycle_ood, dtype: int64"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.obs.split_cellcycle_ood.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "ac569309-1243-4af2-8b06-3b3416a6e4d7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>condition</th>\n",
       "      <th>AMG-900</th>\n",
       "      <th>Alisertib</th>\n",
       "      <th>Aurora</th>\n",
       "      <th>BMS-265246</th>\n",
       "      <th>Barasertib</th>\n",
       "      <th>CYC116</th>\n",
       "      <th>Danusertib</th>\n",
       "      <th>ENMD-2076</th>\n",
       "      <th>Epothilone</th>\n",
       "      <th>Flavopiridol</th>\n",
       "      <th>GSK1070916</th>\n",
       "      <th>Hesperadin</th>\n",
       "      <th>JNJ-7706621</th>\n",
       "      <th>MK-5108</th>\n",
       "      <th>MLN8054</th>\n",
       "      <th>PHA-680632</th>\n",
       "      <th>Patupilone</th>\n",
       "      <th>Roscovitine</th>\n",
       "      <th>SNS-314</th>\n",
       "      <th>Tozasertib</th>\n",
       "      <th>ZM</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_cellcycle_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1407</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3254</td>\n",
       "      <td>2237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>545</td>\n",
       "      <td>428</td>\n",
       "      <td>616</td>\n",
       "      <td>679</td>\n",
       "      <td>463</td>\n",
       "      <td>570</td>\n",
       "      <td>469</td>\n",
       "      <td>1140</td>\n",
       "      <td>230</td>\n",
       "      <td>0</td>\n",
       "      <td>512</td>\n",
       "      <td>356</td>\n",
       "      <td>590</td>\n",
       "      <td>590</td>\n",
       "      <td>478</td>\n",
       "      <td>450</td>\n",
       "      <td>290</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>424</td>\n",
       "      <td>546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>2165</td>\n",
       "      <td>1673</td>\n",
       "      <td>2420</td>\n",
       "      <td>2595</td>\n",
       "      <td>1958</td>\n",
       "      <td>2381</td>\n",
       "      <td>1927</td>\n",
       "      <td>4617</td>\n",
       "      <td>991</td>\n",
       "      <td>0</td>\n",
       "      <td>1990</td>\n",
       "      <td>1593</td>\n",
       "      <td>2398</td>\n",
       "      <td>2416</td>\n",
       "      <td>1866</td>\n",
       "      <td>1731</td>\n",
       "      <td>1191</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1596</td>\n",
       "      <td>2170</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "condition            AMG-900  Alisertib  Aurora  BMS-265246  Barasertib  \\\n",
       "split_cellcycle_ood                                                       \n",
       "ood                        0          0       0           0           0   \n",
       "test                     545        428     616         679         463   \n",
       "train                   2165       1673    2420        2595        1958   \n",
       "\n",
       "condition            CYC116  Danusertib  ENMD-2076  Epothilone  Flavopiridol  \\\n",
       "split_cellcycle_ood                                                            \n",
       "ood                       0           0          0           0          1407   \n",
       "test                    570         469       1140         230             0   \n",
       "train                  2381        1927       4617         991             0   \n",
       "\n",
       "condition            GSK1070916  Hesperadin  JNJ-7706621  MK-5108  MLN8054  \\\n",
       "split_cellcycle_ood                                                          \n",
       "ood                           0           0            0        0        0   \n",
       "test                        512         356          590      590      478   \n",
       "train                      1990        1593         2398     2416     1866   \n",
       "\n",
       "condition            PHA-680632  Patupilone  Roscovitine  SNS-314  Tozasertib  \\\n",
       "split_cellcycle_ood                                                             \n",
       "ood                           0           0         3254     2237           0   \n",
       "test                        450         290            0        0         424   \n",
       "train                      1731        1191            0        0        1596   \n",
       "\n",
       "condition              ZM  \n",
       "split_cellcycle_ood        \n",
       "ood                     0  \n",
       "test                  546  \n",
       "train                2170  "
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(cell_cycle_drugs)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "b93f29bf-a79f-40aa-87dd-4bd736cc8fa7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>dose_val</th>\n",
       "      <th>0.001</th>\n",
       "      <th>0.010</th>\n",
       "      <th>0.100</th>\n",
       "      <th>1.000</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>split_cellcycle_ood</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ood</th>\n",
       "      <td>2165</td>\n",
       "      <td>1774</td>\n",
       "      <td>1457</td>\n",
       "      <td>1502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test</th>\n",
       "      <td>2673</td>\n",
       "      <td>2429</td>\n",
       "      <td>2329</td>\n",
       "      <td>1945</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>train</th>\n",
       "      <td>148175</td>\n",
       "      <td>143467</td>\n",
       "      <td>138042</td>\n",
       "      <td>135819</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "dose_val              0.001   0.010   0.100   1.000\n",
       "split_cellcycle_ood                                \n",
       "ood                    2165    1774    1457    1502\n",
       "test                   2673    2429    2329    1945\n",
       "train                148175  143467  138042  135819"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs.dose_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "41ba1c76-85a8-4398-a637-5354ac5cfb18",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['split_ho_pathway',\n",
       " 'split_tyrosine_ood',\n",
       " 'split_epigenetic_ood',\n",
       " 'split_cellcycle_ood']"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[c for c in adata_sciplex.obs.columns if 'split' in c]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "697e1caf-86a1-46e5-af03-76213446dfe2",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "### Further splits\n",
    "\n",
    "**We omit these split as we design our own splits - for referece this is commented out for the moment**\n",
    "\n",
    "Also a split which sees all data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "97654e4f-4801-42df-94e4-b3abc596dff2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# adata.obs['split_all'] = 'train'\n",
    "# test_idx = sc.pp.subsample(adata, .10, copy=True).obs.index\n",
    "# adata.obs.loc[test_idx, 'split_all'] = 'test'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "07984033-dc32-4cad-91c1-650e3a2926e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# adata.obs['ct_dose'] = adata.obs.cell_type.astype('str') + '_' + adata.obs.dose_val.astype('str')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef98d16d-aa0e-4b39-9675-c7b0132510c9",
   "metadata": {},
   "source": [
    "Round robin splits: dose and cell line combinations will be held out in turn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "4492d665-b265-4c8c-9251-6d7598551116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# i = 0\n",
    "# split_dict = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "84f59288-4925-4d9c-92e3-f2f0611910da",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# # single ct holdout\n",
    "# for ct in adata.obs.cell_type.unique():\n",
    "#     for dose in adata.obs.dose_val.unique():\n",
    "#         i += 1\n",
    "#         split_name = f'split{i}'\n",
    "#         split_dict[split_name] = f'{ct}_{dose}'\n",
    "        \n",
    "#         adata.obs[split_name] = 'train'\n",
    "#         adata.obs.loc[adata.obs.ct_dose == f'{ct}_{dose}', split_name] = 'ood'\n",
    "        \n",
    "#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
    "#         adata.obs.loc[test_idx, split_name] = 'test'\n",
    "        \n",
    "#         display(adata.obs[split_name].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "23d5c2bd-ebcc-4da8-a0b8-c04fab040c44",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# # double ct holdout\n",
    "# for cts in [('A549', 'MCF7'), ('A549', 'K562'), ('MCF7', 'K562')]:\n",
    "#     for dose in adata.obs.dose_val.unique():\n",
    "#         i += 1\n",
    "#         split_name = f'split{i}'\n",
    "#         split_dict[split_name] = f'{cts[0]}+{cts[1]}_{dose}'\n",
    "        \n",
    "#         adata.obs[split_name] = 'train'\n",
    "#         adata.obs.loc[adata.obs.ct_dose == f'{cts[0]}_{dose}', split_name] = 'ood'\n",
    "#         adata.obs.loc[adata.obs.ct_dose == f'{cts[1]}_{dose}', split_name] = 'ood'\n",
    "        \n",
    "#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
    "#         adata.obs.loc[test_idx, split_name] = 'test'\n",
    "        \n",
    "#         display(adata.obs[split_name].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "e722a203-eeba-4e85-a542-33d8783afec7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # triple ct holdout\n",
    "# for dose in adata.obs.dose_val.unique():\n",
    "#     i += 1\n",
    "#     split_name = f'split{i}'\n",
    "\n",
    "#     split_dict[split_name] = f'all_{dose}'\n",
    "#     adata.obs[split_name] = 'train'\n",
    "#     adata.obs.loc[adata.obs.dose_val == dose, split_name] = 'ood'\n",
    "\n",
    "#     test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
    "#     adata.obs.loc[test_idx, split_name] = 'test'\n",
    "\n",
    "#     display(adata.obs[split_name].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "34f21c22-1979-484a-92bb-9f6fc8b71fd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# adata.uns['all_DEGs']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "615fa85a-417d-4c37-8530-f0129132fc4f",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Save adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "319f177a-549f-4424-a56e-51af6535c48e",
   "metadata": {},
   "source": [
    "Reindex the lincs dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "1353837b-9cf8-46f0-9529-1340ba033f2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "sciplex_ids = pd.Index(adata_sciplex.var.gene_id)\n",
    "\n",
    "lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_lincs.var.gene_id[adata_lincs.var.in_sciplex]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "92990709-882a-4ed6-b216-df893b4dcea2",
   "metadata": {},
   "outputs": [],
   "source": [
    "non_lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_sciplex.var.gene_id if not adata_lincs.var.gene_id.isin([_id]).any()]\n",
    "\n",
    "lincs_idx.extend(non_lincs_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "edda556e-1ae9-47ad-906f-bf12d94dccef",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex = adata_sciplex[:, lincs_idx].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "eea089c3-f96a-420c-af98-576d3a34bd1c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "fname = PROJECT_DIR/'datasets'/'sciplex3_matched_genes_lincs.h5ad'\n",
    "\n",
    "sc.write(fname, adata_sciplex)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44fb969d-2a45-41a3-b138-eda1ea8e7238",
   "metadata": {},
   "source": [
    "Check that it worked"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "582e2283-6100-4de2-995b-b7befddd0a92",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 581777 × 2000\n",
       "    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control', 'split_ho_pathway', 'split_tyrosine_ood', 'split_epigenetic_ood', 'split_cellcycle_ood'\n",
       "    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n",
       "    uns: 'all_DEGs', 'hvg', 'lincs_DEGs', 'log1p'"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc.read(fname)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdd00494-7d01-41c9-9a05-cf2921e68393",
   "metadata": {},
   "source": [
    "## Subselect to shared only shared genes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "425a4946-12a1-42c3-ab11-41673606be1e",
   "metadata": {},
   "source": [
    "Subset to shared genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "a0005d9c-42e1-4ec1-b7f5-fa9d96037d5d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "adata_lincs = adata_lincs[:, adata_lincs.var.in_sciplex].copy() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "1b0831d3-c6c2-4dbd-b137-8ca27e4e0e52",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sciplex = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "e0fcb5f4-46a1-4f4b-ab41-7cefdc26abfe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',\n",
       "       'ATP6V0B', 'RPS6',\n",
       "       ...\n",
       "       'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',\n",
       "       'WDTC1', 'PLSCR3', 'NPEPL1'],\n",
       "      dtype='object', length=977)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_lincs.var_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "8825ee35-daab-4625-93f9-764ced4ef32f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',\n",
       "       'ATP6V0B', 'RPS6',\n",
       "       ...\n",
       "       'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',\n",
       "       'WDTC1', 'PLSCR3', 'NPEPL1'],\n",
       "      dtype='object', name='index', length=977)"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata_sciplex.var_names"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c059ac22-e464-40a9-8021-cbb4d8a10aba",
   "metadata": {},
   "source": [
    "## Save adata objects with shared genes only\n",
    "Index of lincs has also been reordered accordingly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "4ea8321a-694c-4522-a931-38d51990b5a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "fname = PROJECT_DIR/'datasets'/'sciplex3_lincs_genes.h5ad'\n",
    "\n",
    "sc.write(fname, adata_sciplex)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36596257-fb0c-479c-8868-996a25affeae",
   "metadata": {},
   "source": [
    "____"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "881fb5d3-1c04-4ecd-8ea3-aeda1e3baf57",
   "metadata": {},
   "outputs": [],
   "source": [
    "fname_lincs = PROJECT_DIR/'datasets'/'lincs_full_smiles_sciplex_genes.h5ad'\n",
    "\n",
    "sc.write(fname_lincs, adata_lincs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89dca192",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "ad25c9354f8cefdf5a943c25e67813a21d2807e3af4d6d0915e47390a83b57ce"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  },
  "toc-autonumbering": false
 },
 "nbformat": 4,
 "nbformat_minor": 5
}