{
"cells": [
{
"cell_type": "markdown",
"id": "9144495b-2433-4bb9-9b6f-6e282ea07891",
"metadata": {},
"source": [
"# 3 LINCS SCIPLEX GENE MATCHING\n",
"\n",
"**Requires**\n",
"* `'lincs_full_smiles.h5ad'`\n",
"* `'sciplex_raw_chunk_{i}.h5ad'` with $i \\in \\{0,1,2,3,4\\}$\n",
"\n",
"**Output**\n",
"* `'sciplex3_matched_genes_lincs.h5ad'`\n",
"* `lincs`: `'sciplex3_lincs_genes.h5ad'`\n",
"* `sciplex`: `'lincs_full_smiles_sciplex_genes.h5ad'`\n",
"\n",
"\n",
"\n",
"## Description \n",
"\n",
"The goal of this notebook is to match and merge genes between the LINCS and SciPlex datasets, resulting in the creation of three new datasets:\n",
"\n",
"### Created datasets\n",
"\n",
"- **`sciplex3_matched_genes_lincs.h5ad`**: Contains **SciPlex observations**. **Genes are limited to the intersection** of the genes found in both LINCS and SciPlex datasets, and or highly variable genes in sciplex.\n",
"\n",
"\n",
"- **`sciplex3_lincs_genes.h5ad`**: Contains **SciPlex data**, but filtered to include **only the genes that are shared with the LINCS dataset**. (strict intersection, 977 genes)\n",
"\n",
"- **`lincs_full_smiles_sciplex_genes.h5ad`**: Contains **LINCS data**, but filtered to include **only the genes that are shared with the SciPlex dataset**.\n",
"\n",
"\n",
"\n",
"To create these datasets, we need to match the genes between the two datasets, which is done as follows:\n",
"\n",
"### Gene Matching\n",
"\n",
"1. **Gene ID Assignment**: SciPlex gene names are standardized to Ensembl gene IDs by extracting the primary identifier and using either **sfaira** or a predefined mapping (`symbols_dict.json`). The LINCS dataset is already standardized.\n",
"\n",
"2. **Identifying Shared Genes**: We then compute the intersection of the gene IDs (`gene_id`) inside LINCS and SciPlex. Both datasets are then filtered to retain only these shared genes.\n",
"\n",
"3. **Reindexing**: The LINCS dataset is reindexed to match the order of genes in the SciPlex dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9a33a003-9ca0-4994-955c-305852e4d354",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/requests/__init__.py:104: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n",
" RequestsDependencyWarning)\n",
"2023-08-19 10:31:31.638164: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n",
"To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"2023-08-19 10:31:34.020338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n",
"2023-08-19 10:31:34.020465: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n",
"2023-08-19 10:31:34.020477: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.2 pynndescent==0.5.6\n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import sfaira\n",
"import warnings\n",
"os.getcwd()\n",
"\n",
"from chemCPA.paths import DATA_DIR, PROJECT_DIR\n",
"\n",
"pd.set_option('display.max_columns', 100)\n",
"\n",
"root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
"sys.path.append(root_dir)\n",
"import logging\n",
"\n",
"logging.basicConfig(level=logging.INFO)\n",
"from notebook_utils import suppress_output\n",
"\n",
"import scanpy as sc\n",
"with suppress_output():\n",
" sc.set_figure_params(dpi=80, frameon=False)\n",
" sc.logging.print_header()\n",
" warnings.filterwarnings('ignore')\n",
"\n",
"# logging.info is visible when running as python script \n",
"if not any('ipykernel' in arg for arg in sys.argv):\n",
" logging.basicConfig(\n",
" level=logging.INFO,\n",
" format='%(asctime)s - %(levelname)s - %(message)s',\n",
" datefmt='%Y-%m-%d %H:%M:%S'\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d3c097de-7254-43e5-89e3-d3095e45f270",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The autoreload extension is already loaded. To reload it, use:\n",
" %reload_ext autoreload\n"
]
}
],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"id": "0c6073db-2c0a-413c-b4b5-17dcef7e064c",
"metadata": {
"tags": []
},
"source": [
"## Load data"
]
},
{
"cell_type": "markdown",
"id": "3bcf6d3b-64c3-48cd-987b-f984c0e76ddd",
"metadata": {},
"source": [
"Load lincs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9a72ac06-5233-4e7a-ba63-79786d6d2c31",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"adata_lincs = sc.read(DATA_DIR/'lincs_full_smiles.h5ad' )"
]
},
{
"cell_type": "markdown",
"id": "05efc7d8-b2ac-4fb5-bfd5-0938e5b80b1a",
"metadata": {},
"source": [
"Load sciplex "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b365aa7a-9957-4359-977d-dafe400df570",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n",
" [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],\n"
]
}
],
"source": [
"from tqdm import tqdm\n",
"from chemCPA.paths import DATA_DIR, PROJECT_DIR\n",
"from raw_data.datasets import sciplex\n",
"\n",
"# Load and concatenate chunks\n",
"adatas_sciplex = []\n",
"logging.info(\"Starting to load in sciplex data\")\n",
"\n",
"# Get paths to all sciplex chunks\n",
"chunk_paths = sciplex()\n",
"\n",
"# Load chunks with progress bar\n",
"for chunk_path in tqdm(chunk_paths, desc=\"Loading sciplex chunks\"):\n",
" tqdm.write(f\"Loading {os.path.basename(chunk_path)}\")\n",
" adatas_sciplex.append(sc.read(chunk_path))\n",
" \n",
"adata_sciplex = adatas_sciplex[0].concatenate(adatas_sciplex[1:])\n",
"logging.info(\"Sciplex data loaded\")"
]
},
{
"cell_type": "markdown",
"id": "0f5c24ac-3b55-40f2-abee-22286b4c6d16",
"metadata": {},
"source": [
"Add gene_id to sciplex"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b72d957d-18eb-4994-875b-0bdd9db254c9",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.var['gene_id'] = adata_sciplex.var.id.str.split('.').str[0]\n",
"adata_sciplex.var['gene_id'].head()"
]
},
{
"cell_type": "markdown",
"id": "f0caf549-00e6-4d18-bb19-6c344c9f62c9",
"metadata": {
"tags": []
},
"source": [
"### Get gene ids from symbols via sfaira"
]
},
{
"cell_type": "markdown",
"id": "065cb35d-ce63-4152-9fcb-ca939295bc29",
"metadata": {},
"source": [
"Load genome container with sfaira"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "11109df9-7724-4658-8865-7bb19ec98e3f",
"metadata": {},
"outputs": [],
"source": [
"try: \n",
" # load json file with symbol to id mapping\n",
" import json\n",
" with open(DATA_DIR/ 'symbols_dict.json') as json_file:\n",
" symbols_dict = json.load(json_file)\n",
"except: \n",
" logging.info(\"No symbols_dict.json found, falling back to sfaira\")\n",
" genome_container = sfaira.versions.genomes.GenomeContainer(organism=\"homo_sapiens\", release=\"82\")\n",
" symbols_dict = genome_container.symbol_to_id_dict\n",
" # Extend symbols dict with unknown symbol\n",
" symbols_dict.update({'PLSCR3':'ENSG00000187838'})"
]
},
{
"cell_type": "markdown",
"id": "d071d09e-dc42-4872-8cf5-88ad20f35fc2",
"metadata": {},
"source": [
"Identify genes that are shared between lincs and trapnell"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c9aba778-ade2-4bb1-b0e9-2e8bc7a95602",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# For lincs\n",
"adata_lincs.var['gene_id'] = adata_lincs.var_names.map(symbols_dict)\n",
"adata_lincs.var['in_sciplex'] = adata_lincs.var.gene_id.isin(adata_sciplex.var.gene_id)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d79a3562-60ad-4795-823d-bdaed56e3fb5",
"metadata": {},
"outputs": [],
"source": [
"# For trapnell\n",
"adata_sciplex.var['in_lincs'] = adata_sciplex.var.gene_id.isin(adata_lincs.var.gene_id)"
]
},
{
"cell_type": "markdown",
"id": "7f4c3d35-0070-40d8-a87a-dce02883cf6e",
"metadata": {
"tags": []
},
"source": [
"## Preprocess sciplex dataset"
]
},
{
"cell_type": "markdown",
"id": "ce8ecbc7-7d41-4e38-8817-f8c8d01ad29f",
"metadata": {},
"source": [
"See `sciplex3.ipynb`"
]
},
{
"cell_type": "markdown",
"id": "45442825-0f56-42bd-88f9-48fd0468010c",
"metadata": {},
"source": [
"The original CPA implementation required to subset the data due to scaling limitations. \n",
"In this version we expect to be able to handle the full sciplex dataset."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "68ff6c9b-6e46-402c-96aa-a2da57af9c79",
"metadata": {},
"outputs": [],
"source": [
"SUBSET = False\n",
"\n",
"if SUBSET: \n",
" sc.pp.subsample(adata_sciplex, fraction=0.5, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "363f0cf5-340c-4589-bc95-de8a1e22fbbe",
"metadata": {},
"outputs": [],
"source": [
"sc.pp.normalize_per_cell(adata_sciplex)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fb7a8ae8-e5db-4c8e-bf0d-965f7c8e4dbe",
"metadata": {},
"outputs": [],
"source": [
"sc.pp.log1p(adata_sciplex)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "aecbc3c6-3882-442c-8b84-ce163f704b84",
"metadata": {},
"outputs": [],
"source": [
"sc.pp.highly_variable_genes(adata_sciplex, n_top_genes=1032, subset=False)"
]
},
{
"cell_type": "markdown",
"id": "dc91a1a0-2011-4834-afb6-278206d15e71",
"metadata": {
"tags": []
},
"source": [
"### Combine HVG with lincs genes\n",
"\n",
"Union of genes that are considered highly variable and those that are shared with lincs"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "761a5c25-2947-4f66-ab8c-c33a1b713444",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2000"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"((adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)).sum()"
]
},
{
"cell_type": "markdown",
"id": "db01d26e-e0f7-44a0-a8b8-380223049f81",
"metadata": {},
"source": [
"Subset to that union of genes"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6e6328cd-cd03-4e72-85dc-b72eed2632f7",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex = adata_sciplex[:, (adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)].copy()"
]
},
{
"cell_type": "markdown",
"id": "b25d10e9-6e1e-4a13-a512-3580fc1295c8",
"metadata": {
"tags": []
},
"source": [
"### Create additional meta data "
]
},
{
"cell_type": "markdown",
"id": "985349fe-37bf-4efc-9765-d612d8d440c8",
"metadata": {},
"source": [
"Normalise dose values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "62b9e529-ca45-4f04-b2d6-75dd246aa36c",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['dose_val'] = adata_sciplex.obs.dose.astype(float) / np.max(adata_sciplex.obs.dose.astype(float))\n",
"adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'dose_val'] = 1.0"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ed4ea831-0650-4b6b-bc09-7fbaf5f004b7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.001 153013\n",
"0.010 147670\n",
"0.100 141828\n",
"1.000 139266\n",
"Name: dose_val, dtype: int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs['dose_val'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "f4908bb2-4fd0-40d5-a6d7-24e68bcf9bb0",
"metadata": {},
"source": [
"Change `product_name`"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "71393716-2328-41d8-a077-ef8fc435bf61",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['product_name'] = [x.split(' ')[0] for x in adata_sciplex.obs['product_name']]\n",
"adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'product_name'] = 'control'"
]
},
{
"cell_type": "markdown",
"id": "6bcd577f-500f-409e-bdc9-d1acac3dc583",
"metadata": {},
"source": [
"Create copy of `product_name` with column name `control`"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "b1cbcd93-43cb-4740-8c84-2331ccb4b066",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['condition'] = adata_sciplex.obs.product_name.copy()"
]
},
{
"cell_type": "markdown",
"id": "0d148e98-c819-4455-92b4-ccc5cf2a46de",
"metadata": {},
"source": [
"Add combinations of drug (`condition`), dose (`dose_val`), and cell_type (`cell_type`)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "05249c77-612d-4236-af95-12cf2d2aefdf",
"metadata": {},
"outputs": [],
"source": [
"# make column of dataframe to categorical \n",
"adata_sciplex.obs[\"condition\"] = adata_sciplex.obs[\"condition\"].astype('category').cat.rename_categories({\"(+)-JQ1\": \"JQ1\"})\n",
"adata_sciplex.obs['drug_dose_name'] = adata_sciplex.obs.condition.astype(str) + '_' + adata_sciplex.obs.dose_val.astype(str)\n",
"adata_sciplex.obs['cov_drug_dose_name'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.drug_dose_name.astype(str)\n",
"adata_sciplex.obs['cov_drug'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.condition.astype(str)"
]
},
{
"cell_type": "markdown",
"id": "58850330-62b6-4d2c-a533-1cf238663805",
"metadata": {},
"source": [
"Add `control` columns with vale `1` where only the vehicle was used"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "5b7be27d-e6b8-42d6-b21b-400ddd5b3641",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['control'] = [1 if x == 'control_1.0' else 0 for x in adata_sciplex.obs.drug_dose_name.values]"
]
},
{
"cell_type": "markdown",
"id": "c06409b8-d30e-45a1-82aa-784fa5c2f1b0",
"metadata": {
"tags": []
},
"source": [
"## Compute DE genes"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "7753ba03-c908-4011-8365-2d871574cd56",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A549\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"MCF7\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"K562\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
}
],
"source": [
"from chemCPA.helper import rank_genes_groups_by_cov\n",
"\n",
"rank_genes_groups_by_cov(adata_sciplex, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='all_DEGs')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "4d9f098d-a04b-4407-a3a2-5041cfb480ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A549\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"MCF7\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"K562\n",
"WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/anndata/_core/anndata.py:1235: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
" df[key] = c\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'names'] = self.var_names[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'scores'] = scores[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals'] = pvals[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]\n",
"/nfs/staff-hdd/hetzell/miniconda3/envs/chemical_CPA/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:421: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`\n",
" foldchanges[global_indices]\n"
]
}
],
"source": [
"adata_subset = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()\n",
"rank_genes_groups_by_cov(adata_subset, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='lincs_DEGs')\n",
"adata_sciplex.uns['lincs_DEGs'] = adata_subset.uns['lincs_DEGs']"
]
},
{
"cell_type": "markdown",
"id": "428882cd-4e02-4af0-b281-51210aafbf79",
"metadata": {},
"source": [
"### Map all unique `cov_drug_dose_name` to the computed DEGs, independent of the dose value\n",
"\n",
"Create mapping between names with dose and without dose"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "238338ab-4950-4c29-94c5-2d2c4b5738ad",
"metadata": {},
"outputs": [],
"source": [
"cov_drug_dose_unique = adata_sciplex.obs.cov_drug_dose_name.unique()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "08aea617-1179-43e1-8391-d38eeee3b748",
"metadata": {},
"outputs": [],
"source": [
"remove_dose = lambda s: '_'.join(s.split('_')[:-1])\n",
"cov_drug = pd.Series(cov_drug_dose_unique).apply(remove_dose)\n",
"dose_no_dose_dict = dict(zip(cov_drug_dose_unique, cov_drug))"
]
},
{
"cell_type": "markdown",
"id": "c78a6e80-5012-442f-b7bf-a6c581da92dd",
"metadata": {},
"source": [
"### Compute new dicts for DEGs"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "b6594da7-64c0-4212-8cca-58d247b2cc5f",
"metadata": {},
"outputs": [],
"source": [
"uns_keys = ['all_DEGs', 'lincs_DEGs']"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "d5c73b35-31d3-4814-a5c0-658b23f1d0a1",
"metadata": {},
"outputs": [],
"source": [
"for uns_key in uns_keys:\n",
" new_DEGs_dict = {}\n",
"\n",
" df_DEGs = pd.Series(adata_sciplex.uns[uns_key])\n",
"\n",
" for key, value in dose_no_dose_dict.items():\n",
" if 'control' in key:\n",
" continue\n",
" new_DEGs_dict[key] = df_DEGs.loc[value]\n",
" adata_sciplex.uns[uns_key] = new_DEGs_dict"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "f713118a-514c-4cdb-b887-7118508ee37c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 581777 × 2000\n",
" obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control'\n",
" var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n",
" uns: 'log1p', 'hvg', 'all_DEGs', 'lincs_DEGs'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex"
]
},
{
"cell_type": "markdown",
"id": "d23ef784-5747-46de-a9bd-d5d869ff8042",
"metadata": {
"tags": []
},
"source": [
"## Create sciplex splits\n",
"\n",
"This is not the right configuration fot the experiments we want but for the moment this is okay"
]
},
{
"cell_type": "markdown",
"id": "6acf9d6e-d1af-4544-8021-8b9f4185d938",
"metadata": {
"tags": []
},
"source": [
"### OOD in Pathways"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "4b70dba8-132a-41fa-8958-78812063b738",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"DNA damage & DNA repair 6640\n",
"Epigenetic regulation 6093\n",
"Tyrosine kinase signaling 5846\n",
"Protein folding & Protein degradation 3863\n",
"Neuronal signaling 3635\n",
"Antioxidant 3616\n",
"HIF signaling 3501\n",
"Metabolic regulation 3470\n",
"Focal adhesion signaling 3450\n",
"Nuclear receptor signaling 3420\n",
"JAK/STAT signaling 3155\n",
"Apoptotic regulation 3141\n",
"TGF/BMP signaling 2794\n",
"PKC signaling 2778\n",
"Cell cycle regulation 2237\n",
"Other 0\n",
"Vehicle 0\n",
"Name: pathway_level_1, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs['split_ho_pathway'] = 'train' # reset\n",
"\n",
"ho_drugs = [\n",
" # selection of drugs from various pathways\n",
" \"Azacitidine\",\n",
" \"Carmofur\",\n",
" \"Pracinostat\",\n",
" \"Cediranib\",\n",
" \"Luminespib\",\n",
" \"Crizotinib\",\n",
" \"SNS-314\",\n",
" \"Obatoclax\",\n",
" \"Momelotinib\",\n",
" \"AG-14361\",\n",
" \"Entacapone\",\n",
" \"Fulvestrant\",\n",
" \"Mesna\",\n",
" \"Zileuton\",\n",
" \"Enzastaurin\",\n",
" \"IOX2\",\n",
" \"Alvespimycin\",\n",
" \"XAV-939\",\n",
" \"Fasudil\",\n",
"]\n",
"\n",
"ho_drug_pathway = adata_sciplex.obs['condition'].isin(ho_drugs)\n",
"adata_sciplex.obs.loc[ho_drug_pathway, 'pathway_level_1'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "65e41d95-3d6a-400b-b3d6-142161773d4d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"57639"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ho_drug_pathway.sum()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "3ce7605f-8fef-4c7d-9b62-be3879bd2991",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs.loc[ho_drug_pathway & (adata_sciplex.obs['dose_val'] == 1.0), 'split_ho_pathway'] = 'ood'\n",
"\n",
"test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs['split_ho_pathway'] != 'ood'], .15, copy=True).obs.index\n",
"adata_sciplex.obs.loc[test_idx, 'split_ho_pathway'] = 'test'"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "89cf167d-67bc-4603-b9f8-a73dd9980280",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" condition | \n",
" AG-14361 | \n",
" Alvespimycin | \n",
" Azacitidine | \n",
" Carmofur | \n",
" Cediranib | \n",
" Crizotinib | \n",
" Entacapone | \n",
" Enzastaurin | \n",
" Fasudil | \n",
" Fulvestrant | \n",
" IOX2 | \n",
" Luminespib | \n",
" Mesna | \n",
" Momelotinib | \n",
" Obatoclax | \n",
" Pracinostat | \n",
" SNS-314 | \n",
" XAV-939 | \n",
" Zileuton | \n",
"
\n",
" \n",
" pathway_level_1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Antioxidant | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3616 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Apoptotic regulation | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3141 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Cell cycle regulation | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2237 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" DNA damage & DNA repair | \n",
" 3401 | \n",
" 0 | \n",
" 0 | \n",
" 3239 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Epigenetic regulation | \n",
" 0 | \n",
" 0 | \n",
" 3151 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2942 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Focal adhesion signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3450 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" HIF signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3501 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" JAK/STAT signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3155 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Metabolic regulation | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3470 | \n",
"
\n",
" \n",
" Neuronal signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3635 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Nuclear receptor signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3420 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" PKC signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2778 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Protein folding & Protein degradation | \n",
" 0 | \n",
" 1858 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2005 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" TGF/BMP signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2794 | \n",
" 0 | \n",
"
\n",
" \n",
" Tyrosine kinase signaling | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3060 | \n",
" 2786 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"condition AG-14361 Alvespimycin Azacitidine \\\n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 \n",
"Apoptotic regulation 0 0 0 \n",
"Cell cycle regulation 0 0 0 \n",
"DNA damage & DNA repair 3401 0 0 \n",
"Epigenetic regulation 0 0 3151 \n",
"Focal adhesion signaling 0 0 0 \n",
"HIF signaling 0 0 0 \n",
"JAK/STAT signaling 0 0 0 \n",
"Metabolic regulation 0 0 0 \n",
"Neuronal signaling 0 0 0 \n",
"Nuclear receptor signaling 0 0 0 \n",
"PKC signaling 0 0 0 \n",
"Protein folding & Protein degradation 0 1858 0 \n",
"TGF/BMP signaling 0 0 0 \n",
"Tyrosine kinase signaling 0 0 0 \n",
"\n",
"condition Carmofur Cediranib Crizotinib \\\n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 \n",
"Apoptotic regulation 0 0 0 \n",
"Cell cycle regulation 0 0 0 \n",
"DNA damage & DNA repair 3239 0 0 \n",
"Epigenetic regulation 0 0 0 \n",
"Focal adhesion signaling 0 0 0 \n",
"HIF signaling 0 0 0 \n",
"JAK/STAT signaling 0 0 0 \n",
"Metabolic regulation 0 0 0 \n",
"Neuronal signaling 0 0 0 \n",
"Nuclear receptor signaling 0 0 0 \n",
"PKC signaling 0 0 0 \n",
"Protein folding & Protein degradation 0 0 0 \n",
"TGF/BMP signaling 0 0 0 \n",
"Tyrosine kinase signaling 0 3060 2786 \n",
"\n",
"condition Entacapone Enzastaurin Fasudil \\\n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 \n",
"Apoptotic regulation 0 0 0 \n",
"Cell cycle regulation 0 0 0 \n",
"DNA damage & DNA repair 0 0 0 \n",
"Epigenetic regulation 0 0 0 \n",
"Focal adhesion signaling 0 0 3450 \n",
"HIF signaling 0 0 0 \n",
"JAK/STAT signaling 0 0 0 \n",
"Metabolic regulation 0 0 0 \n",
"Neuronal signaling 3635 0 0 \n",
"Nuclear receptor signaling 0 0 0 \n",
"PKC signaling 0 2778 0 \n",
"Protein folding & Protein degradation 0 0 0 \n",
"TGF/BMP signaling 0 0 0 \n",
"Tyrosine kinase signaling 0 0 0 \n",
"\n",
"condition Fulvestrant IOX2 Luminespib Mesna \\\n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 3616 \n",
"Apoptotic regulation 0 0 0 0 \n",
"Cell cycle regulation 0 0 0 0 \n",
"DNA damage & DNA repair 0 0 0 0 \n",
"Epigenetic regulation 0 0 0 0 \n",
"Focal adhesion signaling 0 0 0 0 \n",
"HIF signaling 0 3501 0 0 \n",
"JAK/STAT signaling 0 0 0 0 \n",
"Metabolic regulation 0 0 0 0 \n",
"Neuronal signaling 0 0 0 0 \n",
"Nuclear receptor signaling 3420 0 0 0 \n",
"PKC signaling 0 0 0 0 \n",
"Protein folding & Protein degradation 0 0 2005 0 \n",
"TGF/BMP signaling 0 0 0 0 \n",
"Tyrosine kinase signaling 0 0 0 0 \n",
"\n",
"condition Momelotinib Obatoclax Pracinostat \\\n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 \n",
"Apoptotic regulation 0 3141 0 \n",
"Cell cycle regulation 0 0 0 \n",
"DNA damage & DNA repair 0 0 0 \n",
"Epigenetic regulation 0 0 2942 \n",
"Focal adhesion signaling 0 0 0 \n",
"HIF signaling 0 0 0 \n",
"JAK/STAT signaling 3155 0 0 \n",
"Metabolic regulation 0 0 0 \n",
"Neuronal signaling 0 0 0 \n",
"Nuclear receptor signaling 0 0 0 \n",
"PKC signaling 0 0 0 \n",
"Protein folding & Protein degradation 0 0 0 \n",
"TGF/BMP signaling 0 0 0 \n",
"Tyrosine kinase signaling 0 0 0 \n",
"\n",
"condition SNS-314 XAV-939 Zileuton \n",
"pathway_level_1 \n",
"Antioxidant 0 0 0 \n",
"Apoptotic regulation 0 0 0 \n",
"Cell cycle regulation 2237 0 0 \n",
"DNA damage & DNA repair 0 0 0 \n",
"Epigenetic regulation 0 0 0 \n",
"Focal adhesion signaling 0 0 0 \n",
"HIF signaling 0 0 0 \n",
"JAK/STAT signaling 0 0 0 \n",
"Metabolic regulation 0 0 3470 \n",
"Neuronal signaling 0 0 0 \n",
"Nuclear receptor signaling 0 0 0 \n",
"PKC signaling 0 0 0 \n",
"Protein folding & Protein degradation 0 0 0 \n",
"TGF/BMP signaling 0 2794 0 \n",
"Tyrosine kinase signaling 0 0 0 "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.pathway_level_1, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(ho_drugs)])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "3325a1e0-dd95-4e53-a773-99df4b463767",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"train 483951\n",
"test 85403\n",
"ood 12423\n",
"Name: split_ho_pathway, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs['split_ho_pathway'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "dfecfaa3-55c2-4d7d-872b-0e0208eac6a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Fasudil 966\n",
"IOX2 913\n",
"Mesna 884\n",
"Entacapone 868\n",
"Fulvestrant 836\n",
"Zileuton 822\n",
"Carmofur 767\n",
"AG-14361 759\n",
"Azacitidine 736\n",
"Enzastaurin 694\n",
"Pracinostat 658\n",
"SNS-314 547\n",
"Cediranib 528\n",
"Momelotinib 487\n",
"XAV-939 479\n",
"Crizotinib 464\n",
"Luminespib 405\n",
"Obatoclax 404\n",
"Alvespimycin 206\n",
"Name: condition, dtype: int64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'ood'].obs.condition.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "a591e3d0-c1dd-4723-879f-76b37b16b962",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"control 1964\n",
"ENMD-2076 914\n",
"RG108 604\n",
"GSK-LSD1 596\n",
"Altretamine 573\n",
" ... \n",
"Luminespib 236\n",
"Patupilone 228\n",
"Flavopiridol 207\n",
"Epothilone 181\n",
"YM155 112\n",
"Name: condition, Length: 188, dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'test'].obs.condition.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "ff1a2cd4-2a68-4f67-8e13-46fe8fe06c42",
"metadata": {
"tags": []
},
"source": [
"### OOD drugs in epigenetic regulation, Tyrosine kinase signaling, cell cycle regulation"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "244d46ca-9ff8-4e4c-b225-c1a26c84b8da",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Epigenetic regulation 147875\n",
"Tyrosine kinase signaling 85503\n",
"JAK/STAT signaling 70922\n",
"DNA damage & DNA repair 60042\n",
"Cell cycle regulation 53952\n",
"Other 19980\n",
"Nuclear receptor signaling 19940\n",
"Protein folding & Protein degradation 19191\n",
"Metabolic regulation 17989\n",
"Neuronal signaling 14071\n",
"Antioxidant 13414\n",
"Apoptotic regulation 13141\n",
"Vehicle 13004\n",
"HIF signaling 9279\n",
"PKC signaling 8804\n",
"TGF/BMP signaling 8774\n",
"Focal adhesion signaling 5896\n",
"Name: pathway_level_1, dtype: int64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs['pathway_level_1'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "448e9822-947a-49ed-b80b-1485c60a218b",
"metadata": {
"tags": []
},
"source": [
"___\n",
"\n",
"#### Tyrosine signaling"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "aac8694c-1c90-40b5-870e-ff1e41fb8527",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PD98059 3763\n",
"AG-490 3533\n",
"Motesanib 3363\n",
"TGX-221 3358\n",
"Ki8751 3347\n",
" ... \n",
"Fedratinib 0\n",
"Filgotinib 0\n",
"Flavopiridol 0\n",
"Fluorouracil 0\n",
"control 0\n",
"Name: condition, Length: 188, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"]),'condition'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "cbce2aca-0b65-456c-bf26-f01a982b2e99",
"metadata": {},
"outputs": [],
"source": [
"tyrosine_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"]),'condition'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "a7f03a94-0ee5-4e84-9367-020a0b20988e",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['split_tyrosine_ood'] = 'train' \n",
"\n",
"test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Tyrosine kinase signaling\"])], .20, copy=True).obs.index\n",
"adata_sciplex.obs.loc[test_idx, 'split_tyrosine_ood'] = 'test'\n",
"\n",
"adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"Cediranib\", \"Crizotinib\", \"Motesanib\", \"BMS-754807\", \"Nintedanib\"]), 'split_tyrosine_ood'] = 'ood' "
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "a6386e14-7463-4f08-8ea0-d6991b9e3af1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"train 552761\n",
"ood 14880\n",
"test 14136\n",
"Name: split_tyrosine_ood, dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.split_tyrosine_ood.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "8cc683c9-5fdb-47a1-b057-e357b93442a9",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" condition | \n",
" AC480 | \n",
" AG-490 | \n",
" BMS-536924 | \n",
" BMS-754807 | \n",
" Bosutinib | \n",
" Cediranib | \n",
" Crizotinib | \n",
" Dasatinib | \n",
" Glesatinib?(MGCD265) | \n",
" KW-2449 | \n",
" Ki8751 | \n",
" Lapatinib | \n",
" Linifanib | \n",
" Motesanib | \n",
" Nilotinib | \n",
" Nintedanib | \n",
" PD173074 | \n",
" PD98059 | \n",
" Pelitinib | \n",
" Regorafenib | \n",
" Rigosertib | \n",
" SL-327 | \n",
" Sorafenib | \n",
" TAK-901 | \n",
" TGX-221 | \n",
" Temsirolimus | \n",
" Tie2 | \n",
" Trametinib | \n",
" Vandetanib | \n",
"
\n",
" \n",
" split_tyrosine_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2676 | \n",
" 0 | \n",
" 3060 | \n",
" 2786 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3363 | \n",
" 0 | \n",
" 2995 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" test | \n",
" 645 | \n",
" 728 | \n",
" 582 | \n",
" 0 | \n",
" 491 | \n",
" 0 | \n",
" 0 | \n",
" 491 | \n",
" 656 | \n",
" 580 | \n",
" 641 | \n",
" 603 | \n",
" 678 | \n",
" 0 | \n",
" 639 | \n",
" 0 | \n",
" 702 | \n",
" 723 | \n",
" 620 | \n",
" 502 | \n",
" 377 | \n",
" 678 | \n",
" 658 | \n",
" 419 | \n",
" 620 | \n",
" 453 | \n",
" 647 | \n",
" 443 | \n",
" 560 | \n",
"
\n",
" \n",
" train | \n",
" 2597 | \n",
" 2805 | \n",
" 2318 | \n",
" 0 | \n",
" 1945 | \n",
" 0 | \n",
" 0 | \n",
" 2047 | \n",
" 2527 | \n",
" 2452 | \n",
" 2706 | \n",
" 2435 | \n",
" 2487 | \n",
" 0 | \n",
" 2448 | \n",
" 0 | \n",
" 2588 | \n",
" 3040 | \n",
" 2306 | \n",
" 2182 | \n",
" 1562 | \n",
" 2521 | \n",
" 2413 | \n",
" 1649 | \n",
" 2738 | \n",
" 1780 | \n",
" 2616 | \n",
" 2031 | \n",
" 2294 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"condition AC480 AG-490 BMS-536924 BMS-754807 Bosutinib \\\n",
"split_tyrosine_ood \n",
"ood 0 0 0 2676 0 \n",
"test 645 728 582 0 491 \n",
"train 2597 2805 2318 0 1945 \n",
"\n",
"condition Cediranib Crizotinib Dasatinib Glesatinib?(MGCD265) \\\n",
"split_tyrosine_ood \n",
"ood 3060 2786 0 0 \n",
"test 0 0 491 656 \n",
"train 0 0 2047 2527 \n",
"\n",
"condition KW-2449 Ki8751 Lapatinib Linifanib Motesanib \\\n",
"split_tyrosine_ood \n",
"ood 0 0 0 0 3363 \n",
"test 580 641 603 678 0 \n",
"train 2452 2706 2435 2487 0 \n",
"\n",
"condition Nilotinib Nintedanib PD173074 PD98059 Pelitinib \\\n",
"split_tyrosine_ood \n",
"ood 0 2995 0 0 0 \n",
"test 639 0 702 723 620 \n",
"train 2448 0 2588 3040 2306 \n",
"\n",
"condition Regorafenib Rigosertib SL-327 Sorafenib TAK-901 \\\n",
"split_tyrosine_ood \n",
"ood 0 0 0 0 0 \n",
"test 502 377 678 658 419 \n",
"train 2182 1562 2521 2413 1649 \n",
"\n",
"condition TGX-221 Temsirolimus Tie2 Trametinib Vandetanib \n",
"split_tyrosine_ood \n",
"ood 0 0 0 0 0 \n",
"test 620 453 647 443 560 \n",
"train 2738 1780 2616 2031 2294 "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(tyrosine_drugs)])"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "2fa637e1-a444-4235-8c5d-0c8b5acc1b9a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" dose_val | \n",
" 0.001 | \n",
" 0.010 | \n",
" 0.100 | \n",
" 1.000 | \n",
"
\n",
" \n",
" split_tyrosine_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 4226 | \n",
" 4118 | \n",
" 3822 | \n",
" 2714 | \n",
"
\n",
" \n",
" test | \n",
" 3928 | \n",
" 3930 | \n",
" 3590 | \n",
" 2688 | \n",
"
\n",
" \n",
" train | \n",
" 144859 | \n",
" 139622 | \n",
" 134416 | \n",
" 133864 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"dose_val 0.001 0.010 0.100 1.000\n",
"split_tyrosine_ood \n",
"ood 4226 4118 3822 2714\n",
"test 3928 3930 3590 2688\n",
"train 144859 139622 134416 133864"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)"
]
},
{
"cell_type": "markdown",
"id": "c16410d8-57f6-4958-8aec-db63f5acbfd2",
"metadata": {
"tags": []
},
"source": [
"____\n",
"\n",
"#### Epigenetic regulation"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "226d2855-8739-4bf4-bab0-eaac30ffe7b7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RG108 3715\n",
"Tubastatin 3710\n",
"GSK-LSD1 3688\n",
"SRT2104 3687\n",
"Tacedinaline 3664\n",
" ... \n",
"Fulvestrant 0\n",
"G007-LK 0\n",
"GSK1070916 0\n",
"Gandotinib 0\n",
"control 0\n",
"Name: condition, Length: 188, dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"]),'condition'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "bf8532c2-e843-4d6e-87bb-a12dd3333d27",
"metadata": {},
"outputs": [],
"source": [
"epigenetic_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"]),'condition'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "a3548623-3991-49fe-add3-aed28f6a3ee5",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['split_epigenetic_ood'] = 'train' \n",
"\n",
"test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Epigenetic regulation\"])], .20, copy=True).obs.index\n",
"adata_sciplex.obs.loc[test_idx, 'split_epigenetic_ood'] = 'test'\n",
"\n",
"adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"Azacitidine\", \"Pracinostat\", \"Trichostatin\", \"Quisinostat\", \"Tazemetostat\"]), 'split_epigenetic_ood'] = 'ood' "
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "fed7945c-8e2b-44a8-860e-38fea4fac1b4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"train 540070\n",
"test 26538\n",
"ood 15169\n",
"Name: split_epigenetic_ood, dtype: int64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.split_epigenetic_ood.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "0474d068-1bb1-4a9a-b8f3-f6661054f30b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" condition | \n",
" JQ1 | \n",
" A-366 | \n",
" AR-42 | \n",
" Abexinostat | \n",
" Anacardic | \n",
" Azacitidine | \n",
" BRD4770 | \n",
" Belinostat | \n",
" CUDC-101 | \n",
" CUDC-907 | \n",
" Dacinostat | \n",
" Decitabine | \n",
" Divalproex | \n",
" Droxinostat | \n",
" EED226 | \n",
" Entinostat | \n",
" GSK | \n",
" GSK-LSD1 | \n",
" Givinostat | \n",
" ITSA-1 | \n",
" M344 | \n",
" MC1568 | \n",
" Mocetinostat | \n",
" PCI-34051 | \n",
" PFI-1 | \n",
" Panobinostat | \n",
" Pracinostat | \n",
" Quisinostat | \n",
" RG108 | \n",
" Resminostat | \n",
" Resveratrol | \n",
" SRT1720 | \n",
" SRT2104 | \n",
" SRT3025 | \n",
" Selisistat | \n",
" Sirtinol | \n",
" Sodium | \n",
" TMP195 | \n",
" Tacedinaline | \n",
" Tazemetostat | \n",
" Trichostatin | \n",
" Tubastatin | \n",
" Tucidinostat | \n",
" UNC0379 | \n",
" UNC0631 | \n",
" UNC1999 | \n",
" Valproic | \n",
"
\n",
" \n",
" split_epigenetic_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3151 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2942 | \n",
" 2354 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3639 | \n",
" 3083 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" test | \n",
" 625 | \n",
" 645 | \n",
" 623 | \n",
" 582 | \n",
" 728 | \n",
" 0 | \n",
" 743 | \n",
" 581 | \n",
" 661 | \n",
" 519 | \n",
" 518 | \n",
" 491 | \n",
" 647 | \n",
" 652 | \n",
" 645 | \n",
" 716 | \n",
" 690 | \n",
" 686 | \n",
" 631 | \n",
" 544 | \n",
" 611 | \n",
" 655 | \n",
" 385 | \n",
" 591 | \n",
" 618 | \n",
" 517 | \n",
" 0 | \n",
" 0 | \n",
" 701 | \n",
" 649 | \n",
" 655 | \n",
" 583 | \n",
" 779 | \n",
" 605 | \n",
" 690 | \n",
" 669 | \n",
" 710 | \n",
" 511 | \n",
" 747 | \n",
" 0 | \n",
" 0 | \n",
" 718 | \n",
" 453 | \n",
" 686 | \n",
" 664 | \n",
" 686 | \n",
" 728 | \n",
"
\n",
" \n",
" train | \n",
" 2412 | \n",
" 2751 | \n",
" 2278 | \n",
" 2331 | \n",
" 2876 | \n",
" 0 | \n",
" 2886 | \n",
" 2444 | \n",
" 2548 | \n",
" 1898 | \n",
" 1998 | \n",
" 1866 | \n",
" 2581 | \n",
" 2545 | \n",
" 2624 | \n",
" 2669 | \n",
" 2911 | \n",
" 3002 | \n",
" 2474 | \n",
" 2282 | \n",
" 2543 | \n",
" 2761 | \n",
" 1593 | \n",
" 2350 | \n",
" 2589 | \n",
" 2056 | \n",
" 0 | \n",
" 0 | \n",
" 3014 | \n",
" 2670 | \n",
" 2317 | \n",
" 2487 | \n",
" 2908 | \n",
" 2405 | \n",
" 2684 | \n",
" 2872 | \n",
" 2787 | \n",
" 2067 | \n",
" 2917 | \n",
" 0 | \n",
" 0 | \n",
" 2992 | \n",
" 1800 | \n",
" 2595 | \n",
" 2890 | \n",
" 2683 | \n",
" 2812 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"condition JQ1 A-366 AR-42 Abexinostat Anacardic Azacitidine \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 0 3151 \n",
"test 625 645 623 582 728 0 \n",
"train 2412 2751 2278 2331 2876 0 \n",
"\n",
"condition BRD4770 Belinostat CUDC-101 CUDC-907 Dacinostat \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 0 \n",
"test 743 581 661 519 518 \n",
"train 2886 2444 2548 1898 1998 \n",
"\n",
"condition Decitabine Divalproex Droxinostat EED226 Entinostat \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 0 \n",
"test 491 647 652 645 716 \n",
"train 1866 2581 2545 2624 2669 \n",
"\n",
"condition GSK GSK-LSD1 Givinostat ITSA-1 M344 MC1568 \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 0 0 \n",
"test 690 686 631 544 611 655 \n",
"train 2911 3002 2474 2282 2543 2761 \n",
"\n",
"condition Mocetinostat PCI-34051 PFI-1 Panobinostat \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 \n",
"test 385 591 618 517 \n",
"train 1593 2350 2589 2056 \n",
"\n",
"condition Pracinostat Quisinostat RG108 Resminostat \\\n",
"split_epigenetic_ood \n",
"ood 2942 2354 0 0 \n",
"test 0 0 701 649 \n",
"train 0 0 3014 2670 \n",
"\n",
"condition Resveratrol SRT1720 SRT2104 SRT3025 Selisistat \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 0 \n",
"test 655 583 779 605 690 \n",
"train 2317 2487 2908 2405 2684 \n",
"\n",
"condition Sirtinol Sodium TMP195 Tacedinaline Tazemetostat \\\n",
"split_epigenetic_ood \n",
"ood 0 0 0 0 3639 \n",
"test 669 710 511 747 0 \n",
"train 2872 2787 2067 2917 0 \n",
"\n",
"condition Trichostatin Tubastatin Tucidinostat UNC0379 \\\n",
"split_epigenetic_ood \n",
"ood 3083 0 0 0 \n",
"test 0 718 453 686 \n",
"train 0 2992 1800 2595 \n",
"\n",
"condition UNC0631 UNC1999 Valproic \n",
"split_epigenetic_ood \n",
"ood 0 0 0 \n",
"test 664 686 728 \n",
"train 2890 2683 2812 "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_epigenetic_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(epigenetic_drugs)])"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "7fbc8c54-82e7-40bb-b61e-c7c0e30c6717",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" dose_val | \n",
" 0.001 | \n",
" 0.010 | \n",
" 0.100 | \n",
" 1.000 | \n",
"
\n",
" \n",
" split_tyrosine_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 4226 | \n",
" 4118 | \n",
" 3822 | \n",
" 2714 | \n",
"
\n",
" \n",
" test | \n",
" 3928 | \n",
" 3930 | \n",
" 3590 | \n",
" 2688 | \n",
"
\n",
" \n",
" train | \n",
" 144859 | \n",
" 139622 | \n",
" 134416 | \n",
" 133864 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"dose_val 0.001 0.010 0.100 1.000\n",
"split_tyrosine_ood \n",
"ood 4226 4118 3822 2714\n",
"test 3928 3930 3590 2688\n",
"train 144859 139622 134416 133864"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)"
]
},
{
"cell_type": "markdown",
"id": "3f5dc02f-c298-46a3-b79d-39b414f87d0f",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"__________\n",
"\n",
"#### Cell cycle regulation"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "b97eade0-5807-41b7-b657-809cb7f9b930",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ENMD-2076 5757\n",
"BMS-265246 3274\n",
"Roscovitine 3254\n",
"Aurora 3036\n",
"MK-5108 3006\n",
" ... \n",
"Fedratinib 0\n",
"Filgotinib 0\n",
"Fluorouracil 0\n",
"Fulvestrant 0\n",
"control 0\n",
"Name: condition, Length: 188, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"]),'condition'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "04af6f98-298d-4630-9d53-1cd4892ed0d8",
"metadata": {},
"outputs": [],
"source": [
"cell_cycle_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"]),'condition'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "2c47f50d-2178-4122-90b3-bb4202cc9f36",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex.obs['split_cellcycle_ood'] = 'train' \n",
"\n",
"test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin([\"Cell cycle regulation\"])], .20, copy=True).obs.index\n",
"adata_sciplex.obs.loc[test_idx, 'split_cellcycle_ood'] = 'test'\n",
"\n",
"adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin([\"SNS-314\", \"Flavopiridol\", \"Roscovitine\"]), 'split_cellcycle_ood'] = 'ood' "
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "4845abcb-fbe7-4206-9d6b-19d9a937bd6b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"train 565503\n",
"test 9376\n",
"ood 6898\n",
"Name: split_cellcycle_ood, dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.obs.split_cellcycle_ood.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "ac569309-1243-4af2-8b06-3b3416a6e4d7",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" condition | \n",
" AMG-900 | \n",
" Alisertib | \n",
" Aurora | \n",
" BMS-265246 | \n",
" Barasertib | \n",
" CYC116 | \n",
" Danusertib | \n",
" ENMD-2076 | \n",
" Epothilone | \n",
" Flavopiridol | \n",
" GSK1070916 | \n",
" Hesperadin | \n",
" JNJ-7706621 | \n",
" MK-5108 | \n",
" MLN8054 | \n",
" PHA-680632 | \n",
" Patupilone | \n",
" Roscovitine | \n",
" SNS-314 | \n",
" Tozasertib | \n",
" ZM | \n",
"
\n",
" \n",
" split_cellcycle_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1407 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3254 | \n",
" 2237 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" test | \n",
" 545 | \n",
" 428 | \n",
" 616 | \n",
" 679 | \n",
" 463 | \n",
" 570 | \n",
" 469 | \n",
" 1140 | \n",
" 230 | \n",
" 0 | \n",
" 512 | \n",
" 356 | \n",
" 590 | \n",
" 590 | \n",
" 478 | \n",
" 450 | \n",
" 290 | \n",
" 0 | \n",
" 0 | \n",
" 424 | \n",
" 546 | \n",
"
\n",
" \n",
" train | \n",
" 2165 | \n",
" 1673 | \n",
" 2420 | \n",
" 2595 | \n",
" 1958 | \n",
" 2381 | \n",
" 1927 | \n",
" 4617 | \n",
" 991 | \n",
" 0 | \n",
" 1990 | \n",
" 1593 | \n",
" 2398 | \n",
" 2416 | \n",
" 1866 | \n",
" 1731 | \n",
" 1191 | \n",
" 0 | \n",
" 0 | \n",
" 1596 | \n",
" 2170 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"condition AMG-900 Alisertib Aurora BMS-265246 Barasertib \\\n",
"split_cellcycle_ood \n",
"ood 0 0 0 0 0 \n",
"test 545 428 616 679 463 \n",
"train 2165 1673 2420 2595 1958 \n",
"\n",
"condition CYC116 Danusertib ENMD-2076 Epothilone Flavopiridol \\\n",
"split_cellcycle_ood \n",
"ood 0 0 0 0 1407 \n",
"test 570 469 1140 230 0 \n",
"train 2381 1927 4617 991 0 \n",
"\n",
"condition GSK1070916 Hesperadin JNJ-7706621 MK-5108 MLN8054 \\\n",
"split_cellcycle_ood \n",
"ood 0 0 0 0 0 \n",
"test 512 356 590 590 478 \n",
"train 1990 1593 2398 2416 1866 \n",
"\n",
"condition PHA-680632 Patupilone Roscovitine SNS-314 Tozasertib \\\n",
"split_cellcycle_ood \n",
"ood 0 0 3254 2237 0 \n",
"test 450 290 0 0 424 \n",
"train 1731 1191 0 0 1596 \n",
"\n",
"condition ZM \n",
"split_cellcycle_ood \n",
"ood 0 \n",
"test 546 \n",
"train 2170 "
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(cell_cycle_drugs)])"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "b93f29bf-a79f-40aa-87dd-4bd736cc8fa7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" dose_val | \n",
" 0.001 | \n",
" 0.010 | \n",
" 0.100 | \n",
" 1.000 | \n",
"
\n",
" \n",
" split_cellcycle_ood | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ood | \n",
" 2165 | \n",
" 1774 | \n",
" 1457 | \n",
" 1502 | \n",
"
\n",
" \n",
" test | \n",
" 2673 | \n",
" 2429 | \n",
" 2329 | \n",
" 1945 | \n",
"
\n",
" \n",
" train | \n",
" 148175 | \n",
" 143467 | \n",
" 138042 | \n",
" 135819 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"dose_val 0.001 0.010 0.100 1.000\n",
"split_cellcycle_ood \n",
"ood 2165 1774 1457 1502\n",
"test 2673 2429 2329 1945\n",
"train 148175 143467 138042 135819"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs.dose_val)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "41ba1c76-85a8-4398-a637-5354ac5cfb18",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['split_ho_pathway',\n",
" 'split_tyrosine_ood',\n",
" 'split_epigenetic_ood',\n",
" 'split_cellcycle_ood']"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[c for c in adata_sciplex.obs.columns if 'split' in c]"
]
},
{
"cell_type": "markdown",
"id": "697e1caf-86a1-46e5-af03-76213446dfe2",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"### Further splits\n",
"\n",
"**We omit these split as we design our own splits - for referece this is commented out for the moment**\n",
"\n",
"Also a split which sees all data:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "97654e4f-4801-42df-94e4-b3abc596dff2",
"metadata": {},
"outputs": [],
"source": [
"# adata.obs['split_all'] = 'train'\n",
"# test_idx = sc.pp.subsample(adata, .10, copy=True).obs.index\n",
"# adata.obs.loc[test_idx, 'split_all'] = 'test'"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "07984033-dc32-4cad-91c1-650e3a2926e3",
"metadata": {},
"outputs": [],
"source": [
"# adata.obs['ct_dose'] = adata.obs.cell_type.astype('str') + '_' + adata.obs.dose_val.astype('str')"
]
},
{
"cell_type": "markdown",
"id": "ef98d16d-aa0e-4b39-9675-c7b0132510c9",
"metadata": {},
"source": [
"Round robin splits: dose and cell line combinations will be held out in turn."
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "4492d665-b265-4c8c-9251-6d7598551116",
"metadata": {},
"outputs": [],
"source": [
"# i = 0\n",
"# split_dict = {}"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "84f59288-4925-4d9c-92e3-f2f0611910da",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# # single ct holdout\n",
"# for ct in adata.obs.cell_type.unique():\n",
"# for dose in adata.obs.dose_val.unique():\n",
"# i += 1\n",
"# split_name = f'split{i}'\n",
"# split_dict[split_name] = f'{ct}_{dose}'\n",
" \n",
"# adata.obs[split_name] = 'train'\n",
"# adata.obs.loc[adata.obs.ct_dose == f'{ct}_{dose}', split_name] = 'ood'\n",
" \n",
"# test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
"# adata.obs.loc[test_idx, split_name] = 'test'\n",
" \n",
"# display(adata.obs[split_name].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "23d5c2bd-ebcc-4da8-a0b8-c04fab040c44",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# # double ct holdout\n",
"# for cts in [('A549', 'MCF7'), ('A549', 'K562'), ('MCF7', 'K562')]:\n",
"# for dose in adata.obs.dose_val.unique():\n",
"# i += 1\n",
"# split_name = f'split{i}'\n",
"# split_dict[split_name] = f'{cts[0]}+{cts[1]}_{dose}'\n",
" \n",
"# adata.obs[split_name] = 'train'\n",
"# adata.obs.loc[adata.obs.ct_dose == f'{cts[0]}_{dose}', split_name] = 'ood'\n",
"# adata.obs.loc[adata.obs.ct_dose == f'{cts[1]}_{dose}', split_name] = 'ood'\n",
" \n",
"# test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
"# adata.obs.loc[test_idx, split_name] = 'test'\n",
" \n",
"# display(adata.obs[split_name].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "e722a203-eeba-4e85-a542-33d8783afec7",
"metadata": {},
"outputs": [],
"source": [
"# # triple ct holdout\n",
"# for dose in adata.obs.dose_val.unique():\n",
"# i += 1\n",
"# split_name = f'split{i}'\n",
"\n",
"# split_dict[split_name] = f'all_{dose}'\n",
"# adata.obs[split_name] = 'train'\n",
"# adata.obs.loc[adata.obs.dose_val == dose, split_name] = 'ood'\n",
"\n",
"# test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index\n",
"# adata.obs.loc[test_idx, split_name] = 'test'\n",
"\n",
"# display(adata.obs[split_name].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "34f21c22-1979-484a-92bb-9f6fc8b71fd2",
"metadata": {},
"outputs": [],
"source": [
"# adata.uns['all_DEGs']"
]
},
{
"cell_type": "markdown",
"id": "615fa85a-417d-4c37-8530-f0129132fc4f",
"metadata": {
"tags": []
},
"source": [
"## Save adata"
]
},
{
"cell_type": "markdown",
"id": "319f177a-549f-4424-a56e-51af6535c48e",
"metadata": {},
"source": [
"Reindex the lincs dataset"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "1353837b-9cf8-46f0-9529-1340ba033f2f",
"metadata": {},
"outputs": [],
"source": [
"sciplex_ids = pd.Index(adata_sciplex.var.gene_id)\n",
"\n",
"lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_lincs.var.gene_id[adata_lincs.var.in_sciplex]]"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "92990709-882a-4ed6-b216-df893b4dcea2",
"metadata": {},
"outputs": [],
"source": [
"non_lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_sciplex.var.gene_id if not adata_lincs.var.gene_id.isin([_id]).any()]\n",
"\n",
"lincs_idx.extend(non_lincs_idx)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "edda556e-1ae9-47ad-906f-bf12d94dccef",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex = adata_sciplex[:, lincs_idx].copy()"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "eea089c3-f96a-420c-af98-576d3a34bd1c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"fname = PROJECT_DIR/'datasets'/'sciplex3_matched_genes_lincs.h5ad'\n",
"\n",
"sc.write(fname, adata_sciplex)"
]
},
{
"cell_type": "markdown",
"id": "44fb969d-2a45-41a3-b138-eda1ea8e7238",
"metadata": {},
"source": [
"Check that it worked"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "582e2283-6100-4de2-995b-b7befddd0a92",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AnnData object with n_obs × n_vars = 581777 × 2000\n",
" obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control', 'split_ho_pathway', 'split_tyrosine_ood', 'split_epigenetic_ood', 'split_cellcycle_ood'\n",
" var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n",
" uns: 'all_DEGs', 'hvg', 'lincs_DEGs', 'log1p'"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sc.read(fname)"
]
},
{
"cell_type": "markdown",
"id": "bdd00494-7d01-41c9-9a05-cf2921e68393",
"metadata": {},
"source": [
"## Subselect to shared only shared genes"
]
},
{
"cell_type": "markdown",
"id": "425a4946-12a1-42c3-ab11-41673606be1e",
"metadata": {},
"source": [
"Subset to shared genes"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "a0005d9c-42e1-4ec1-b7f5-fa9d96037d5d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"adata_lincs = adata_lincs[:, adata_lincs.var.in_sciplex].copy() "
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "1b0831d3-c6c2-4dbd-b137-8ca27e4e0e52",
"metadata": {},
"outputs": [],
"source": [
"adata_sciplex = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "e0fcb5f4-46a1-4f4b-ab41-7cefdc26abfe",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',\n",
" 'ATP6V0B', 'RPS6',\n",
" ...\n",
" 'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',\n",
" 'WDTC1', 'PLSCR3', 'NPEPL1'],\n",
" dtype='object', length=977)"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_lincs.var_names"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "8825ee35-daab-4625-93f9-764ced4ef32f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',\n",
" 'ATP6V0B', 'RPS6',\n",
" ...\n",
" 'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',\n",
" 'WDTC1', 'PLSCR3', 'NPEPL1'],\n",
" dtype='object', name='index', length=977)"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adata_sciplex.var_names"
]
},
{
"cell_type": "markdown",
"id": "c059ac22-e464-40a9-8021-cbb4d8a10aba",
"metadata": {},
"source": [
"## Save adata objects with shared genes only\n",
"Index of lincs has also been reordered accordingly"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "4ea8321a-694c-4522-a931-38d51990b5a0",
"metadata": {},
"outputs": [],
"source": [
"fname = PROJECT_DIR/'datasets'/'sciplex3_lincs_genes.h5ad'\n",
"\n",
"sc.write(fname, adata_sciplex)"
]
},
{
"cell_type": "markdown",
"id": "36596257-fb0c-479c-8868-996a25affeae",
"metadata": {},
"source": [
"____"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "881fb5d3-1c04-4ecd-8ea3-aeda1e3baf57",
"metadata": {},
"outputs": [],
"source": [
"fname_lincs = PROJECT_DIR/'datasets'/'lincs_full_smiles_sciplex_genes.h5ad'\n",
"\n",
"sc.write(fname_lincs, adata_lincs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89dca192",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "ad25c9354f8cefdf5a943c25e67813a21d2807e3af4d6d0915e47390a83b57ce"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"toc-autonumbering": false
},
"nbformat": 4,
"nbformat_minor": 5
}