# 3 LINCS SCIPLEX GENE MATCHING

**Requires**
* `'lincs_full_smiles.h5ad'`
* `'sciplex_raw_chunk_{i}.h5ad'` with $i \in \{0,1,2,3,4\}$

**Output**
* `'sciplex3_matched_genes_lincs.h5ad'`
* `lincs`: `'sciplex3_lincs_genes.h5ad'`
* `sciplex`: `'lincs_full_smiles_sciplex_genes.h5ad'`



## Description 

The goal of this notebook is to match and merge genes between the LINCS and SciPlex datasets, resulting in the creation of three new datasets:

### Created datasets

- **`sciplex3_matched_genes_lincs.h5ad`**: Contains **SciPlex observations**. **Genes are limited to the intersection** of the genes found in both LINCS and SciPlex datasets, and or highly variable genes in sciplex.


- **`sciplex3_lincs_genes.h5ad`**: Contains **SciPlex data**, but filtered to include **only the genes that are shared with the LINCS dataset**. (strict intersection, 977 genes)

- **`lincs_full_smiles_sciplex_genes.h5ad`**: Contains **LINCS data**, but filtered to include **only the genes that are shared with the SciPlex dataset**.



To create these datasets, we need to match the genes between the two datasets, which is done as follows:

### Gene Matching

1. **Gene ID Assignment**: SciPlex gene names are standardized to Ensembl gene IDs by extracting the primary identifier and using either **sfaira** or a predefined mapping (`symbols_dict.json`). The LINCS dataset is already standardized.

2. **Identifying Shared Genes**: We then compute the intersection of the gene IDs (`gene_id`) inside LINCS and SciPlex. Both datasets are then filtered to retain only these shared genes.

3. **Reindexing**: The LINCS dataset is reindexed to match the order of genes in the SciPlex dataset.



In [1]:
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sfaira
import warnings
os.getcwd()

from chemCPA.paths import DATA_DIR, PROJECT_DIR

pd.set_option('display.max_columns', 100)

root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_dir)
import logging

logging.basicConfig(level=logging.INFO)
from notebook_utils import suppress_output

import scanpy as sc
with suppress_output():
    sc.set_figure_params(dpi=80, frameon=False)
    sc.logging.print_header()
    warnings.filterwarnings('ignore')

# logging.info is visible when running as python script 
if not any('ipykernel' in arg for arg in sys.argv):
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )

2023-08-19 10:31:31.638164: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-19 10:31:34.020338: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-19 10:31:34.020465: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.2 pynndescent==0.5.6


In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load data

Load lincs

In [3]:
adata_lincs = sc.read(DATA_DIR/'lincs_full_smiles.h5ad' )

Load sciplex 

In [4]:
from tqdm import tqdm
from chemCPA.paths import DATA_DIR, PROJECT_DIR
from raw_data.datasets import sciplex

# Load and concatenate chunks
adatas_sciplex = []
logging.info("Starting to load in sciplex data")

# Get paths to all sciplex chunks
chunk_paths = sciplex()

# Load chunks with progress bar
for chunk_path in tqdm(chunk_paths, desc="Loading sciplex chunks"):
    tqdm.write(f"Loading {os.path.basename(chunk_path)}")
    adatas_sciplex.append(sc.read(chunk_path))
    
adata_sciplex = adatas_sciplex[0].concatenate(adatas_sciplex[1:])
logging.info("Sciplex data loaded")

  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],


Add gene_id to sciplex

In [5]:
adata_sciplex.var['gene_id'] = adata_sciplex.var.id.str.split('.').str[0]
adata_sciplex.var['gene_id'].head()

### Get gene ids from symbols via sfaira

Load genome container with sfaira

In [6]:
try: 
    # load json file with symbol to id mapping
    import json
    with open(DATA_DIR/ 'symbols_dict.json') as json_file:
        symbols_dict = json.load(json_file)
except: 
    logging.info("No symbols_dict.json found, falling back to sfaira")
    genome_container = sfaira.versions.genomes.GenomeContainer(organism="homo_sapiens", release="82")
    symbols_dict = genome_container.symbol_to_id_dict
    # Extend symbols dict with unknown symbol
    symbols_dict.update({'PLSCR3':'ENSG00000187838'})

Identify genes that are shared between lincs and trapnell

In [7]:
# For lincs
adata_lincs.var['gene_id'] = adata_lincs.var_names.map(symbols_dict)
adata_lincs.var['in_sciplex'] = adata_lincs.var.gene_id.isin(adata_sciplex.var.gene_id)

In [8]:
# For trapnell
adata_sciplex.var['in_lincs'] = adata_sciplex.var.gene_id.isin(adata_lincs.var.gene_id)

## Preprocess sciplex dataset

See `sciplex3.ipynb`

The original CPA implementation required to subset the data due to scaling limitations.   
In this version we expect to be able to handle the full sciplex dataset.

In [9]:
SUBSET = False

if SUBSET: 
    sc.pp.subsample(adata_sciplex, fraction=0.5, random_state=42)

In [10]:
sc.pp.normalize_per_cell(adata_sciplex)

In [11]:
sc.pp.log1p(adata_sciplex)

In [12]:
sc.pp.highly_variable_genes(adata_sciplex, n_top_genes=1032, subset=False)

### Combine HVG with lincs genes

Union of genes that are considered highly variable and those that are shared with lincs

In [13]:
((adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)).sum()

2000

Subset to that union of genes

In [14]:
adata_sciplex = adata_sciplex[:, (adata_sciplex.var.in_lincs) | (adata_sciplex.var.highly_variable)].copy()

### Create additional meta data 

Normalise dose values

In [15]:
adata_sciplex.obs['dose_val'] = adata_sciplex.obs.dose.astype(float) / np.max(adata_sciplex.obs.dose.astype(float))
adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'dose_val'] = 1.0

In [16]:
adata_sciplex.obs['dose_val'].value_counts()

0.001    153013
0.010    147670
0.100    141828
1.000    139266
Name: dose_val, dtype: int64

Change `product_name`

In [17]:
adata_sciplex.obs['product_name'] = [x.split(' ')[0] for x in adata_sciplex.obs['product_name']]
adata_sciplex.obs.loc[adata_sciplex.obs['product_name'].str.contains('Vehicle'), 'product_name'] = 'control'

Create copy of `product_name` with column name `control`

In [18]:
adata_sciplex.obs['condition'] = adata_sciplex.obs.product_name.copy()

Add combinations of drug (`condition`), dose (`dose_val`), and cell_type (`cell_type`)

In [19]:
# make column of dataframe to categorical 
adata_sciplex.obs["condition"] = adata_sciplex.obs["condition"].astype('category').cat.rename_categories({"(+)-JQ1": "JQ1"})
adata_sciplex.obs['drug_dose_name'] = adata_sciplex.obs.condition.astype(str) + '_' + adata_sciplex.obs.dose_val.astype(str)
adata_sciplex.obs['cov_drug_dose_name'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.drug_dose_name.astype(str)
adata_sciplex.obs['cov_drug'] = adata_sciplex.obs.cell_type.astype(str) + '_' + adata_sciplex.obs.condition.astype(str)

Add `control` columns with vale `1` where only the vehicle was used

In [20]:
adata_sciplex.obs['control'] = [1 if x == 'control_1.0' else 0 for x in adata_sciplex.obs.drug_dose_name.values]

## Compute DE genes

In [21]:
from chemCPA.helper import rank_genes_groups_by_cov

rank_genes_groups_by_cov(adata_sciplex, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='all_DEGs')

A549


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


MCF7


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


K562


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


In [22]:
adata_subset = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()
rank_genes_groups_by_cov(adata_subset, groupby='cov_drug', covariate='cell_type', control_group='control', key_added='lincs_DEGs')
adata_sciplex.uns['lincs_DEGs'] = adata_subset.uns['lincs_DEGs']

A549


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


MCF7


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


K562


  df[key] = c
  df[key] = c
  df[key] = c
  df[key] = c
  self.stats[group_name, 'names'] = self.var_names[global_indices]
  self.stats[group_name, 'scores'] = scores[global_indices]
  self.stats[group_name, 'pvals'] = pvals[global_indices]
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
  foldchanges[global_indices]


### Map all unique `cov_drug_dose_name` to the computed DEGs, independent of the dose value

Create mapping between names with dose and without dose

In [23]:
cov_drug_dose_unique = adata_sciplex.obs.cov_drug_dose_name.unique()

In [24]:
remove_dose = lambda s: '_'.join(s.split('_')[:-1])
cov_drug = pd.Series(cov_drug_dose_unique).apply(remove_dose)
dose_no_dose_dict = dict(zip(cov_drug_dose_unique, cov_drug))

### Compute new dicts for DEGs

In [25]:
uns_keys = ['all_DEGs', 'lincs_DEGs']

In [26]:
for uns_key in uns_keys:
    new_DEGs_dict = {}

    df_DEGs = pd.Series(adata_sciplex.uns[uns_key])

    for key, value in dose_no_dose_dict.items():
        if 'control' in key:
            continue
        new_DEGs_dict[key] = df_DEGs.loc[value]
    adata_sciplex.uns[uns_key] = new_DEGs_dict

In [27]:
adata_sciplex

AnnData object with n_obs × n_vars = 581777 × 2000
    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control'
    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'all_DEGs', 'lincs_DEGs'

## Create sciplex splits

This is not the right configuration fot the experiments we want but for the moment this is okay

### OOD in Pathways

In [28]:
adata_sciplex.obs['split_ho_pathway'] = 'train'  # reset

ho_drugs = [
    # selection of drugs from various pathways
    "Azacitidine",
    "Carmofur",
    "Pracinostat",
    "Cediranib",
    "Luminespib",
    "Crizotinib",
    "SNS-314",
    "Obatoclax",
    "Momelotinib",
    "AG-14361",
    "Entacapone",
    "Fulvestrant",
    "Mesna",
    "Zileuton",
    "Enzastaurin",
    "IOX2",
    "Alvespimycin",
    "XAV-939",
    "Fasudil",
]

ho_drug_pathway = adata_sciplex.obs['condition'].isin(ho_drugs)
adata_sciplex.obs.loc[ho_drug_pathway, 'pathway_level_1'].value_counts()

DNA damage & DNA repair                  6640
Epigenetic regulation                    6093
Tyrosine kinase signaling                5846
Protein folding & Protein degradation    3863
Neuronal signaling                       3635
Antioxidant                              3616
HIF signaling                            3501
Metabolic regulation                     3470
Focal adhesion signaling                 3450
Nuclear receptor signaling               3420
JAK/STAT signaling                       3155
Apoptotic regulation                     3141
TGF/BMP signaling                        2794
PKC signaling                            2778
Cell cycle regulation                    2237
Other                                       0
Vehicle                                     0
Name: pathway_level_1, dtype: int64

In [29]:
ho_drug_pathway.sum()

57639

In [30]:
adata_sciplex.obs.loc[ho_drug_pathway & (adata_sciplex.obs['dose_val'] == 1.0), 'split_ho_pathway'] = 'ood'

test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs['split_ho_pathway'] != 'ood'], .15, copy=True).obs.index
adata_sciplex.obs.loc[test_idx, 'split_ho_pathway'] = 'test'

In [31]:
pd.crosstab(adata_sciplex.obs.pathway_level_1, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(ho_drugs)])

condition,AG-14361,Alvespimycin,Azacitidine,Carmofur,Cediranib,Crizotinib,Entacapone,Enzastaurin,Fasudil,Fulvestrant,IOX2,Luminespib,Mesna,Momelotinib,Obatoclax,Pracinostat,SNS-314,XAV-939,Zileuton
pathway_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Antioxidant,0,0,0,0,0,0,0,0,0,0,0,0,3616,0,0,0,0,0,0
Apoptotic regulation,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3141,0,0,0,0
Cell cycle regulation,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2237,0,0
DNA damage & DNA repair,3401,0,0,3239,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Epigenetic regulation,0,0,3151,0,0,0,0,0,0,0,0,0,0,0,0,2942,0,0,0
Focal adhesion signaling,0,0,0,0,0,0,0,0,3450,0,0,0,0,0,0,0,0,0,0
HIF signaling,0,0,0,0,0,0,0,0,0,0,3501,0,0,0,0,0,0,0,0
JAK/STAT signaling,0,0,0,0,0,0,0,0,0,0,0,0,0,3155,0,0,0,0,0
Metabolic regulation,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3470
Neuronal signaling,0,0,0,0,0,0,3635,0,0,0,0,0,0,0,0,0,0,0,0


In [32]:
adata_sciplex.obs['split_ho_pathway'].value_counts()

train    483951
test      85403
ood       12423
Name: split_ho_pathway, dtype: int64

In [33]:
adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'ood'].obs.condition.value_counts()

Fasudil         966
IOX2            913
Mesna           884
Entacapone      868
Fulvestrant     836
Zileuton        822
Carmofur        767
AG-14361        759
Azacitidine     736
Enzastaurin     694
Pracinostat     658
SNS-314         547
Cediranib       528
Momelotinib     487
XAV-939         479
Crizotinib      464
Luminespib      405
Obatoclax       404
Alvespimycin    206
Name: condition, dtype: int64

In [34]:
adata_sciplex[adata_sciplex.obs.split_ho_pathway == 'test'].obs.condition.value_counts()

control         1964
ENMD-2076        914
RG108            604
GSK-LSD1         596
Altretamine      573
                ... 
Luminespib       236
Patupilone       228
Flavopiridol     207
Epothilone       181
YM155            112
Name: condition, Length: 188, dtype: int64

### OOD drugs in epigenetic regulation, Tyrosine kinase signaling, cell cycle regulation

In [35]:
adata_sciplex.obs['pathway_level_1'].value_counts()

Epigenetic regulation                    147875
Tyrosine kinase signaling                 85503
JAK/STAT signaling                        70922
DNA damage & DNA repair                   60042
Cell cycle regulation                     53952
Other                                     19980
Nuclear receptor signaling                19940
Protein folding & Protein degradation     19191
Metabolic regulation                      17989
Neuronal signaling                        14071
Antioxidant                               13414
Apoptotic regulation                      13141
Vehicle                                   13004
HIF signaling                              9279
PKC signaling                              8804
TGF/BMP signaling                          8774
Focal adhesion signaling                   5896
Name: pathway_level_1, dtype: int64

___

#### Tyrosine signaling

In [36]:
adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Tyrosine kinase signaling"]),'condition'].value_counts()

PD98059         3763
AG-490          3533
Motesanib       3363
TGX-221         3358
Ki8751          3347
                ... 
Fedratinib         0
Filgotinib         0
Flavopiridol       0
Fluorouracil       0
control            0
Name: condition, Length: 188, dtype: int64

In [37]:
tyrosine_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Tyrosine kinase signaling"]),'condition'].unique()

In [38]:
adata_sciplex.obs['split_tyrosine_ood'] = 'train'  

test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin(["Tyrosine kinase signaling"])], .20, copy=True).obs.index
adata_sciplex.obs.loc[test_idx, 'split_tyrosine_ood'] = 'test'

adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin(["Cediranib", "Crizotinib", "Motesanib", "BMS-754807", "Nintedanib"]), 'split_tyrosine_ood'] = 'ood'  

In [39]:
adata_sciplex.obs.split_tyrosine_ood.value_counts()

train    552761
ood       14880
test      14136
Name: split_tyrosine_ood, dtype: int64

In [40]:
pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(tyrosine_drugs)])

condition,AC480,AG-490,BMS-536924,BMS-754807,Bosutinib,Cediranib,Crizotinib,Dasatinib,Glesatinib?(MGCD265),KW-2449,Ki8751,Lapatinib,Linifanib,Motesanib,Nilotinib,Nintedanib,PD173074,PD98059,Pelitinib,Regorafenib,Rigosertib,SL-327,Sorafenib,TAK-901,TGX-221,Temsirolimus,Tie2,Trametinib,Vandetanib
split_tyrosine_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
ood,0,0,0,2676,0,3060,2786,0,0,0,0,0,0,3363,0,2995,0,0,0,0,0,0,0,0,0,0,0,0,0
test,645,728,582,0,491,0,0,491,656,580,641,603,678,0,639,0,702,723,620,502,377,678,658,419,620,453,647,443,560
train,2597,2805,2318,0,1945,0,0,2047,2527,2452,2706,2435,2487,0,2448,0,2588,3040,2306,2182,1562,2521,2413,1649,2738,1780,2616,2031,2294


In [41]:
pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)

dose_val,0.001,0.010,0.100,1.000
split_tyrosine_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ood,4226,4118,3822,2714
test,3928,3930,3590,2688
train,144859,139622,134416,133864


____

#### Epigenetic regulation

In [42]:
adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Epigenetic regulation"]),'condition'].value_counts()

RG108           3715
Tubastatin      3710
GSK-LSD1        3688
SRT2104         3687
Tacedinaline    3664
                ... 
Fulvestrant        0
G007-LK            0
GSK1070916         0
Gandotinib         0
control            0
Name: condition, Length: 188, dtype: int64

In [43]:
epigenetic_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Epigenetic regulation"]),'condition'].unique()

In [44]:
adata_sciplex.obs['split_epigenetic_ood'] = 'train'  

test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin(["Epigenetic regulation"])], .20, copy=True).obs.index
adata_sciplex.obs.loc[test_idx, 'split_epigenetic_ood'] = 'test'

adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin(["Azacitidine", "Pracinostat", "Trichostatin", "Quisinostat", "Tazemetostat"]), 'split_epigenetic_ood'] = 'ood'  

In [45]:
adata_sciplex.obs.split_epigenetic_ood.value_counts()

train    540070
test      26538
ood       15169
Name: split_epigenetic_ood, dtype: int64

In [46]:
pd.crosstab(adata_sciplex.obs.split_epigenetic_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(epigenetic_drugs)])

condition,JQ1,A-366,AR-42,Abexinostat,Anacardic,Azacitidine,BRD4770,Belinostat,CUDC-101,CUDC-907,Dacinostat,Decitabine,Divalproex,Droxinostat,EED226,Entinostat,GSK,GSK-LSD1,Givinostat,ITSA-1,M344,MC1568,Mocetinostat,PCI-34051,PFI-1,Panobinostat,Pracinostat,Quisinostat,RG108,Resminostat,Resveratrol,SRT1720,SRT2104,SRT3025,Selisistat,Sirtinol,Sodium,TMP195,Tacedinaline,Tazemetostat,Trichostatin,Tubastatin,Tucidinostat,UNC0379,UNC0631,UNC1999,Valproic
split_epigenetic_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
ood,0,0,0,0,0,3151,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2942,2354,0,0,0,0,0,0,0,0,0,0,0,3639,3083,0,0,0,0,0,0
test,625,645,623,582,728,0,743,581,661,519,518,491,647,652,645,716,690,686,631,544,611,655,385,591,618,517,0,0,701,649,655,583,779,605,690,669,710,511,747,0,0,718,453,686,664,686,728
train,2412,2751,2278,2331,2876,0,2886,2444,2548,1898,1998,1866,2581,2545,2624,2669,2911,3002,2474,2282,2543,2761,1593,2350,2589,2056,0,0,3014,2670,2317,2487,2908,2405,2684,2872,2787,2067,2917,0,0,2992,1800,2595,2890,2683,2812


In [47]:
pd.crosstab(adata_sciplex.obs.split_tyrosine_ood, adata_sciplex.obs.dose_val)

dose_val,0.001,0.010,0.100,1.000
split_tyrosine_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ood,4226,4118,3822,2714
test,3928,3930,3590,2688
train,144859,139622,134416,133864


__________

#### Cell cycle regulation

In [48]:
adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Cell cycle regulation"]),'condition'].value_counts()

ENMD-2076       5757
BMS-265246      3274
Roscovitine     3254
Aurora          3036
MK-5108         3006
                ... 
Fedratinib         0
Filgotinib         0
Fluorouracil       0
Fulvestrant        0
control            0
Name: condition, Length: 188, dtype: int64

In [49]:
cell_cycle_drugs = adata_sciplex.obs.loc[adata_sciplex.obs.pathway_level_1.isin(["Cell cycle regulation"]),'condition'].unique()

In [50]:
adata_sciplex.obs['split_cellcycle_ood'] = 'train'  

test_idx = sc.pp.subsample(adata_sciplex[adata_sciplex.obs.pathway_level_1.isin(["Cell cycle regulation"])], .20, copy=True).obs.index
adata_sciplex.obs.loc[test_idx, 'split_cellcycle_ood'] = 'test'

adata_sciplex.obs.loc[adata_sciplex.obs.condition.isin(["SNS-314", "Flavopiridol", "Roscovitine"]), 'split_cellcycle_ood'] = 'ood'  

In [51]:
adata_sciplex.obs.split_cellcycle_ood.value_counts()

train    565503
test       9376
ood        6898
Name: split_cellcycle_ood, dtype: int64

In [52]:
pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs['condition'][adata_sciplex.obs.condition.isin(cell_cycle_drugs)])

condition,AMG-900,Alisertib,Aurora,BMS-265246,Barasertib,CYC116,Danusertib,ENMD-2076,Epothilone,Flavopiridol,GSK1070916,Hesperadin,JNJ-7706621,MK-5108,MLN8054,PHA-680632,Patupilone,Roscovitine,SNS-314,Tozasertib,ZM
split_cellcycle_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ood,0,0,0,0,0,0,0,0,0,1407,0,0,0,0,0,0,0,3254,2237,0,0
test,545,428,616,679,463,570,469,1140,230,0,512,356,590,590,478,450,290,0,0,424,546
train,2165,1673,2420,2595,1958,2381,1927,4617,991,0,1990,1593,2398,2416,1866,1731,1191,0,0,1596,2170


In [53]:
pd.crosstab(adata_sciplex.obs.split_cellcycle_ood, adata_sciplex.obs.dose_val)

dose_val,0.001,0.010,0.100,1.000
split_cellcycle_ood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ood,2165,1774,1457,1502
test,2673,2429,2329,1945
train,148175,143467,138042,135819


In [54]:
[c for c in adata_sciplex.obs.columns if 'split' in c]

['split_ho_pathway',
 'split_tyrosine_ood',
 'split_epigenetic_ood',
 'split_cellcycle_ood']

### Further splits

**We omit these split as we design our own splits - for referece this is commented out for the moment**

Also a split which sees all data:

In [55]:
# adata.obs['split_all'] = 'train'
# test_idx = sc.pp.subsample(adata, .10, copy=True).obs.index
# adata.obs.loc[test_idx, 'split_all'] = 'test'

In [56]:
# adata.obs['ct_dose'] = adata.obs.cell_type.astype('str') + '_' + adata.obs.dose_val.astype('str')

Round robin splits: dose and cell line combinations will be held out in turn.

In [57]:
# i = 0
# split_dict = {}

In [58]:
# # single ct holdout
# for ct in adata.obs.cell_type.unique():
#     for dose in adata.obs.dose_val.unique():
#         i += 1
#         split_name = f'split{i}'
#         split_dict[split_name] = f'{ct}_{dose}'
        
#         adata.obs[split_name] = 'train'
#         adata.obs.loc[adata.obs.ct_dose == f'{ct}_{dose}', split_name] = 'ood'
        
#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#         adata.obs.loc[test_idx, split_name] = 'test'
        
#         display(adata.obs[split_name].value_counts())

In [59]:
# # double ct holdout
# for cts in [('A549', 'MCF7'), ('A549', 'K562'), ('MCF7', 'K562')]:
#     for dose in adata.obs.dose_val.unique():
#         i += 1
#         split_name = f'split{i}'
#         split_dict[split_name] = f'{cts[0]}+{cts[1]}_{dose}'
        
#         adata.obs[split_name] = 'train'
#         adata.obs.loc[adata.obs.ct_dose == f'{cts[0]}_{dose}', split_name] = 'ood'
#         adata.obs.loc[adata.obs.ct_dose == f'{cts[1]}_{dose}', split_name] = 'ood'
        
#         test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#         adata.obs.loc[test_idx, split_name] = 'test'
        
#         display(adata.obs[split_name].value_counts())

In [60]:
# # triple ct holdout
# for dose in adata.obs.dose_val.unique():
#     i += 1
#     split_name = f'split{i}'

#     split_dict[split_name] = f'all_{dose}'
#     adata.obs[split_name] = 'train'
#     adata.obs.loc[adata.obs.dose_val == dose, split_name] = 'ood'

#     test_idx = sc.pp.subsample(adata[adata.obs[split_name] != 'ood'], .16, copy=True).obs.index
#     adata.obs.loc[test_idx, split_name] = 'test'

#     display(adata.obs[split_name].value_counts())

In [61]:
# adata.uns['all_DEGs']

## Save adata

Reindex the lincs dataset

In [62]:
sciplex_ids = pd.Index(adata_sciplex.var.gene_id)

lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_lincs.var.gene_id[adata_lincs.var.in_sciplex]]

In [63]:
non_lincs_idx = [sciplex_ids.get_loc(_id) for _id in adata_sciplex.var.gene_id if not adata_lincs.var.gene_id.isin([_id]).any()]

lincs_idx.extend(non_lincs_idx)

In [64]:
adata_sciplex = adata_sciplex[:, lincs_idx].copy()

In [65]:
fname = PROJECT_DIR/'datasets'/'sciplex3_matched_genes_lincs.h5ad'

sc.write(fname, adata_sciplex)

Check that it worked

In [66]:
sc.read(fname)

AnnData object with n_obs × n_vars = 581777 × 2000
    obs: 'cell_type', 'dose', 'dose_character', 'dose_pattern', 'g1s_score', 'g2m_score', 'pathway', 'pathway_level_1', 'pathway_level_2', 'product_dose', 'product_name', 'proliferation_index', 'replicate', 'size_factor', 'target', 'vehicle', 'batch', 'n_counts', 'dose_val', 'condition', 'drug_dose_name', 'cov_drug_dose_name', 'cov_drug', 'control', 'split_ho_pathway', 'split_tyrosine_ood', 'split_epigenetic_ood', 'split_cellcycle_ood'
    var: 'id', 'num_cells_expressed-0-0', 'num_cells_expressed-1-0', 'num_cells_expressed-1', 'gene_id', 'in_lincs', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'all_DEGs', 'hvg', 'lincs_DEGs', 'log1p'

## Subselect to shared only shared genes

Subset to shared genes

In [67]:
adata_lincs = adata_lincs[:, adata_lincs.var.in_sciplex].copy() 

In [68]:
adata_sciplex = adata_sciplex[:, adata_sciplex.var.in_lincs].copy()

In [69]:
adata_lincs.var_names

Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',
       'ATP6V0B', 'RPS6',
       ...
       'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',
       'WDTC1', 'PLSCR3', 'NPEPL1'],
      dtype='object', length=977)

In [70]:
adata_sciplex.var_names

Index(['DDR1', 'PAX8', 'RPS5', 'ABCF1', 'SPAG7', 'RHOA', 'RNPS1', 'SMNDC1',
       'ATP6V0B', 'RPS6',
       ...
       'P4HTM', 'SLC27A3', 'TBXA2R', 'RTN2', 'TSTA3', 'PPARD', 'GNA11',
       'WDTC1', 'PLSCR3', 'NPEPL1'],
      dtype='object', name='index', length=977)

## Save adata objects with shared genes only
Index of lincs has also been reordered accordingly

In [71]:
fname = PROJECT_DIR/'datasets'/'sciplex3_lincs_genes.h5ad'

sc.write(fname, adata_sciplex)

____

In [72]:
fname_lincs = PROJECT_DIR/'datasets'/'lincs_full_smiles_sciplex_genes.h5ad'

sc.write(fname_lincs, adata_lincs)