chemCPA / preprocessing /README.md
github-actions[bot]
HF snapshot
a48f0ae

Preprocessing

This folder contains preprocessing notebooks that convert the raw data to datasets that may be used for training.

Description

Briefly:

  1. The first notebook cleans up the LINCS dataset, computes DEGS, and splits.
  2. The second notebook adds the SMILES information to the LINCS dataset.
  3. The third notebook finds matching genes between LINCS and SciPlex-3 datasets, and creates datasets with only subsets of genes that match in some way.
  4. The fourth notebook adds the SMILES information to the SciPlex-3 dataset
  5. The fifth notebook creates various sub-datasets with varying observations and train/test/ood splits
  6. The sixth notebook computes a baseline dataset
  7. The seventh notebook computes embedding files for all datasets

For more details read the notebooks.

Clarifcation on the avaialable datasets

The lincs_full.h5ad as a combination of the available L1000 datasets from phase 1 and phase 2, available here:

  • L1000 Connectivity Map Phase I: GSE70138
  • L1000 Connectivity Map Phase II: GSE92742

We combined the data which is only available in the .gctx together with its metadata, cf. <...>_inst_info.txt, and make it available as scanpy compatible .h5ad object.

Note that not all perturbations correspond to small molecules which is why we subsetted the data to only contain perturbation types trt_cp and ctl_vehicle, resulting in a total of 1034271 observations.

The provided data is normalised.

For the training on the LINCS data, we ignored the treatment time, adata_lincs_full.obs["pert_time"].

Preprocess data

The data preprocessing should run thorugh with the provided files. For the matching of genes between LINCS and the SciPlex-3 data in 3_lincs_sciplex_gene_matching.ipynb, we provide a symbols_dict which replaces the matching via sfaira. Note that you have to execute 4_sciplex_SMILES.ipynb for both gene sets. The same notebook also contains multiple check against the trapnell_final_V7.h5ad file to make sure that the SMILES are correctly matched. You could ignore these and implement your own solution. For this, we provide drug_dict.json file.

Files produced