Causal AI Scientist: Facilitating Causal Data Science with Large Language Models

Causal AI Scientist (CAIS) is an LLM-powered tool for generating data-driven answers to natural language causal queries. It takes a natural language query (for example, "Does participating in a job training program lead to higher income?"), an accompanying dataset, and the corresponding description as inputs. CAIS then frames a suitable causal estimation problem by selecting appropriate treatment and outcome variables. It finds the suitable method for causal effect estimation, implements it, runs diagnostic tests, and finally interprets the numerical results in the context of the original query.

This repo includes instructions on both using the tool to perform causal analysis on a dataset of interest and reproducing results from our paper.

Note : This repository is a work in progress and will be updated with additional instructions and files.

Getting Started

🔧 Environment Installation

Prerequisites:

Python 3.10 (create a new conda environment first)
Required Python libraries (specified in requirements.txt)

Step 1: Copy the example configuration

cp .env.example .env

Step 2: Create Python 3.10 environment

# Create a new conda environment with Python 3.10
conda create -n auto_causal python=3.10
conda activate auto_causal
pip install -r requirement.txt

Step3: Setup auto_causal library

pip install -e .

Dataset Information

All datasets used to evaluate CAIs and the baseline models are available in the data/ directory. Specifically:

all_data: Folder containing all CSV files from the QRData and real-world study collections.
synthetic_data: Folder containing all CSV files corresponding to synthetic datasets.
qr_info.csv: Metadata for QRData files. For each file, this includes the filename, description, causal query, reference causal effect, intended inference method, and additional remarks.
real_info.csv: Metadata for the real-world datasets.
synthetic_info.csv: Metadata for the synthetic datasets.

Run

To execute CAIS, run

python main/run_cais.py \
    --metadata_path {path_to_metadata} \
    --data_dir {path_to_data_folder} \
    --output_dir {output_folder} \
    --output_name {output_filename} \
    --llm_name {llm_name}

Args:

metadata_path (str): Path to the CSV file containing the queries, dataset descriptions, and data file names
data_dir (str): Path to the folder containing the data in CSV format
output_dir (str): Path to the folder where the output JSON results will be saved
output_name (str): Name of the JSON file where the outputs will be saved
llm_name (str): Name of the LLM to be used (e.g., 'gpt-4', 'claude-3', etc.)

A specific example,

python main/run_cais.py \
    --metadata_path "data/qr_info.csv" \
    --data_dir "data/all_data" \
    --output_dir "output" \
    --output_name "results_qr_4o" \
    --llm_name "gpt-4o-mini"

Reproducing paper results

Will be updated soon

⚠️ Important Notes:

Keep your .env file secure and never commit it to version control

License

Distributed under the MIT License. See LICENSE for more information.