Spaces:
Running
A newer version of the Gradio SDK is available:
5.42.0
Causal AI Scientist: Facilitating Causal Data Science with
Large Language Models
Causal AI Scientist (CAIS) is an LLM-powered tool for generating data-driven answers to natural language causal queries. It takes a natural language query (for example, "Does participating in a job training program lead to higher income?"), an accompanying dataset, and the corresponding description as inputs. CAIS then frames a suitable causal estimation problem by selecting appropriate treatment and outcome variables. It finds the suitable method for causal effect estimation, implements it, runs diagnostic tests, and finally interprets the numerical results in the context of the original query.
This repo includes instructions on both using the tool to perform causal analysis on a dataset of interest and reproducing results from our paper.
Note : This repository is a work in progress and will be updated with additional instructions and files.
Getting Started
π§ Environment Installation
Prerequisites:
- Python 3.10 (create a new conda environment first)
- Required Python libraries (specified in
requirements.txt
)
Step 1: Copy the example configuration
cp .env.example .env
Step 2: Create Python 3.10 environment
# Create a new conda environment with Python 3.10
conda create -n auto_causal python=3.10
conda activate auto_causal
pip install -r requirement.txt
Step3: Setup auto_causal library
pip install -e .
Dataset Information
All datasets used to evaluate CAIs and the baseline models are available in the data/ directory. Specifically:
all_data
: Folder containing all CSV files from the QRData and real-world study collections.synthetic_data
: Folder containing all CSV files corresponding to synthetic datasets.qr_info.csv
: Metadata for QRData files. For each file, this includes the filename, description, causal query, reference causal effect, intended inference method, and additional remarks.real_info.csv
: Metadata for the real-world datasets.synthetic_info.csv
: Metadata for the synthetic datasets.
Run
To execute CAIS, run
python main/run_cais.py \
--metadata_path {path_to_metadata} \
--data_dir {path_to_data_folder} \
--output_dir {output_folder} \
--output_name {output_filename} \
--llm_name {llm_name}
Args:
- metadata_path (str): Path to the CSV file containing the queries, dataset descriptions, and data file names
- data_dir (str): Path to the folder containing the data in CSV format
- output_dir (str): Path to the folder where the output JSON results will be saved
- output_name (str): Name of the JSON file where the outputs will be saved
- llm_name (str): Name of the LLM to be used (e.g., 'gpt-4', 'claude-3', etc.)
A specific example,
python main/run_cais.py \
--metadata_path "data/qr_info.csv" \
--data_dir "data/all_data" \
--output_dir "output" \
--output_name "results_qr_4o" \
--llm_name "gpt-4o-mini"
Reproducing paper results
Will be updated soon
β οΈ Important Notes:
- Keep your
.env
file secure and never commit it to version control
License
Distributed under the MIT License. See LICENSE
for more information.