Spaces:
Running
Running
## This file contains the functions that can be used to create prompts for generating synthetic data contexts. | |
def generate_data_summary(df, n_cont_vars, n_bin_vars, method, cutoff=None) -> str: | |
""" | |
Generate a summary of the input dataset. The summary includes information about column headings | |
for continuuous, binary, treatment, and outcome variables. Additionally, it also includes information on the method | |
used to generate the dataset and the basic statistical summary. | |
Args: | |
df (pd.DataFrame): The input dataset. | |
n_cont_vars (int): Number of continuous variables in the dataset | |
n_bin_vars (int): Number of binary variables in the dataset | |
method (str): The method used to generate the dataset | |
cutff (float, None): The cutoff value for RDD data | |
Returns: | |
str: Summary of the (raw) dataset. | |
""" | |
continuous_vars = [f"X{i}" for i in range(1, n_cont_vars + 1)] | |
binary_vars = [f"X{i}" for i in range(n_cont_vars + 1, n_cont_vars + n_bin_vars + 1)] | |
information = "The dataset contains the following **continuous covariates**: " + ", ".join(continuous_vars) + ".\n" | |
information += "The dataset contains the following **binary covariates**: " + ", ".join(binary_vars) + ".\n" | |
information += "The **outcome variable** is Y.\n" | |
information += "The **treatment variable** is D.\n" | |
if method == "encouragement": | |
information += "This is an encouragement design where Z is the instrument, i.e., the \ | |
, the initial treatment assignment \ n" | |
elif method == "IV": | |
information += "This is an IV design where Z is the instrument \n" | |
elif method == "rdd": | |
information += "The running variable is running_X, and the cutoff is {}\n".format(cutoff) | |
elif method == "did_twfe": | |
information += "This is a staggered Difference in Difference where D indicates whether or not the unit is treated \ | |
at time t. Similarly, year denotes the time at which the data was measured.\n" | |
elif method == "did_canonical": | |
information += "This is a canonical Difference in Difference where D indicates whether or not the unit is treated \ | |
at time t. Similarly, post is a binary variable indicating post / pre-intervention time points \ | |
, post = 1 indicates post-intervention time points.\n" | |
information += "Here is the statistical summary of the variables: \n " + str(df.describe(include='all')) + "\n" | |
return information | |
def create_prompt(summary, method, domain, history): | |
""" | |
Creates a prompt for the OpenAI API to generate a context for the given dataset | |
Args: | |
summary (str): Summary of the dataset | |
method (str): The method used to generate the dataset | |
domain (str): The domain of the dataset | |
history (str): Previous contexts that have been used. We use this to avoid overlap in contexts | |
""" | |
method_names = {"encouragement": "Encouragement Design", "did_twfe": "Difference in Differences with Two-Way Fixed Effects", | |
"did_canonical": "Canonical Difference in Differences", "IV": "Instrumental Variable", | |
"multi_rct": "Multi-Treatment Randomized Control Trial", "rdd": "Regression Discontinuity Design", | |
"observational": "Observational", "rct": "Randomized Control Trial", "frontdoor": "Front-Door Causal Inference"} | |
domain_guides = { | |
"education": "Education data often includes student performance, school-level features, socioeconomic background, and intervention types like tutoring or online classes.", | |
"healthcare": "Healthcare data may include treatments, diagnoses, hospital visits, recovery outcomes, or demographic details.", | |
"labor": "Labor datasets typically include income, education, job type, employment history, and training programs.", | |
"policy": "Policy evaluation data may track program participation, regional differences, economic impact, and public outcomes like housing, safety, or benefits." | |
} | |
prompt = f""" | |
You are a helpful assistant generating realistic, domain-specific contexts for synthetic datasets. | |
The current dataset is designed for **{method_names[method]}** studies in the **{domain}** domain. | |
### Dataset Summary | |
{summary} | |
### Previously Used Contexts (avoid duplication) | |
{history} | |
### Domain-Specific Guidance | |
{domain_guides.get(domain, '')} | |
--- | |
### Your Tasks: | |
1. Propose a **realistic real-world scenario** that fits a {method_names[method]} study in the {domain} domain. Mention whether the data was collected from a randomized trial, policy rollout, or real-world observation. | |
2a. Assign **realistic and concise variable names** in snake_case. Map original variable names like `"X1"` to names like `"education_years"`. | |
2b. Provide a **one-line natural-language description for each variable** (e.g., `education_years: total years of formal schooling completed by the individual.`). Use newline-separated key-value format. | |
3. Write a **paragraph** describing the dataset's background: who collected it, what was studied, why, and how. | |
4. Write a **natural language causal question** the dataset could answer. The question should: | |
- Relate implicitly to the treatment and outcome | |
- Avoid any statistical or causal terminology | |
- Avoid naming variables directly | |
- Feel like it belongs in a news article or report | |
5. Write a **1-2 sentence summary** capturing the dataset's overall intent and contents. | |
--- | |
Return your output as a JSON object with the following keys: | |
- "variable_labels": {{ "X1": "education_years", ... }} | |
- "description": "<paragraph>" | |
- "question": "<causal question>" | |
- "summary": <summary> | |
- "domain": "<domain>" | |
Return only a valid JSON object. Do not include any markdown, explanations, or extra text. | |
""" | |
return prompt | |
def filter_question(question): | |
""" | |
Filter the question to remove explicit mentions of variables. | |
Args: | |
question (str): The original causal query | |
Returns: | |
str: The filtered causal query | |
""" | |
prompt = """ | |
You are a helpful assistant. Help me filter this causal query. | |
The query is: {} | |
The query should not provide information on what variables one needs to consider in course of causal analysis. | |
For example, | |
Bad question: "What is the effect of the training program on job outcomes considering education and experience?" | |
Good question: "What is the effect of the training program on job outcomes?" | |
If the question is already filtered, return it as is. | |
Return only the filtered query. Do not say anything else. | |
""" | |
return prompt.format(question) | |