How AI Can Help You Build Better Experiments
For hands-on training in applying AI to experimental design, treatment development, validation, and analysis, join Professor Crabtree for Using AI to Build Better Experiments on July 14–17.
If you run experiments, you know where the time goes.
It’s not often in the analysis but in the setup – designing treatments, checking wording, coding responses, and validating measures. That work is often slow, frequently repetitive, and frustratingly difficult to speed up—until recently.
Large language models (LLMs) can now compress the distance between an idea and a testable experiment. They don’t replace your judgment and taste, but they handle the plumbing so that you can spend more time on the design decisions that actually matter. Here are three concrete applications, followed by the question every researcher should be grappling with: how do we use these tools in a way that increases (not decreases) scientific rigor?
1. Generating Experimental Materials at Scale
Consider a standard survey experiment with vignettes that vary by political party and legislator experience. Traditionally you write each version by hand. A 2×2 design means four drafts, albeit changing only small parts of them. If you extend the design to include more factors, the drafting work just multiplies.
By integrating an LLM into R, you can define the factors and let the code generate every condition. Start with the simplest case:
# Define factors party <- c("Democratic", "Republican") # 2 levels experience <- c("first-term", "veteran") # 2 levels # Create all 2×2 = 4 combinations factorial_design <- expand.grid( party = party, experience = experience ) # Generate a vignette for each condition factorial_vignettes <- factorial_design |> mutate( vignette = map2_chr(party, experience, ~{ prompt <- paste( "Write a 100-word vignette about a", .y, .x, "senator considering a climate policy bill.", "Be neutral and focus on deliberation." ) call_llm(prompt, temp = 0.3) }) )
Now add a third factor—the framing tone of the vignette—and the only things that change are the input vectors and the mapping function:
# Add a third factor party <- c("Democratic", "Republican") # 2 levels experience <- c("first-term", "second-term", "veteran") # 3 levels framing <- c("neutral", "optimistic", "cautious", "urgent") # 4 levels # Create all 2×3×4 = 24 combinations factorial_design <- expand.grid( party = party, experience = experience, framing = framing ) # Switch from map2_chr (2 args) to pmap_chr (any number of args) factorial_vignettes <- factorial_design |> mutate( vignette = pmap_chr(list(party, experience, framing), ~{ prompt <- paste( "Write a 100-word vignette about a", ..2, ..1, "senator considering a climate policy bill.", "Tone:", ..3, ". Focus on deliberation." ) call_llm(prompt, temp = 0.3) }) )
That’s it. Twenty-four conditions from a handful of input vectors.
Scalability matters for reasons beyond convenience. Experimental results can be sensitive to specific phrasing choices—a problem called wording effects. One solution is to generate multiple stylistic variants for each condition and then randomize across them. That lets you average over those effects in a way that handwritten vignettes rarely allow. Researchers have wanted to do this for years, particularly since Porter and Velez’s Political Analysis paper, but the bottleneck was limited time and the lack of a toolkit.
2. Automated Text Classification with Validity Checks
Open-ended responses are often the most valuable data you collect and the most painful to code. Manual coding is slow, expensive, and requires inter-rater reliability processes that can take longer than the analysis itself.
LLMs can classify text quickly and cheaply. The key technique is few-shot prompting: you provide labeled examples and ask the model to generalize.
few_shot_prompt <- function(text) { paste( "Classify the sentiment of political texts as Positive, Negative, or Neutral.", "\nExamples:", "\nText: 'The new policy will create thousands of jobs.'", "Classification: Positive", "\nText: 'The scandal has damaged public trust in government.'", "Classification: Negative", "\nText: 'The committee met to discuss the proposal.'", "Classification: Neutral", "\nNow classify this text:", "\nText:", text, "\nClassification:" ) }
With just three examples, LLMs can match or exceed trained human coders on many tasks. That doesn’t mean you can skip validation. You shouldn’t, and I advocate checking all four types of construct validity:
- Face validity: do the labels look right to a domain expert?
- Content validity: does the coding scheme cover the full concept?
- Convergent validity: does the LLM agree with human coders on a gold-standard subset?
- Discriminant validity: are classifications independent of things they should not correlate with?
The LLM replaces labor. It does not replace your expert judgment.
3. Chatbots as Experimental Treatments
The most powerful application of LLMs in experimental work is to use them to move beyond static vignettes entirely. Instead of showing respondents fixed text, you deploy an AI chatbot that engages in real-time conversation—responding to objections, adapting its arguments, personalizing the treatment for each participant.
In my own research, we have tested whether AI-driven personalized persuasion can shift attitudes on contested topics. The treatment is delivered conversationally in a way that would be impossible with static text. But this is also where the ethical stakes become most serious.
When an AI interacts directly with participants, questions about informed consent, manipulation, and transparency are not peripheral. They are design constraints. They need to be resolved before deployment, not after.
Keeping It Rigorous
AI lowers the cost of generating experimental materials. It does not lower the standards for validating them.
Reproducibility requires documentation. APIs for LLMs do not support user-specified seeds. Models get updated without notice. Outputs are stochastic. The practical response is to archive everything: the exact model version, every prompt, every parameter (especially temperature), and every generated output. Treat this as your new replication standard.
Human review is non-negotiable. Every AI-generated treatment should be read before it reaches a participant. Every classification scheme should be validated against human coding. LLMs produce fluent, confident text—but fluency is not accuracy. Build human checkpoints into every stage of the pipeline.
The Course
In Using AI to Build Better Experiments (July 14–17), we walk through the full experimental lifecycle: generating and validating treatments, building chatbot-based experiments, automating text analysis, conducting power analysis, and drafting preregistrations. The emphasis is practical—you write code, generate materials, and build real workflows.
If you want to integrate AI into your experimental research without sacrificing rigor, I hope you will join us.

