The Black Box Problem with Generative AI Workflows
To learn more about integrating generative AI into reproducible analytical workflows, join Professor Desmarais for Advanced Machine Learning and Applied AI Workflows on May 13–15.
When researchers hear “black box problem” in the context of machine learning, they typically think of a single model whose predictions are hard to explain. Can we tell which features drove the output? Can we visualize the relationships? This framing has produced a rich literature on interpretability tools like SHAP values, partial dependence plots, and permutation importance.
A different kind of black box problem has emerged with the rise of large language models (LLMs), also known as generative AI. It occurs when these models are used within a larger analytical workflow: for data annotation, preprocessing, feature extraction, or synthetic data generation. In these cases, the challenge is not about interpreting a single model’s prediction. It is about whether the outputs at each stage can be validated, whether errors can be traced as they propagate through the pipeline, and whether the workflow is documented well enough for others to reproduce it.
Validation: Treat Every AI Step Like a Measurement Instrument
Validation should not happen only at the end of the research pipeline. Each step warrants its own assessment, because downstream models will inherit and potentially amplify systematic errors introduced upstream. For example, if an LLM miscodes categories 15% of the time, a subsequent classifying model will train on those corrupted labels, and no amount of cross-validation on the final model will catch that. How to validate LLM outputs will vary by task, but we can often use the same techniques we would apply if humans were performing it. If an LLM is coding data, for instance, we should compare its labels to a human-annotated gold standard, just as with any team of coders.
Error Tracing: When Something Goes Wrong, Can We Find It?
In a conventional pipeline, debugging is straightforward. Each transformation can be inspected, intermediate data files checked, and the problem isolated. When a generative model sits in the middle, the mapping from input to output becomes opaque, and it can be harder to find errors. Useful strategies include logging raw model outputs alongside processed versions (e.g., saving entailment scores in zero-shot classification tasks), running sensitivity analyses that compare final results with and without the AI-generated component, and testing whether conclusions hold up across alternative prompts, different LLMs, or versions of the same LLM. If swapping one prompt for another changes 20% of the labels, that is information worth having before publication.
Reproducibility: The Underappreciated Challenge
A traditional script-based workflow can be rerun and will produce the same results. A workflow that depends on an LLM may not, because the model’s behavior can change between versions and even identical prompts can produce different outputs across runs. Important decisions to make early include whether to cache outputs, fix random seeds where possible, or treat the generative step as a source of uncertainty to be quantified via compositional estimators.
One practical recommendation is to embed all prompts in scripts that call models through their APIs, rather than working through chat interfaces. A script records the exact prompt text, model version, and parameters in code that can be versioned, shared, and rerun. A chat session does not. This also makes it straightforward to save outputs, log responses, and run the same prompts against updated models to assess sensitivity. At minimum, recording the exact model version, prompt text, and parameter settings as part of the research documentation is essential. Without this, a workflow is not reproducible in any meaningful sense.
Why This Matters Now
A growing body of research relies on LLMs, and the standards for evaluating that work are still being established. Researchers who build validation and documentation into their workflows will be ahead of the curve. Those who treat the generative model as a convenient shortcut, without subjecting it to the same scrutiny as any other tool, risk producing results that cannot be verified or reproduced.
The black box problem has not gone away. It has moved from the model to the workflow. The tools for addressing it are not exotic. They are the same principles of measurement validation, error analysis, and documentation that rigorous research has always required, applied to a new and powerful class of tools.

