Welo Data Unveils New Framework for Evaluating Causal Reasoning in Large Language Models

New York, NY – Welo Data today released a new research study examining how different scoring methods and prompt structures influence the evaluation of causal reasoning in large language models (LLMs). Titled “A Novel Multi-Select Framework for Evaluating Causal Reasoning in LLMs,” the study introduces a framework for testing how well LLMs identify multiple causes, including confounders and contributing factors, across a range of prompt types.

Unlike many existing benchmarks that rely on binary or single-answer formats, this study uses multi-select questions—requiring models to identify all valid answers from a list of options. The research team evaluated eight state-of-the-art LLMs using 576 prompts spanning three causal question types and two prompting formats (standard and chain-of-thought). Five distinct scoring metrics were applied to assess performance from multiple angles: precision, recall, F1 score, complement, and a "trapdoor" metric that penalizes any incorrect selection.

“This research shows just how much our conclusions about AI performance depend on the lens we use,” said Dr. David Harper, Data Scientist at Welo Data and lead author of the study. “If we're serious about building high-performing reasoning models we need evaluation methods that reflect the complexity of the real world.”

Key findings from the study include:

  • Scoring methods significantly impact model rankings. Metrics like precision and recall reveal different strengths, and complement scoring offered the clearest separation among models. The trapdoor metric—where a single incorrect choice nullifies a response—exposed over-selection tendencies in some models. Of the eight models tested, only two maintained the same ranking tier across all five metrics.
  • Models struggle with causal completeness. Most models demonstrated conservative selection strategies on causal questions—averaging just 2.7–3.6 selections versus 5.5 expected selections. In contrast, models adopted overly liberal strategies on identifying confounders, selecting up to 4.7 options when, on average, 2.5 were correct.
  • Chain-of-thought prompting showed no benefit in multi-select tasks. Contrary to previous research suggesting it improves reasoning, this study found no consistent performance improvement when chain-of-thought instructions were added.
  • Behavioral patterns vary by model family. Some models exhibited more cautious, precision-oriented selection behavior, while others were more liberal and risk-prone—particularly when evaluating non-causal relationships.

This study underscores a single, practical takeaway: evaluation methods must mirror the specific reasoning behaviors we want LLMs to demonstrate. As these models take on high-stakes tasks—such as diagnosing diseases, flagging fraudulent transactions, and guiding scientific discovery—organizations need tests that expose how a model actually reasons, not merely whether it selects the correct answer, to ensure safe and dependable deployment.

For more information visit welodata.ai.

About Welo Data

Welo Data, a division of Welocalize, stands at the forefront of the AI training data industry, delivering exceptional data quality and security. Supported by a global network of over 500,000 AI training professionals and domain experts, along with cutting-edge technological infrastructure, Welo Data fulfills the growing demand for dependable training data across diverse AI applications. Its service offerings span a variety of critical areas, including data annotation and labeling, large language model (LLM) enhancement, data collection and generation, and relevance and intent assessment. Welo Data's technical expertise ensures that datasets are not only accurate but also culturally aligned, tackling significant AI development challenges like minimizing model bias and improving inclusivity. Its NIMO (Network Identity Management and Operations) framework guarantees the highest level of accuracy and quality in AI training data by leveraging advanced workforce assurance methods. welodata.ai