Measuring quality of general reasoning

Alexandru Marcoci (Lead / Corresponding author), Margaret E. Webb, Luke Rowe, Ashley Barnett, Tamar Primoratz, Ariel Kruger, Benjamin Stone, Michael L. Diamond, Morgan Saletta, Tim van Gelder, Simon Dennis

Research output: Contribution to journalConference articlepeer-review

49 Downloads (Pure)


Machine learning models that automatically assess reasoning quality are trained on human-annotated written products. These “gold-standard” corpora are typically created by prompting annotators to choose, using a forced choice design, which of two products presented side by side is the most convincing, contains the strongest evidence or would be adopted by more people. Despite the increase in popularity of using a forced choice design for assessing quality of reasoning (QoR), no study to date has established the validity and reliability of such a method. In two studies, we simultaneously presented two products of reasoning to participants and asked them to identify which product was 'better justified' through a forced choice design. We investigated the criterion validity and inter-rater reliability of the forced choice protocol by assessing the relationship between QoR, measured using the forced choice protocol, and accuracy in objectively answerable problems using naive raters sampled from MTurk (Study 1) and experts (Study 2), respectively. In both studies products that were closer to the correct answer and products generated by larger teams were consistently preferred. Experts were substantially better at picking the reasoning products that corresponded to accurate answers. Perhaps the most surprising finding was just how rapidly raters made judgements regarding reasoning: On average, both novices and experts made reliable decisions in under 15 seconds. We conclude that forced choice is a valid and reliable method of assessing QoR.

Original languageEnglish
Pages (from-to)3229-3235
Number of pages7
JournalProceedings of the Annual Meeting of the Cognitive Science Society
Publication statusPublished - 2022
Event44th Annual Meeting of the Cognitive Science Society: Cognitive Diversity 2022 - Metro Toronto Conference Centre and Online, Toronto, Canada
Duration: 27 Jul 202230 Jul 2022


  • forced choice
  • quality of reasoning
  • Reasoning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Human-Computer Interaction
  • Cognitive Neuroscience


Dive into the research topics of 'Measuring quality of general reasoning'. Together they form a unique fingerprint.

Cite this