TY - JOUR
T1 - Measuring quality of general reasoning
AU - Marcoci, Alexandru
AU - Webb, Margaret E.
AU - Rowe, Luke
AU - Barnett, Ashley
AU - Primoratz, Tamar
AU - Kruger, Ariel
AU - Stone, Benjamin
AU - Diamond, Michael L.
AU - Saletta, Morgan
AU - van Gelder, Tim
AU - Dennis, Simon
N1 - Funding Information:
Funding This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) (2016), under Contract (16122000002). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the United States Government. The United States Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding copyright annotation therein.
Publisher Copyright:
© 2022 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY)
PY - 2022
Y1 - 2022
N2 - Machine learning models that automatically assess reasoning quality are trained on human-annotated written products. These “gold-standard” corpora are typically created by prompting annotators to choose, using a forced choice design, which of two products presented side by side is the most convincing, contains the strongest evidence or would be adopted by more people. Despite the increase in popularity of using a forced choice design for assessing quality of reasoning (QoR), no study to date has established the validity and reliability of such a method. In two studies, we simultaneously presented two products of reasoning to participants and asked them to identify which product was 'better justified' through a forced choice design. We investigated the criterion validity and inter-rater reliability of the forced choice protocol by assessing the relationship between QoR, measured using the forced choice protocol, and accuracy in objectively answerable problems using naive raters sampled from MTurk (Study 1) and experts (Study 2), respectively. In both studies products that were closer to the correct answer and products generated by larger teams were consistently preferred. Experts were substantially better at picking the reasoning products that corresponded to accurate answers. Perhaps the most surprising finding was just how rapidly raters made judgements regarding reasoning: On average, both novices and experts made reliable decisions in under 15 seconds. We conclude that forced choice is a valid and reliable method of assessing QoR.
AB - Machine learning models that automatically assess reasoning quality are trained on human-annotated written products. These “gold-standard” corpora are typically created by prompting annotators to choose, using a forced choice design, which of two products presented side by side is the most convincing, contains the strongest evidence or would be adopted by more people. Despite the increase in popularity of using a forced choice design for assessing quality of reasoning (QoR), no study to date has established the validity and reliability of such a method. In two studies, we simultaneously presented two products of reasoning to participants and asked them to identify which product was 'better justified' through a forced choice design. We investigated the criterion validity and inter-rater reliability of the forced choice protocol by assessing the relationship between QoR, measured using the forced choice protocol, and accuracy in objectively answerable problems using naive raters sampled from MTurk (Study 1) and experts (Study 2), respectively. In both studies products that were closer to the correct answer and products generated by larger teams were consistently preferred. Experts were substantially better at picking the reasoning products that corresponded to accurate answers. Perhaps the most surprising finding was just how rapidly raters made judgements regarding reasoning: On average, both novices and experts made reliable decisions in under 15 seconds. We conclude that forced choice is a valid and reliable method of assessing QoR.
KW - forced choice
KW - quality of reasoning
KW - Reasoning
UR - http://www.scopus.com/inward/record.url?scp=85146417936&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85146417936
SN - 1069-7977
VL - 44
SP - 3229
EP - 3235
JO - Proceedings of the Annual Meeting of the Cognitive Science Society
JF - Proceedings of the Annual Meeting of the Cognitive Science Society
T2 - 44th Annual Meeting of the Cognitive Science Society
Y2 - 27 July 2022 through 30 July 2022
ER -