Lexical Recall or Logical Reasoning: Probing the Limits of Reasoning Abilities in Large Language Models

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Despite the increasing interest in the reasoning abilities of Large Language Models (LLMs), existing work shows limitations in assessing logic abilities independently from lexical memory. We address this gap with Mystery-Zebra. This robust two-part benchmark (4,290 puzzles) challenges the logic abstraction abilities of LLMs in two setups: (1) a lexical obfuscation setup tests the dependence of LLMs on lexical content based on two canonical grid puzzles widely spread on the Internet; (2) a set of new grid puzzles in 42 different sizes and 12 difficulty levels tests how the formal difficulty degree of a puzzle affects LLMs.We test open and closed-weight LLMs on both parts of the benchmark. The results on part two suggest that model sizes up to 70B parameters have only a minor influence when solving newly generated puzzles, while performance mainly relates to the number of items in the puzzle. The results on the first part of the benchmark suggest that the applied obfuscation strategies help to mitigate effects of logic puzzles being part of LLM training data, showing a drastic drop in performance for obfuscated versions of well-known puzzles. In addition we conduct a case-study on the first part of the benchmark predicting the position of single items, unveiling that the reasoning abilities of LLMs are mainly limited to a few consecutive steps of reasoning.
Original languageEnglish
Title of host publicationProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationVolume 1: Long Papers
Place of PublicationTexas, USA
PublisherAssociation for Computational Linguistics
Pages13532–13557
Number of pages26
ISBN (Print)979-8-89176-251-0
DOIs
Publication statusPublished - Jul 2025
EventThe 63rd Annual Meeting of the Association for Computational Linguistics - Austria Center Vienna, Vienna, Austria
Duration: 27 Jul 20251 Aug 2025
https://2025.aclweb.org/

Conference

ConferenceThe 63rd Annual Meeting of the Association for Computational Linguistics
Abbreviated titleACL 2025
Country/TerritoryAustria
CityVienna
Period27/07/251/08/25
Internet address

Fingerprint

Dive into the research topics of 'Lexical Recall or Logical Reasoning: Probing the Limits of Reasoning Abilities in Large Language Models'. Together they form a unique fingerprint.

Cite this