Large Language Models Are Not Strong Abstract Reasoners

Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

Large Language Models Abstract Reasoning Evaluation Out-Of-Distribution Generalization

Overview

This project was the first attempt to evaluate large language models (LLMs) on a large variety of abstract reasoning tasks, including the Abstract Reasoning Challenge (ARC-AGI), Raven’s Progressive Matrices (RPM), etc. We build a large benchmark of text-based abstract reasoning tasks, and evaluate the performance of LLMs on these tasks. We show that LLMs are not able to generalize to unseen reasoning chains, and fail to adapt to out-of-distribution settings.

Key Takeaways

New Benchmark: We build a large benchmark of text-based abstract reasoning tasks.

Evaluation of LLMs: We evaluate the performance of large language models on abstract reasoning tasks, including GPT-4, LLaMA-2, etc. with different prompting methods (direct, CoT, few-shot, self-refinement, code-refinement) and using LoRA fine-tuning. Look into the paper for the numerical results!

Generalization Failure with Training and Fine-tuning: We show that prompt-tuning methods are not sufficient to adapt LLMs to abstract reasoning tasks, and that fine-tuning models, while improving performance, does not lead to better generalization. Building systems of higher cognition would require additional breakthroughs in model architectures and training methods.

Future Directions

We show that LLMs are not able to generalize to unseen reasoning chains, and fail to adapt to out-of-distribution settings. This highlights the need for more robust and generalizable models for abstract reasoning tasks, problem that we tackled in our other projects.