Evaluation of LLMs on Abstract Reasoning

Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie
Large Language Models Abstract Reasoning Evaluation Out-of-distribution Generalization

Overview

This project was the first attempt to evaluate large language models (LLMs) on a large variety of abstract reasoning tasks, including the Abstract Reasoning Challenge (ARC-AGI), Raven’s Progressive Matrices (RPM), etc. We build a large benchmark of text-based abstract reasoning tasks, and evaluate the performance of LLMs on these tasks. We show that LLMs are not able to generalize to unseen reasoning chains, and fail to adapt to out-of-distribution settings.

Key Takeaways

New Benchmark: We build a large benchmark of text-based abstract reasoning tasks.

Evaluation of LLMs: We evaluate the performance of large language models on abstract reasoning tasks, including the best models at the time (GPT-4, LLaMA-2, etc.) with different prompting methods (direct, CoT, few-shot, self-refinement, code-refinement) and using LoRA fine-tuning.

Impact of Training and Fine-tuning: We show that prompt-tuning methods are not sufficient to adapt LLMs to abstract reasoning tasks, and that fine-tuning models, while improving performance, does not lead to better generalization.

Future Directions

We show that LLMs are not able to generalize to unseen reasoning chains, and fail to adapt to out-of-distribution settings. This highlights the need for more robust and generalizable models for abstract reasoning tasks, problem that we tackled in our other projects.