LLMs excel at inductive reasoning but struggle with deductive tasks, new research finds

LLMs excel at inductive reasoning but struggle with deductive tasks, new research finds


Sign up for our daily and weekly newsletters to stay up to date with the latest updates and exclusive content on industry-leading AI coverage. More information


Large language models (LLMs) have shown impressive performance on various reasoning and problem solving tasks. However, questions remain about how these reasoning abilities work and their limitations.

In a new study by researchers at the University of California, Los AngelesAnd Amazon conducted an extensive study of the capabilities of LLMs in deductive and inductive reasoning. Their findings show that while LLMs can be very good at finding the rules of a task from solved examples, they are limited in following specific instructions. The findings may have important implications for how we use LLMs in applications requiring reasoning.

Inductive vs. Deductive Reasoning

Reasoning can be broadly divided into two different types: deductive and inductive. Deductive reasoning, often described as “top-down” logic, starts with a general principle or rule and applies it to draw specific conclusions. For example, if you are given the formula to convert the temperature from Celsius to Fahrenheit, you can use it to calculate new measurements.

Inductive reasoning, on the other hand, takes a bottom-up approach. It involves observing specific cases or examples and drawing general conclusions or patterns from them. For example, you might look at different Celsius and Fahrenheit readings on a thermometer and try to derive the formula that converts one to the other.

Both types of reasoning are essential to intelligence, but involve different cognitive processes. And while LLMs are often assessed on their reasoning abilityMost research does not make a clear distinction between their inductive and deductive abilities.

A new framework for testing LLM reasoning

The Amazon and UCLA researchers designed a series of experiments to evaluate LLMs’ inductive and deductive reasoning capabilities. To ensure a fair and consistent comparison, the experiments used a similar task structure across different contexts, with each context specifically emphasizing deductive or inductive reasoning.

Inductive vs. Deductive Reasoning
Deductive vs. inductive reasoning (source: arXiv)

For example, in a math task, the researchers tested the LLMs' ability to apply a given mathematical function to solve problems (deductive reasoning) and their ability to derive the underlying mathematical function from a series of input-output examples (inductive reasoning).

To further distinguish between inductive and deductive reasoning, the researchers developed SolverLearner, a two-step model that isolates and evaluates the inductive reasoning process in LLMs.

SolverLearner first asks the LLM to generate a function that maps input data points to their corresponding output values, based solely on a set of input-output examples. This step focuses on the LLM's ability to learn the underlying pattern or rule from the data.

In the second step, SolverLearner uses an external code interpreter to execute the proposed function on new test data. This separation ensures that the LLM is not involved in applying the function, preventing his deductive reasoning ability from influencing the evaluation of his inductive reasoning.

SolveStudent
SolveLearner framework (source: arXiv)

“By focusing on inductive reasoning and setting aside LLM-based deductive reasoning, we can isolate LLM inductive reasoning in its pure form and investigate it via SolverLearner,” the researchers write.

LLMs show contrasting strengths in inductive and deductive reasoning

The researchers used SolverLearner to evaluate the inductive and deductive reasoning capabilities of GPT-3.5 and GPT-4 in various tasks, including syntactic reasoning, arithmetic operations, and spatial reasoning.

The results showed that both LLMs consistently demonstrated remarkable inductive reasoning skills, achieving near-perfect accuracy on tasks that required them to learn from examples and infer the underlying mapping function.

However, the LLMs had difficulty applying specific rules or instructions, especially when those instructions involved scenarios that did not occur frequently during their training. This is especially true for counterfactual reasoning tasks that differ from conventional cases. For example, the LLMs performed well on deductive reasoning with base 10 arithmetic, but performed very poorly on unconventional numerical bases, such as 11 and 9.

The findings suggest that LLMs may be better at learning from examples and discovering patterns in data than at following explicit instructions. This has important implications for the use of LLMs in real-world scenarios. While LLMs may initially show impressive abilities to follow logical instructions, there is a high probability that they are only following patterns that they have observed during their training, meaning that their performance will deteriorate as soon as the examples they see deviate from their training distribution.

On the other hand, SolverLearner provides a framework that ensures that the model learns the correct rules that map the input to the output. However, SolverLearner is only applicable in settings where a verification mechanism such as a code interpreter is available.

This study is a sobering reminder that we still have much to learn about the capabilities of these increasingly common black boxes.