Planning with Perception: Smarter Agents Through Graph-Enhanced Reasoning

Author: Denis Avetisyan

A new framework combines the power of large language models with graph-based scene understanding to enable more robust and efficient long-term task planning for embodied AI agents.

GiG leverages a Graph-in-Graph memory architecture and lookahead planning to improve the performance of LLM-based agents in complex environments.

Despite advances in reasoning, deploying Large Language Models as embodied agents struggles with long-horizon planning due to context limitations and environmental constraints. This paper introduces ‘Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model’, a novel framework-GiG-that structures agent memory using a Graph-in-Graph architecture and bounded lookahead to improve planning robustness. By encoding environmental states into action-connected graph embeddings and retrieving relevant past experiences, GiG enables agents to ground decisions in structural patterns and project grounded actions. Can this approach unlock more efficient and reliable embodied AI capable of tackling complex, real-world tasks?

Beyond Pattern Matching: The Limits of Scale in Large Language Models

Large language models demonstrate a remarkable capacity for identifying and replicating patterns within vast datasets, enabling them to generate human-quality text and perform tasks like translation with impressive fluency. However, this proficiency often plateaus when confronted with challenges demanding genuine reasoning or intricate planning. While adept at recognizing correlations, these models frequently falter when required to extrapolate beyond learned associations, solve novel problems, or navigate scenarios necessitating multiple sequential steps. The core limitation isn’t a lack of data, but rather a fundamental architectural constraint; LLMs, at their heart, are sophisticated pattern matchers, not symbolic manipulators capable of true cognitive processes like deduction, abstraction, or goal-directed behavior. This distinction highlights a crucial gap between statistical learning and the hallmarks of intelligence, suggesting that simply increasing model size will not resolve inherent difficulties with complex thought.

The pursuit of ever-larger transformer models, while initially successful, is increasingly hampered by diminishing returns. Simply increasing the number of parameters doesn’t guarantee improved performance, particularly when faced with tasks demanding more than pattern recognition. These models struggle to integrate external, grounded knowledge-information not present within their training data-and lack the capacity for iterative refinement, a crucial element in complex problem-solving. Consequently, challenges such as planning, common-sense reasoning, and adapting to novel situations prove difficult, as the models tend to rely on statistical correlations rather than genuine understanding. This limitation highlights the need for architectural innovations that go beyond sheer scale and focus on incorporating mechanisms for explicit knowledge representation and dynamic, iterative processing.

The limitations of simply increasing the size of transformer models are driving research toward novel architectures that move beyond pattern recognition alone. These emerging systems prioritize the incorporation of explicit knowledge representation – essentially, building in facts and relationships about the world – and combining this with dynamic planning capabilities. This means the model doesn’t just respond to a prompt, but actively constructs a sequence of actions or inferences to achieve a goal, much like a human solving a complex problem. Rather than relying solely on statistical correlations learned from massive datasets, these architectures aim to create systems capable of reasoning, adapting to new information, and tackling tasks that require more than just recalling memorized patterns. This shift promises to unlock capabilities in areas like robotics, scientific discovery, and complex decision-making where LLMs currently falter.

Augmenting Knowledge: Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) mitigates the inherent knowledge limitations of Large Language Models (LLMs) by supplementing their pre-trained parameters with information retrieved at inference time. LLMs, while proficient in language understanding and generation, possess a fixed knowledge base established during training, leading to potential inaccuracies or omissions when addressing queries requiring up-to-date or specialized information. RAG systems address this by first identifying relevant documents or data points from an external knowledge source – such as a vector database, document repository, or the internet – based on the user’s input. This retrieved information is then concatenated with the original prompt and fed to the LLM, enabling it to generate responses grounded in external, dynamically accessed knowledge, rather than solely relying on its internal parameters.

The efficacy of Retrieval-Augmented Generation (RAG) pipelines is directly correlated with the quality of information retrieved prior to generation. While simple keyword searches can be utilized, unstructured or semi-structured data often yields suboptimal results due to ambiguity and lack of semantic understanding. Employing structured knowledge representations, such as knowledge graphs or relational databases, allows for more precise and nuanced retrieval based on relationships and entities. This improved retrieval quality directly translates to more relevant context being provided to the Large Language Model (LLM), increasing the likelihood of accurate, consistent, and informative generated outputs. Specifically, structured data enables the system to identify not just documents containing keywords, but documents containing information related to the query through defined relationships, thereby mitigating the impact of lexical mismatch and improving recall.

Employing a Knowledge Graph (KG) as the retrieval mechanism in a Retrieval-Augmented Generation (RAG) system improves information access by representing knowledge as interconnected entities and relationships. This structured approach contrasts with traditional keyword-based search, allowing for semantic understanding of queries and more precise identification of relevant information. The graph structure enables the system to traverse relationships between concepts, identifying information that may not contain the exact query terms but is contextually relevant. Consequently, utilizing a KG boosts both precision – minimizing the retrieval of irrelevant documents – and recall – maximizing the retrieval of all relevant documents – resulting in LLM-generated responses that are better informed, more accurate, and less prone to hallucination.

ReAct and Embodied Task Execution: Planning Through Interaction

The ReAct framework implements a cyclical process of agent interaction with an environment. This loop consists of alternating “action” and “observation” phases; the agent executes an action, then receives observational feedback detailing the consequences of that action within the environment. Crucially, this feedback is not simply used for reward calculation, but is directly incorporated into the agent’s reasoning process, allowing it to revise its internal plan and select subsequent actions based on the observed results. This iterative refinement is distinct from traditional planning approaches and enables agents to adapt to unexpected circumstances and dynamic environments, improving robustness and task completion rates.

Embodied Task Planning necessitates an agent’s ability to actively perceive its environment through sensor data, interpret that data to understand the current state, and then reason about possible actions and their likely outcomes within that state. This process moves beyond static planning by requiring continuous adaptation to dynamic conditions; the agent must not simply follow a pre-defined plan but must monitor its execution, recognize deviations from the expected trajectory, and adjust its subsequent actions accordingly. Successful embodied planning, therefore, fundamentally depends on integrating perception, interpretation, and reasoning to enable agents to operate effectively in complex and unpredictable physical spaces.

Bounded Lookahead (BL) enhances embodied task planning by predicting the immediate consequences of an agent’s actions, facilitating more informed decision-making. This projection often utilizes a State-Transition Graph to model possible state changes and track progress toward a goal. The GiG framework incorporates BL to achieve state-of-the-art performance on established benchmarks; specifically, it demonstrates a Pass@1 rate improvement of up to 37% when evaluated on the Robotouille Asynchronous dataset, indicating a significant advancement in successful task completion compared to previous methods.

Recursive Planning with ReCAP: Navigating Complexity Through Iteration

The ReCAP framework addresses the challenges of complex task completion through a robust recursive planning process coupled with backtracking capabilities. This allows an agent to not simply pursue a single course of action, but to systematically explore multiple potential solution paths, branching out to consider alternatives at each step. Crucially, when an initial path proves unsuccessful – encountering an obstacle or leading to a dead end – the system doesn’t halt; instead, it intelligently backtracks, revisiting previous decision points to explore previously untried options. This iterative process of planning, execution, and recovery from failures is fundamental to ReCAP’s ability to navigate intricate environments and achieve successful outcomes even when faced with unforeseen circumstances, offering a significant advantage over methods that rely on a single, inflexible plan.

ReCAP’s ability to navigate complex tasks stems from its innovative use of a context tree, a dynamic record of the agent’s interactions with its environment. This tree doesn’t simply log actions; it meticulously catalogs both the actions taken and the subsequent observations resulting from those actions. By preserving this detailed history, ReCAP gains a crucial advantage in decision-making, allowing it to assess the consequences of previous choices and refine its strategy accordingly. The context tree effectively functions as a form of experience replay, enabling the agent to ‘remember’ what has worked – and importantly, what hasn’t – in similar situations, leading to more robust and adaptable behavior. This historical awareness is key to overcoming challenges and achieving success in intricate, multi-step tasks.

The integration of Chain-of-Thoughts (CoT) prompting with the ReCAP planning framework enables a level of sophisticated reasoning crucial for tackling complex challenges. This synergy facilitates not merely action, but deliberate, step-by-step problem-solving, evidenced by near-perfect success rates – measured as Pass@1 – on the ALFWorld benchmark using both Qwen3 and DeepSeek models. Performance gains extend to more demanding robotic manipulation tasks; notably, the approach achieves a 22% improvement on the Robotouille Synchronous environment and a 37% improvement on the Asynchronous variant, both utilizing Qwen3-235B. Further refinement through the incorporation of experience memory yields additional gains, boosting Pass@1 rates by 15% with Qwen3-30B and 7% with Gemini-2.5-Flash-Lite, demonstrating the robustness and adaptability of this combined methodology.

The pursuit of robust embodied agents, as demonstrated by GiG, benefits from a relentless focus on distilling complexity. The framework’s Graph-in-Graph memory architecture, combined with bounded lookahead, exemplifies this principle; it isn’t about adding more layers of computation, but about intelligently structuring existing information for efficient retrieval and planning. As Robert Tarjan once stated, “The best algorithms are the ones you never have to think about.” GiG, by prioritizing clarity in its memory representation and planning horizon, moves closer to this ideal. The system’s efficiency isn’t a byproduct of intricate design, but rather a testament to the power of streamlined information processing – a deliberate subtraction of unnecessary complication.

Where Does This Leave Us?

The presented framework, while a step toward mitigating the inherent fallibility of large language models in extended planning scenarios, does not, of course, solve the problem. It sidesteps, with a certain elegant complexity, the issue of genuine understanding. The reliance on graph-informed action generation, and the bounded lookahead, are merely pragmatic concessions-acknowledgements that true foresight remains elusive. One suspects the gains observed are less about planning and more about skillfully managing the illusion of it.

Future work will inevitably focus on scaling these graph architectures, and refining the experience retrieval mechanisms. But such incremental improvements risk obscuring a fundamental truth: a system predicated on retrieving and recombining existing data cannot, by definition, transcend the limitations of that data. A truly robust agent will require something beyond clever indexing-a capacity for genuine abstraction, for forming novel representations untethered to prior experience.

Perhaps the most fruitful avenue for investigation lies not in refining the planning process itself, but in questioning its necessity. If the goal is not to simulate intelligence, but to achieve competent action, then a simpler, more reactive approach-one that embraces imperfection and prioritizes immediate adaptation-may ultimately prove more effective. The pursuit of perfect plans, it seems, is often a distraction from the business of simply doing.

Original article: https://arxiv.org/pdf/2601.21841.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Pattern Matching: The Limits of Scale in Large Language Models

Augmenting Knowledge: Retrieval-Augmented Generation (RAG)

ReAct and Embodied Task Execution: Planning Through Interaction

Recursive Planning with ReCAP: Navigating Complexity Through Iteration

Where Does This Leave Us?

See also: