Predicting What You’ll Do Next: A New Leap in User Interface Design

Author: Denis Avetisyan

Researchers have developed a new model that anticipates user actions by learning from their complete interaction history, offering a path toward more intuitive and efficient interfaces.

LongNAPs predict user actions by analyzing a complete history of multimodal context - encompassing nearly 1,800 hours of screen activity from 20 users over a month - and are trained to maximize similarity between predicted and actual future actions as evaluated by a large language model. — LongNAPs predict user actions by analyzing a complete history of multimodal context – encompassing nearly 1,800 hours of screen activity from 20 users over a month – and are trained to maximize similarity between predicted and actual future actions as evaluated by a large language model.

LongNAP, a long-context next action predictor, leverages multimodal behavioral data to significantly outperform existing models in anticipating user intent.

Truly proactive AI requires anticipating user intent beyond simple prompts, demanding reasoning over complete interaction histories. In ‘Learning Next Action Predictors from Human-Computer Interaction’, we address this challenge by formalizing next action prediction (NAP) and introducing LongNAP, a long-context model that learns from multimodal user data to forecast future actions. LongNAP significantly outperforms supervised and prompted baselines-achieving a 79% and 39% improvement respectively-and generalizes across users, correctly predicting user trajectories with up to 26% accuracy. Given these promising results, can learning from the full context of user behavior unlock a new era of genuinely anticipatory and helpful AI systems?

The Illusion of Context: Why Standard NAP Systems Fail

Conventional Next Action Prediction (NAP) systems frequently encounter limitations when processing extensive interaction histories, resulting in a loss of critical contextual information. These systems, often designed to anticipate immediate subsequent actions, struggle to effectively retain and utilize data from interactions further removed in time. This presents a significant challenge because human behavior is rarely isolated to the present moment; instead, it is deeply rooted in a complex web of past events and decisions. Consequently, the inability to leverage these longer-term dependencies hinders a NAP system’s capacity to accurately predict and respond to user needs, potentially leading to irrelevant or unhelpful suggestions as the nuance of previous exchanges fades from consideration.

Robust reasoning isn’t simply about processing the most recent information; it fundamentally depends on the ability to synthesize knowledge from a potentially lengthy and complex history of events. Standard Next Action Prediction (NAP) models, however, often struggle with this crucial task, encountering significant limitations when attempting to integrate extended sequences of past interactions. These models frequently prioritize the immediacy of recent data, inadvertently discarding potentially vital contextual cues embedded in earlier exchanges. This creates a bottleneck in their ability to accurately predict appropriate subsequent actions, as nuanced understanding often requires drawing connections across a broader temporal scope than these systems are presently equipped to handle. Consequently, achieving truly effective reasoning in interactive systems demands innovative approaches capable of reliably accessing and leveraging these extended histories.

Many current next action prediction systems exhibit a pronounced bias toward recent interactions, inadvertently diminishing the influence of potentially crucial information embedded within earlier exchanges. This prioritization of recency stems from architectural limitations and training methodologies, leading to models that struggle to maintain a comprehensive understanding of extended conversational histories. Consequently, actions predicted based solely on immediate context may lack the nuance or foresight derived from a broader, more complete awareness of past events. The result is a system that, while proficient at responding to the present, often overlooks vital cues and dependencies established earlier in the interaction, hindering its ability to formulate truly informed and strategically sound next actions.

LongNAP effectively leverages a significant portion of its historical context to anticipate user actions, as demonstrated by its retrieval patterns for a representative user query.

LongNAP: A Patch, Not a Paradigm Shift

LongNAP functions as a next action predictor by analyzing extensive user interaction histories to forecast subsequent actions. This is achieved by considering a significantly larger context window than typical models, allowing it to identify patterns and dependencies spanning multiple turns of interaction. By incorporating a comprehensive record of past observations and reasoning steps, LongNAP aims to improve prediction accuracy in complex, multi-step tasks where understanding the evolution of the interaction is crucial. The system doesn’t rely solely on the immediately preceding action, but instead integrates information from a broader historical record to anticipate the user’s likely next step.

LongNAP utilizes a BM25 Retriever as its primary mechanism for accessing relevant historical data. BM25, a ranking function used to estimate the relevance of documents to a given search query, efficiently identifies past reasoning traces and observational data from the extended interaction history. This retrieval process involves scoring each historical trace based on its term frequency-inverse document frequency (TF-IDF) similarity to the current context. The highest-scoring traces are then selected and provided as input to the prediction model, forming the knowledge base upon which future action predictions are based. This approach allows LongNAP to leverage a significantly larger context window than traditional methods, improving prediction accuracy by incorporating a richer understanding of past interactions.

LongNAP refines information retrieval through the combined application of Temporal Decay and Maximal Marginal Relevance (MMR). Temporal Decay prioritizes recent interactions by assigning higher weights to more recent reasoning traces and observations, acknowledging their increased relevance to current action prediction. Simultaneously, MMR ensures diversity within the retrieved context by penalizing redundancy; it selects items that are both relevant to the query and dissimilar to those already included in the retrieved set. This combination prevents the model from being overly focused on a narrow subset of past interactions and promotes a more comprehensive understanding of the user’s behavioral patterns.

LongNAP predicts user actions by first reasoning to retrieve relevant past trajectories based on current observations, then reasoning to predict future steps, optimizing its performance via comparison to ground truth and reinforcement learning with GRPO (Shao et al., 2024b).

NAPsack: Gathering the Fuel for the Algorithm

NAPsack is a system designed to automatically gather and annotate user interaction data, forming the foundation for training and assessing LongNAP models. The pipeline operates passively, collecting data from user activity without requiring active participation or explicit labeling from users. Collected interactions include both visual elements, such as screenshots or application windows, and associated textual data like user input or application logs. This data is then processed to create a labeled dataset suitable for supervised learning tasks, specifically focusing on understanding and predicting long-form user intentions and actions. The resulting dataset provides a scalable and efficient means to develop and benchmark LongNAP’s performance in complex, multi-step interaction scenarios.

NAPsack employs Vision-Language Models (VLMs) to process user interaction data by extracting and annotating both visual and textual content. These VLMs analyze screenshots captured during user sessions to identify visual elements such as buttons, icons, and displayed text. Simultaneously, they interpret any textual input provided by the user, or present within the interface. The output of this analysis is structured annotation data, linking specific visual regions to corresponding textual descriptions or actions, effectively creating a machine-readable representation of the user’s interaction. This automated process allows NAPsack to generate labels for large volumes of interaction data without requiring manual annotation.

Automated labeling within the NAPsack pipeline substantially decreases the resource requirements traditionally associated with dataset creation for interaction modeling. Manual annotation of visual and textual data is a time-consuming and expensive process; NAPsack’s use of Vision-Language Models circumvents this limitation by providing programmatic labeling capabilities. This automation facilitates the generation of datasets containing millions of examples, scaling data collection beyond the practical limits of manual efforts. The resulting large-scale datasets, coupled with the inherent consistency of machine-generated labels, contribute to improved model training and evaluation, specifically for LongNAP and similar interaction-based systems.

NAPsack passively collects and annotates human-computer interaction data by compressing screenshots and input events to retain only meaningful action frames.

The LLM-Judge: An Automated Arbiter of Prediction Quality

A critical component of evaluating LongNAP’s performance lies in the implementation of an LLM-Judge, a specialized large language model meticulously trained to discern the accuracy and relevance of predicted actions. This automated assessment system moves beyond simple pass/fail metrics, providing a nuanced evaluation of each prediction’s quality. The LLM-Judge operates as an objective arbiter, consistently applying predefined criteria to a vast dataset of LongNAP’s outputs, thereby enabling continuous refinement and improvement of the model’s predictive capabilities. This approach ensures a rigorous and scalable evaluation process, crucial for advancing the field of long-term behavior prediction.

To rigorously evaluate the performance of predictive models, an automated assessment system, termed the LLM-Judge, was implemented. This system leverages the capabilities of a large language model to provide an objective and consistent evaluation of prediction quality, moving beyond subjective human assessments. The LLM-Judge analyzes predicted actions against expected outcomes, generating a quantifiable score that reflects accuracy and relevance. Crucially, this automated feedback loop allows for continuous improvement of the model; errors are flagged, patterns are identified, and the model can be iteratively refined based on the LLM-Judge’s impartial evaluations. This objective analysis ensures that progress isn’t hampered by bias and accelerates the development of more reliable predictive capabilities.

Evaluations reveal LongNAP significantly outperforms existing methods in predicting user actions, as quantified by an LLM-Judge score of 0.38. This represents a substantial improvement over the supervised finetuning (SFT) baseline, achieving a 79% increase in accuracy assessment, and a more than twofold improvement – a 106% increase – when contrasted with zero-shot prompting techniques. The LLM-Judge score offers a robust, automated metric for gauging prediction quality, highlighting LongNAP’s enhanced capability to accurately anticipate user behavior compared to conventional approaches and establishing a new benchmark for performance in this domain.

LongNAP exhibits a notable capacity for anticipating user actions, as evidenced by its performance metrics focused on successful predictions within a limited number of attempts. Specifically, the model achieves a Pass@1 rate of 17.1%, meaning it correctly forecasts the user’s next step as its top prediction nearly 17% of the time. Expanding on this, the Pass@20 rate reaches 36.3%, indicating that within its top 20 predictions, LongNAP accurately identifies the user’s subsequent action over a third of the time. These rates collectively demonstrate the model’s effectiveness in learning and generalizing user behavior patterns, suggesting a strong foundation for building more intuitive and responsive AI assistants.

LongNAP achieves substantial performance gains, exceeding the strongest baseline by at least 39.4% as measured by LLM-judge's assessment of similarity to ground truth future actions (averaged across 20 user-specific models). — LongNAP achieves substantial performance gains, exceeding the strongest baseline by at least 39.4% as measured by LLM-judge’s assessment of similarity to ground truth future actions (averaged across 20 user-specific models).

The pursuit of predictive modeling, as evidenced by LongNAP’s attempt to anticipate user actions, feels predictably…optimistic. It’s a familiar cycle: a clever framework promising to solve the chaos of human-computer interaction. LongNAP, with its long-context approach and multimodal data, undoubtedly achieves a performance boost, but one suspects production environments will quickly reveal edge cases and unforeseen user behaviors. As Claude Shannon observed, “The most important thing in communication is to have something to say.” All the context in the world won’t help if the ‘something’ is predicting what a user thought they wanted, versus what they actually do. It’s elegant, sure, but everything new is just the old thing with worse docs.

What Comes Next?

LongNAP demonstrates a predictable advance – longer contexts generally yield marginally better predictions, at exponentially increasing computational cost. The real question isn’t whether models can ingest a user’s entire digital life, but whether doing so offers a return on investment beyond a statistically significant, yet practically negligible, improvement in anticipatory accuracy. Tests are a form of faith, not certainty; production will invariably reveal edge cases where extended history becomes noise, not signal.

The current focus on multimodal integration feels particularly optimistic. The assumption that correlating mouse movements with application launches unlocks some fundamental truth about user intent ignores the chaos of daily work. Humans are wonderfully inconsistent. Models that demand coherence will inevitably fail to account for interruptions, distractions, and the simple desire to occasionally do things randomly.

Future work will likely revolve around resource management – distillation, pruning, and other techniques to force these behemoths into something resembling practical deployment. But the core issue remains: automation will not save anyone. It will merely shift the burden of error from human mistake to algorithmic opacity. The interesting failures, as always, will be far more instructive than the carefully curated successes.

Original article: https://arxiv.org/pdf/2603.05923.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Context: Why Standard NAP Systems Fail

LongNAP: A Patch, Not a Paradigm Shift

NAPsack: Gathering the Fuel for the Algorithm

The LLM-Judge: An Automated Arbiter of Prediction Quality

What Comes Next?

See also: