A new research paper introduces MemoryWAM, an efficient world action model with persistent memory specifically designed for long-horizon robot manipulation tasks. Published on arXiv in June 2026, the work addresses a fundamental trade-off in world action models (WAMs): efficient methods typically condition on limited recent observations and struggle in non-Markovian environments, while methods that retain long-term history suffer from prohibitive time and space costs.
MemoryWAM employs a hybrid memory structure that integrates three types of information: recent frames for fine-grained short-term context, event boundary anchor frames capturing key transition moments, and compact 'gist tokens' that summarize long-range historical information. A custom attention mechanism simultaneously retrieves detailed short-term context and highly compressed long-term context, significantly reducing inference latency and GPU memory usage while supporting memory-dependent decision-making.
The model was evaluated across a series of long-horizon, memory-dependent manipulation tasks in both simulation and real-world environments. Results show MemoryWAM significantly outperforms strong Vision-Language-Action (VLA) models and various WAM baselines, while maintaining excellent computational efficiency.
This research represents a meaningful step toward enabling robots to operate effectively in complex, real-world environments that require sustained attention and memory across extended task sequences. The hybrid memory approach offers a practical solution to the scaling problem of context length in embodied AI systems, which has been a major bottleneck for deploying foundation models in physical robotics.

