EG
Abstract visualization of AI neural network processing temporal data for robotic manipulation
ResearchJune 9, 2026Embodied Global Team

MemoryVLA++ Enables VLA Models with Human-Like Memory and Imagination for Robotic Manipulation

Researchers introduce MemoryVLA++, a temporal modeling framework that equips Vision-Language-Action models with memory and imagination capabilities, enabling robots to maintain context across long-horizon manipulation tasks and anticipate future states.

Reading in English

MemoryVLA++: Bridging Memory and Imagination in Robotic Manipulation

Researchers from multiple institutions have introduced MemoryVLA++, a full temporal modeling framework that equips Vision-Language-Action (VLA) models with memory and imagination capabilities for robotic manipulation.

The Temporal Modeling Challenge

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks.

Cognitive-Inspired Architecture

Inspired by cognitive science, MemoryVLA++ draws from three human cognitive mechanisms:

  1. Working Memory: Buffers short-lived context from current observations
  2. Episodic Memory: Preserves experiences from past interactions
  3. Internal Models: Imagines possible future state evolution

Technical Implementation

The framework consists of several key components:

  • A pretrained Vision-Language Model (VLM) encodes current observations into perceptual and cognitive tokens
  • A Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics from past interactions
  • A world model imagines future states in a denoising latent space
  • A diffusion action expert predicts temporally consistent action sequences

Experimental Results

MemoryVLA++ was evaluated across 5 simulation benchmarks and 3 categories of real-robot tasks:

Task TypePerformance Gain
General Manipulation+9%
Memory-Dependent Tasks+26%
Imagination-Dependent Tasks+28%

The method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, and Libero-Plus benchmarks, validating the effectiveness of full temporal modeling with memory and imagination.

Implications for Embodied AI

This research represents a significant step toward robots that can maintain coherent task context over extended periods, handle interruptions gracefully, and plan ahead—capabilities essential for real-world deployment in complex environments.