XPENG Releases World Model Technical Report, Powering VLA 2.0 Model R&D And Verification

Summary By: eMotoX

XPENG has unveiled its X-World Technical Report, detailing the development and deployment of a novel generative world model designed specifically for autonomous driving applications. X-World leverages advanced video diffusion technology to produce real-time, multi-view video simulations that are controllable and continuous, enabling it to generate future driving scenes based on historical multi-camera footage and specified vehicle actions. This innovation is already embedded within XPENG’s autonomous driving ecosystem, playing a crucial role in closed-loop simulation, online reinforcement learning, and data synthesis, particularly supporting the research and validation of the company’s VLA 2.0 autonomous driving model. The report highlights the limitations of traditional simulation methods, which often rely on 3D Gaussian Splatting techniques that struggle to simulate scenarios deviating significantly from recorded trajectories, such as abrupt lane changes or detours. This shortfall has necessitated costly and time-consuming real-world testing. In response, XPENG’s team developed X-World as a “real-world simulator” capable of generating physically plausible future video sequences that maintain high controllability and stability, thus overcoming the constraints of existing simulation approaches. The model’s architecture combines a video variational autoencoder with a DiT-based latent space denoiser, enabling efficient long-sequence video generation with consistent multi-view perspectives across seven cameras. A key technical achievement of X-World is its ability to maintain cross-view 3D consistency and accurately follow specified driving actions over extended video sequences. The model operates in a streaming autoregressive manner, allowing it to generate future frames progressively and interactively, which is particularly suited for closed-loop testing and reinforcement learning scenarios. XPENG’s two-stage training process involved transforming a pre-trained video generation model into a controllable multi-camera system, followed by refining it into a streaming simulator using novel architectures and learning techniques. Experimental results demonstrate that X-World delivers high-quality, multi-view video generation that closely aligns with real-world driving behaviours. Beyond its technical sophistication, X-World serves as a foundational platform for XPENG’s VLA 2.0 system, providing scalable and reproducible testing environments that enhance the development and verification of autonomous driving policies. By enabling comprehensive simulation of diverse driving scenarios, the model supports more efficient regression testing and interactive learning, potentially reducing reliance on expensive real-world trials. XPENG’s release of this technical report marks a significant step towards practical, AI-driven simulation tools that could accelerate the advancement of autonomous vehicle technologies.

← Back to EV News

Read Full Article ↗