In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model
desinged for unified multimodal understanding and generation across text, image,
video, and audio. All modalities are trained from scratch under a unified next-grou...