arXiv (Cornell University)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models
May 2025 • Xiaomeng Tan, Yanzhao Yang, Hancheng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Hao Jia, Tao Chen
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across conse…