SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications Paper • 2303.15446 • Published Mar 27, 2023 • 1
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Paper • 2406.09418 • Published Jun 13, 2024 • 1
Perception Encoder: The best visual embeddings are not at the output of the network Paper • 2504.13181 • Published Apr 17 • 34
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Paper • 2504.13180 • Published Apr 17 • 19
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Paper • 2506.05349 • Published Jun 5 • 24
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model Paper • 2506.07032 • Published Jun 8
Video-CoM: Interactive Video Reasoning via Chain of Manipulations Paper • 2511.23477 • Published 27 days ago • 2
Video-CoM: Interactive Video Reasoning via Chain of Manipulations Paper • 2511.23477 • Published 27 days ago • 2