- Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi — 2025
- Questioning the Stability of Visual Question Answering
Amir Rosenfeld, Neta Glazer, Ethan Fetaya — 2025
- OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie — 2025
- Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans — 2025
- The Term 'Agent' Has Been Diluted Beyond Utility and Requires Redefinition
Brinnae Bent — 2025
- Vision Language Models are Biased
An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim — 2025
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, et al. — 2025
- Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal — 2025
- Seed1.5-VL Technical Report
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, et al. — 2025
- Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan — 2025
- Harnessing the Universal Geometry of Embeddings
Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris — 2025
- Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie — 2025
- Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He — 2025
- Rethinking Visual Layer Selection in Multimodal LLMs
Haoran Chen, Junyan Lin, Xinhao Chen, Yue Fan, Xin Jin, Hui Su, Jianfeng Dong, Jinlan Fu, Xiaoyu Shen — 2025
- Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer — 2025
- Scaling Laws for Native Multimodal Models
Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby — 2025
- Science-T2I: Addressing Scientific Illusions in Image Synthesis
Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie — 2025
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu — 2025
- Scaling Language-Free Visual Representation Learning
David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie — 2025
- A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu, Kaiming He — 2024
- Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su — 2025
- Pretrained Transformers as Universal Computation Engines
Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch — 2021
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein — 2025
- Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper — 2024
- Open Problems in Mechanistic Interpretability
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath — 2025
- Why Do We Need Weight Decay in Modern Deep Learning?
Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre, Nicolas Flammarion — 2023
- Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi — 2025
- ICONS: Influence Consensus for Vision-Language Data Selection
Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky — 2024
- The GAN is dead; long live the GAN! A Modern GAN Baseline
Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin — 2025
- The Unbearable Slowness of Being: Why do we live at 10 bits/s?
Jieyu Zheng, Markus Meister — 2024
- Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots — 2024
- Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana — 2024
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu — 2024