MAR 2025 - Multimodal Algorithmic Reasoning

About MAR 2025

In this workshop, we plan to gather researchers working in neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence to showcase their cutting-edge research, discuss the latest challenges, as well as bring to the forefront problems in perception and language modeling that are often overlooked but are pivotal in achieving true artificial general intelligence. An emphasis of this workshop is on the emerging topic of multimodal algorithmic reasoning, where a reasoning agent is required to automatically deduce new algorithms/procedures for solving real-world tasks, e.g., algorithms that use multimodal foundational models for analysis, synthesis, and planning, new approaches towards solving challenging vision-and-language mathematical (Olympiad type) reasoning problems, deriving winning strategies in multimodal games, procedures for using tools in robotic manipulation, etc. We hope to deep dive into this exciting topic at the intersection of multimodal learning and cognitive science to understand what we have achieved thus far in machine intelligence and what we are lacking in relation to the human way of thinking -- through talks from outstanding researchers and faculty that could inspire the audience to search for the missing rungs on the ladder to true intelligence.

Where

Room 207 A-D, Music City Center, Nashville, TN, USA

When

1:40 PM - 6:00 PM CST on June 11, 2025

Keynote Speakers

Cordelia Schmid

INRIA

Heng Ji

UIUC

Rishabh Agarwal

Meta & McGill University

Brenden Lake

NYU

MAR 2025 Schedule

[in Nashville local time (CST)]

01:40 PM

Opening Remarks Anoop Cherian

[video]

01:45 PM

Keynote Cordelia Schmid

Multi-stage reasoning for video understanding & scene generation.

[video]

02:15 PM

Keynote Heng Ji

Multimodal Reasoning for Drug Discovery.

02:45 PM

Spotlight Paper Presentation

Zhenhailong Wang et al., Visually Descriptive Language Model for Vector Graphics Reasoning.

Sadegh Rahmaniboldaji et al., Human vs. Machine Minds: Ego-Centric Action Recognition Compared.

Yiqiao Huang et al., Autonomous Multimodal Reasoning via Implicit Chain-of-Vision.

Zhangquan Chen et al., VisRL: Intention-Driven Visual Perception via Reinforced Reasoning.

03:15 PM

Coffee Break

03:35 PM

Spotlight Paper Presentation

Tan-Hanh Pham et al., SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging.

Aniket Rajiv Didolkar et al., CTRL-O: Language-Controllable Object-Centric Visual Representation Learning.

Mohammadmostafa Rostamkhani et al., Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions.

Rabiul Awal et al., WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation.

04:00 PM

Keynote Brenden Lake

How to represent “goals” in minds and machines: Goals as reward-producing programs.

04:30 PM

Keynote Rishabh Agarwal

The Bitter Lesson for RL: Verification as the Key to Reasoning LLMs.

[video]

05:00 PM

Closing Remarks Kuan-Chuan Peng

05:05 PM - 06:00 PM

Poster Session

All the accepted papers.

Submissions

[Call for Contributions PDF]

Call for Contributions

Deep learning–powered AI systems have rapidly advanced in their data modeling capabilities, yielding compelling applications that often seem to rival human intelligence. Despite these impressive achievements, questions remain about whether these systems possess the foundational elements of general intelligence, or whether they simply excel at task-specific computations without human-like understanding. Addressing these questions calls for new methods of both developing and assessing such models.

In this workshop, we aim to bring together researchers working in neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence to showcase cutting-edge research, tackle current challenges, and highlight critical yet underexplored problems in perception and language modeling—issues at the core of achieving true artificial general intelligence. A key focus is on the emerging field of multimodal algorithmic reasoning, which explores neural representations of algorithms to devise novel solutions for real-world tasks. These span a wide range of areas, including multimodal alignment, algorithms over foundational models for solving problems related to analysis, synthesis, or planning, mathematical problem-solving, procedural learning in robotic manipulation, and more.

Our goal is to delve deeply into this exciting intersection of multimodal algorithmic learning and cognitive science, reflecting on the current progress in machine intelligence while examining the gaps that distinguish it from human cognition. Through talks by leading researchers and faculty, we aim to inspire participants to explore the "missing rungs" on the ladder to true intelligence.

We invite you to submit high-quality papers to the workshop that propose innovative approaches, theoretical insights, or practical applications towards advancing this exciting field, as well as foster meaningful discussions and collaborations.

Important Dates

Paper submission deadline: March 12 19, 2025 (11:59pm PDT)
~~Rebuttal (optional): March 25-26 2025.~~
Notification to authors: April 3, 2025.
Camera-ready deadline: ~~April 7, 2025 (11:59pm PDT)~~ April 14, 2025.

Topics

We invite submissions of high-quality research papers in the topics related to multimodal algorithmic reasoning. The topics for MAR 2025 include, but are not limited to:

Multimodal machine reasoning.
Algorithmic reasoning in vision, including program synthesis, planning, and procedural learning.
Neural architectures and approaches for mathematical reasoning.
Architectures for aligning/integrating multimodal foundation models, including vision, language, audio, and 3D content.
Architectures for solving abstract multimodal reasoning/language-based IQ puzzles, e.g., using sketches, diagrams, audio-visual clips
New tasks, datasets, benchmarks, and models for multimodal reasoning including algorithmic reasoning, neuro-symbolic reasoning, abstract reasoning, and mathematical reasoning.
Extreme generalization to new tasks and few-shot concept induction.
Synthetic data and automatic verification for reasoning.
Multimodal agents, including programmable agent, tool-use agent, etc., for reasoning tasks.
Position papers on novel perspectives to understand AI and human problem solving.
Studies comparing AI and human problem solving skills, including but not limited to:

Perspectives from psychology, neuroscience, and educational science.
Children's cognitive development.
Limitations of large vision-and-language models.

Submission Instructions

We have two tracks for paper submissions:

Papers with IEEE/CVF workshop proceedings (≤ 8 pages)
Papers without workshop proceedings (≤ 8 pages)

For track 1, we are inviting only original, previously unpublished papers, and dual submissions are not allowed. The page limits described above are excluding the references. Papers accepted to track 2 will not be included in the proceedings, however will be publicly shared on the workshop website. The submissions to track 2 can be novel/ongoing work (limited to 4 pages) or accepted/previously published papers (limited to 8 pages), both excluding references.

All submissions are handled via the workshop’s CMT website.
Submissions should be made in PDF format and should follow the official CVPR 2025 template and guidelines.
Papers accepted in track 1 will be part of the CVPR 2025 workshop proceedings.
Authors may upload an optional supplementary materials, containing additional details, videos, images, etc. in a separate zip file (with a max of 50MB in size); the deadline for submitting these supplementary materials is the same as that for the main paper.
All submissions should maintain author anonymity and should abide by the CVPR 2025 conference guidelines for double-blind review.
Accepted papers will be presented as either an oral, spotlight, or poster presentation. At least one author of each accepted submission must present the paper at the workshop in-person.
Presentation of accepted papers at our workshop will follow the same policy as that for accepted papers at the CVPR 2025 main conference.
Accepted papers will be made publicly accessible on the workshop website shortly after the camera-ready deadline. CVPR 2025 will provide the official proceedings of the accepted papers.
By submitting a paper to the MAR 2025 workshop for review, the authors must agree that they are willing and able to serve as the reviewers of the MAR 2025 workshop submissions if needed (decided by the MAR 2025 workshop organizing team).

Contact

Email: smart101@googlegroups.com

Accepted Papers

[Workshop Proceedings]

All the accepted papers will be presented in the poster session. The poster boards #443 - #470 in Exhibit Hall D at the venue are assigned to the MAR 2025 workshop at 5pm - 6pm on June 11, 2025. The number in front of each paper is the assigned poster board number.

Spotlight Papers

[445] Visually Descriptive Language Model for Vector Graphics Reasoning.
Wang, Zhenhailong; Hsu, Joy; Wang, Xingyao; Huang, Kuan-Hao; Li, Manling; Wu, Jiajun; Ji, Heng
[supplement]

[446] Human vs. Machine Minds: Ego-Centric Action Recognition Compared.
Rahmaniboldaji, Sadegh; Rybansky, Filip; Vuong, Quoc; Guerin, Frank; Gilbert, Andrew

[447] Autonomous Multimodal Reasoning via Implicit Chain-of-Vision.
Huang, Yiqiao; Chen, Zhaorun; Qi, He; Zhang, Haopeng; Yu, Hanchao; Zhao, Zhuokai

[448] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning.
Chen, Zhangquan; Luo, Xufang; Li, Dongsheng

[449] SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging.
Pham, Tan-Hanh; Bui, Trong-Duong ; Quang, Minh Luu ; Pham, Tan Huong; Ngo, Chris ; Hy, Truong Son

[450] CTRL-O: Language-Controllable Object-Centric Visual Representation Learning.
Didolkar, Aniket Rajiv; Zadaianchuk, Andrii; Awal, Md Rabiul; Seitzer, Maximilian; Gavves, Efstratios; Agrawal, Aishwarya

[451] Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions.
Rostamkhani, Mohammadmostafa; Ansari, Baktash; Sabzevari, Hoorieh; Rahmani, Farzan; Eetemadi, Sauleh
[supplement]

[452] WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation.
Awal, Rabiul; Massoud, Mahsa; Li, Zichao; Feizi, Aarash; Wang, Suyuchen; Pal, Christopher; Agrawal, Aishwarya; Vazquez, David; Reddy, Siva; Rodriguez, Juan; Taslakian, Perouz; Gella, Spandana ; Rajeswar, Sai

Poster Papers

[453] Exemplar Masking for Multimodal Incremental Learning.
Lee, Yi-Lun; Lee, Chen-Yu; Chiu, Wei-Chen; Tsai, Yi-Hsuan
[supplement]

[454] Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Park, Simon; Panigrahi, Abhishek; Cheng, Yun; Yu, Dingli; Goyal, Anirudh; Arora, Sanjeev

[455] Cap2Aug: Caption guided Image data Augmentation.
Roy, Aniket; Shah, Anshul; Shah, Ketul; Roy, Anirban; Chellappa, Rama

[456] Comparison Visual Instruction Tuning.
Lin, Wei; Mirza, Jehanzeb; Doveh, Sivan; Feris, Rogerio; Giryes, Raja; Hochreiter, Sepp; Karlinsky, Leonid
[supplement]

[457] Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding.
Kumar, Akash; Kira, Zsolt; Rawat, Yogesh S.

[458] STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding.
Garg, Aaryan; Kumar, Akash; Rawat, Yogesh S.

[459] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography.
Fang, I-Sheng; Chen, Jun-Cheng

[460] Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder.
Li, Siting; Koh, Pang Wei; Du, Simon
[supplement]

[461] Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models.
Khan, Md Azim; Gangopadhyay, Aryya; Wang, Jianwu

[462] OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance.
Wang, Chaoyi; Li, Baoqing; Di, Xinhan

[463] Controlling Multimodal LLMs via Reward-guided Decoding.
Mañas, Oscar; D'Oro, Pierluca; Sinha, Koustuv; Romero-Soriano, Adriana; Drozdzal, Michal; Agrawal, Aishwarya

MAR 2025 Venue

Music City Center, Nashville, TN, USA

MAR 2025 will be held at Room 207 A-D, Music City Center, Nashville, TN, USA at 1:40 PM - 6:00 PM CST on June 11, 2025.

Sponsor

Organizers

[Contact Email: smart101@googlegroups.com]

Program Committee

Asim Kadav	Adobe
Bimsara Pathiraja	Arizona State University
Boqi Chen	University of North Carolina at Chapel Hill
Changdi Yang	Northeastern University
Eunice Yiu	University of California, Berkeley
Hengyi Wang	Rutgers University
Ishan Dave	S. V. National Institute of Technology
Jiahao Zhang	Australian National University
Juntao Tan	Rutgers University
Junwen Chen	Amazon
Malitha Gunawardhana	University of Auckland
Moitreya Chatterjee	Mitsubishi Electric Research Laboratories
Ravindu Nagasinghe	Stony Brook University
Shijie Wang	Brown University
Siddharth Nagar Nayak	Massachusetts Institute of Technology
Tingfeng Li	Rutgers University
Tyler Zhu	Princeton University
Wenqing Wang	Northeastern University
Xinyi Yang	Salesforce Research
Yao Ni	Australian National University
Yunhe Gao	Stanford University
Zhang Dong	Amazon
Zhicheng Zheng	Princeton University
Zhuowei Li	Rutgers University
Ziyang Wang	University of North Carolina at Chapel Hill
Ziyang Luo	Hong Kong Baptist University

Multimodal Algorithmic Reasoning

(MAR)

About MAR 2025

Where

When

Keynote Speakers

MAR 2025 Schedule

Opening Remarks Anoop Cherian

Keynote Cordelia Schmid

Keynote Heng Ji

Coffee Break

Keynote Brenden Lake

Keynote Rishabh Agarwal

Closing Remarks Kuan-Chuan Peng

Submissions

Call for Contributions

Important Dates

Topics

Submission Instructions

Contact

Accepted Papers

Spotlight Papers

Poster Papers

MAR 2025 Venue

Music City Center, Nashville, TN, USA

Sponsor

Organizers

Program Committee