Multimodal Algorithmic Reasoning

(MAR)

In Conjunction with the IEEE/CVF Conference on Computer Vision and Patten Recognition 2024

9:00 AM - 12:30 PM PDT on June 17, 2024

About MAR 2024

About MAR 2024

In this workshop, we plan to gather researchers working in neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence to showcase their cutting-edge research, discuss the latest challenges, as well as bring to the forefront problems in perception and language modeling that are often overlooked but are pivotal in achieving true artificial general intelligence. An emphasis of this workshop is on the emerging topic of multimodal algorithmic reasoning, where a reasoning agent is required to automatically deduce new algorithms/procedures for solving real-world tasks, e.g., algorithms that use multimodal foundational models for analysis, synthesis, and planning, new approaches towards solving challenging vision-and-language mathematical (Olympiad type) reasoning problems, deriving winning strategies in multimodal games, procedures for using tools in robotic manipulation, etc. We hope to deep dive into this exciting topic at the intersection of multimodal learning and cognitive science to understand what we have achieved thus far in machine intelligence and what we are lacking in relation to the human way of thinking -- through talks from outstanding researchers and faculty that could inspire the audience to search for the missing rungs on the ladder to true intelligence.

A second focus of MAR 2024 is to nudge the vision community to make progress on building neural networks that have human-like intelligence abilities for abstraction, inference, and generalization. To this end, we propose the SMART-101 challenge as part of this workshop. This challenge is based on the Simple Multimodal Algorithmic Reasoning Task and the SMART-101 dataset consisting of vision-and-language algorithmic reasoning puzzles. The puzzles demonstrate the need for algorithmic reasoning on abstract visual puzzles, which we believe would be a great test bed for evaluating multimodal large language models, which is a topic of great interest and excitement in computer vision currently. Overall, we strongly believe the combination of multimodal abstraction and language reasoning combined with perspectives from cognition makes our workshop unique among other workshops at CVPR 2024.

Where

Seattle Convention Center (exact room TBD)

When

9:00AM - 12:30PM PDT, Monday, June 17, 2024

Keynote Speakers

[More info about keynote speakers will be updated here]

Petar Velickovic

Petar Veličković

Google DeepMind

Tom Griffiths

Tom Griffiths

Princeton University

Pushmeet Kohli

Pushmeet Kohli

Google DeepMind

Lijuan Wang

Lijuan Wang

Microsoft GenAI

MAR 2024 Schedule

[in Seattle local time (Pacific Time)]

[More info about the schedule will be updated here]

Paper Track: Submission Instructions

We welcome four sub-tracks for paper submissions:
  1. Short papers for IEEE/CVF workshop proceedings (≤ 4 pages)
  2. Long papers for IEEE/CVF workshop proceedings (≤ 8 pages)
  3. Papers without proceedings (≤ 8 pages), and
  4. Previously published papers (≤ 8 pages).
Please submit your paper(s) at the MAR 2024 @ CVPR 2024 CMT website.  

Papers accepted in sub-track 3 will only be shared on the workshop website and will not have any IEEE/CVF proceedings. Authors of such papers may thus choose to re-submit these papers elsewhere. We plan to accept only a very limited number of previously accepted papers (without proceedings) in sub-track 4 if our final program schedule permits. The page limits described above are excluding the references.

The paper submissions must be in pdf format and must use the official CVPR 2024 templates. All submissions must be anonymous and conform to the CVPR standards for double-blind review. The deadline for submitting the (optional) supplementary material is the same as the main paper deadline. The (optional) supplementary material should be submitted as a separate file from the main paper. You will be able to select the type of the paper (among the four types described above) during the submission.

The accepted papers will be presented as either an oral, spotlight, or poster presentation. At least one author of each accepted submission must present the paper at the workshop. The presentation of the accepted papers at MAR 2024 will follow the same policy as that for the accepted papers of CVPR 2024.  

Submission deadline (both main paper & (optional) supplementary material): March 11 17, 2024 (11:59PM PDT)
Notification to authors: April 5, 2024.
Camera ready deadline:  April 14, 2024 (11:59PM PDT).


Paper Track: Topics

The topics for MAR 2024 include, but are not limited to:
  • Multimodal cognition and learning.
  • Foundation models of intelligence, including vision, language, and other modalities.
  • Large language models, vision, and cognition including children’s cognition.
  • Artificial general intelligence / general-purpose problem solving architectures.
  • Neural architectures for solving vision & language or language-based IQ puzzles.
  • Embodiment and AI.
  • Large language models, neuroscience, and vision.
  • Functional and algorithmic / procedural learning in vision.
  • Abstract visual-language reasoning, e.g., using sketches, diagrams, etc.
  • Perceptual reasoning and decision making.
  • New vision-and-language abstract reasoning tasks and datasets.
  • Vision-and-language applications.


SMART-101 Challenge Track: Participation Instructions

In the last couple of years, we have seen dramatic improvements in the reasoning abilities of multimodal and large language models. In this SMART-101 CVPR 2024 challenge, we attempt to understand these abilities of large deep models through solving visuo-linguistic puzzles that need basic mathematical and algorithmic skills; these skills even young children seem to possess and can solve the puzzles without much difficulty. A thorough empirical analysis of such abilities of multimodal LLMs is the basic premise of our CVPR 2023 paper titled Are Deep Neural Networks SMARTer than Second Graders? This paper introduces the Simple Multimodal Algorithmic Reasoning Task (SMART) and the SMART-101 dataset. Building upon the efforts in the paper, this SMART-101 CVPR-2024 challenge is an attempt at bringing research interest into this important topic to understand where we stand in the race towards achieving true Artificial General Intelligence (AGI). Specifically, the goals of this competition are three-fold, towards understanding:
(i) how well do state-of-the-art multimodal LLMs abstract data, attend to key details, and generalize their knowledge to solve new problems?
(ii) how fluid are they in acquiring new skills? and
(iii) how effective are they in the use of language for visual reasoning?

Through the state-of-the-art AI models submitted by the participants of this challenge, we hope to learn and understand where we stand in real AGI abilities, and more importantly, clearly answer if the current AI is at least better than second graders in mathematical/algorithmic abilities.

The SMART challenge involves solving visuo-linguistic puzzles designed specifically for children in the 6–8 age group. The puzzles are taken from the Math Kangaroo Olympiad -- a popular international children's Olympiad that uses a multiple choice answer selection format. Most of the puzzles have an image and a text question, and five answer options of which only one option is the correct answer to the puzzle. Participant submissions to the challenge will be evaluated against a private test set. The solution to each puzzle needs a mix of various basic mathematical and algorithmic reasoning skills, involving basic arithmetic, algebra, spatial reasoning, logical reasoning, measuring, path tracing, pattern matching, and counting.

Important dates for the SMART-101 Challenge Track:  
  • Challenge open: March 28, 2024.
  • Challenge close: June 7, 2024 (11:59PM PDT).
  • ArXiv challenge paper deadline to be considered for awards: June 7, 2024 (11:59PM PDT).
  • Public winner announcement:  June 17, 2024.
Instructions for participating in the SMART-101 challenge:  
  • The challenge is hosted on Eval.AI. Please read the instructions at this link for the submission guidelines.
  • The challenge participants are required to make arXiv submissions detailing their approach as well as make their implementation publicly available on Github to be considered for the prizes. Note that the participant’s arXiv submissions will not be part of the workshop proceedings.
  • Winners of the challenge are determined both by the performance on the leaderboard over a private test set as well as the quality of the proposed method (as detailed in their arXiv submission and reviewed by a panel). Please see the details on the challenge website.
  • Prizes will be awarded on the day of the workshop.
  • The contact for the SMART-101 challenge: smart101@googlegroups.com.



Accepted Papers

  • CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update.
    Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li.

  • Spatial Representations in Multimodal AI Systems.
    Scott O. Murray, Bridget Leonard.

  • Using Language-Aligned Gesture Embeddings for Understanding Gestures Accompanying Math Terms.
    Tristan Maidment, Purav J. Patel, Erin A. Walker, Adriana Kovashka.

  • What does CLIP know about peeling a banana?
    Claudia Cuttano, Gabriele Rosi, Gabriele Trivigno, Giuseppe Averta.

  • Transformers meet Neural Algorithmic Reasoners.
    Wilfried Bounsi, Borja Ibarz, Andrew J. Dudzik, Jessica Hamrick, Larisa Markeeva, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković.

  • Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models.
    Feipeng Ma, Yizhou Zhou, Yueyi Zhang, Siying Wu, Zheyu Zhang, Zilong He, Fengyun Rao, Xiaoyan Sun.

  • MIRACLE: Multimodal Image-text Retrieval and Analysis for Contextual Long-form Evaluation.
    Md Messal Monem Miah, Agent Chatterjee, Arindam Mitra, Huang Ruihong, Man Luo.

  • The Deep Equilibrium Algorithmic Reasoner.
    Dobrik G. Georgiev, Pietro Lió, Davide Buffelli.

  • Multi-Explainable TemporalNet: An Interpretable Multimodal Approach using Temporal Convolutional Network for User-level Depression Detection.
    Anas Zafar, Danyal Aftab, Rizwan Qureshi, Yaofeng Wang, Hong Yan.

  • Math-Search: A Benchmark for Multi-hop Visual Reasoning over Plots.
    Pulkit Madan, Sanjay Haresh, Apratim Bhattacharyya, Litian Liu, Reza Pourreza, Sunny P Panchal, Roland Memisevic.

  • ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System.
    Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar.

  • Autonomous Evaluation and Refinement of Digital Agents.
    Jiayi Pan, Yichi Zhang, Nicholas A. Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr.



MAR 2024 Venue

Seattle Convention Center

MAR 2024 will be held at the Seattle Convention Center at 9:00 AM - 12:30 PM PDT on June 17, 2024.

Sponsors

Organizers

Anoop Cherian

Anoop Cherian

Mitsubishi Electric Research Laboratories (MERL)

Kuan-Chuan Peng

Kuan-Chuan Peng

Mitsubishi Electric Research Laboratories (MERL)

Suhas Lohit

Suhas Lohit

Mitsubishi Electric Research Laboratories (MERL)

Honglu Zhou

Honglu Zhou

Salesforce Research

Moitreya Chatterjee

Moitreya Chatterjee

Mitsubishi Electric Research Laboratories (MERL)

Tim Marks

Tim Marks

Mitsubishi Electric Research Laboratories (MERL)

Joanna Matthiesen

Joanna Matthiesen

Math Kangaroo USA



Program Committee

Alessandro Salatiello    University of Tübingen
Anshul Shah Apple
Ashish Singh University of Massachusetts Amherst
Asim Kadav NEC Labs
Can Qin Northeastern University
Deep A. Patel NEC Laboratories America, Inc.
Haoqi Hu Amazon LLC.
Jihoon Chung Princeton
Michael J. Jones Mitsubishi Electric Research Laboratories
Sameer Khurana Mitsubishi Electric Research Laboratories
Shu Zhang Salesforce Research
Sourya Basu University of Illinois at Urbana-Champaign
Tyler Zhu Princeton University
Yizhou Wang Northeastern University