4M: a next-generation framework for training multimodal foundation

An open-source training framework to advance multimodal AI.

Follow us onFollow Tech Explorist on Google News

Large Language Models (LLMs) like OpenAI’s ChatGPT have changed how humans interact with technology. These models, trained on vast amounts of text data from the internet, excel at understanding and generating human-like language.

Now, researchers are focused on the next step: multimodal models that can process different data types, including text, images, sound, and biological or environmental information.

Researchers at EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland have significantly advanced in this area. In partnership with Apple, EPFL’s Visual Intelligence and Learning Laboratory (VILAB) has developed 4M, a cutting-edge neural network capable of handling various tasks and data types.

The transition from only language models to multimodal models has several pitfalls. Training a single model to deal with different types of data- text, images, sound – is a source of performance trade-offs. Most attempts at this have struggled to maintain performance when processing varied input mediums. Further, combining these disparate data types into a single, coherent system without losing crucial information was challenging.

However, 4M promises to change this. Unlike previous systems, which were often limited in what they could understand to a particular kind of language and a single-type input, 4M is capable of interpreting and processing several types of data at once. For example, the model could understand the concept of an “orange” not only through the word “orange” (as traditional language models do) but also by interpreting the image of an orange and understanding its texture through touch-based sensors.

Assistant Professor Amir Zamir, head of the VILAB, explains why this breakthrough matters: “One common criticism of LLMs is that their knowledge is not grounded because they only rely on language data. Moving to multimodal modeling, we can integrate additional senses — like sight and touch — to create a more holistic, realistic understanding of the world.”

Despite the encouraging developments, creating 4M presented several problems. One key challenge was that the team could not clearly represent knowledge across the various modalities. Instead of seamlessly integrating text, image, and sensory data, early versions of multimodal models like 4M often resulted in disjointed solutions, with separate sets of parameters solving different parts of the problem.

Zamir speculates that “under the hood, models may be using a kind of ‘cheat’ — running multiple independent models that seem unified but are essentially functioning as separate entities.”

This represents one of the major impediments to developing an AI system that can achieve a unified, comprehensive understanding of the world.

The VILAB team is working toward improving 4M into a cohesive, scalable model. It is intended as an open-source, extensible architecture that professionals and experts in various fields, from climate science to health care, may adjust to their specifications.

Oguzhan Fatih Kar and Roman Bachmann, doctoral assistants at VILAB and co-authors of the 4M research, emphasize the potential for open-source collaboration: “The whole point of open-sourcing 4M is to allow other researchers to tailor it with their own data and specific applications. This could be a turning point for many industries, from climate modeling to biomedical research.”

The exciting fact is that 4M can be adapted to different domains; still, there are roadblocks ahead. Researchers are working to design a precise model that performs in the real world.

Reflecting on the future of AI, Zamir draws an intriguing parallel between human cognition and artificial intelligence. “As humans, we learn through our five senses and language. Our AI systems today, however, are mainly trained on text, lacking sensory input. Our goal is to reverse that — to develop a system that combines language and sensory data to model the physical world more accurately and efficiently.”

As 4M continues to evolve, it represents a significant step toward bridging the gap between human-like understanding and AI. The possibilities for multimodal AI are vast, and it has the potential to transform industries ranging from healthcare to environmental science. While much work remains, the EPFL team’s research offers a promising glimpse into a future where AI can perceive and understand the world in ways closer to how humans experience it.

As multimodal AI continues to develop, the impact on sectors such as climate science, healthcare, and autonomous systems could be profound, marking a new chapter in artificial intelligence research and applications.

The model is described in a groundbreaking research paper presented at NeurIPS 2024, the premier conference for neural information processing systems.

Up next

New AI model imitates sounds more like humans

Teaching AI to communicate sounds like humans do.

Why text message reminders may fall short for medication refills?

The study is one of the largest and most diverse of its kind and reveals that digital strategies need to be tested.
Recommended Books
The Cambridge Handbook of the Law, Policy, and Regulation for Human–Robot Interaction (Cambridge Law Handbooks)

The Cambridge Handbook of the Law, Policy, and Regulation for Human-Robot...

Book By
Cambridge University Press
University
Picks for you

New AI model imitates sounds more like humans

Researchers develop offline speech recognition algorithm

Most recent Large Language Models remain vulnerable to simple manipulations

Boltz-1: A model to predict biomolecular structures

Researchers develop a new framework to train AI in real-time