Meta introduces Chameleon, a state-of-the-art multimodal model

Meta introduces Chameleon, a state-of-the-art multimodal model

Join us as we return to New York on June 5 to work with executive leaders to explore comprehensive methods for auditing AI models for bias, performance, and ethical compliance in diverse organizations. Find out how you can be present here.


As competition in generative AI shifts to multimodal models, Meta has released a preview of what could be the answer to the models released by border labs. Chameleonthe new model family, is designed to be multimodal in nature rather than assembling components with different modalities.

Although Meta has not yet released the models, their reported experiments show that Chameleon delivers state-of-the-art performance in several tasks, including image captioning and visual question answering (VQA), while remaining competitive in tasks with text only.

Chameleon's architecture can unlock new AI applications that require a deep understanding of both visual and textual information.

Multimodal models with early fusion

The popular way to create basic multimodal models is to merge models trained for different modalities. This approach is called 'late fusion', where the AI ​​system receives different modalities, encodes them with separate models, and then merges the encodings for inference. Although late fusion works well, it limits the models' ability to integrate information across modalities and generate sequences of interleaved images and text.

VB event

The AI ​​Impact Tour: the AI ​​audit

Join us as we return to New York on June 5 to engage with top executives and delve into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across organizations. Secure your attendance for this exclusive invitation-only event.

Request an invitation

Chameleon uses an early-fusion token-based mixed-modal architecture, meaning it is designed from the ground up to learn from an interwoven mix of images, text, code and other modalities. Chameleon transforms images into discrete tokens, just like language models do with words. It also uses a unified vocabulary consisting of text, code and image tokens. This makes it possible to apply the same transformer architecture to sequences containing both image and text tokens.

According to the researchers, the model is most comparable to Chameleon Google Gemini, which also uses a token-based early fusion approach. However, Gemini uses separate image decoders in the generation phase, while Chameleon is an end-to-end model that both processes and generates tokens.

“Chameleon's unified token space enables seamless reasoning and generation of interleaved image and text sequences, without the need for modality-specific components,” the researchers write.

meta-chameleon architecture
With Chameleon encryption and decryption logic (source: arxiv)

Although early fusion is very attractive, it poses significant challenges in training and scaling the model. To overcome these challenges, the researchers used a series of architectural tweaks and training techniques. In their paper they share the details about the different experiments and their effects on the model.

Chameleon's training takes place in two phases, with a dataset containing 4.4 trillion tokens of text, image-text pairs, and sequences of text and interleaved images. The researchers trained a version of Chameleon with 7 billion and 34 billion parameters for more than 5 million hours Nvidia A100 80GB GPUs.

Chameleon in action

According to the experiments reported in the paper, Chameleon can perform a diverse set of text and multimodal tasks. In visual question answering (VQA) and image captioning benchmarks, the Chameleon-34B achieves state-of-the-art performance and outperforms models like Flamingo, IDEFICS And Lava-1.5.

According to the researchers, Chameleon matches the performance of other models with “far fewer in-context training examples and with smaller model sizes, both in pre-trained and fine-tuned model evaluations.”

One of the disadvantages of multimodality is a drop in performance for single-modality requests. For example, visual language models tend to perform worse on text-only cues. But Chameleon remains competitive on text-only benchmarks, matching models like Mixtral 8x7B and Gemini-Pro on common-sense reasoning and reading comprehension.

Interestingly, Chameleon can unlock new possibilities for mixed-modal reasoning and generation, especially when the cues expect mixed-modal responses with text and images in between. Experiments with human-evaluated responses show that users generally preferred the multimodal documents generated by Chameleon.

Both in the past week Open AI And Googling unveiled new models that deliver rich multimodal experiences. However, they have not released many details about the models. If Meta continues to follow its playbook and releases the weights for Chameleon, it could become an open alternative to private models.

Early fusion can also inspire new directions for research into more advanced models, especially as more modalities are added to the mix. For example, robotics startups are already experimenting with the integration of language models into robotics control systems. It will be interesting to see how early fusion can also improve the basic models of robotics.

“Chameleon represents an important step toward realizing the vision of unified base models capable of flexible reasoning and generating multimodal content,” the researchers write.