Apple researchers reveal MM1 series of multimodal language models
Apple has introduced a family of multimodal language models (MLMs) that it calls MM1, with up to 30 billion parameters. Multimodal AI models can process a combination of text, images and videos as input, and provide a similar combination as output.
Traditionally, large language models have been trained on vast corpora of text. MLMs benefit from having a wide variety of training sources, and the capability of processing information from a variety of sources.
The significant aspect about the paper is that Apple researchers have detailed the architectural choices as well as the training regimen used to develop the MM1 family of models. The research has the potential for improving the performance of all MLMs. The insights from the research is that the resolution of the images used to train the model has a significant impact on the performance. Caption data improves the performance, with synthetic captions providing a boost as well. The ratios of the modalities used has an impact as well. A diversity of pre-training data was important for improving the performance as well.
Capabilities of MM1
MM1 has the capabilities for following custom formatting, count objects in an image, perform optical character recognition on particular portions of an image, respond to common-sense and world knowledge queries, and perform basic mathematical functions. There were three different types of pre-training data used, images accompanied by captions, images interleaved with text, and text-only documents. The researchers also found that pre-training the visual encoder can improve the performance.
How MM1 fares against competition
Apple has evaluated the performance of its MM1 family of models on benchmarks, which are industry standard tests. The researchers claim that the MM1 models with three, seven and 30 billion parameters outperform EMU2, Flamingo and IDEFICS when it comes to captioning images and answering visual questions. The models have demonstrated competetive performance across 12 established multimodal benchmarks. The researchers have called for better benchmarks for MLMs.
The research paper has been put up on arxiv, and anticipates MLMs as becoming the next generation of foundational AI models after large language models (LLMs).
Apple is catching up to competition when it comes to AI
Apple is lagging behind competitors when it comes to integrating generative AI into its products and services. Microsoft has integrated OpenAI tech into several of its services, including Bing search, CoPilot and Microsoft Office. Google is also expected to deeply integrate AI into its services and hardware products going forward, including the Android ecosystem. Apple CEO Tim Cook hinted at deeper AI integrations during a call with investors earlier in the year, with deep AI integrations expected for iOS 18. Apple has been acquiring AI startups along with their talent for years now, with the latest acquisition being that of Canada based DarwinAI.