VITA: Revolutionizing Multimodal Inputs with Open-Source Technology

Introduction

The technology of AI has been evolving rapidly in recent years, and some major abilities such as multimodal capabilities and collaborative experiences through open-source models have emerged. This means models can now perform detailed analysis on multiple types of data such as text, images, audio and video modes simultaneously rendering them useful to be applied in real-world situations.

While great strides have been made in other aspects, open-source models often falter when trying to integrate these multimodal capabilities seemlessly. Such challenges far exceed retaining high performance through diverse modalities and providing seamless UX where excessive computational resources are unnecessary. VITA tries to fix that with powerful multimodal comprehension and advanced interactive capabilities.

Who Developed VITA?

VITA is developed by the joint team of Tencent Youtu Lab, NJU, XMU CASIA and a number of researchers & engineers. The Tencent Youtu Lab is famous for their leading computer vision and AI research, whilst NJU, XMU and CASIA are well-acclaimed top universities in China with significant competence on AI. The collaboration sought to advance the frontiers of multimodal AI and democratize sophisticated capabilities for everyone in open source. VITA was developed with two key goals: (1) to develop a multi-modal understanding model that is easy-to-use, just drop in some data and return the predictions;

What is VITA?

VITA is the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities. it is developed to provide a richer multimodal interactive experience before any other project from the open-source community targeting this level of seamless integration between multimodal understanding and interaction.

Key Features of VITA

VITA is multimodal: video, image text and audio inputs can be processed simultaneously making it versatile.
Highly Interaction: This model is able to achieve never-waking result from interaction and keep audio interrupt, making you can interact with the agent more natural.
Bilingual Support: VITA has strong and stable baseline capabilities for a multilingual understanding — both English & Chinese.
Landmark: Its outstanding results in many unimodal and multimodal benchmarks show that the model is effective and robust.

Capabilities and/or Use Cases of VITA

The VITA solves this problem and enables a number of different applications. Here are a few of the key use cases:

Video: VITA tells us that if we have some questions related to content in video, it can give exact answers of the question based on our video content.
Captioning videos: This model can generate descriptive textual captions across a wide variety of video contents for better accessibility and insight.
Detection Localisation: VITA detect and localise events in videos, this might be used for surveillance or monitoring applications.
Multi-modal Interaction: The model can take input from a variety of sources, transforming the way man and machine would interact with each other in an more natural manner.

How VITA Works (Architecture/Design)

VITA is designed to handle different types of inputs smoothly. At its core, VITA uses separate encoders for visual, audio, and text inputs. The visual encoder processes both images and videos, treating videos as a series of frames. The audio encoder handles speech, while text input is managed directly by the language model. These encoders connect to specific MLP connectors that align the different input types with the language model’s feature space.

source - https://arxiv.org/pdf/2408.05211

A key innovation in VITA’s design is the use of state tokens to tell apart different types of input queries. The model is trained to recognize three states: (1) for effective query audio, (2) for background noise or non-query audio, and (3) for text queries. This allows VITA to filter out background noise and respond only to intentional queries without needing a wake word.

To enable audio interrupt functionality, VITA uses a duplex deployment scheme with two models running simultaneously. One model generates responses to the current query, while the other continuously listens for new audio input. If a new query is detected, the listening model interrupts the response generation, gathers the historical context, and seamlessly transitions to answering the latest query. This innovative approach allows VITA to provide a more natural and fluid interaction experience, similar to human conversation dynamics.

Innovative Techniques Powering the VITA Model

The work done on the VITA model is based on numerous cutting-edge AI and ML techniques to improve its multimodal features as well as interactivity. Key methodologies applied were:

SMoE : VITA is based on the Mixtral 8x7B model which incorporates a sparse mixture of experts (SMoE) architecture. This strategy enable to save computational resources by an executing a fraction-experts per input.
Tuning the model of instructions: Instructional text corpus having high quality is obtained to tune the model. This type of process improves VITA’s ability to understand languages and enables it receive more nuanced linguistic inputs.
Multimodal Alignment: To reduce the representation gap between text with different modalities such as images and audio then VITA uses multimodal alignment. This way we make sure that our model can be fitted on any kind of data.
Dynamic patching strategy: VITA employs a dynamic patching method for high-resolution images, which helps it to better capture fine-grained local details. This method makes it possible for the model to pay attention on vital regions in an image, enhancing its real visual understanding.
Two-Stage Multi-Task Learning: VITA leverages a two-stage multi-task learning strategy to equip the language model with visual-audio. This approach allows the model to be trained on several tasks one after another, which in turn is beneficial for increased generalization over different modalities.
ASR: This model includes ASR for understanding and transcribing spoken language into text. This capability is essential for both audio input processing and interactive, voice-based searches.
TTS (Text-to-Speech) Tool : VITA uses a TTS tool which is known as GPT-SoVITS Text to Speech framework. This allows the model to generate responses in natural language that can be spoken.
Multi-Stage Training Pipeline: The development of VITA consists a multi-stage training pipeline with LLM instruction tuning, Mulitmodal alignment and Multimodal intruction tuning. High Strength: This meticulous training methodology empowers an example to take into account distinctive errands and info types.
Data Concatenation: in order to save computations, VITA makes data concatenation so the length of different training batches are uniform. With this trick you can optimize training and reduce computational burden.
Voice Activity Detection (VAD): Silero uses VAD for recognizing human speech. This feature filters out background noise and only listens to important audio queries.

These are innovative techniques, along with those covered in previous sections, that collectively contribute to VITA’s advanced capabilities.

Performance Evaluation

VITA was evaluated in language and multimodal tasks, showcasing state-of-the-art performance on a comprehensive set of benchmarks. See table below: VITA outperformed the official Mixtral 8x7B Instruct model by a wide margin on language tasks, especially in Chinese. VITA even outperformed the baseline by 3.38 points on C-EVAL, and as much as 4. It has also retained good performance especially on English tasks and, exhibited a hand advancement 11.67 percentage points in the mathematical reasoning benchmark GSM8K.

Comparison with official Mixtral 8x7B Instruct

source - https://arxiv.org/pdf/2408.05211

On multimodal understanding, VITA achieved competitive performance against open-source and proprietary models. Figure below illustrates VITA's image and video understanding. Examples Results shows VITA is more advanced in image understanding, comparing with open-source models on MME OCRBench and HallusionBench benchmarks than other specialised competitive model of LLaVA-Next. While strong video-specialized models such as LLaVA-Next-Video remain a distant hope for ease of setup and use, VITA outperforms other multi-modal pooling methods in this task even without extensive fine-tuning.

Evaluation on image and video understanding

source - https://arxiv.org/pdf/2408.05211

In addition to these basic benchmarks, VITA also directed success working with audio processing use cases. On the ASR benchmarks (Wenetspeech and Librispeech) that tested VITA, it was able to obtain an impressive results in both Chinese and English speech recognition. The tests also probed in different audio processing cases that attest to VITA's generalisation capabilities as well. In sum, though plenty of gains need to be made (especially when pitted against the top proprietaries), our various experiments leave us optimistic about VITA as an open-source multimodal model.

How to Access and Use VITA

VITA — Full Repo: GitHub (Training code, deployment code and model weights) Used locally, or online for demos. The model is open-source licensed with terms in the licensing structure for researchers and developers to use it as needed, by following modified tasks present in certain previous studies.

Limitations And Future Work

VITA model has several limitations that need to be resolved in future work. The foundation capabilities for this, as with proprietary models do have a big gap where it needs to be optimized further. Furthermore, the way corpuses of noisy audio samples are generated might be oversimplified: it misclassifies data more often and compels a subtle handling. It is also worth mentioning that the current implementation depends on an external text-to-speech (TTS) tool and needs to be paired end-to-end with the LLM for real-time interaction. Researchers plan to work on these areas as part of future work and improve the model's performance, robustness, and interactive experience.

Conclusion

VITA is a game changer among open-source multimodal AI models. It provides solutions to one of the most common headaches you'll come across, but it also goes over and above with cutting-edge innovative interactions that truly set a new benchmark for what open-source approaches are able to do. While there is much more to come as the AI community builds on this foundation, we have seen incredible applications and improvements over just a few short years.

Source
research paper: https://arxiv.org/abs/2408.05211
research document: https://arxiv.org/pdf/2408.05211
Project details: https://vita-home.github.io/
GitHub Repo: https://github.com/VITA-MLLM/VITA

SocialViews From TechWorld

Pages

Monday, 19 August 2024

VITA: Revolutionizing Multimodal Inputs with Open-Source Technology

No comments:

Post a Comment

Google's MLE-STAR: Winning with Real-Time Web Search