Pages

Thursday, 29 August 2024

How Mistral-NeMo-Minitron 8B Achieves Top Accuracy with Model Compression

Presentational View

Introduction

Over the years, Model Compression approaches have advanced at an outstanding rate. This opens the door to smaller and more efficient models with retained capabilities. These approaches are critical for the deployment of large language models in low-resource settings. They allow for faster processing speeds with low energy consumption. But achieving the great supply of a system functionalities in resource constrained environment is indeed challenging. The aforementioned issues prevent real-world deployment in edge devices.  Sophisticated pruning and knowledge transfer technologies that Mistral-Neomo-Minitron 8B leverages would help in solving these challenges. This ensures to  get a quality result even with limited equipment. This approach subsequently takes us a step closer to the objective of AI assistance for everyone.

Who Developed This Model?

This is jointly developed model by NVIDIA and Mistral AI.  NVIDIA leads in AI and GPU technology advancements. Mistral AI excels in AI model development and optimization. They wanted to build a language model that was really accurate and efficient. Their emphasis lies on reducing computational complexity. They focused on accuracy and resource efficiency.

What is Mistral-NeMo-Minitron?

Mistral-NeMo-Minitron is 8B parameter large language model. It is a pruned and distilled version of the 12B  Mistral NeMo model. It is optimized for many natural language generation tasks. It does a good job of balancing both size and performance.

Key Features of Mistral-NeMo-Minitron

  • Pruning & Distillation: This method reduces the size of model using advanced techniques. This keeps things accurate and fast.
  • Leading accuracy: Highest accuracy on 9 benchmarks. It also performs well across a variety of natural language processing (NLP) tasks.
  • The architecture is efficient: embedding size of 4096 and 32 heads. Designed for low latency throughput
  • Advanced Techniques: GQA and RoPE gaining effectiveness. These innovations boost the performance of our models.

Capabilities/Use Cases

  • Wide Applications: Perfect for chatbots and Virtual Assistant. Best for real time processing.
  • Healthcare: Helps doctors in saving notes or in other documentation work and allows interfacing with patients. Enhances Accuracy and efficiency of healthcare
  • Legal: Automates the drafting and review of legal documents Saves time and effort on legal errands.
  • Finance: Intelligent financial assistants to improve customer service. Offers timely, relevant financial guidance.

How Does Mistral-NeMo-Minitron Work?

Mistral-NeMo-Minitron uses a smart compression strategy that mixes weight pruning with knowledge distillation to shrink the model size without losing performance. As shown in figure below, it starts with a pretrained model (like Mistral-NeMo-12B) and goes through an important 'teacher correction' step. This means fine-tuning the teacher model on the target dataset for distillation, fixing any data distribution mismatches. The corrected teacher then kicks off the compression process.

High-level overview of our proposed pruning and distillation approach
source - https://arxiv.org/pdf/2408.11796

Next up is pruning, where the model’s size gets cut down by removing less important weights. Unlike the step-by-step approach in figure below, Mistral-NeMo-Minitron (MN-Minitron-8B for short) uses a one-shot pruning method. For MN-Minitron-8B, this means reducing the hidden dimension from 5120 to 4096 and the MLP hidden dimension from 14336 to 11520, while keeping the number of attention heads and model depth the same. This pruned model then goes through distillation, learning to copy the behavior of the bigger teacher model. The distillation process uses forward KL Divergence loss on the teacher and student logits, training the student to produce similar outputs as the teacher.

Pruning and distillation process as per original paper
source - https://arxiv.org/pdf/2408.11796

A big win for Mistral-NeMo-Minitron is its efficiency with training data and compute power. The MN-Minitron-8B model hits top performance using just 380B tokens for distillation, compared to the 15T tokens needed to train the original Llama 3.1 8B model. This huge cut in training data, along with maintaining or even boosting performance on some benchmarks (like GSM8k and HumanEval), shows how effective this compression technique is. The result is a smaller model that not only matches but sometimes beats the capabilities of its larger counterpart, while needing way less computational resources for training and inference.

Techniques Used in Building Mistral-NeMo-Minitron

  • Teacher Correction: Training the teacher model on target dataset. This technique brings training distribution closer to the dataset.
  • Structured pruning: Removing blocks of non-zero elements from model weights This method decreases model size and preserves the network structure.
  • Value Weighted Training Importance Estimation: Leveraging activation experienced to distill the contributions of Monte Carlo rollouts in training models. This automatically makes more intelligent pruning decisions, maintains what is critical to the model.
  • Knowledge Distillation : Train the student model that is smaller to mimic teacher outputs. This method enables the compact model to deeply understand any function.
  • Forward KL Divergence Loss: Applicable for logit-based distillation only This loss function will produce similar output distributions as the teacher.
  • Single-shot Pruning: Performing pruning in one step rather than iteratively. This approach reduces computational cost while achieving effective results.

Each technique contributes to creating a compact, efficient model. The model maintains or surpasses the performance of its counterpart.

Performance Evaluation 

Mistral-NeMo-Minitron-8B is top of the class in many areas. The MMLU benchmark has a score of 69.5%. This is better than Llama 3.1 8B, which is a larger model. It is also more effective than the teacher model, Mistral NeMo 12B. This is the case despite using many fewer parameters. It demonstrates that pruning methods are effective. It is also due to the use of distillation methods. 

Comparision of Minitron models to similarly-sized SoTA open models on various benchmarks
source - https://arxiv.org/pdf/2408.11796

The Mistral-NeMo-Minitron-8B succeeds in other tasks as well. The Winogrande has a score of 80.4%. The ARC challenge is 64.4%. Finally, Hellaswag is 83.0%. The truthfulQA has 47.6%. Additionally, it is advantageous on the GSM8k benchmark. It attains a score of 58.5%, which is better than the Llama 3.1 8B and its teacher. The Minitron performs better in tasks that require code generation. HumanEval is raised from 23.8% to 36.2%. These findings show its effectiveness and the fact that the approach used to build Minitron model works well.

How Mistral NeMo-Minitron 8B Stacks Up Against Rivals

When you compare the Mistral NeMo-Minitron 8B with the Gemma 7B, it’s pretty clear that the former shines in both accuracy and computational efficiency. The Mistral NeMo-Minitron 8B uses advanced pruning and knowledge distillation techniques, which help it achieve high accuracy without racking up huge computational costs. On the flip side, the Gemma 7B, while lightweight and great for tasks like question answering and summarization, doesn’t quite match the versatility and efficiency of the Mistral NeMo-Minitron 8B. Plus, the Mistral NeMo-Minitron 8B is optimized for low latency and high throughput, making it perfect for real-time applications—a feature that really sets it apart from the Gemma 7B.

When you stack it up against the Mistral 7B and Llama 3.1 8B, the Mistral NeMo-Minitron 8B still comes out on top. The Mistral 7B is high-performing and great for real-time applications, but it doesn’t offer the same level of accuracy and efficiency as the Mistral NeMo-Minitron 8B. The Llama 3.1 8B, on the other hand, is notable for its multilingual capabilities and long context length, making it excellent for dialogue-based applications. However, the Mistral NeMo-Minitron 8B’s blend of high accuracy, low computational cost, and versatility across multiple benchmarks—including language understanding, common sense reasoning, and coding—makes it a more well-rounded choice.

So, the Mistral NeMo-Minitron 8B really sets itself apart from its competitors with its unique mix of accuracy, efficiency, and versatility. While the Gemma 7B and Mistral 7B focus on specific tasks and the Llama 3.1 8B excels in multilingual dialogue, the Mistral NeMo-Minitron 8B’s advanced pruning and knowledge distillation techniques make it a superior choice for a wide range of applications. Its optimization for low latency and high throughput further enhances its suitability for real-time applications, making it an ideal model for various use cases, from language understanding to coding.

How to Access and Use This Model?

The Mistral-NeMo-Minitron 8B model is available for download on Hugging Face. It can be used locally on GPU-accelerated systems or deployed as an NVIDIA NIM microservice with a standard API. The model is open-source and commercially usable under the NVIDIA Open Model License Agreement. For detailed instructions on how to use the model, you can refer to the GitHub repository. All links are provided at the end of this article for users who wants to learn more about this model.

Limitations and Future Work

The Mistral-NeMo-Minitron 8B model is a major improvement on previous models, but it still has limitations. The model was fine-tuned on a dataset which originally has issues such as toxic language and societal biases that will become more prominent in the fine-tuned models’ outputs. There will be plan on working on improving the model to eliminate this limitation.

Conclusion

The Mistral-NeMo-Minitron 8B model offers unmatched accuracy and efficiency. It’s compact and highly effective in its form factor. NVIDIA and Mistral AI collaborated on this powerful model. It’s both powerful and accessible for various uses. They used advanced pruning and knowledge distillation techniques. This model will shape the future of technology. It will be interesting to see its impact.

Source
NVidia blog : https://developer.nvidia.com/blog/mistral-nemo-minitron-8b-foundation-model-delivers-unparalleled-accuracy/
research document on model: https://arxiv.org/pdf/2408.11796
Model weight on HF : https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base
GitHub Repo Minitron: https://github.com/NVlabs/Minitron


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

No comments:

Post a Comment

DeepSeek-V3: Efficient and Scalable AI with Mixture-of-Experts

Introduction Scalable and efficient AI models are among the focal topics of the current artificial intelligence agenda.  The purpose is to d...