Pages

Saturday, 24 August 2024

Phi-3.5: Microsoft’s Efficient, Multilingual, and Secure Open-Source SLMs

Presentational View

Introduction

Small Language Models (SLMs) are reduced versions of large language models that can perform specialized tasks with fewer parameters and computational resources than the original comprehensive models. Improved SLMs are showing advancements by utilizing efficient architectures, domain-specific training, and novel methods such as knowledge distillation and transfer learning, which enable them to achieve high accuracy despite having fewer parameters.

Despite the advantages of SLMs, they face challenges such as limited context understanding and factual accuracy, as well as a reduced ability to store large amounts of knowledge. Phi-3.5 promises to offer substantial progress in the frontiers of AI by introducing state-of-the-art training techniques, enhanced security directives, and multilingual benefits, all targeted at addressing these challenges head-on.

Phi-3.5 was developed by Microsoft. The development of these models was contributed by various researchers and engineers within Microsoft’s AI and Research division.

What is Phi-3.5?

Phi-3 is  a family of high-quality, multilingual SLMs. It is Fast, powerful and value for money solution. It beats existing models of both similar size and larger in language, reasoning, coding or math benchmarks. Phi-3 has many versions . Each version tailored for the different kinds and uses.

Phi- 3.5 Quality vs. Size graph in SLM
source - Micosoft tech community website

Model Variants

  • Phi-3.5-mini: A lightweight model with 3.8 billion parameters. Designed for multi-lingual support and long-context tasks up to 128K tokens.
  • Phi-3.5-vision: A multimodal model with 4.2 billion parameters. Tailored for image understanding, OCR, and diagram interpretation.
  • Phi-3.5-MoE: A Mixture-of-Experts model with 16 experts and 6.6 billion active parameters. Provides high performance, reduced latency, and robust safety measures.

Key Features of Phi-3.5

  • Multi-language support: More than 20 languages supported. That translates to its usefulness for all users across the globe.
  • Big Context Understanding: Longer than 128K token segments. Ideal for translating long documents and meeting output.
  • Fast: It generally earns competitive or even top marks on performance benchmarks when compared with models of approximate size, or larger sizes.
  • Safe by Design: Supervised Finetuning+Preference Optimization That makes it safe and trustworthy.
  • Scalable Architecture: Mixture-of-Experts architecture used. This activates only the necessary parameters to reduce computational load.

Capabilities/Use Cases of Phi-3.5

  • Language Understanding: It’s great at tasks that need language understanding and reasoning.
  • Image Understanding: The Phi-3.5-vision variant is especially good at understanding images, OCR, and diagrams.
  • Long Document Processing: Perfect for summarizing long documents, meetings, and answering questions about lengthy texts.
  • Advanced Reasoning: It excels in reasoning tasks, even outperforming many larger models.
  • Content Generation: It can generate content for various uses, like language translation and solving complex problems.

How Phi-3.5 Works: Its Clever Design and Training 

Phi-3.5 leverages cutting-edge techniques and an innovative architecture to achieve strong results. An interesting feature is the Mixture-of-Experts (MoE) design within the Phi-3.5-MoE variant. It only activates certain parameters during inference, allowing scalability, efficiency and easy maintenance. This selective engagement helps Phi-3.5 handle lengthy contextual tasks like summarizing long documents or retrieving information within them. Phi-3.5-vision also includes an image encoder, connector and projector for processing both text and visuals, exhibiting multimodal potential. 

Phi-3.5 applies various enhanced training approaches for optimized performance and security. Supervised fine-tuning, proximal policy optimization and direct preference optimization guide the model to comply appropriately and avoid harm. It can comprehend contexts up to 128K tokens, fitting complex systems. The model aggregates open-source and proprietary datasets to stay helpful, risk-averse and reliable across uses. 

With sophisticated techniques and thoughtful architecture, Phi-3.5 surpasses earlier versions in capabilities while solving many of their issues. This renders it a highly capable and adaptable AI. 

Performance Evaluation with Other Models

The Phi-3.5 family of models, including Phi-3.5-MoE, Phi-3.5-mini, and Phi-3.5-vision, has shown impressive performance across various benchmarks. They often outperform larger models. See table below: With only 6.6B active parameters, Phi-3.5-MoE achieves results comparable to or better than much larger models. It excels in language understanding, math, and reasoning tasks, often surpassing bigger models in reasoning.

Phi-3.5-MoE Model Quality
source - Micosoft tech community website

Phi-3.5-mini, despite its compact size of 3.8B parameters, is remarkably efficient and performs well. Table below shows that it matches or exceeds the performance of larger models across key benchmarks. Its multi-lingual capabilities are particularly noteworthy. It shows significant improvements over its predecessor, Phi-3-mini, especially in languages like Arabic, Dutch, Finnish, Polish, Thai, and Ukrainian, with 25-50% performance boosts.

Phi-3.5-mini Model Quality
source - Micosoft tech community website

Phi-3.5-vision introduces advanced capabilities for multi-frame image understanding and reasoning. Table below demonstrates significant performance improvements in numerous single-image benchmarks. For instance, it boosted the MMMU performance from 40.4 to 43.0 and improved the MMBench performance from 80.5 to 81.9. Additionally, the document understanding benchmark TextVQA saw an increase from 70.9 to 72.0. These improvements highlight the model’s enhanced ability to process and understand visual information across various tasks.

Phi-3.5-vision Tasks Benchmark
source - Micosoft tech community website

The Phi-3.5 models have undergone extensive testing across various domains. Phi-3.5-MoE was evaluated on multi-lingual tasks and showed competitive performance even against models with much larger active parameters. Phi-3.5-mini excelled in long-context understanding tasks like summarizing long documents, long document-based QA, and information retrieval, outperforming models of similar or larger sizes. Phi-3.5-vision also showed improvements in multi-frame image comparison, multi-image summarization/storytelling, and video summarization. These comprehensive evaluations underscore the versatility and efficiency of the Phi-3.5 model family across language processing, reasoning, long-context understanding, and visual tasks. They are highly capable and cost-effective options for various AI applications.

Comparison of Phi-3.5-MoE, Gemma-2 27B, and Mistral 8x22B

Phi-3.5-MoE stands out with its Mixture-of-Experts (MoE) architecture. It activates only a few of its 16 experts during inference. This makes it efficient and good at handling long-context tasks. It is great for long document summarization and information retrieval.

Gemma-2 27B has 27 billion parameters and uses a text-to-text decoder-only architecture. It has features like sliding window attention and grouped-query attention. It is excellent at text generation and summarization but is not as efficient as Phi-3.5-MoE.

Mistral 8x22B uses a sparse Mixture-of-Experts (SMoE) architecture with 141 billion parameters. It activates only 39 billion during inference. It performs well in multilingual understanding and math reasoning. However, its context window of 64K tokens is shorter than Phi-3.5-MoE’s 128K tokens. This limits its effectiveness in long-context tasks.

Phi-3.5-MoE uses supervised fine-tuning, proximal policy optimization, and direct preference optimization. These methods ensure precise instruction adherence and robust safety measures. It is trained on 4.9 trillion tokens using 512 H100 GPUs. This makes it highly efficient and capable.

Gemma-2 27B is trained on 13 trillion tokens using Google Cloud TPU. It also uses supervised fine-tuning, knowledge distillation, and RLHF for optimization. However, its single architecture limits its flexibility compared to Phi-3.5-MoE’s MoE approach.

Mistral 8x22B is optimized for fine-tuning scenarios and supports a wide range of tasks. However, it lacks the comprehensive safety measures and efficiency of Phi-3.5-MoE. Its sparse activation pattern may complicate deployment and fine-tuning processes.

So, Phi-3.5-MoE stands out with its innovative Mixture-of-Experts architecture. This enhances scalability and efficiency. It can handle contexts up to 128K tokens and supports over 20 languages. The model’s robust safety measures and advanced training techniques ensure precise instruction adherence and reliability. This makes it a highly capable and versatile solution in AI. For long-context processing, multilingual support, and robust safety measures, Phi-3.5-MoE is the best choice. For text generation and summarization, Gemma-2 27B may be more suitable. Mistral 8x22B excels in multilingual understanding and math reasoning.

How to Access and Use Phi-3.5

Phi-3.5 models are available on Hugging Face and Microsoft Azure AI Studio. They are open-source and released under a permissive MIT license, meaning developers can use them freely without restriction. Instructions for local usage and online demos are available in their corresponding GitHub repositories. You can access the models directly on Hugging Face, where you'll also find detailed documentation and hands-on examples.

Limitations and Future Work

The Phi-3.5 family of models, including the mini, MoE, and vision models, has greatly improved the processing of long contexts and language capabilities for small models. But the new versions also have limitations. The model is not as effective in low-resource languages as it is English, which is a short-coming that needs to be addressed. And while the MoE model is known for its impressive performance—the model has more than six billion parameters—its complex approach to mixing experts has the potential of complicating future deployment. It probably also makes it more difficult to modify day-by-day.

The next move is to seek out diverse training sets that contain more different languages and new applications tailor-make them for specific purposes. Furthermore, significant safety and ethical issues about AI utilization need to be addressed to ensure true robustness in real-world situations. The method outlines a holistic philosophy for ensuring competitiveness in future generations of the Phi-3.5 family models as well as sizing up its suitability over a wider range of uses and users.

Conclusion 

Phi-3.5 was an epoch-making achievement in the field of small language models. With high performance, it does support multiple languages and making system security enhancement easy as well! It has tackled a host of problems encountered by SLMs. Phi-3.5 sets a new benchmark for small language models (SLMs), providing vital intelligence and solutions in various applications. Its arrival is an important step in making AI technologies more useful and accessible to the average person. 


Source
Microsoft community post : https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/ba-p/4225280
HF Phi-3.5-mini: https://huggingface.co/microsoft/Phi-3.5-mini-instruct
HF Phi-3.5-MoE: https://huggingface.co/microsoft/Phi-3.5-MoE-instruct
HF Phi-3.5-vision : https://huggingface.co/microsoft/Phi-3.5-vision-instruct


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

No comments:

Post a Comment

ShowUI: Advanced Open-Source Vision-Language-Action Model for GUI

Introduction Graphical User Interface (GUI) assistants assist users to interact with digital appliances and applications. They can be an ord...