Pages

Friday 24 May 2024

Aya 23: New Open-Source Multilingual Language Models by Cohere

Presentational View

Introduction

Multilingual Language Models (MLLMs) are pioneering a new frontier in the field of artificial intelligence, revolutionizing our global linguistic interactions. They are transforming how we interact across the globe’s tapestry of languages, making AI more accessible to people around the world. The journey of MLLMs has been marked by significant advancements, particularly in the field of Natural Language Processing (NLP). These models are designed to understand, interpret, and generate text in multiple languages, breaking down language barriers and fostering global communication.

Yet, the path of progress is not without its challenges. One of the most pressing issues is the performance disparity across languages, particularly for those less commonly spoken. Most of the progress in large language modeling has been English-centric, leading to models which perform poorly outside of a handful of languages. This is a significant hurdle in the journey of advancements of multilingual language models.

Aya 23 emerges as a beacon in this landscape, addressing these challenges head-on. It is a product of Cohere for AI, the non-profit research arm of the Canadian enterprise AI startup Cohere. Cohere’s mission is to democratize language AI, making it accessible and useful across various industries. The development of Aya 23 was driven by the vision to create a model that not only understands but also generates language with unprecedented accuracy and fluency across multiple languages. Their goal was to create a powerful multilingual large language model that could serve a significant portion of the world’s population, thus propelling the AI field forward.

What is Aya 23?

Aya 23 is a sophisticated family of multilingual language models (MLLMs) that serves 23 languages, thereby expanding the horizons of language modeling to nearly half of the world’s population. 

Model Variants

The Aya 23 family comprises two main variants, each designed to cater to different needs:

  1. Aya-23-8B: Tailored for the everyday developer, this variant features 8 billion parameters. It is optimized for generating accurate, contextually relevant text across supported languages and requires fewer resources than the larger model.
  2. Aya-23-35B: With 35 billion parameters, this variant offers enhanced performance for complex multilingual tasks, maintaining consistency and coherence in the generated text.

Key Features of Aya 23

Aya 23 boasts several unique features that set it apart:

  • It is designed to significantly enhance multilingual capabilities in NLP.
  • It supports 23 languages, including Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
  • It outperforms its predecessor, Aya 101, as well as other widely used models like Gemma, Mistral, and Mixtral on an extensive range of discriminative and generative tasks.

    Multilingual benchmark results

  • It features an optimized transformer architecture and Instruction Fine-Tuning (IFT), enabling it to follow human instructions effectively and generate text with high accuracy and coherence.

Capabilities/Use Case of Aya 23

Aya 23 is not just a multilingual language model; it’s a tool that can revolutionize various sectors with its high precision and extensive linguistic coverage. Here are some plausible use cases:

  • Advanced Translation Services: With its ability to understand and generate text in 23 languages, Aya 23 can be used to build advanced translation services. It can provide more accurate and contextually relevant translations than traditional models, making cross-language communication seamless.
  • Customer Support: Aya 23 can be integrated into customer support systems to provide multilingual support. It can understand customer queries in various languages and generate appropriate responses, improving the efficiency and effectiveness of customer service.
  • Language Learning Applications: Aya 23 can be used in language learning applications to provide accurate translations and language exercises. It can help users learn new languages more effectively.
  • Multilingual Chatbots: Aya 23 can power chatbots that can interact with users in multiple languages. This can enhance user experience and make the chatbots more user-friendly.

How does Aya 23 work?/ Architecture/Design

Aya 23 is surpassing its predecessor, Aya, on many tasks. While Aya was a generative language model proficient in 101 languages, Aya 23 adopts a more focused strategy. It prioritizes depth, dedicating greater computational power to a select group of languages during the pre-training phase. This approach not only enhances the model’s performance but also synergizes with the Aya collection to form a robust multilingual large language model.

The core architecture of Aya 23 is a refined version of the decoder-only Transformer design. This architecture scrutinizes each word to ascertain its intent and context, enabling Aya 23 to deliver responses with higher accuracy compared to models based on older methodologies. The decoder-only Transformer is integral to Aya 23, empowering it to comprehend and articulate text fluidly across a multitude of languages.

Training and Fine-tuning

All base models of Aya 23 are trained using Fax, a Jax-based distributed training framework on TPU v4 chips. A combination of parallelism strategies is used to ensure high training throughput.

The pre-trained models are fine-tuned using multilingual instruction data. The fine-tuning datasets combine a range of approaches to improve data availability, including multilingual templates, human annotations, translated data, and synthetic data. The models are fine-tuned for 13,200 update steps using an 8192 context length with data packing enabled.

The examples used to instruction-tune Aya 23 are formatted using special tokens to include extra information. This formatting is used both during instruction-tuning and inference.

Performance Evaluation with Other Models

Numerous performance evaluation tests were conducted to assess the performance of Aya23.

In the arena of discriminative tasks, as illustrated above in graph under 'key features' section, the prowess of Aya 23 models was put to the test with challenges like XWinograd, XCOPA, and XStoryCloze. The larger Aya-23-35B variant demonstrated its superiority, achieving an impressive average accuracy of 70.8%. Meanwhile, the Aya-23-8B variant led its category, securing an average accuracy of 67.6%.

When it came to general language comprehension, the models underwent evaluation via the Multilingual MMLU benchmark. As shown in table above, the Aya-23-8B model stood out among its peers, recording an average accuracy of 48.2% across languages. The Aya-23-35B model, on the other hand, edged out the Mixtral-8x7B-Inst with an average accuracy of 58.2% versus 57.1%.

The models’ multilingual mathematical reasoning was assessed using the MGSM benchmark, as shown in table above, where both Aya 23 variants outshone their respective baselines. The Aya-23-8B scored an average of 36.6 over seven languages, and the Aya-23-35B surpassed the Mixtral-8x7B-Instruct-v0.1 with a score of 53.7.

In generative tasks, such as translation and summarization, the Aya 23 models excelled significantly. The Aya-23-8B variant achieved an average spBleu score of 37.2 in translation and a RougeL score of 27.5 in summarization. The Aya-23-35B variant outperformed the Mixtral-8x7B by a margin of 7.8 spBleu (scoring 40.4 against 32.6) in translation and 23.8 (scoring 30.9 against 7.1) in summarization.

The Multilingual Edge: Aya-23-8B’s Superior Performance Landscape

When we examine Aya-23-8B alongside Mistral-7B-Instruct-v0.2 and Gemma-1.1-7B-it, distinct differences become apparent. Mistral-7B-Instruct-v0.2 represents an instruction fine-tuned iteration of the Mistral-7B-v0.2 model, featuring a 32k context window and incorporating Grouped-Query Attention and Byte-fallback BPE tokenizer, but it does not utilize Sliding-Window Attention. Conversely, Gemma-1.1-7B-it is an instruction fine-tuned model that leverages the architectures, data, and training methodologies of the Gemini models. It has been trained on 6T tokens from web documents, mathematics, and code, predominantly in English, and is characterized as a lightweight, decoder-only large language model trained with an innovative RLHF method.

In contrast, Aya-23-8B stands out as a multilingual instruction-tuned language model that supports 23 languages, drawing from Cohere’s Command model framework. Performance-wise, Aya-23-8B surpasses both Mistral-7B-Instruct-v0.2 and Gemma-1.1-7B-it across a broad spectrum of discriminative and generative tasks. Remarkably, it achieves this feat despite its relatively smaller size, outperforming larger models in over half of the languages it supports.

Aya-23-8B distinguishes itself with its multilingual capabilities and its proficiency across diverse tasks. It caters to 23 languages, thereby extending state-of-the-art language modeling to nearly half of the global population. This positions Aya-23-8B as a formidable asset for multilingual language processing endeavors. Its distinctive features and capabilities underscore its role as a pivotal development in the spheres of AI and multilingual language models.

Access and Usage

The model is open-source, with weights available on Hugging Face for both the 8B and 35B variants. It can be used locally or online via demo links, offering flexibility for developers and researchers. 

If you are interested to learn more about this AI model, All relevant links are provided under the 'source' section at the end of this article.

Limitations and Future Work

While Aya 23 is a significant advancement in multilingual language models, it’s important to acknowledge its existing limitations. The model extends support to 23 languages, a modest segment of the global linguistic diversity that encompasses around 7,000 languages. The scope of languages that Aya 23 encompasses is confined to those included during its pre-training phase, exhibiting a skew towards languages that are predominantly spoken in certain geographical areas, notably leaving many Asian and African languages less represented.

As team build on the groundwork established by the original Aya model, future endeavors will aim to broaden the linguistic reach and enhance the performance for the multitude of languages not yet covered. 

Conclusion

Aya 23 exemplifies the power of MLLMs to transcend language barriers, envisioning a future where AI-facilitated communication is as effortless and intuitive as conversing in our native language. By prioritizing depth over breadth, it delivers precise and contextually appropriate text generation across 23 languages.


Source
cohere website technical paper : https://cohere.com/research/papers/aya-command-23-8b-and-35b-technical-report-2024-05-23
technical paper document : https://drive.google.com/file/d/1YKBPo61pnl97C1c_1C2ZVOnPhqf7MLSc/view
HF aya-23: https://huggingface.co/spaces/CohereForAI/aya-23
Weights for aya-23 35B:  https://huggingface.co/CohereForAI/aya-23-35B
Weights for aya-23 8B:  https://huggingface.co/CohereForAI/aya-23-8B
Try out  demo  for Aya 23 (35B) : https://huggingface.co/spaces/CohereForAI/aya-23


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

No comments:

Post a Comment

GRIN-MoE: Microsoft’s Revolutionary Mixture-of-Experts Model

Introduction One of the large strides made by the traditional Mixture-of-Experts (MoE) is sparse computation: they only activate a few modul...