MPT-7B: A transformer model trained from scratch on 1T tokens of text and code

Introduction

A New transformer model developed by MosaicML, an AI research and development company. MosaicML wanted to democratize AI by making it more accessible and efficient for businesses and developers. The model was trained from scratch on 1 trillion tokens of text and code using the MosaicML platform in just 9.5 days with zero human intervention. New model is called as 'MPT-7B'.

What is Mosaicml pre-trained transformer (MPT)?

The Mosaic pre-trained transformer is a decoder-only transformer that follows the GPT-style architecture.

Its model is licensed for commercial use, unlike Llama's open-source project.

Features of MPT 7 Billion Parameters

It has been trained on the Mosaicml platform and in just nine days they've been able to do this at a cost of approximately 200k with zero human intervention. It matches the quality of llama's 7 billion parameter actual language model. There are different private language models where you can deploy this on different use cases. There are pre-trained models such as chat bot, story writer which have also been fine-tuned and deployed for different private language models.

Architecture Improvements

The MPT has performance optimized layer implementations and architecture changes that increase training stability. These modifications allow customers to train their models with greater efficiency without diverging with loss spikes. The model can be served using both standard Hugging Face pipelines as well as faster Transformers.

Chatbot Functionality

Due to its capacity to handle up to 65k token inputs, the chatbot is capable of providing more extensive contextual generative responses. While it shares similar features with Llama’s chatbot, it offers a wider range of output options. MPT-7B has three finetuned variants that showcase the versatility and potential of the MPT series.

The first variant, MPT-7B-Instruct, is a commercially usable instruction-following model that has been finetuned on dataset Dolly+HHRLHF.

The second variant, MPT-7B-Chat, is a chatbot-like model for dialogue generation that has been built by finetuning MPT-7B on multiple datasets.

The third variant, MPT-7B-StoryWriter-65k+, is designed to read and write stories with exceptionally long context lengths, up to 65,000 tokens and extrapolate beyond. The Story Writer Model has a context window of 65,000 plus tokens which allows it to generate long-form documents like novels. The entire novel "The Great Gatsby" was used to test the Story Writer Model's ability to pay attention to various parts of a document and generate content based on it.

These variants can be used for diverse downstream uses and demonstrate the cutting-edge features of MPT-7B, such as FlashAttention for fast training and inference, ALiBi for finetuning and extrapolation to long context lengths, and open-source training code for maximum efficiency.

Benefits of using the MPT over other open-source models.

The MPT has been trained on one trillion tokens and can handle up to 84k inputs worth of tokens. Its architecture improvements allow for greater efficiency during training without loss spikes. Additionally, the chatbot examples provided by MPT offer larger contextual generative answers compared to other open-source models.

Comparison with other models

MPT-7B is more efficient than many other models as it is designed to handle extremely long inputs thanks to Attention with Linear Biases (ALiBi), which replaces positional embeddings and eliminates context length limits. Below image illustrates that Zero-shot accuracy of MPT 7 billion is better than that of many other open-source models on academic tasks. While Llama may be better in certain cases due to its resources. MPT provides better results overall.

MPT-7B transformer model comparison with other model

Instruct vs. MPT Chatbot

Instruct is based on instruction demonstrations while the chatbot is fine-tuned to focus on different conversation samples. Instructors are more basic and used for asking questions, while the chatbot is more suitable for day-to-day conversations. Models can be downloaded from Hugging Place and duplicated on space.

Advantages of MPT

MPT is licensed for commercial use, unlike many popular models like GPT-3. It has been trained on large amounts of data (1 trillion), making it comparable in quality to top models like Llama. It can handle extremely long inputs (up to 65k tokens). It has been optimized for both fast training and interference using techniques like flash attention and flash Transformers.

Future Goals

MPT is an open-source commercially usable LM that will keep innovating in the future. Currently, there is not much clarity about their future datasets or goals; however, many good things are expected to come in the future.

How to access MPT-7 model?

They have released the dataset for fine-tuning instruct which can be accessed immediately. They have also released HuggingFace spaces for instruct where users can try it out easily. A chat interface has been released on HuggingFace spaces where users can chat with the model. Kindly Check out their blog website for more analysis and a detailed understanding of what they're trying to do. Play around with the chatbot demo available on Hugging Face interface. All links are provided under 'source' section at end of this article.

Conclusion

MPT-7B has been rigorously evaluated on a variety of benchmarks and consistently meets the high-quality bar set by LLaMA-7B. MPT-7B is a significant step forward in the development of language models, and its future development plans are exciting. MosaicML has released three finetuned models in addition to the base MPT-7B, which showcase the versatility and potential of the MPT series. It will be interesting to see how MPT-7B and other models like GPT-7 will continue to evolve and improve in the future.

source

blog - https://www.mosaicml.com/blog/mpt-7b

huggingFace - https://huggingface.co/mosaicml

demo chat - https://huggingface.co/spaces/mosaicml/mpt-7b-chat

demo instruct - https://huggingface.co/spaces/mosaicml/mpt-7b-instruct

Read More
click here to Discover how StarCoder LLM can help you generate high-quality code and increase productivity.

SocialViews From TechWorld

Pages

Sunday, 7 May 2023

MPT-7B: A transformer model trained from scratch on 1T tokens of text and code

No comments:

Post a Comment

GLM-4.5: Unifying Reasoning, Coding, and Agentic Work