Introduction
Red Pajama is a new project that aims to create an open-source Llama model. The current closest rival to OpenAI's GPT model is Llama, but it has two major flaws: it's from Meta and not commercially open source. Red Pajama aims to solve this problem by creating a high-quality model that is also commercially available.
About Red Pajama
Red Pajama is a project to create leading fully open-source models. It’s a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. They have already put together a data set of 1.2 trillion tokens, which is on par with Llama. The biggest issue with open-source models is the quality gap between them and closed models like GPT4. Even the best open-source models are not commercially viable, but Red Pajama aims to change that.
Three Stages of Red Pajama
There are three stages in the creation of the Red Pajama model: pre-training data, base models trained at scale on this data, and fine-tuning for specific tasks.
- Pre-training data
It needs to be both high quality and have broad coverage. This is where technical advancements will only go so far; data quality is everything. Team claims that they have few terabytes of such data that can be easily downloaded. it's lot of data from consumer computers point of view however can be useful to researchers' group with Hugh computing power machines. - Training base models at scale on the pre-training data set.
- Instruction Tuning
Instruction tuning means taking a bunch of instruction examples, such as code for XYZ, and then rating the output from the model to improve its usability and safety. Llama was trained on high-quality instruction data, such as writing code in Python to count from one to a hundred, to create a 7 billion parameter model that is particularly valuable for the open-source community. The 7 billion parameter llama model is trained for much longer than other models to ensure the best quality at that size. Llama and all of its derivatives are only available for non-commercial research purposes. However, Red Pajama aims to create a fully open-source reproduction of Llama that would be available for commercial applications.
Red Pajama-Based Data Set
The Red Pajama-based dataset contains 1.2 trillion tokens carefully filtered for quality. There are seven data slices in total.
- CommonCrawl: This slice comprises of five dumps of CommonCrawl, which have been processed using the CCNet pipeline. The pages have been filtered by several quality filters, including a linear classifier that selects pages similar to those found on Wikipedia.
- C4: This slice includes the standard C4 dataset.
- GitHub (good at programming): This slice comprises of data from GitHub that has been filtered by licenses and quality.
- arXiv : This slice includes scientific articles with boilerplate removed.
- Books: This slice includes a corpus of open books that have been deduplicated based on content similarity.
- Wikipedia (subset of pages removing boilerplate): This slice includes a subset of Wikipedia pages with boilerplate removed.
- StackExchange (wide variety of questions and answers): This slice includes a subset of popular websites under StackExchange with boilerplate removed.
In data analysis process, Red Pajama meticulously preprocess and filter each data slice. They adjust the quality filters with great attention to detail, ensuring that they align with the token count presented in the LLaMA paper by Meta AI. The largest slice is Common Crawl, followed by C4.
Training a Strong Base Model
As part of the Insight program with support from Oak Ridge Leadership Computing Facility, Red Pajama is training a full suite of models with the first becoming available in the coming weeks. Red Pajama plans to release instruction-tuned versions of the Red Pajama models using hundreds of thousands of high-quality natural user instructions received via Open Chat Kit.
Conclusion
Red Pajama is an ambitious project that aims to bridge the gap between open-source and closed models by creating a high-quality, commercially viable open-source Llama model. With a collaboration between leading research institutes and a data set of 1.2 trillion tokens, Red Pajama has the potential to revolutionize the AI industry Red Pajama represents a significant step towards democratizing access to high-quality AI models. There is a lot of excitement to come in the future and we can’t wait to see it.
sources
No comments:
Post a Comment