Overview:
Databricks has released Dolly 2.0, an LLM that follows instructions and is open source. It has been fine-tuned on a dataset that is both transparent and free to use and is also open sourced for commercial purposes.
What is Dolly 2.0?
- Dolly 2.0 is an open-source instruction following LLM.
- It is trained on a human-generated instruction dataset licensed for research and commercial use.
- The model is based on a EleutherAI model family and has a 12b parameter language model.
- They fine-tuned EleutherAI’s pythia-12b in order to get the dolly 2.0 model. The fine-tuning process used their instruct data set, which they claim is better than the original Dolly that was trained on the synthetic alpaca dataset.
- The model requires significant hardware to run due to its size.
About the Dataset
- The dataset, called Databricks Dolly 15K, contains 15,000 high-quality human-generated prompt-response pairs specifically designed for instruction tuning large language models.
- The data set contains natural and expressive training records that represent a wide range of behaviors from brainstorming and content generation to information extraction and summarization.
- The dataset was generated by professionals as high quality and contains long answers to most tasks.
- This dataset is released under the Creative Commons 3.0 license which means anyone can use modify or extend it for any purpose including commercial applications.
Commercial Use
- Existing instruction following models prohibit commercial use, so databricks created this new data set because they wanted to produce an open-source model that can be commercially used.
- Databricks is open sourcing the entirety of Dolly 2.0 including the training code data set and model weights making it suitable for commercial use.
- Any organization can create own and customize powerful LLMs that can talk to people without paying for API access or sharing data with third parties.
Few Drawbacks
- As per information available on GitHub page for the dataset, During the process of developing prompts and responses, Wikipedia information was utilized in the training. This means that any biases present in Wikipedia could potentially be reflected in the final dataset.
- Additionally, some of the individuals involved in creating the dataset were not native English speakers, which could result in inconsistencies.
- Furthermore, the demographic composition of the team responsible for the dataset's creation could also contribute to biases specific to their backgrounds being present in the dataset.
Important download Links
Conclusion (Encouraging Innovation)
The open-source data sets and models encourage commentary research and Innovation that will help ensure everyone benefits from advances in artificial intelligence technology.
Databricks hopes that Dolly and the open-source data set will act as the seed for a multitude of follow-on works which may serve to bootstrap even more powerful language models. It is not meant to be state of the art but rather a good model at following instructions.
No comments:
Post a Comment