Pages

Friday, 28 March 2025

Fin-R1's Financial Reasoning: Excels in Financial Table & Conversation AI

Presentational View

Introduction

Financial AI systems are transforming our perceptions of, and interaction with, financial data. The machine learning- and natural language-based intelligent systems are designed to support anything from the prediction of trends in markets to financial reporting automation. The principal challenge of building such systems lies in ensuring they possess good reasoning abilities to work on data as well as being able to articulate in simple terms financial insights that can be passed on.

Fin-R1 is a major improvement in this direction, providing us with a domain-specific large language model that's designed for financial reasoning. With a new architecture and a rigorous training regimen, it aims to address some of the important problems in the financial sector. The emphasis in the development of Fin-R1 is to enhance AI's capacity to understand and process complex financial information, creating potential for more stable and effective applications in finance.

Who discovered Fin-R1?

Fin-R1 was developed by SUFE-AIFLM Lab, the AI powerhouse of Shanghai University of Finance and Economics. They've built an agile yet strong model, which is meant to turbocharge financial decision-making with advanced AI.

What is Fin-R1?

Fin-R1 is a new large language model designed specifically for financial reasoning. The authors introduce its architecture, a specially constructed high-quality financial reasoning dataset and a two-stage training procedure based on supervised fine-tuning and reinforcement learning.

Unique Key Features of Fin-R1

Fin-R1 has some special things that make it different:

  • Good at Financial Thinking: It's made specifically to think through complicated problems about money and finance.
  • Small but Strong: It's built in a way that makes it cheaper to use because it doesn't need as much computer power (it has 7 billion parameters). But it still works really well.
  • Better at Tricky Money Questions: The way it's trained in two steps, especially the second step using something called RL with GRPO, helps it handle very detailed and complex financial thinking.
  • Performs Well in Tests: Fin-R1 does great in tests that focus on understanding financial tables (FinQA) and answering financial questions in conversations (ConvFinQA). It's one of the best in these areas
  • Addresses Financial Pain Points: It is designed to address key challenges in the financial industry, including fragmented financial data, uncontrollable reasoning logic, and weak business generalization.

Unique Use Cases of Fin-R1

Fin-R1 has a number of distinct applications in the financial industry:

  • Deeper Financial Analysis: Its robust reasoning ability can be utilized for detailed analysis of financial information, such as interpreting financial statements and deriving important conclusions.
  • Automated Financial Computations: The model is capable of executing intricate financial computations, possibly simplifying processes and minimizing errors.
  • Enhanced Financial Compliance: Its capacity to comprehend and reason about financial rules can help ensure compliance and identify prospective risks.
  • Smart Risk Management: Through analysis of financial information and recognition of patterns, Fin-R1 can help with streamlined and precise risk assessment and management.
  • ESG Analysis: The model can be utilized to assess firms based on environmental, social, and governance considerations in order to guide sustainable investment choices.
  • Robo-advisory: It can use its reasoning and analytic abilities towards devising smarter, personalized robo-advisory solutions.
  • Code Generation and Financial Analysis: It has some knowledge of code understanding and potentially creating financial code to carry out unique tasks for certain operations.
  • Execution of English Finance Calculations and Communication: Trained with English financial information, it is possible to achieve financial cross-language operation and communication.

Architecture/ Workflow of Fin-R1

Fin-R1's architecture and functionality are established around a two-stage process: (as shown in below figure) Data Generation and Model Training. The first Data Generation stage is devoted to building a high-quality financial reasoning dataset referred to as Fin-R1-Data. It entails distilling data from open-source and proprietary financial datasets into DeepSeek-R1 to produce preliminary reasoning steps. A strict two-stage data filtering process then follows in order to guarantee the accuracy and logical consistency of the resultant dataset. The first filter, Answer Check, checks the correctness of the produced answers with rule-based techniques and Qwen2.5-72B-Instruct as an LLM-as-judge. The second filter, Reasoning Selection, checks the merit of the reasoning paths with Qwen2.5-72B-Instruct according to specified criteria. Fin-R1-Data is made up of varied categories with a large segment devoted to financial non-reasoning business knowledge (50.4%) and financial reasoning business knowledge (27.5%), in addition to financial expertise (21.9%) and the minimal amount of financial code (0.2%).

The pipeline for constructing Fin-R1
source - https://arxiv.org/pdf/2503.16252

The next Model Training phase fine-tunes the model in a two-step process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The process starts with SFT, in which a base model, Qwen2.5-7B-Instruct, is trained on the high-quality Fin-R1-Data to improve its capacity to conduct financial reasoning and produce structured outputs such as 'think' and 'answer' tags. Based on this, the model is subjected to RL with the Group Relative Policy Optimization (GRPO) algorithm. This RL phase uses a double reward function to further optimize the performance of the model. The Format Reward induces the model to strictly follow the given output format with the 'think' and 'answer' tags. At the same time, the Accuracy Reward, which is tested using Qwen2.5-Max, judges the semantic correctness of the final answer in the 'answer' tags. This two-step training paradigm, utilizing a well-designed dataset and focused reinforcement learning, allows Fin-R1 to develop robust financial reasoning skills.

Performance Evaluation of Fin-R1

The Fin-R1 model has been comprehensively tested against a number of important financial metrics, which are outlined in table below of the sources. Of particular note, Fin-R1 showed state-of-the-art performance on certain financial reasoning tasks. On the numerical reasoning FinQA benchmark over financial data, Fin-R1 scored 76.0. This score ranks it number one, beating out other models tested, such as DeepSeek-R1 (71.0), Qwen-2.5-32B-Instruct (72.0), and even the much larger DeepSeek-R1-Distill-Llama-70B (68.0). In the ConvFinQA benchmark, which investigates chain-of-thought numerical reasoning in conversational finance question answering, Fin-R1 also achieved a top score of 85.0, once again beating DeepSeek-R1 (82.0) and other rival models.

Evaluation results in different financial benchmarks.
source - https://arxiv.org/pdf/2503.16252

Over a wider set of financial metrics, such as Ant_Finance, TFNS, and Finance-Instruct-500K, Fin-R1 recorded an average of 75.2. Such a high average ranked Fin-R1 second in general among models tested, given its compact 7B parameter size. Of particular note was that Fin-R1 beat every other model in the same size category and even beat the larger 70B DeepSeek-R1-Distill-Llama-70B (69.2) by a significant margin of 6 points. The fairly narrow performance gap of only 3.0 points between Fin-R1 and the much bigger DeepSeek-R1 (78.2) further highlights the effectiveness and efficiency of Fin-R1 in financial tasks. Such findings are very important to the financial industry, suggesting that Fin-R1 is a strong yet efficient solution to difficult financial reasoning tasks, perhaps a cost-saving alternative to significantly larger models.

DeepSeek-R1 vs Qwen-2.5-32B-Instruct vs Fin-R1

DeepSeek-R1, Qwen-2.5-32B-Instruct, and Fin-R1 represent different design philosophies in improving the reasoning capabilities of large language models. DeepSeek-R1 utilizes reinforcement learning to improve chain-of-thought reasoning with self-verification, whereas Qwen-2.5-32B-Instruct, a strong 32-billion-parameter transformer bolstered with innovations such as RoPE and SwiGLU, performs well in dealing with long contexts, multilingual tasks, and structured outputs. Conversely, Fin-R1 is finetuned for financial reasoning and uses a two-stage training method supervised fine-tuning on a custom financial reasoning dataset and reinforcement learning with a dual reward scheme—in a highly efficient 7B architecture that achieves state-of-the-art performance on industrial benchmarks.

In situations where domain-specific monetary understanding is the priority like automated financial reasoning, risk management, and regulation Fin-R1 is the best choice because of its task-specific training and effective deployment. On the other hand, setups that require wider, multi-faceted language comprehension or massive long-context processing may prefer Qwen-2.5-32B-Instruct, with DeepSeek-R1 still a top contender for research and use cases that depend on clear, chain-of-thought reasoning.

How to use and access Fin-R1 model

User may get Fin-R1 as a free model on the Hugging Face Model Hub and GitHub. These websites contain complete guides and simple steps to install and utilize it. Individuals can copy the files or download the model themselves. Then they could integrate Fin-R1 into their projects with the help of the Hugging Face Transformers tool, along with examples illustrating how to utilize it and improve it. you can find all relevant links at the end of this article if interested.

Limitations and Future Directions

Fin-R1 is limited since it was primarily trained on only FinQA and ConvFinQA. This makes it more difficult for it to comprehend numerous various money scenarios. It is only able to operate with text, so it is unable to comprehend things such as charts. Furthermore, the tests we've conducted have largely been on simple answer questions. In the future, we want to train it on more data, make it learn images, and utilize it more in actual finance to assist in controlling risk and adhering to regulations.

Conclusion

Fin-R1's strong performance in financial reasoning represents a great leap forward for AI to manage sophisticated financial data. Its accuracy and efficiency show the potential of AI to revolutionize financial analysis, making it more reliable and accessible. This breakthrough opens the door to more intelligent, more informed financial decision-making in multiple applications.


Source
Research document: https://arxiv.org/pdf/2503.16252
Hugging Face: https://huggingface.co/SUFE-AIFLM-Lab/Fin-R1/blob/main/README_en.md 
GitHub Repo: https://github.com/SUFE-AIFLM-Lab/Fin-R1/blob/main/README_en.md


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Tuesday, 18 March 2025

Gemma 3: Open Multimodal AI with Increased Context Window

Presentational View

Introduction

Everyone working on Artificial Intelligence (AI) wants to make it really good at understanding things, thinking, and talking to people. Because of this shared goal, AI is getting much better all the time. It continues to push what computers can accomplish. Yet, this thrilling evolution is hindered by challenges. There are model size constraints for mass deployment. There is the imperative to support more languages in order to cater to a wide range of people. There is the vision to create models that can handle and interpret multiple types of data such as text and images with ease.

In addition, making AI work on complicated tasks continues to be of utmost importance. These tasks involve extensive contextual information. Overcoming such challenges and pushing AI forward is Gemma 3. It is an important development involving cutting-edge optimization and improvement approaches in transformer architectures. The goal is to enhance efficiency. The goal is increasing contextual awareness. The goal is optimizing language generation and processing.

What is Gemma 3?

Gemma 3 is Google's latest set of light and cutting-edge open models. Interestingly, it brings multimodality to the Gemma family, which means some versions can now process and understand images and text.

Model Variants

The models come in various sizes. These include sizes 1 billion (1B), 4 billion (4B), 12 billion (12B), and a solid 27 billion (27B) parameters. These provide a range of abilities. These are designed for varying hardware limitations and performance requirements. Gemma 3 models are available in both base (pre-trained) and instruction-tuned. They are suitable for a broad range of use cases. These applications vary from fine-tuning for highly specialized tasks to being general-purpose conversation agents. These agents can execute instructions well.

Key Features That Define Gemma 3

Gemma 3 has a powerful array of features that make it stand out and enhance its functions:

  • Multimodality: The 4B, 12B, and 27B implementations include a vision encoder (SigLIP-based), which allows them to handle images as well as text. This provides scope for applications that can examine visual material along with text. The vision encoder supports square images of size 896x896 pixels.
  • Increased Context Window: All three models--4B, 12B, and 27B--have a hugely increased context window of 128,000 tokens, which eclipses that of its predecessor as well as many other open models. The 1B model has a context window of 32,000 tokens. Increased context enables the models to process and work with much greater amounts of information.
  • Wide Multilingual Coverage: Gemma 3 has pre-trained coverage for a staggering collection of more than 140 languages for the 4B, 12B, and 27B models. This adds to an enhanced data blend and the powerful Gemini 2.0 tokenizer. The 1B model mainly covers English. The Gemini 2.0 tokenizer, with 262,000 entries, has improved representation and balance across languages, with Chinese, Japanese, and Korean seeing big benefits.
  • Function Callability: Gemma 3 has function callability and structured output, allowing developers to create AI-based workflows and smart agent experiences through interaction with external APIs and tools.
  • Model Optimized Quantization: Official quantized models of Gemma 3 are easily accessible, which compresses the model size and computation requirements while maintaining high accuracy for optimized performance. These are available in per-channel int4, per-block int4, and switched fp8 formats.

Use Cases of Gemma 3

Gemma 3 power also paves the way for a host of exciting future use cases:

  • Gemma 3 benefits the single-accelerator model end result by showcasing the power of the architecture in a manner that allows for development for interactive experiences that run effortlessly on a single GPU or TPU, putting heavy-hitting AI in the hands of smaller development groups and independent thinkers.
  • Globally Accessible Applications Development: The wide-ranging support for over 140 languages can help develop truly global applications — so you can communicate with users in their own languages with ease.
  • Revolutionizing Visual and Textual Reasoning: With the ability to interpret images, text, and short videos, Gemma 3 can enable interactive and intelligent applications, including image-based Q&A and advanced content analysis.
  • Tackling Harder Problems with Extended Context: The extended context window is crucial for use cases such as summarization of long documents, code analysis of large codebases, or having more contextualized and coherent long conversations.
  • Workflows Automated With Function Calling: Gemma 3's capability for function calling and structured output enable easy communication with external APIs and tools, perfect for automating tasks and building smart agent experiences.
  • Providing Edge AI to Low Computational Devices: Thanks to the quantized models and computation emphasis, these can be deployed on low computational devices, hence bringing advanced AI capabilities to frequent devices like phones, laptops, and workstations.
  • Creating Custom AI Solutions: Since Gemma 3 is an open model, developers are free to customize and optimize it to suit their needs and specific industry, enabling creativity and the evolution of extremely tailored AI solutions.

How Gemma 3 Achieves Its Capabilities

Gemma 3 starts with a decoder-only transformer framework and adds the major innovation in the form of 5:1 interleaving of local and global self-attention layers, a design element that successfully reduces the memory requirements of the KV-cache at inference time, highly useful for managing longer context lengths, with the local attention having 1024 token ranges in focus and the global attention including the whole context to enable fast long-sequence processing.

In order to improve inference scalability, Gemma 3 utilizes Grouped-Query Attention (GQA) and QK-norm, and for its multimodal support within the larger models, it uses a 400 million parameter SigLIP encoder that converts images into 256 vision embeddings, which are consistent and frozen during training, whereas non-standard images are processed at inference using the Pan & Scan algorithm that cuts and resizes images.

The language model maps these image embeddings into soft tokens, employing varied attention mechanisms for text, one-way causal attention, and images, which get the advantage of full bidirectional attention so all parts of an image can be analyzed at once.

Lastly, Gemma 3 is pre-trained with knowledge distillation over an enlarged dataset containing additional multilingual and image-text examples, taking advantage of the increased vocabulary of the Gemini 2.0 tokenizer, and an innovative post-training recipe consisting of enhanced knowledge distillation and reinforcement learning fine-tuning continues to enhance its capabilities in domains such as math, reasoning, chat, following instructions, and multilingual comprehension.

Performance Evaluation

One of the most important ways in which the abilities of Gemma 3 are measured is by its showing in human preference tests, for example, as reported on the LMSys Chatbot Arena, as illustrated in table below. In this arena, various language models compete against each other in blind side-by-side evaluations decided upon by human evaluators. Elo scores are provided as a result, which act as a direct measure of user preference for certain models. Gemma 3 27B IT has shown a very competitive ranking compared to a variety of other well-known models, both open and closed-source. Most interestingly, it scores among the leading competitors, reflecting a very positive preference by human evaluators in direct comparison with other important language models in the field. This reflects Gemma 3's capacity to produce answers that are highly regarded by human users in conversational applications.

Evaluation of Gemma 3 27B IT model in the Chatbot Arena
source - https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

Apart from explicit human preference, Gemma 3's abilities are also stringently tested on a range of standard educational metrics, as illustrated in table below. These metrics are a wide-ranging set of competencies, from language comprehension, code writing, mathematical reasoning, to question answering. When comparing the performance of Gemma 3 instruction-tuned (IT) models to earlier versions of Gemma and Google's Gemini models, it is clear that the newest generation performs well on these varied tasks. Where direct numerical comparisons should be reserved for the fine-grained tables, the general tendency is to indicate that these Aria models exhibit significant improvements and competitive performance across a variety of proven tests meant to test various dimensions of language model intelligence. This serves to indicate the concrete improvements in Gemma 3's fundamental capabilities.

Performance of instruction fine-tuned (IT) models compared to earlier versions
source - https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

In addition, the testing of Gemma 3 is also done on other vital areas like handling long context, where metrics such as RULER and MRCR are utilized to measure performance with longer sequence lengths. The models are also tested on multiple multilingual tasks to confirm their competence across many languages. Furthermore, stringent safety tests are performed to comprehend and avoid possible harms, such as measurements of policy break rates and understanding about sensitive areas. Lastly, the memorization ability of the models is tested to comprehend how much they replicate training data. These varied tests cumulatively present a detailed picture of the strengths and areas of improvement for Gemma 3.

How to Access and Use Gemma 3

Accessing and using Gemma 3 is designed for developer convenience and offers multiple integration methods, including:

  • Testing in your browser with Google AI Studio and fetching an API key
  • Easily downloading models from the Hugging Face Hub that supports pre-trained and instruction-tuned options with help from the Transformers library
  • Locally running with intuitive tools such as Ollama, downloading via Kaggle, local CPU run using Gemma.cpp and llama.cpp
  • Taking advantage of MLX for Apple Silicon hardware
  • Prototyping fast via the NVIDIA API Catalog
  • Deployment at scale on Vertex AI, and
  • One-click deployment of a particular model on Hugging Face Endpoints.

Gemma 3 is made available as an open model to facilitate easy public use. Particular information on its licensing model is usually available on the platforms that host the models.

Areas for Future Exploration

One potential area for future work, while already a strong point of Gemma 3, could involve further optimization of performance and memory usage. This kind of optimization may be particularly helpful for multimodal models. It would be a goal to support even more resource-constrained environments. Even though Pan & Scan can push through some limitations due to the fixed inference input resolution of the vision encoder to a certain degree, further enhancement could be made. This enhancement would be in withstanding changing image aspect ratios and resolutions. Continued development is also a likely course of action. This development will be in further extending multilingual support and performance on an even greater selection of languages.

Conclusion

Gemma 3 provides effective performance for its scale and makes advanced capabilities widely accessible. Its addition of multimodality and a significant jump in context window address significant shortcomings. Its robust multilingual capability opens up new global possibilities, and the emphasis on efficiency and availability across diverse platforms, such as quantized models, will make it easier to adopt.


Source
Blog: https://blog.google/technology/developers/gemma-3/
Tech report: https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
Developer: https://developers.googleblog.com/en/introducing-gemma3/
Gemma 3 Variants: https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Wednesday, 12 March 2025

OpenManus: Learn Customizable AI Agents , the Open Source Framework

Presentational View

Introduction

Artificial intelligence is changing the world, and leading the charge are AI agents. But the furious rate of improvement in this sector is too commonly held back through lack of access to advanced frameworks. The vast majority of innovation in cutting-edge solutions is only accessible behind invitation-only systems and proprietary licenses that limit broader levels of innovation and cooperation between researchers and developers. OpenManus then emerges as one particular solution to this problem, offering a complete open and community-driven AI agent platform aimed at removing these roadblocks and making mass creativity possible.

OpenManus avoids the restriction of limited membership by providing an entirely open-source framework that requires no invitation for participation. Its approach is based on the belief that innovation in AI can only be maximally optimized through sharing and collaborative advancement of ideas. Through disrupting elitism inherent within current AI technology, OpenManus aims at empowering a bigger populace to partake and benefit from cutting-edge advancements in AI agent technology.

OpenManus is designed by an internationally diverse group of researchers, developers, and technology enthusiasts working under the umbrella of the OpenManus organization. The community effort aggregates input from universities, freelance developers, and forward-thinking technology innovators, all brought together by a common passion for democratizing AI. As described by the slogan "empowerment through openness," this effort leads the way in cutting-edge reinforcement learning methods and simplifies integration so that leading-edge AI functionality is made both available and further developed continuously through community-driven innovation.

What Is OpenManus?

OpenManus is an open-source AI agent framework—a platform engineered to build, deploy, and experiment with intelligent, autonomous agents. It is a tool that integrates the latest in natural language processing with sophisticated reinforcement learning techniques, all wrapped in a simple architecture.

Key Features of OpenManus

OpenManus has a few significant features, focusing on openness and community-based development.

  • The framework has a simple yet customisable implementation, allowing for extensibility to suit specific needs.
  • It is designed to take advantage of large language models (LLMs) such as GPT-4o in order to execute tasks upon user input, through a process of taking in input, running tasks using tools and APIs, giving feedback, and keeping context.
  • The OpenManus-RL sub-project emphasizes a focus on investigating reinforcement learning (RL)-based tuning techniques for LLM agents, with future possibilities of incorporating RL fine-tuned models.

These characteristics combine to make up the skeleton of a framework that is not only accessible but also customisable within a heterogeneous developer community.

Unique Capabilities and Real-World Benefits 

OpenManus offers significant capabilities across research and commercial applications:

  • The OpenManus-RL repository underscores the framework’s commitment to exploring reinforcement learning, with the potential for enhancing responsiveness through learning.
  • Customisability allows tailoring for specific needs in various domains.
  • Open, community-driven nature fosters idea exchange, algorithm sharing, and development. OpenManus’s versatility and adaptability make it a promising foundation for diverse sectors and emerging challenges.

How Does OpenManus Work?

Insight into how OpenManus operates illustrates the manner in which its well-crafted architecture produces efficiency and scalability. On a high level, the system is composed of multiple fundamental components:

Input/Query Layer: This portion receives and preprocesses input data—either a natural language query or task instruction.
NLP Processing Module: The module uses strong language models to convert human input into a form that can be read by the system.
Decision Making & Reinforcement Learning Engine: The centerpiece of OpenManus, perhaps literally so, this module decides the best response through feedback on the fly. Its sophisticated reinforcement learning algorithms enable the agent to learn and optimize its decision matrix in real time.
Action/Response Layer: Lastly, this layer aggregates the results into a logical output, returning accurate and contextually relevant responses.

This structure not only makes the system extremely transparent but also independent updates and optimizations.

OpenManus Vs Manus AI

If we consider the realm of AI Agents, there's a fascinating divide between two: OpenManus and Manus (AI).

Manus AI is similar to a smooth, business-oriented alternative. You have to be invited to use it, so it has an air of exclusivity. It guarantees to be simple and pleasant to use, with tools already prepared and that cooperate nicely. The best thing about Manus is that it can be used immediately. It's web-based, so you don't have to be a technical guru to install it. They also claim that they'll provide you with official support should you require it. This is perfect for individuals who desire something that simply works and is stable, out of the box. Currently, it's free to test (beta), but they intend to charge for it in the future, like a subscription. This is to say it's likely for people who are fine with paying for a service that is simple to use and supported by a company. It's a solid option if you just need something straightforward and you know you'll be able to get assistance from whoever is behind it.

While Manus AI takes a different path, OpenManus is considerably different. OpenManus promotes sharing and opening up to everybody. It takes its roots in the concepts of MetaGPT, and it allows anyone to use its AI agent technology. You don't require an invitation, and no money is spent. OpenManus is everything about being open and allowing you to modify. It provides you with a simple setup that can be modified a lot. Since it's open and community-driven, it feels authentic and good for those who enjoy working together and generating new ideas. However, this freedom comes with the cost that you have to be slightly techy. You should have some knowledge of Python, conda, and how to install API keys. So, it's actually for individuals who enjoy being in charge, wish to transform things profoundly, and enjoy belonging to a community that continues to grow and evolve. In short, whether to go with OpenManus or Manus is all about what matters to you. Do you prefer something that is easy and backed by a company? Or do you prefer something open, that you can modify yourself, and that is developed by a community?

How to Access and Use OpenManus

All guides, updates, and resources are posted on the OpenManus website and blog. The code itself, including the Reinforcement Learning project and the main framework, is hosted on GitHub. You can install OpenManus either locally on your own machine or in the cloud, with instructions clearly outlined on GitHub. Since it is open-source, OpenManus is free to use, modify, and even commercially, thus readily available for business and research purposes alike.

Future Development Plan

This is what's coming next for improving OpenManus:

  • Improved Task Planning: Team would like to make the AI agent more intelligent at planning and performing extremely complex tasks. Consider it as a way of instructing it to create a step-by-step plan for large projects, not merely small ones.
  • Live Demos: Team wish to have live demos that demonstrate to you directly what OpenManus is capable of. People will get to see how awesome and useful it is.
  • Session Replays: Team wish to include a feature where you can replay older agent sessions. This way, you can see what the AI did and how it went about it, similar to watching a game over again to pick up from it.
  • Better Learning Models: Team is exploring applying a specific type of learning known as Reinforcement Learning to enable OpenManus to be even more effective. It's similar to training it with rewards to make it do things better. It is applicable to the OpenManus-RL project.
  • Improved Methods of Success Measurement: Team must come up with quality tests in order to actually observe how effectively OpenManus is performing. These tests will make us precisely aware of how much more it has enhanced and where it still requires improving.

Conclusion:

In an age of technology that is commonly marked by exclusivity and limited access to innovation, OpenManus is a shining example of transparency, openness, and true accessibility. Through the methodical deconstruction of the old obstacles inherent in invite-only models and proprietary limitations, this revolutionary open-source system goes beyond being simply another AI utility.


Source
OpenManus Website: https://openmanus.org/
OpenManus Blog:https://openmanus.org/blog/introduction-to-openmanus
openmanus-vs-manusAI : https://openmanus.org/blog/openmanus-vs-manus-comparison
openmanus GitHub Repo: https://github.com/openmanus-ai/openmanus
OpenManus-RL GitHub Repo: https://github.com/OpenManus/OpenManus-RL


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Wednesday, 5 March 2025

How Claude 3.7 Sonnet Merges Logic, Neural Power, and AI Safety

Presentational View

Introduction

Stimulating new concepts in AI, such as hybrid reasoning, extended thinking, self-awareness, and improved safety, are truly testing the limits of what these machines can accomplish. Hybrid reasoning merges the ability of neural networks with the standard symbolic approaches in order to make it simpler to solve harder problems. And then there's extended thinking, which gives rise to greater contemplation and improved accuracy. Self-reflection is really about allowing AI to reflect back on its own processes so it can create considered and well-thought-out responses. And of course, increased safety protocols are essential to ensure that AI is ethical, reducing bias and preventing the creation of harmful content.

But even today's AI models have some problems to overcome. They tend to get confused by complex contexts, their logic is sometimes a black box, and there's always a risk of producing unethical results. Today's progress tries to address these problems directly by enhancing the ability to reason, enhancing transparency, and incorporating ethical barriers. By combining these technologies, we're trying to build AI that's not only trustworthy, but also transparent and human-aligned.

Meet Claude 3.7 Sonnet of Anthropic! It's a reflection of all these advancements and truly the next generation of AI development. By introducing all these innovations, it's capable of going beyond the limitation of previous models, developing considerate and ethical AI.

What is Claude 3.7 Sonnet?

Claude 3.7 Sonnet is a sophisticated AI system with hybrid thinking – symbolic and neural networks' combined thinking, and extended reasoning. It includes architecture for planned reasoning prior to output, hence guaranteeing appropriate, contextualized, and differentiated responses. Claude 3.7 Sonnet is an elegant tool to deploy in multiple disparate complex problem types.

Key Features of Claude 3.7 Sonnet

  • Clear Thought Process: This feature gives you a peek into how the AI thinks, so you can follow along with its decision-making.
  • Increased Output Capacity: Now supports up to 128K tokens (in beta), perfect for tackling demanding projects like coding and content creation.
  • Improved Safety Features: Comes with advanced protection against harmful content and prompt injection, boasting an impressive 88% success rate.
  • Blended Reasoning Model: Combines symbolic reasoning with neural networks to tackle complex problems more effectively.
  • Adaptive Capabilities: Shows better ability to scale actions dynamically, adjusting to changing tasks and inputs.

Capabilities and Use Cases

Claude 3.7 Sonnet displays some remarkable tricks:

  • Great at Coding: It handles complicated code, maps out updates, and can spit out code ready to use. That means stuff like automated cleanup of code and clever code review.
  • Intelligent Problem-Solver: Claude is able to manage work that requires perpetual fine-tuning, so it is beneficial for tasks such as identifying cybersecurity dangers or conducting scientific experiments.
  • Solving Challenging Problems: It processes difficult problems, and this may be useful for individualized education or examining legal briefs.
  • Flexible and Bettering: It learns from its own experiences and continues to refine its approaches, which is ideal for maximizing logistics or delivering custom-tailored healthcare.

How Claude 3.7 Sonnet Works

Claude 3.7 Sonnet unites two strong methods: it unites fast neural networks with the power of symbolic logic. This union is further amplified by a special 'extended thinking mode' that allows Claude to test various lines of reasoning, making it more precise for math, science, and instruction-following tasks. In this process, Claude builds 'thinking' content blocks to demonstrate its inner thought process thinking over these pieces of insight prior to generating a final answer. This openness presents users with better insight into how Claude makes a decision.

In terms of structure, Claude 3.7 Sonnet has an agentic structure, wherein it is capable of performing tasks iteratively and responding to fluctuations in its surroundings in order to meet predetermined objectives. A perfect instance of this is Claude Code, where it handles coding operations such as file editing and testing on its own. Also, how it scales the use of compute resources in testing enables the model to chase various lines of thoughts simultaneously, resulting in improved solutions and robustness in practical applications. Users are also able to manage thinking resources by allocating a 'thinking budget', with which they are then able to balance speed, expense, and solution quality.

This longer thinking mode capability can be triggered with an anthropic-beta header of output-128k-2025-02-19, having a larger thinking budget to accommodate deeper thinking and ensuring that there are sufficient tokens remaining for the ultimate response. This design allows Claude 3.7 Sonnet to work on significant engineering projects directly in a terminal, showcasing its supremacy in coding skills.

Performance Evaluation

Claude 3.7 Sonnet has very strong performance on major benchmark tests and beats other models in several critical areas. It performed very well on SWE-bench Verified, which tests whether it performs well at solving actual software issues, and performed very well on TAU-bench, which examines how artificial intelligence agents perform at difficult tasks that relate to users and tools. These findings indicate that Claude 3.7 Sonnet is the leader in coding and agent capacities, a major leap towards solving real and complex problems.

Claude 3.7 Sonnet performance on various benchmarks
source - https://www.anthropic.com/news/claude-3-7-sonnet 

Recent real-world tests support Claude 3.7 Sonnet's coding abilities, with companies such as Cognition, Vercel, and Canva demonstrating how it excels. Cognition discovered it quite good at organizing code changes and staying up-to-date, while Vercel highlighted its precision in complicated workflows. Canva also highlighted that Claude always outputs code ready for production with excellent design and fewer errors. These consistent outcomes of multiple evaluations confirm the value of the model to developers who require good and credible AI assistance.

Claude 3.7 Sonnet excels across various tasks.
source - https://www.anthropic.com/news/claude-3-7-sonnet 

Other than coding assessments, Claude 3.7 Sonnet is great at adhering to instructions, overall reasoning, and navigating various kinds of tasks. Its deep thinking capability actually enhances its performance in math and science. In fact, it outperformed all the other models in Pokémon gaming test evaluations, flaunting superior agent skills and enhanced goal clarity. Safety tests confirm that Claude 3.7 Sonnet satisfies the ASL-2 safety standard, and continuous efforts are being made to enhance its safety features and address any weaknesses.

How to Access and Use Claude 3.7 Sonnet

You can readily access Claude 3.7 Sonnet across various platforms. If you are an AI enthusiast, you can see its capabilities on the easy-to-use Claude.ai. Researchers and coders who want to go deeper, the Anthropic API is an excellent option for bespoke integration. Companies can seamlessly integrate this model into their workflows through tools like Amazon Bedrock and Google Cloud's Vertex AI, enhancing their workflows with high-powered AI capabilities.

Limitations and Future Work

Claude 3.7 Sonnet, though sophisticated, is not perfect. The observable thought process sometimes has errors and possible weaknesses. Extended thinking is very computationally intensive. Ongoing work seeks to make safety more refined, efficiency better, and reasoning fidelity higher.

Conclusion

Claude 3.7 Sonnet is a major advancement in AI that puts together intelligent reasoning, more in-depth thinking, and robust safety features.  Claude 3.7 Sonnet is notable for its transparency and adaptability, providing assistance in the realms of coding, learning, and customized health care. With further advancement of AI, Claude 3.7 Sonnet indicates how it can amplify human capabilities without betraying human ethics.

Source
Website: https://www.anthropic.com/news/claude-3-7-sonnet 
visible-extended-thinking: https://www.anthropic.com/research/visible-extended-thinking
extended-thinking: https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking



Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Friday, 7 February 2025

DeepSeek AI's Janus-Pro: A Leap in Multimodal Learning

Presentational View

Introduction

Advancements in multimodal understanding and model scaling are revolutionizing Artificial Intelligence towards ever more advanced and versatile systems. Recent years have witnessed landmark advances in the capacity of AI to process information from various kinds of data modality, whether it is textual or image-oriented, thus stepping out of their unimodal confines. Mainly, it has been largely fueled by the scaling of the AI models into billions of parameters and training methods improvement to yield better learning with improved efficiency.

Challenges remain. Combining seamlessly the multimodal understanding with strong content generation with guarantees of stability and high-quality output is difficult to achieve. Optimization of the training strategy with the goal of surmounting issues in the quality of training data and bottlenecks arising in unified architectures for multimodal models is under intense research and development. An outstanding achievement to build on such a trend and face these challenges in multimodal AI is represented by Janus-Pro.

What is Janus-Pro?

Janus-Pro is an integrated multimodal model capable of understanding as well as generating content across the various modalities.  This new model adopts model scaling and refined training and introduces essential architectural and strategic optimisations. It includes variants such as the 7B parameter Janus-Pro-7B, catering to varied computational needs and deployment environments. Each variant is engineered to provide an optimal balance between high performance and resource efficiency.

Key Features of Janus-Pro

Janus-Pro presents an improvement factor for the functionalities and efficiencies at work through particular AI distincting features.

  • Unified Multimodal Architecture: It applies a single, unified architecture to multimodal understanding and generation. It simplifies the design and increases efficiency since separate modality-specific pipelines are no longer needed.
  • Rich representations from multimodal fusion: Janus-Pro combines information from a variety of different modalities in order to form comprehensive multimodal representations. This deep fusion facilitates an understanding of context and makes it possible to create content that fluidly incorporates both textual and visual elements.
  • Scalable and Data-Efficient Design: Designed to scale, Janus-Pro uses the benefits of more significant datasets and computational resources while keeping the increased demands proportionally low, without losing learning efficiency even when less data is present.
  • Resource-Efficient High Performance: Janus-Pro achieves strong performance with fairly small model sizes, like the 7B parameter Janus-Pro-7B and the 1B parameter Janus-Pro-1B, and does so at high performance without using much computation.

Capabilities and Use Cases of Janus-Pro

Janus-Pro's distinct capabilities enable diverse applications:

  • Multimodal coherent narrative content creation: The method will involve integrating textual and visual content, so as to result in more engaging and richer descriptions such as comprehensive reports with illustrations or illustrated storybooks.
  • Improved data understanding for contextual comprehension: Improving understanding in tasks such as image captioning, which extends beyond the mere description of objects to provide richer contextual information through the integration of visual and textual cues.
  • Multimodal Response: Interactive AI Systems Facilitate the development of AI assistants which can process and respond using text and images; this would mean more natural, engaging user interaction.
  • Cross-Modal Information Retrieval: This modality retrieves information based on images allowing for reverse lookup tasks including looking up the pictures given their written descriptions as well as written summaries for an image.

How does Janus-Pro work?

Multimodal processing within Janus-Pro is optimized efficiently, with both decoupled visual encoders and a single Transformer backbone shared across all views. For the purpose of understanding multimodally, it then uses the SigLIP encoder to obtain image representations aligned against text. Such representations are thereafter projected into Large Language Model LLM embedding. For visual generation, a VQ tokenizer maps images into discrete codes, fed through a generation adaptor to project codebook embeddings into the embedding space of the LLM. The shared Transformer backbone then processes these combined feature sequences to produce coherent, contextually aware outputs across modalities.

Architecture of our Janus-Pro
source - 
https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

The architecture further includes a mixture-of-modality experts router that dynamically assigns experts based on the input modality; the cross-modal attention mechanisms allow the model to learn inter-modality relationships. All these components work together in perfect harmony and enable Janus-Pro to efficiently perform both understanding and generation tasks over different data types.

Advanced Techniques used for Janus Pro model

Janus-Pro features a set of advanced techniques. Each of them contributes to making the model work better and with greater efficiency.

  • Decoupled Visual Encoding: Decoupling visual encoding pathways for understanding and generation enhances performance by removing the inherent conflict between representational needs for the two different tasks.
  • Coherent Output using the Autoregressive Framework:  An adherence to an autoregressive framework makes it possible to predict the next states of information, essential to produce coherent, contextually relevant outputs.
  • Dedicated Visual Encoders for Optimization by Task: SigLIP is used as the encoder to understand and VQ tokenizer is used to generate while optimizing the feature extraction and visual representation specific to the task at hand.
  • Cross-Modal Attention Mechanisms for Information Fusion: Integrates attention mechanisms across modality to deepen understanding and leverages relationships, thereby facilitating useful information fusion from cross-modal cues.
  • Optimized Multi-Stage Training Strategy for Efficiency and Performance: A refined multi-stage training approach, including extended Stage I pixel dependence modeling, focused Stage II text-to-image training, and adjusted Stage III data ratios, enhances computational efficiency and overall performance.
  • Data Scaling for Enhanced Learning: Scaling training data for both understanding and generation, including adding 90M samples for Stage II and incorporating synthetic aesthetic data, improves model generalization and text-to-image stability.
  • Model Scaling for Better Convergence: Scaling Janus-Pro up to 7B parameters can benefit from a larger language model, resulting in faster convergence as well as performance on both understanding and generation tasks.
  • Modality Adaptators: The two adaptors take image features and codebook embeddings and project them into the language model's embedding space to be used for unified architecture of both modalities.
  • Unified transformer architecture for coherent processing: This is achieved using a single unified transformer architecture where the shared backbone enables coherent processing of concatenated multimodal feature sequences to generate contextually relevant outputs across modalities.

Performance Evaluation with other models

Janus-Pro is tested for how well it can turn text into images. The test, called the GenEval benchmark, looks at things like making one or two objects, counting items, matching colors correctly, and placing objects in the right spot. Table below shows that Janus-Pro-7B scored 80%. This is higher than other methods like Transfusion (63%), SD3-Medium (74%), and DALL-E 3 (67%). This means Janus-Pro follows instructions very well when creating images from text.

Evaluation of text-to-image generation ability on GenEval benchmark.
source - https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

Table below shows Janus-Pro’s results on another test called DPG-Bench. This test checks how well a model works with long and detailed prompts. Janus-Pro scored 84.19, which is better than all the other methods. The test looks at overall picture consistency, clear object representation, detail accuracy, and understanding of relationships between items. Janus-Pro did very well on all these parts.

Performances on DPG-Bench
source - https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

Janus-Pro was also tested on several other benchmarks for multimodal understanding. Tests like GQA, POPE, MME, SEED, MMB, and MMMU show that it performs strongly compared to top models. The results confirm that Janus-Pro is good at both understanding and creating images in many different situations.

How to Access and Use This Model?

Janus-Pro-7B is found on Hugging Face, with easy access for developers and researchers. You can view its capabilities in the interactive demo space available at Hugging Face. The GitHub repository for the project provides code and resources for those that want to dig deeper and experiment more. The model is meant for research purposes only. Commercial use details can be found at the GitHub repository licensing information.

Limitations

Janus-Pro, though a step ahead into multimodal AI, is still in many ways weak and limited, according to the source: its input resolution resolution is at 384 x 384 pixel, and its fine-grained application, such as OCRs, is affected. Thus, in image-to-text generation, the resolution combined with reconstruction losses 'images suffer from missing details. Earlier versions utilized low-quality real-world training data. The text-to-image generation system thus produced visually poor outputs as it suffered instability. Synthetic aesthetic data used in Janus-Pro addresses the problem of instability along with improving the aesthetic quality; however, such usage introduces possible problems with regard to the model's diversity and its real-world applicability. Adjustments to the data ratio during fine-tuning also point towards a trade-off between optimizing visual generation capabilities and multimodal understanding capabilities.

Conclusion

Janus-Pro is one of the exemplars of incredible progress in multimodal AI by scaling and improved training. The fact that this system can so effectively understand and generate content across modalities gives credence to these advancements. Scaling up the capacity of AI models opens it up to be able to learn complex multimodal relationships, whereas refined training methodologies enhance learning efficiency and generalization. In many ways, that synergistic mixture is essential in developing more complicated and intelligent multimodal AI that can understand this world better, as well as interact with its surroundings.


Source
Project details: https://github.com/deepseek-ai/Janus
research paper: https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf
https://huggingface.co/deepseek-ai/Janus-Pro-7B
Trial: https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Saturday, 25 January 2025

DeepSeek-R1: Enhanced Reasoning via Reinforcement Learning

Presentational View

Introduction


The artificial intelligence field is pushing machines to achieve new capabilities. Its most sought-after advancement is when AI systems can reason well. Today's LLMs work wonderfully on recognizing patterns and making statistical predictions but may fail in line with the problems, which are mainly supported on logical deduction, commonsense understanding, and complex problem solving. This gap between pattern recognition and true reasoning limits the capacity of potential applications of LLMs.

Deepseek-R1 is an innovative approach that attacks this challenge from the front. It uses RL to train LLMs to become more capable reasoners. This is one giant leap in the pursuit of AI systems that do not only process information but understand and reason about it.

Model Variants

DeepSeek-R1 has several variants, each with different characteristics and uses. The base model, DeepSeek-R1-Zero, is trained with large-scale reinforcement learning directly on the base model without preliminary supervised fine-tuning. It has 671B total parameters, 37B activated per token, and a 128K context length. DeepSeek-R1 builds upon R1-Zero, addressing its limitations via a multi-stage training pipeline, which improves reasoning performance. There are also denser, more compact models distilled from DeepSeek-R1 which reach better performance than training them directly with RL. The different variants offer everything from exploring purely RL in foundation models to the final refined DeepSeek-R1 and efficient distilled models.

Key Features of Deepseek-R1

  • Explicit Reasoning Ability Focus: One key feature that characterizes the explicit core strength of Deepseek-R1 is the use of reinforcement learning for the specific focus of training the ability of reasoning. While many LLMs primarily utilize supervised learning, RL trains the model to create answers not just correctly but also meaningfully and in coherent well-reasoned explanation towards robust skill-building of reasoning abilities.

    Example of DeepSeek-R1's Thinking Ability
    source - https://chat.deepseek.com/

  • Emergent Chain-of-Thought Reasoning: While nearly every model can be prompted into exhibiting chain-of-thought behavior, the training procedure used for Deepseek-R1 causes this behavior to emerge. The model has learned to produce explanations as a part of its reasoning process and not merely through the use of specific prompting methods. This produces more robust and coherent chain-of-thought behavior.
  • Emphasis on Transparency and Explainability: Deepseek-R1 also emphasizes transparency and explainability by explicitly training the model to give explanations. This way, the model could better explain its reasoning process in a transparent manner to the user, fostering trust and support better debugging and analysis.
  • Generalization Benefits from RL: Even though the training is focused on reasoning, it has been observed that the general language tasks show improvement in large-scale RL training. This clearly indicates that synergistic benefits in training for reasoning positively impact other abilities related to language.

Reinforcement Learning of DeepSeek-R1

Reinforcement learning (RL) is a machine learning technique where an agent learns to make optimal decisions in an environment based on feedback received in the form of rewards or penalties. RL does not implicitly rely on labelled examples like supervised learning. Since last few decades, RL has seen significant growth with the rise of deep learning and more computing power. Reinforcement learning is crucial for DeepSeek-R1, particularly DeepSeek-R1-Zero, as it enables the model to learn reasoning without prior supervised fine-tuning.  This direct use of RL helps the model learn to explain its thinking step-by-step, which is called 'chain-of-thought reasoning'. It shows how RL can help AI become much better at complex reasoning.

Capabilities and Use Cases of DeepSeek-R1


DeepSeek-R1's new approach to reasoning opens up unique applications, pushing the boundaries of AI. Its key capabilities include:
  • Pioneering Reasoning Research via Pure RL: DeepSeek-R1 provides a groundbreaking research platform by showing effective reasoning development without initial training, providing new insights into how reasoning appears in LLMs. The availability of basic and improved models allows direct study of different training methods.
  • Transforming Education: Excellent performance on educational tests suggests DeepSeek-R1's potential to change educational applications. This includes improving AI-driven search, enhancing data analysis tools for education, and creating better question-answering systems.
  • Enabling Custom Model Development: The open-source nature of DeepSeek-R1 and its models allows developers to fine-tune them for very specific reasoning tasks, enabling custom AI solutions for areas like scientific research and complex data analysis.
These are just some examples, and as DeepSeek-R1 improves, we can expect even more new uses.

Technological Advancements

DeepSeek-R1 involves a new kind of training paradigm based on reinforcement learning (RL) that can be applied directly to the base model without initial supervised fine-tuning (SFT). It enables fully autonomous development of reasoning skills, like in DeepSeek-R1-Zero. The latter one is the base model that uses a Group Relative Policy Optimization (GRPO) algorithm - a specified RL method - for the exploration of chain-of-thought (CoT) reasoning and complex problem-solving. This process nurtures self-verification, reflection, and the production of extended CoTs toward the potential of enhancing LLM reasoning without the preliminary SFT approach. Reinforcing the validity and format of structured reasoning, there is an intrinsic self-evolution process to allocate more computation power to complex problems, with the resulting behaviors being spontaneous, such as reflection and diverse problem-solving strategies.

Building on top of R1-Zero, DeepSeek-R1 has a multi-stage training pipeline. First, it involves a 'cold-start' stage with a high-quality, curated dataset of long CoT examples, generated via few-shot prompting, direct prompting for detailed answers with reflection, and human annotation of R1-Zero outputs. It is further improved through a reasoning-focused RL stage and rejection sampling and SFT for general-purpose tasks. Finally, there is an RL stage that aligns the model with human preferences, taking care of the limitations that R1-Zero has, such as readability and language mixing. Deepseek uses distillation as well, transferring reasoning patterns learned by larger models to smaller, more efficient ones. Remarkably, the models distilled outperform those directly trained using RL and display an improving pattern over current open-source models. This integrated method of RL, cold-start data, and distillation forms one of the best strategies to gain superior reasoning ability in LLMs.

Performance Evaluation with Other Models

Performance evaluation of DeepSeek-R1 is conducted strictly with various benchmarks and tasks for reasoning. Mathematics, coding, and logical reasoning benchmark comparison as reported in the paper between DeepSeek-R1, OpenAI-o1-1217, and OpenAI-o1-mini in table below reveals how often DeepSeek-R1 performed well to the levels or even above its counterparts and the complexity that the model can actually address.

Comparison between DeepSeek-R1 and other representative models.
source - https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

More importantly, DeepSeek-R1 won the length-controlled contest on AlpacaEval 2.0 with an 87.6% win-rate and on ArenaHard for open-ended generation, winning 92.3% of tests, showing how well it was able to respond to non-exam-oriented questions. It had a much larger lead over DeepSeek-V3 on long-context benchmarks, showing long-context understanding is more improved. Note that distilled versions, especially 32B and 70B models, showed new records in dense models' benchmarks for reasoning benchmarks. For example, DeepSeek-R1-Distill-Qwen-7B scored 55.5% on AIME 2024, beating QwQ-32B-Preview.

These distilled models (such as the DeepSeek-R1-Distill-Qwen variants from 1.5B to 32B) were further explored on reasoning benchmarks; notable improvement was observed against other open-source and even some closed-source models. For instance, the 14B distilled model outperformed QwQ-32B-Preview against all metrics, the 32B model, and 70B models significantly exceeded o1-mini on most benchmarks. These findings indicate that the distillation of the reasoning patterns from the models gives a better result than training smaller, base models with reinforcement learning.

How to access And Use DeepSeek-R1

The DeepSeek-R1 model has multiple ways for access and usability. Users can utilize it online at the DeepSeek website or can use an API offered by DeepSeek Platform; this API has compatibility with the OpenAI's API. For users desiring to employ the model on a local setting, instructions on how to access it are within the DeepSeek-V3 repository. Moreover, the light-weight and distilled variants of DeepSeek-R1 are executed on top of the interfaces of tools vLLM and SGLang like all popular models. Official GitHub repository shares the links of research paper and downloadable models, and the result of the evaluations.

Limitations 

DeepSeek-R1 need improvements, currently not as powerful as DeepSeek-V3 in terms of reasoning. Issues related to function calling, multi-turn conversations, complex role-playing and consistent JSON output. It's optimized to perform better at Chinese and English. So it may mix up with other languages. The model mostly falls back to English for reasoning and responses. Moreover, DeepSeek-R1 is quite sensitive to prompting, which may result in performance degradation due to few-shot prompting. Therefore, the recommended method is zero-shot prompting. To date, DeepSeek-R1 has not seen improvements over DeepSeek-V3 in software engineering due to the cost involved in evaluating software engineering tasks in the Reinforcement Learning (RL) process.

Future Work

Future work will leverage longer Chain-of-Thought (CoT) reasoning to improve function calling, multi-turn conversations and role-playing. It is also important to deal with other non-Chinese/English queries. In the following release (or version), software engineering performance will be improved by trying reject sampling on relevant data or by doing asynchronous evaluations during the RL process. Objective of these works is improving robustness and versatility of DeepSeek-R1 on more tasks.

Conclusion

DeepSeek-R1 leverages a novel reinforcement learning paradigm with emergent chain-of-thought reasoning and improved explainability. This makes it better at solving tough problems and communicating. A significant contribution is the introduction of distilled models making sophisticated AI reasoning feasible on resource-constrained devices and thus expanding its use cases. Open-source nature of DeepSeek-R1 models empower community exploration and development of more powerful reasoning AI across science, education, software development, and everyday problem-solving




Source
Website: https://api-docs.deepseek.com/news/news250120
Research Paper: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
GitHub Repo: https://github.com/deepseek-ai/DeepSeek-R1
Model weights of Variants: https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
Try chat model: https://chat.deepseek.com/


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Fin-R1's Financial Reasoning: Excels in Financial Table & Conversation AI

Introduction Financial AI systems are transforming our perceptions of, and interaction with, financial data. The machine learning- and natur...