Pages

Friday, 26 September 2025

DeepSeek-V3.1-Terminus: Inside Its Superior Agentic AI Capabilities

Presentational View

Introduction

Previously, the advancement of technology was tracked in terms of raw power, but today it is about building specialized, reliable tools that open up more things for people to do. Among the most powerful of these are Search Agents and Code Agents. Search Agents are essential, serving as the bridge of the AI to the real-time world by retrieving live data, conducting research for solutions, and anchoring models in real information. At the same time, Code Agents are transforming software development by acting as relentless aides that can write, debug, and maintain intricate codebases. The real paradigm shift comes, though, with their combination—producing an independent agent that not only can research an innovative programming problem but execute the solution, significantly speeding the whole development cycle.

This powerful pairing has been undercut by nagging challenges: inconsistant results, language mistakes in multilingual code, and ineffective tool utilization that hold back genuine autonomous workflow. How can an AI comfortably locate the most current API documentation and then seamlessly apply it within a sophisticated, terminal-driven project? That's the very problem new AI model is designed to address. By emphasizing stability, increasing bilingual fidelity, and maximizing agentic tool utilization, this AI model is an authentic competent developer agent. This is the new AI model referred to as DeepSeek-V3.1-Terminus.

What is DeepSeek-V3.1-Terminus?

DeepSeek-V3.1-Terminus is a complex large language model characterized by a chain of strategic improvements upon its ancestor, DeepSeek V3.1. Although it continues to utilize the robust architectural core of the DeepSeek V3 lineage as a large hybrid reasoning model. It is intended to be an even more trustworthy and refined instrument for problematic real-world tasks.

Key Features of DeepSeek-V3.1-Terminus

The Terminus release is characterized by a number of distinguishing features that tackle key use-case pain points in deploying AI models, changing incremental benefits into operational advantages.

  • Refined Stability and Language Consistency: A key aim of this release was to address user feedback directly related to the quality of output. The model, in general, provides a more stable and reliable output utilising a host of tasks compared to the prior version. A key feature is the improved language consistency, as instances of mixed Chinese-English (CN/EN) text and/or randomness or abnormal characters occurring in the model's output has been completely removed.
  • Optimized Agentic Workflow and Tool Use: In this release focused on optimizing agentic capabilities. The Code Agent and Search Agent that are integrated into this release have, both, improved in function through improvements in performance and efficiency. These are both important improvements in how the model can accomplish tasks utilizing external tools and code generation making it vastly more suitable in complex coding and agent tasks.
  • Native Structured Tool Calling: In addition to agent tasks, the model is structured to call external tool integrations natively as well as having structure support.

Use Cases of DeepSeek-V3.1-Terminus

With its specialized improvements, DeepSeek-V3.1-Terminus is well suited for several practical scenarios where robustness, precision and agentic execution are paramount.

  • High-Fidelity Bilingual Document Generation: The model's design specifically improves consistency of language, while reducing Chinese-to-English mixed text and extraneous characters. It is particularly useful for producing accurate, reliable, and compliant reports, contracts or technical documentation in bilingual (Chinese/English) contexts where high quality of output is needed to support user trust and formal verification.
  • Robust Autonomous Execution of Terminal-Based Workflows: Due to more stable and reliable outcomes, stemming from agent improvements, the model showed a significant increase in its Terminal-bench score from 31.3 (last year) to 36.7 (currently). This is a unique high-performance use case for DeepSeek, especially for deployment in managing and executing multi-step, complex workflows within a simulated command-line interface or other terminal-related tasks where stable, reliable agreement to sequence of actions is critical to mission completion.
  • Gained more efficient General Agentic Tool-Use: Optimization efforts  including changes to the Search Agent's template and tool-set -- resulted in more than a 28% increase in the BrowseComp (Agentic Tool Use) score rising from 30.0 - 38.5. This tool-use efficiency automates information-research or operational workflow information that pragmatically and reliably chains policies, processes, or workflows of external-tools such as search or browsing functions.
  • Specialized Resident Solve Multilingual Software Bugs: The metric improvement on the SWE-bench Multilingual benchmark, climbing from 54.5-57.8, indicates the tool is a suitable choice when performers contend with workflows that involve higher complex coding. It can be a integral engine that automates the analysis, debugging, and application of software bug-fixing within code repository-style workflows across multiple programming languages.

How Does DeepSeek-V3.1-Terminus Work?

DeepSeek-V3.1-Terminus has the same general architecture as its forerunner, DeepSeek-V3. It is a massive hybrid reasoning model with a staggering 671 billion total parameters, 37 billion of which are active at any one moment. This enables it to operate in both thinking and non-thinking modes, so that problem-solving can be tackled in a flexible manner.

Its secret to use, especially for expert users, lies in its capacity to be controlled for its particular reasoning behavior by a reasoning_enabled boolean parameter. It gives a developer the means to switch on or off the model's deeper reasoning paths, optimizing for speed in easier tasks or depth in harder ones. The Terminus update, though, is more about tweaking this underlying architecture than fundamentally altering it and is focused rather on honing the upper layers of it—namely, the robustness of the output and the effectiveness of its built-in Code and Search agents.

Performance Evaluation

The real metric to determine the improvements of the DeepSeek-V3.1-Terminus update is its performance improvements vs the prior DeepSeek-V3.1 model within task-based benchmarks, designed around complex, agentic tasks. The most substantial improvement was in the BrowseComp benchmark, which measures general agentic tool use. The model score improved from a  score of 30.0 to 38.5, an almost 28% improvement. This improvement is imperative, as it indicates increased efficiency in complex agentic workflows that require an AI to engage with external tools (browsers, search APIs, etc.). This suggests that Search Agent optimizations provided substantial improvements, beyond simple adjustments, that makes the model an elevated capable autonomous agent.

Performance Benchmark
source - https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus

Another area of significant improvement was in Terminal-bench, which improved from a score of 31.3 to 36.7. This benchmark is critical, as it observes models performance against terminal-based coding and agentic tasks—equal to performing actions in a command-line environment. The improvement in score strongly indicates improved stability and performance tied to agents improving performance on benchmark scores and is indicative of performance in tasks that require he execution of precise commands, in a sequential order.

Finally , the model showed solid progress in multilingual coding performance, with an improvement in the SWE-bench Multilingual score from 54.5 to 57.8. The SWE-bench Multilingual score directly tracks the model's ability to reason across software engineering tasks when presented with numerous programming languages. This improvement provides strong confidence in the model's ability to support complex coding workflows, specifically in contemporary development environments where multilingual repositories are overwhelmingly common.

Competitive Landscape and Key Differentiators

Four most popular models, DeepSeek-V3.1-Terminus, Kimi K2-Instruct-0905, GLM-4.5, and Qwen2.5-Max can be compared; they have different approaches to the state-of-the-art performance. Though they both take advantage of MoE architectures to achieve efficiency, the underlying philosophies of the models and methods of training are vastly different.

The scale-centered nature of Qwen2.5-Max is characterized by the use of more than 20 trillion pre-training tokens and RLHF to generate powerful general-purpose reasoning. At the other extreme, Kimi K2-Instruct-0905 is a much more specialized model, also trained, via reinforcement learning, to perform better as an agent, whether it is by code or by tool-use, showing a score of 69.2% on SWE-bench Verified. GLM-4.5 is designed to achieve a holistic fusion of reasoning, coding, agentics, and is able to perform well in terminal based tasks as well as reach a high average tool-calling success rate.

DeepSeek-V3.1-Terminus finds its way to the niche by means of architectural innovation and intellectual transfer. Its major distinguishing features are the innovative auxiliary-loss-free load balancing scheme to its MoE architecture and the Multi-Token Prediction (MTP) training goal, which increases performance and inference speed. More importantly, its training includes the knowledge distillation of the long-Chain-of-Thought DeepSeek-R1 model, with the direct incorporation of advanced reasoning patterns. Such emphasis on uncompromising efficiency and purified intelligence enables it to deliver the best results on hard math and coding tests using significantly low training budgets, unlike competitors who focus on big data or small agentic specialization.

How to Access and Use this Model

DeepSeek offers a variety of access methods to fit many users with varying access needs. You can interact with and run the model online through their App, Web interface, or API. If you are a developer who is trying to integrate it into your own applications, you can access the API both through the DeepSeek API, or use the external platform OpenRouter, which offers an OpenAI-compatible completion API. If you would like to run the model locally you can just access the open-source weights on Hugging Face. If you would like to run it locally you should refer to the DeepSeek-V3 GitHub repository for information on the model structure and to use the updated inference demo code in the inference folder in the GitHub repository. Importantly, the model was released under the MIT License for both academic and commercial use, which makes this tool accessible and powerful for many, many projects.

Future Work and Upcoming Updates

In the future, DeepSeek-AI plans to update the DeepSeek-V3.1-Terminus model. The developers were candid about a known technical issue in the current model checkpoint in which the self_attn.o_proj parameters do not currently match the UE8M0 FP8 scale data. A solution to this has been worked on, and an update at some point will address this issue.

Conclusion

DeepSeek-V3.1-Terminus is not just another incremental step in the AI arms race. By doubling down on stability, removing linguistic artifacts, and supercharging its agentic capacity, the model has a distinct identity as a reliable workhorse for complicated, automated workflows.


source
Turminus doc: https://api-docs.deepseek.com/news/news250922
HuggingFace model Weight: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Openrouter AI: https://openrouter.ai/deepseek/deepseek-v3.1-terminus



Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Monday, 15 September 2025

How Qwen3-Next Processes 1M Tokens With Blazing Inference Speed

Presentational View

Introduction

The domain of artificial intelligence is in a rapid evolutionary period defined by increasingly more efficient, accessible, architecturally sophisticated, and intelligent processing. The pace of evolution is also obvious with the emergence of high-sparsity Mixture-of-Experts (MoE) models, which in concert with innovations such as Multi-Token Prediction (MTP), high-tech hybrid attention mechanisms, and more stable optimizations, is palpably changing how we think about the ways we will build and engage with AI.

This New AI model are now gaining traction, and is a significant contributor to this evolution. By presenting an ultra-efficient architecture that minimizes active parameters and optimizes inference speed, it represents a significant step toward democratizing advanced AI capabilities. Most importantly, by presenting a unique architecture that includes hybrid attention mechanism, high-sparsity MoE, and MTP aspect, it mitigates the primary issues of long-context ability processing, computational cost and inference latency, which allows for a more nimble and easily deployable AI in the future. The New AI model is called Qwen3-Next.

What is Qwen3-Next?

Qwen3-Next is a state-of-the-art Mixture-of-Experts (MoE) large language model designed to come as close as possible to high performance while achieving high efficiency. It contains 80 billion parameters; however, during inference, it intelligently activates only 3 billion parameters. The result is vastly reduced compute requirements, greater throughput, and very robust capabilities overall.

Model Variants

Qwen3-Next is released in different variants for different operational purposes that utilizes the efficiency and advanced architecture of the model

  • Qwen3-Next-80B-A3B-Instruct: This version is ready to produce immediate and streamlined outputs. It is best for tasks that require answers to guide instruction in any context, producing quick outputs for typical conversational or instructional prompts.
  • Qwen3-Next-80B-A3B-Thinking: The Thinking capable aspect of Qwen3-Next is for more complex deliberative reasoning, however it still supports Thinking Mode, which allows step-by-step solution capabilities. This is perhaps better suited for tasks that require deeper analytical and/or mathematical analysis and/or more aspect of reasoning compared to other forms of reasoning, offering a more deliberate method of answering difficult prompts.

Key Features of Qwen3-Next

Qwen3-Next is full of many notable factors that work together to embody the very best advances of large language models on the notions of efficiency, scale, and performance before they are lost to computational considerations.

  • Training and inference efficiency: Qwen3-Next is efficient. It was trained with less than 10% GPU hours than Qwen3-32B which is massive reduction in cost in this context. For users, Qwen3-Next is over 10 times faster at inference production throughput than Qwen3-32B for contexts larger than 32k tokens, which ensures more responsive applications.
  • Exceptional context length: Qwen3-Next has a native context window of 262,144 tokens meaning it can consume and understand large amounts information during a single interaction. Using the YaRN method, the model can process up to 1 million tokens which makes it ideal for complete and thorough exploration of long documents.
  • Native Multi-Token Prediction Model (MTP): Qwen3-Next implements an MTP mechanism that is trained end-to-end specifically to greatly increase inference speed and overall model performance.  If you like having slower interactions or a smoother scoring experience, with Qwen3-Next you can have both.
  • Ultra-High Sparsity MoE Architecture: The model is built around a super low activation ratio, using approximately 3.7% of its total 80 billion parameters at any one time through its architecture. This high-sparsity architecture is the bedrock of high-performance while operating with low computational cost.
  • Improved Structural Stability: The architecture contains multiple key optimizations that enhance stability, including Zero-Centered RMSNorm and an output gating method. These features maximize structure stability during pre-training and fine-tuning to create a more stable and reliable model. Most of these optimizations are aimed at improving performance, however the structural changes also enhance capability and proper usage.
  • Unique Hybrid Attention Mechanism: The model features a novel hybrid (coarse-grained/fine-grained) attention method. This hybrid attention architecture is the reason it can handle extremely long sequences of text while still maintaining a very high degree of informational recall, as compared to traditional attention configurations. 

Use Cases of Qwen3-Next

Given its distinctive characteristics and performance, the most unique use cases for Qwen3-Next are:

  • Real-time, Indeterminate Legal or Scientific Document Analysis on Edge Devices: Being able to autonomously process entire legal depostitions, research papers, and substantial technical specifications on local workstations to extract insights, summarze findings and crossreference their findings - all while being detached from cloud resources.
  • Highly Efficient, Large-Scale Codebase Intelligence: Providing robust code review, refactoring recommendations, and bug detection across vast repositories of code by reasoning over the entire codebase's context with low latency and computational cost.
  • Hyper-Specialized Adaptive AI Agents: Building AI systems that transition between respectithng experimental constraints for a rapid, factual response (the "Instruct" model) and thoroughly deliberated and reflective reasoning through the complex problem domain with an explicit thought process (Thinking model) for highly specialized applications such as financial analysis, engineering design, and strategic planning.
  • Advanced Mathematical and Logical Proof Generation: Generating long, verifiable formal proofs (e.g. in Lean 4) where the AI is tasked with reporting long, sophisticated proof chains and decomposition of subgoals where human experts are shaded and observed in real-time.

How Does Qwen3-Next Work?

Qwen3-Next offers a fascinating look at the possibilities of architectural innovation by leveraging many complex components to create an efficient and high-performance architecture. At the heart of the model is Qwen3-Next's Hybrid Attention Mechanism, which synthesizes Gated DeltaNet and Gated Attention. Gated DeltaNet is utilized for most of the layers and provides an efficient mechanism for processing extremely long sequences, and also offers the benefit of avoiding the traditional quadratic attention scaling and addressing the strengths of in-context learning. The more traditional type of attention observed through Gated Attention layers compares much more favorably on recall, addressing a major weakness of pushing the limitations of linear attenuation. The hybrid attention guarantees speed in long contexts and recall.

Hybrid Architecture
source - https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list

Along with its attention mechanism, Qwen3-Next uses a High-Sparsity Mixture-of-Experts (MoE) architecture. Contrary to invoking all 80 billion parameters for each inference, only a fraction (roughly 3 billion) is actually used. This is done through sending the input to a chosen subset of 10 dedicated specialists and 1 common specialist among a pool of 512 specialists. This sparse activation significantly minimizes computational overhead (FLOPs per token) while preserving the model's immense capacity as well as enabling specialization across tasks. Other improvements include Multi-Token Prediction (MTP), an end-to-end optimized technique that speeds up inference by predicting multiple tokens simultaneously, and stable optimization techniques such as Zero-Centered RMSNorm to provide stable performance during training and deployment.

Performance Evaluations in Comparison to Other Models

Qwen3-Next performs extremely well on a variety of benchmarks, illustrating its efficiency and power relative to its predecessors and other sophisticated models. 

RULER Benchmark
https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list

On the RULER benchmark, Qwen3-Next-80B-A3B-Instruct evidences its ability since it outperforms Qwen3-30B-A3B-Instruct-2507 along every tested length and exceeds the performance of Qwen3-235B-A22B-Instruct-2507 for contexts below 256K tokens. For the full 1M RULER benchmark, Qwen3-Next-80B-A3B-Instruct score a very competitive 91.8 Acc avg compared to Qwen3-235B-A22B-Instruct-2507 with 92.5 Acc avg, but it activated less parameters, showing its efficiency on ultra-long-context. 

Performance benchmark of Thinking model
source -  https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking

Also, Qwen3-Next-80B-A3B-Thinking model far exceeds the performance of Qwen3-30B-A3B-Thinking-2507 across a multitude of benchmark tests including difficult reasoning tasks such as AIME25 and HMMT25. In addition, it also routinely outperforms the proprietary model Gemini-2.5-Flash-Thinking on multiple benchmarks that illustrate its prowess on reasoning. The Qwen3-Next-80B-A3B-Thinking, on 1M RULER benchmark, the Sparse Attention method gets a similar 9590  Acc avg to Qwen3-235B-A22B-Thinking-2507 Sparse Attention; both of these models evidences reasoning ability in the long-context.

Competitive Landscape

The AI landscape has moved away from a bigger is better model to one based on smart and efficient design. Consider Kimi K2, with massive scale as the target. GLM-4.5 has the target of holistic capability. GPT-OSS gpt-oss-20b is targeting edge devices. Qwen3-Next is not making a case as better than AI, nor is it a niche goal like its competitors. The distinguishing characteristic is a core philosophy of radical efficiency.

Its technical advantage lies in a novel combination of architecture innovations. The biggest practical advantage is the ability to process a 1 million token context window, which has a practical implementation that is many times greater than the 128k-256k constraint of its main competitors. This difference gives Qwen3-Next a truly unparalleled advantage in this current ELT practice of analyzing massive datasets.

This radical efficiency is directly related to grabbing the latest evolution of advanced AI AI and bringing it to the masses. For Qwen3-Next, developing a model that can run effectively on a single GPU, or even leveraging the power of a CPU, the performance ceiling is put within reach without spending hundreds of thousands of dollars in hardware. It shows, the most recent evolution of AI is not simply a matter of making models smarter, rather the continued effort of making AI models practically possible and available to all.

How to Access and Utilize Qwen3-Next

The weights of the model are publicly accessible on Hugging Face, a premier platform for hosting AI models. Deployment and utilization instructions, especially with performant frameworks such as SGLang and vLLM to take advantage of Multi-Token Prediction (MTP), are outlined in the Hugging Face repositories and in the official Qwen.ai blog. Qwen3-Next is open-source in compliance with the increased trend of making capable AI tools publicly accessible for purposes of research and commercial use.

Limitations and/or Future Work

While Qwen3-Next is a major improvement, it does come with some limitations. The static implementation of YaRN, although allowing for ultra-long context extension, does mean the scaling factor is the same across all input lengths, which could affect performance on shorter texts. Also, the highly useful Multi-Token Prediction (MTP) mechanism is not available at large scale in Hugging Face Transformers, so special inference frameworks such as SGLang or vLLM are needed for maximum efficiency. Secondly, it is worth noting that the model, even when it is asked in foreign languages, will most likely conduct its internal thinking process in English before producing the final answer in the language of choice.

Conclusion

Qwen3-Next is a landmark in establishing a new standard for what can be achieved with intelligent architectural design. This model not only as an incremental improvement but as a breakthrough, especially regarding the compromise between computational expense and sophisticated capability. Minor limitations such as YaRN's performance on shorter text are certainly present but the overall package offered by Qwen3-Next is a vision to create AI that is intelligent, inherently effective and universally accessible.


Sources:
Tech blog: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list
Qwen3-Next-80B-A3B-Instruct : https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
Qwen3-Next-80B-A3B-Thinking : https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Tuesday, 9 September 2025

Kimi K2 0905: Enhanced Frontend Coding with Agentic AI Capabilities

Presentational View

Introduction

The philosophy of the Kimi K2 series has been to come as a cutting-edge trend that has been carefully optimized for agentic potential to establish it as an action-oriented agent meant to accomplish things and not merely speak.

Progressing from this base, the new latest version, goes far in making AI an effective collaborator. One of the biggest hurdles in AI-aided development has been separating useful code from nice design; this model solves it head-on by improving both the look and feel as well as functionality of frontend coding. In addition, to truly succeed, an AI will need to fit into pre-existing digital environments seamlessly. Kimi K2 variant accomplishes this with high-quality tool-calling and API support.

What is Kimi K2 0905? 

Kimi K2-Instruct-0905, or Kimi K2 0905 for short, is the newest and most capable version in the Kimi K2 family of models. It is a cutting-edge Mixture-of-Experts (MoE) language model that has been specifically designed and optimized to provide advanced agentic capabilities. Whereas, most LLMs just generate text, Kimi K2 0905 was designed to be a doer that can accomplish tasks that require tool use, complex reasoning, and independent problem-solving to carry out projects that involve multiple steps from beginning to end.

Key Features of Kimi K2 0905

Kimi K2 0905 has many upgrades that enhance its agentic capabilities.

  • Extended Context Length:  Kimi K2 0905 has a capability of 256K tokens for its context window.  This was a substantial upgrade extending the ability for long-horizon tasks, having a better process to hold actionable information.
  • Frontend coding experience is so much better: This version provided several upgrades to frontend development.  The options for programming mechanics has shifted forward. Kimi K2 0905 provided enhanced programming aesthetics, making better looking and easier to use coding interfaces.  Kimi K2 0905 also improved the practical programming generation capabilities.  Kimi would generate code better, more solid, more functional and better able to be incorporated into project options.
  • Tool-calling and API compatibility: Kimi K2 0905 is capable of making its own autonomous decisions about when to/call the tools available to it.  Kimi's options also have: OpenAI/Anthropic-compatible APIs so that the two can integrate without difficulty, following the mapping of temperature for all Anthropic-based use cases; using a formula provided by OpenAI, in the form of (real_temperature=request_temperature×0.6).

Capabilities and use cases for Kimi K2 0905 

The real potential of Kimi K2 0905 is not found in its advanced generative features; rather it is in the capability it has to deliver real, high-volume, high-impact capability. The Kimi K2 0905 acts as less of a tool, and more of a self-aware teammate.

  • The Autonomous Business Catalyst: The agent can start with a vague idea of a business opportunity, and orchestrate the plan execution for the whole go to market strategy. From market research to product strategy which is self-judged, to integrate API sources for the manufacturer, create a brand, create and launch an entire e-commerce store with initial marketing campaigns, it is the ultimate ‘doer’ agent that brings airtight strategic insight to multi-tool execution.
  • The Autonomous DevOps & SRE: In this use case, the model is a live software engineer, 24/7. The automated agent autonomously has the ability to read a production bug report, use git commands to access the code base, use its orchestration to order testing and logging tools, reproduce the bug application, fix the code, and update the bug ticket in Jira after addressing the issue. What sets this use case apart is that it integrates in reflex time speed with the autonomous use of the entire complex software lifecycle maintenance workflow, rather than humans intervening at some of the key infusion points.
  • The Proactive Cybersecurity Incident Responder: Being an automated security practitioner, the agent will receive a threat alert, run various security-specific tools to analyze it, and have the background to make a reflex grade decision to isolate a compromised item by communicating with cloud infrastructure APIs to isolate it from the company and to patch and remediate the vulnerability. The agent can then document the incident to produce great value through its automated speed and orchestration of specialized security incident tools during a security incident.  
  • The End-to-End Product Prototyping Agent: This agent is a fast product developer who can move rapidly from a simple idea to a functional, deployed prototype. The agent independently researches and compiles a series of 3rd party APIs (e.g. payment gate way, maps, etc.), autonomously conceives what front end and back end code needs to be written coded (with a sense of aesthetic and convenience) by the agent, and executes all elements around the deployment.

Performance Evaluation

The capabilities demonstrated by Kimi K2 0905 can be substantiated through a variety of extensive testing and exceedingly satisfactory performances on multiple difficult industry benchmarks. A standout performance was on SWE-Bench Verified, where the model was assessed for accuracy at a notable 69.2% ± 0.63. This benchmark is unusually meaningful because it assessed an agent's success at responding to actual software issues using metrics from bug fixes. An impressive score on this benchmark illustrates that the model can be a dependable and effective holistic automated tool for software development and maintenance guiding its own solution to problems, such as diagnosing complex defects and proposing (verifiable) patch solutions, with good accuracy and great speed and efficiency in reducing the time and resources dedicated to manual debugging.

Evaluation Results
source - https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905

Another impressive performance was on the SWE-Dev benchmark at an accuracy of 66.6% ± 0.72. This benchmark evaluated the models' potential ability to create a new code base or new features based upon high level specifications metrics. The evaluation of SWE-Dev is purposely hindered due to the removal of the test files that indirectly could provide hints, the model must create, design and implement its own solutions. Its remaining extremely high accuracy is even more impressive given the purposeful restrictions applied during SWE-Dev; it illustrates Kimi K2 is not only absolutely feasible in terms of effective code generation but also has a great depth and autonomy when it comes to creative development.

In addition to these, Kimi K2 0905 is excellent in other regards. In SWE-Bench Multilingual, it scored 55.9% ± 0.72, indicating Kimi K2 0905's strong global development adaptability. In Multi-SWE-Bench, Kimi K2 0905 scored a respectable 33.5% ± 0.28, indicating it robustly manages multiple interdependent tasks and reasons over long-term, and with Terminal-Bench, Kimi K2 0905 scored 44.5% ± 2.03 confirmed its proficiency for automating command-line operations, an important skill for being a direct system control agent.

How to Get and Use Kimi K2 0905

Kimi K2 0905 and its variants are available to researchers and developers via the Hugging Face Hub, with a specific model card for Kimi-K2-Instruct-0905. The model is available to be locally run via the transformers library, and model card includes code snippets to get started. The entire Kimi K2 project is open source, with its resources and code on the official MoonshotAI GitHub repository.

Conclusion

We have spoken with AI as a conversationalist and accessed information from it as an information retriever for years. Kimi K2 0905 anchors the shift of AI to become an effective team member—an independent agent that can be given high-level objectives and then perform the intricate, multi-step, multi-tool processes needed to succeed.


Sources:
Blog: https://moonshotai.github.io/Kimi-K2/
Kimi-K2-Instruct model card: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905
GitHub Repo: https://github.com/MoonshotAI/Kimi-K2


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Monday, 25 August 2025

Nemotron Nano 2: How NVIDIA Achieves High-Speed Reasoning AI

Presentational View

Introduction

The constant quest for increasingly powerful artificial intelligence has resulted in an interesting development in model structure. We see an interesting movement towards hybrid models that combine the best aspects of various structures to overcome current shortcomings. One of the biggest challenges has been striking a balance between a model's reasoning power, speed, and computational expense necessary to execute it. New AI model designed to provide advanced reasoning with greatly increased throughput, bringing advanced AI within reach for deployment on more readily available high-performance hardware. The new model is called Nemotron Nano 2.

What is Nemotron Nano 2?

Nemotron Nano 2 is a series of precise and effective combinations of reason based on Mamba -Transformers. It is an innovation by NVIDIA, a provider of artificial intelligence and accelerated computing that is worldwide. These are specially designed computers to enhance the speed of reasoning outputs. They achieve this with or even greater accuracy compared to other models as state of the art and of a comparable size. This combination of high velocity and high cognition is reason enough why they are especially fitted in a new age of AI applications.

Model Variants

The Nemotron Nano 2 line is represented by three different models, each designed to address slightly different requirements but all inheriting the core advantages of the architecture and capable of supporting a whopping 128K context length.

  • NVIDIA-Nemotron-Nano-9B-v2: This is an aligned and pruned model with around 8.89 billion parameters. It's a general-purpose reasoning and chat model, best for AI agents and following instructions. The distinct feature is that it can create a "reasoning trace" before its final output. Its knowledge cutoff is September 2024.
  • NVIDIA-Nemotron-Nano-9B-v2-Base: This is the base pruned model, also with approximately 8.89 billion parameters. It is a base model that can then be further fine-tuned for other purposes. Its freshness of data is up to May 1, 2025.
  • NVIDIA-Nemotron-Nano-12B-v2-Base: This is the base model, the untrimmed version, with a size of around 12.31 billion parameters. This model was pre-trained on an immense 20 trillion tokens of data and has the same cutoff date of data freshness as May 1, 2025.

Key Features of Nemotron Nano 2

Nemotron Nano 2 is supplied with innovative features that have made it powerful and useful.

  • Transformer Architecture: The Mamba-2 can significantly reduce the time the models required to process information, particularly, the long sequences of thinking traces with the help of replacing the majority of the traditional Self-attention layers with the Mamba-2.
  • Reasoning Budget Control: This exclusive tool enables users to disable the number of thinking tokens which a model will use to come up with the final response. This assists in positioning the responses in a well formed format and not having too much preparatory paragraphs.
  • Native Tool-Calling: The models have in-built tool-calling, so that they can interface with external tools and APIs to complete a broader set of tasks.
  • Multilingual Capabilities: Nemotron Nano 2 can speak in multiple languages, such as English, Spanish, French, German, Japanese, and Italian, in other words, it is a multilingual model.

Real-World Applications and Capabilities

The architecture and operating characteristics of Nemotron Nano 2 allow a range of valuable, practical uses that need a fast response with profound thinking and constrained response.

  • Optimized Real time Decision Support: Nemotron Nano 2 can be implemented on resource-intense applications in specific NVIDIA-accelerated computers because of the advanced compression algorithm. Its hybrid layout guarantees speed of reasoning. The Budget Control enables short and well-structured explanations. This is appropriate in processes such as computerized diagnosis, or real-time control applications where time-efficient, flying auditable reasoning is essential.
  • Debugging of code and math support: Given its large knowledge-base of specialized math and code training data, like the 133 billion-token Nemotron-CC-Math-v1 training set, the model becomes a particularly effective debugging assistant. It is able to produce step-by-step explanation to complex STEM problems. The Budget Control Reasoning enables the AI to scale the depth of its explanations- providing a succinct overview of simpler steps and a deeper thought process of more complex concepts.
  • Automatic Compliance Report and Audit Trail: In extremely regulated industries, creating easy to understand records is it. Nemotron Nano 2 is able to handle huge regulatory documents due to its 128K context window and as a result create compliance reports. The length of the resulting reasoning output is controlled by it and the audit trail resulting is therefore concise and meets strict formatting requirements which help to streamline the review process.

How does Nemotron Nano 2 Work?

One of the reasons why Nemotron Nano 2 records a great performance is that it comes with an innovative design. The new models make use of a hybrid Mamba-Transformer architecture, referred to as Nemotron-Hybrid, that replaces most of the computationally demanding self-attention layers with Mamba-2 layers instead. This improvement is mostly helpful in speeding up the inference process when one is exposed to long sequence of information as in complex reasoning tasks. Such is the case with the Nemotron-Nano-12B-v2-Base which consists of 62 layers or 28 Mamba-2 layers, and just 6 layers of self-attention.

Hybrid Mamba-Transformer architecture
                                          source - https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
A sophisticated compression strategy has been used so that this powerful model can be run on a single GPU (NVIDIA A10G). This entailed two core procedures: Pruning, which identified the less relevant model structures such as layers and hidden dimensions, and Distillation that retrained the pruned model through the remaining original model as a teacher to restore some portion of the initial model accuracy and re-strengthen the decision making.

Training and Alignment

The robust nature of the model is a direct response to a well tailored, but computationally expensive training session. Training with FP8 precision was a design trade-off between numerical accuracy and computational throughput, and it enabled training at a larger scale (20 trillion tokens) thereby enabling free scaling training. It is such efficiency that allowed Jim to develop the curriculum learning idea in which the model was presented with increasing better-quality data developing key strengths in a predictable order as opposed to a random presentation of data. The reliance on artificially generated data particularly to specific and multilingual tasks is a modern, data-informed effort to address the lack of quality human-labeled data and develop a truly adaptive foundation model basing on the ground up.

On top of this good foundation the alignment process that follows makes the model the tool which comes to refinement after a rough predictive tool into an advanced user friendly tool. Supervised Fine-Tuning (SFT) The first carefully idiosyncratic preparation with properly curated post-training datasets effectively teaches the model to learn and follow subtle directions in various fields. And yet it is a final crucial step in RLHF that allows honing the behavior of the model so that what it produces are not merely right but also helpful, safe, and do not go against any expectation of subtle rules of conversation with people. Such a two-stage fine-tuning process is necessary toward moving a model out of research application to a deployable product suitable to be used in the real world and to be deployed.

Performance Evaluation

Nemotron Nano 2 does not only claim efficiency, it achieves leading-edge performance on many industry-standard tests. The Nemotron-Nano-12B-v2-Base model has achieved especially good results in the field of mathematical reasoning. It scored an excellent 91.66 percent on the GSM8K CoT benchmarking and 83.54 percent at the MATH benchmarking offering it great results in resolving complicated mathematical tasks. Such scores reflect that the model can learn and operate very high mathematical concepts.

Accuracy of Nemotron-Nano-V2-Base models versus existing SoTA models
source - https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf

Other areas where the models perform well is in the management of long-contexts information. The 12B Base and 9B Base models scored 84.74% and 82.22%, respectively, on the Ruler-128K, which tests a model on how well they retain and are able to retrieve information over long sequences of text. It is a key feature that applications requiring work with large files, conversation histories, or reports will have up to 6x higher inference throughput in generation-intensive workloads.

Comparison of Nemotron Nano 2 and Qwen3-8B in terms of accuracy and throughput
source - https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf

In addition to these particular tests, Nemotron Nano 2 has passed numerous different tests as well. The 9B Base model has special strength on multilingual math problems. The models also obtain good scores on code generation tasks such as HumanEval+ and MBPP+. Moreover, in the working mode as the 9B-v2 reasoner, the model displayed a good performance on such benchmarks, as AIME25 (72.1%) or GPQA (64.0%), among open small models.

How to Access and Use Nemotron Nano 2?

In order to access and use Nemotron Nano 2, it needs to be downloaded first. The Nemotron Nano 2 is extensively available to the developer and researcher community by NVIDIA. All their models are published under the NVIDIA Open Model License (allows for commercial use) and the majority of their training data is made openly available on Hugging Face. They are also optimized on NVIDIA GPU accelerated systems such as the A10G and the A100 series and are made to run under Linux. It fits into the most-used runtime engines, such as Hugging Face Transformers and TRT-LLM or vLLM. Attention: we advise those users who use vLLM to add the flag --mamba_ssm_cache_dtype float32 to ensure high quality and avoid deterioration of performance.

Limitations

Although the Nemotron Nano 2 models are highly sophisticated in design with great features, there are significant limitations nonetheless. The main limitation is the hardware efficiency, as the 12.31 billion parameters model successfully deployed on a single NVIDIA A10G GPU is achievable only after a major tick with the approach to model compression through pruning and distillation. This demonstrates that the resource requirements of the uncompressed base model are much high. In addition to that, the models are highly optimized to run within NVIDIA GPU accelerated systems, which reduces the probability of usage with other hardware platforms. Lastly, their knowledge is not exhaustive; their factual resources are limited by the training data and have an expiration limit of September 2024 in the case of the 9B-v2 and May 1, 2025 in the case of base model.

Conclusion

NVIDIA Nemotron Nano 2 is an innovative Mamba-Transformer hybrid architecture that overturns the age-old trade-off between reasoning power, speed and cost. By providing state-of-the-art performance with many times greater throughput, and offering the ability to use a single GPU via judicious compression, it is able to deploy advanced AI in a practical, affordable way. It is an emerging template out to implement capable and economical reasoning models in the real-life applications.


Source
Tech blog: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/
Research paper: https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
Nano-12B-v2-Base : https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
Nano-9B-v2 : https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2
Nano-9B-v2-Base  : https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Tuesday, 5 August 2025

Google's MLE-STAR: Winning with Real-Time Web Search

Presentational View

Introduction

In the never-ending competition for AI dominance, the real bottleneck is now not merely about building larger models but getting them to function—optimally, stably, and at the frontier of innovation. We're in a period of advanced Machine Learning Engineering (MLE) agents, autonomous systems that vow to mechanize the laborious task of developing and tweaking AI. But too many of these agents have been operating with one hand behind their back, hobbled by the static, frequently outdated knowledge of their fundamental language models. They use old maps to navigate a world that's constantly changing.

This dependence on prior knowledge is a brake on innovation. The challenge has not only been to construct an agent that can write code but one that can learn and adapt in real-time, as an expert human would do. It must solve problems with the precision of an experienced engineer and not the brushstrokes of a generalist.

To address this critical business requirement, a new architecture has been developed out of the research labs at Google. It's an agent built to run on the live edge of machine learning. By incorporating real-time web search for the most current models, using a new approach of focused code refinement, and including a set of automated quality tests, this agent is a qualitative breakthrough. This new paradigm is referred to as MLE-STAR.

What is MLE-STAR?

MLE-STAR is a sophisticated autonomous agent that recasts machine learning construction as a focused code optimization problem. In contrast to precursors that accessed a static body of knowledge, MLE-STAR is a dynamic system. It uses real-time web search to find and apply state-of-the-art solutions, producing high-performing Python code custom-designed for a massive range of data types, from images and text to tabular and audio data.

Key Features of MLE-STAR

In terms of engineering, MLE-STAR has distinctive and instantiated features that power it:

  • Live Web Model Search: The agent taps into the live geographically distributed global conversation of AI development to guarantee the models that it employs, not just the good ones, but the models that are actually for the purpose of the task the state-of-the-art models.
  • Exact Code Tuning: Rather than make general changes, the agent locates and tunes the elements of code that truly control performance, and applies all the agent's power to those elements of code to a maximum amount.
  • Automated Advanced Ensembling: It not only finds and creates advanced ensemble strategies, it will actually automatically achieve this.
  • Broad Task Generalization: MLE-STAR is a truly general framework that can do a nearly limitless set of tasks from classification to denoising, for any type of data, without making manual examples.
  • Coupled Code Reliability: MLE-STAR includes implicit QA to give reliable and trustworthy code, and will inherently find and change problematic fatal issues with code, such as bugs, info leaks and misuse of data.
  • Novel Solution Development: The agent is devised to create novel solutions, rather than suggesting simply repeating simple patterns from its training.

Use Cases and Capabilities of MLE-STAR

These technical capabilities drive value and deliver efficiencies from a business and strategic perspective and deliver the following benefits to the market:

  • Market Agility and Innovation: For any organization, the ability to develop high-performance solutions to new data problems rapidly is the definite competitive benefit. MLE-STAR reduces development time, and therefore enhances the opportunity to innovate.
  • Optimizing Present Investment: Organizations can install MLE-STAR and achieve well-targeted, high leverage improvement on their existing ML system instead of spending huge sums of money on a disruptive redesign of existing ML systems, and therefore achieve the best value on their existing infrastructure.
  • Securing a Competitive Edge: In industries like finance or medicine, where narrowly defined margins of error have enormous ramifications, the agent's automated ensemble processes provide it with a direct path to better performance and mastery.
  • De-risking AI Deployment: Defective AI models are very risky. By automatically determining the major errors, like data leaks and bugs, MLE-STAR not only ensures the models deployed are both high-performance and reliable, but also trustworthy by reducing the risk of poor outcomes and damaging reputational incidents.

How Does MLE-STAR Work?

MLE-STAR works through a sophisticated, multi-stage process that is capable of developing strong and high-performance machine learning models. Initial Solution Generation through Web Search kick-starts the process. Using Google Search, an agent called Aretriever retrieves relevant, state-of-the-art models and their respective code examples as a function of the task description provided by the user. A second agent, Ainit, then generates simple Python scripts for every model retrieved, which are assessed to determine the top performers. These highest-performing scripts are then merged into a powerful initial solution, usually a simple average ensemble, by the Amerger agent.

Overview of MLE-STAR
source - https://arxiv.org/pdf/2506.15692

The heart of MLE-STAR's workflow is the Iterative Refinement of Code Blocks. During this phase, a nested loop iteratively refines the initial solution. In the outer loop, an Aabl agent conducts an ablation study to determine the most important code block with respect to performance, and then an Aextractor agent selects it for refinement. Within the inner loop, a planning agent, Aplanner, suggests various strategies to enhance the focused block, which are carried out by a coding agent, Acoder. The solution is updated only when such modifications lead to improved performance.

MLE-STAR iteratively proposes effective ensemble strategies
source - https://arxiv.org/pdf/2506.15692

After this, MLE-STAR uses a Novel Ensemble Method where MLE-STAR suggests and refines different complex methods for combining the strong candidate solutions into the final, stronger ensemble model. As this whole process occurs, a suite of Robustness Modules, such as the debugging module (Adebugger), a data leakage checker (Aleakage), and a data usage checker (Adata) run continuously validating the code, and helping with reliability and correctness.

Performance Evaluation with Other Models

In competitive machine learning, there is only one thing that counts: results. MLE-STAR's performance was tested on MLE-Bench-Lite, a benchmark that includes 22 Kaggle competitions from real-world domains—the ultimate test ground for ML performance. Not only were the results affirmative, but they were overriding.

Main results from MLE-bench Lite
source - https://arxiv.org/pdf/2506.15692

MLE-STAR won a medal in an incredible 63.6% of the competitions. Better still, 36% of its victories were gold medals, a standard that consistently is higher than that of expert human professionals. This shows a capability not only to compete, but to succeed.

Model usage (%) on image classification competitions
source - https://arxiv.org/pdf/2506.15692

When compared against its competitors, MLE-STAR's design strengths stand starkly revealed. It put AIDE, an agent that is dependent on older internal models such as ResNet, well behind it, taking 37% of image classification medals to AIDE's 26%, with its capability to tap into newer architectures such as EfficientNet. It also handily outcompeted specialist agents such as DS-Agent (constrained by a manual case bank) and generalist agents such as gpt-4o and OpenHands, which achieved medal rates of only 6.1% and 12.1% respectively on the same test. That performance gap is not simply a figure; it's evidence that a specialist, dynamic, and strong architecture is the secret to state-of-the-art performance.

The Specialist's Edge

MLE-STAR's superior performance proves a key design principle: the benefit of a specialist tool over a general-purpose one. While capable generalist agents such as OpenHands or models such as gpt-4o (employed with MLAB) can try to perform machine learning tasks, they are like a Swiss Army knife attempting surgery. They do not possess the specialist architecture necessary for the highly specific challenges of competitive machine learning.

This expert benefit is embedded outright into its properties. Its focused code block optimisation achieves a more profound, more effective optimisation feature than the general approaches of other MLE agents such as AIDE. Most importantly, its built-in robustness modules, including the data leakage checker, address machine learning-specific issues that are not designed to be discovered by generalist developer agents. This intentional emphasis on MLE's distinctive pain areas, coupled with a flexible architecture that is scalable past the manually curated bounds of agents like DS-Agent, is exactly what produces such an enormous performance gap and creates its competitive advantage.

How to Access and Use MLE-STAR

For those who want to see what MLE-STAR can do, it is open-sourced on GitHub. The agent is developed with the Agent Development Kit (ADK). To utilize MLE-STAR, a user gives a description of the task and the datasets involved. The agent then works on it, doing the laborious machine learning work and creating an executable Python solution script. It should be noted that MLE-STAR is presently only for research use. The users are accountable for ensuring that any models or content obtained by the agent do not violate the relevant licensing restrictions.

Limitations and Future Work

Currently, the biggest limitation of MLE-STAR is its label of research use only, which puts responsibility on the user to comply with licensing for any models or content used. Another possible limitation is that since the computer-based LLM (Large Language Model) utilizes public data, it is plausible that some solutions generated are not entirely original because they may have been previously posted, such as on a user forum on Kaggle. 

Looking ahead, the nature of MLE-STAR provides exciting future work. MLE-STAR will likely improve, due to changes in performance and availability of state-of-the-art models in general to the user since it relies on web search. One potential improvement could involve a more direct human involvement by allowing users to enter descriptions of models that could be utilized more directly, and provide model descriptions so the agent could search for even newer models or model refinement strategies.

Conclusion

For developers, researchers, and companies, MLE-STAR is a vision for a world where the costs of entry for building AI solutions that have the ability to make a significant impact are greatly reduced, and we are paving the way for a new generation of innovation across nearly every industry. The AI journey has always been characterized as a constant journey to be able to do more, and with MLE-STAR we have taken a huge and exciting step forward.


Sources:
Tech blog: https://research.google/blog/mle-star-a-state-of-the-art-machine-learning-engineering-agents/
Research paper: https://arxiv.org/pdf/2506.15692
GitHub Repo: https://github.com/google/adk-samples/tree/main/python/agents/machine-learning-engineering


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Wednesday, 30 July 2025

GLM-4.5: Unifying Reasoning, Coding, and Agentic Work

Presentational View

Introduction

Breakthroughs in agentic AI and coding models leading to more advanced and autonomous systems. These models are now advanced to proactive agents that can reason, plan, and perform complex, multi-step actions. But there are obstacles. One main challenge has been fragmentation of capabilities; models tend to be excellent at either reasoning, coding, or being an agent, but hardly all three at once. This has resulted in clumsy and inefficient arrangements which involve the handling of many specialist models.

New Model is designed , to solve this very issue, by integrating reasoning, coding, and agentic functions into one, complete system. By combining these fundamentals, this new model intends to satisfy the sophisticated needs of intelligent agent applications, ushering in a new age of AI that is more productive, powerful, and integrated seamlessly. This new model is known as GLM-4.5.

The Visionaries Behind the Model

The GLM-4.5 series is the creation of Zhipu, originating from the technological advancements of Tsinghua University's Computer Science Department, is an artificial intelligence company with the mission of teaching machines to think like humans. The underlying philosophy for creating GLM-4.5 was to develop one comprehensive system that integrates reasoning, coding, and agentic capabilities. This ambitious aim was taken up in order to meet the increasing sophistication of intelligent agent applications.

What is GLM-4.5?

GLM-4.5 is a sequence of cutting-edge, open-source AI models that aim to be one system for reasoning, coding, and agentic work. It is constructed to manage the complex needs of contemporary intelligent agent applications by offering an extensive and cohesive set of skills.

Model Variants

The GLM-4.5 line consists of two different foundation models, each designed for different user use cases while sharing a common design of combined capabilities and a hybrid mode of thinking.

  • GLM-4.5 (The Flagship): This behemoth model has an impressive 355 billion total parameters and 32 billion active parameters. It has a huge 128k context length capacity, which means it can have very long and rich interactions. To be more efficient when inferring, an FP8 variant (GLM-4.5-FP8) exists. Its API cost is 60 cents per 1 million input tokens and $2.20 per 1 million output tokens.
  • GLM-4.5-Air (The Efficient Compact): This model is for users who value efficiency without sacrificing much on power. It has 106 billion total parameters with 12 billion active parameters and also has a 128k context length. There is also an FP8 variant (GLM-4.5-Air-FP8) for this model. The API cost for the Air model is very low at 20 cents per 1 million input tokens and $1.10 per 1 million output tokens, rendering it very cost-effective.

Key Features of GLM-4.5

GLM-4.5 is filled with cutting-edge features that set it apart from the rest.

  • Hybrid Thinking Modes: The two models each employ a dynamic hybrid reasoning model. They are able to alternate between a 'thinking' mode for sophisticated reasoning and tool employment, and a 'non-thinking' mode for fast, direct answers as per the complexity of the task.
  • Task-Oriented Optimized for Agentic: GLM-4.5 is naturally optimized as a foundation model for agentic tasks. It supports native function calling and has recorded the highest average tool calling success rate of 90.6% when compared to the likes of Claude-4-Sonnet, Kimi K2, and Qwen3-Coder.

    Average Tool Calling Success Rate
    source - https://z.ai/blog/glm-4.5

  • Novel MoE Architecture: GLM-4.5 follows a novel Mixture-of-Experts (MoE) architecture. Contrary to other MoE models, which are width-oriented, GLM-4.5 deepens (increases layers) but thins (decreases hidden dimension and number of routed experts). The design followed from the observation that deeper models have better reasoning abilities.
  • Innovative Reinforcement Learning Infrastructure ('slime'): One of GLM-4.5's main technical strengths is its tailor-made, open-sourced Reinforcement Learning (RL) infrastructure called 'slime'. 'slime' is designed for extremely fast training and has a hybrid architecture that is flexible enough to accommodate both synchronous and asynchronous training. This is especially important for advanced agentic RL where data generation may become a bottleneck.

Capabilities and Use Cases of GLM-4.5

The integrated design of GLM-4.5 opens up a wide range of sophisticated uses.

  • End-to-End Full-Stack Development: The framework can automatically produce complete web applications, from frontend coding to backend deployment and database handling.
    Use Case: An e-commerce site could be built using GLM-4.5 to quickly prototype and deploy a full-fledged e-commerce site, with an easy-to-use interface, product database, and payment gateway, all from a single set of high-level specifications.
  • Sophisticated Artifact Creation: In addition to regular code, the model may create advanced, standalone artifacts.
    Use Case: A game designer might create the full code for an interactive mini-game such as Flappy Bird, or a physicist might develop a working physics simulation right inside the development platform.
  • Sophisticated Frontend and Visual Design: GLM-4.5 is also great at designing beautifully crafted frontend interfaces in different forms.
    Use Case: A UI/UX designer may have the model create complex SVG graphics, i.e., a detailed drawing of a butterfly, or create a responsive and visually good-looking web page utilizing HTML and Python.
  • Agent-Augmented Content Creation: The model may utilize its agentic tools to create rich content.
    Use Case: A business analyst may assign GLM-4.5 to develop a complete slide deck for a market analysis report. The model would employ its web search feature to collect current market information and then create the presentation, including charts and editable HTML code.

Training and architecture

The great performance in GLM-4.5 is based on the new architecture of the design. This strategy of the model to focus on depth rather than width has allowed the model to have an advantage in its depth, which improves reasoning abilities. It uses lossless balance routing, and sigmoid gates as its MoE layers. It has a self-attention component using Grouped-Query Attention with partial RoPE, and 96 attention heads with a 5120 hidden dimension to achieve a high level of reasoning that provides enormous gains on reasoning benchmarks. QK-Norm has been used to stabilize attention logits in the model and Muon optimizer to speed up convergence. To be used at a faster rate, a Multi-Token Prediction (MTP) layer is inserted to facilitate speculative decoding.

Slime - RL Infrastructure
source - https://z.ai/blog/glm-4.5

In addition to its architecture, the capabilities of GLM-4.5 are the direct consequence of an enormous and state-of-the-art multi-stage training process. It started with using an astounding 22 trillion tokens in a process of pre-training by imaginatively splitting newer into a larger 15-trillion-token general corpus then a 7-trillion-token corpus specially concentrated on code and reasoning. This base was then refined with a decisive post-training process that follows the reinforcement learning (RL) concept to develop elite agentic and reasoning capabilities. To reason, the model was trained in a one-stage RL on full length of context representation, using a curriculum that was decided based on difficulty. In the case of agentic work, it was trained to handle testable domains such as software engineering and answering information-seeking Q&A where execution-based correction was used to guarantee practical value. All of this is fueled by, what we call, slime, an innovative RL infrastructure that has a decoupled agent-first design and mixed-precision data generation (FastFormat8 to accelerate training, BetterFloat16 to ensure stability) to address the common training bottlenecks.

Performance Evaluation

Thoroughly tested on 12 industry benchmarks, GLM-4.5 had an outstanding aggregate score of 63.2, 3rd among all proprietary and open-source models. Its lighter version, GLM-4.5-Air, also registered a high 59.8, showing a better cost to performance ratio that makes high-end AI more affordable.

Overall performance on 12 benchmarks covering agentic , reasoning, and Coding
source - https://z.ai/blog/glm-4.5

The model's agentic capability is its defining characteristic, supported by a best-in-class 90.6% tool-calling success rate—a key statistic for dependable automaton. On agentic metrics such as TAU-bench and BFCL v3, it outperformed peers such as GPT-4 consistently. This capability reaches into coding, where it not only recorded leading win rates over Kimi K2 (53.9%) and Qwen3-Coder (80.8%) on agentic coding tasks but also beat GPT-4 on real-world problems such as SWE-bench Verified.

Agentic coding in Real-World Development Scenarios
source - https://z.ai/blog/glm-4.5

This real-world power is founded on an elite-level of reasoning. GLM-4.5 exhibits state-of-the-art performance on challenging reasoning tests, matching the performance of top Google and Anthropic models on tough math and science problems such as AIME24 and MATH 500. This is evidence that the model's novel deep-network architecture has effectively translated to enhanced reasoning ability.

How to Access and Usage

GLM-4.5 is intended to be access-friendly. You can access it via the Z.ai API platform, which provides OpenAI-compatible interfaces, and the Z AI chatbot. For people who want local deployment, the open weights of the base and hybrid reasoning models, including the FP8 variants, are hosted on Hugging Face and ModelScope. The models integrate with mainstream inference frameworks such as vLLM and SGLang. Importantly, GLM-4.5 is open-source with a permissive MIT license for commercial use and secondary development, encouraging a thriving innovation ecosystem. Developers' main resource is the GitHub repository, which includes all the information needed for local deployment and integration.

Limitations and Future Work

While GLM-4.5 is a major leap towards unified AI, the process of reaching human-level capability in all areas remains underway. The developers admit that while the model goes a long way in unifying various capabilities, total proficiency in all tasks is an aspiration for subsequent versions. In particular, there are 'further optimization opportunities' in agentic coding tasks compared to certain competitors. Moreover, though effective as a reinforcement learning curriculum, further broadening to more complex, real-world situations may make the model even more adaptable.

Conclusion

Availability as good as it is (open-source), along with incomparable performance and price make GLM-4.5 a very tempting option in the eyes of a developer, a researcher, or a business. Generation 4.5 will use GLM-4.5 to construct the kind of smarter, more capable, and more autonomous systems of tomorrow. With GLM-4.5 released, it is clear to think that the future does not solely belong to AI on a large scale, but rather an intelligent, built-in, accessible design.


Source:
Tech blog: https://z.ai/blog/glm-4.5
GitHub Repo: https://github.com/zai-org/GLM-4.5
Model collections: https://huggingface.co/collections/zai-org/glm-45-687c621d34bda8c9e4bf503b
Base Model Weight; https://huggingface.co/zai-org/GLM-4.5
Air Model Weight: https://huggingface.co/zai-org/GLM-4.5-Air


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

DeepSeek-V3.1-Terminus: Inside Its Superior Agentic AI Capabilities

Introduction Previously, the advancement of technology was tracked in terms of raw power, but today it is about building specialized, reliab...