Pages

Tuesday, 5 August 2025

Google's MLE-STAR: Winning with Real-Time Web Search

Presentational View

Introduction

In the never-ending competition for AI dominance, the real bottleneck is now not merely about building larger models but getting them to function—optimally, stably, and at the frontier of innovation. We're in a period of advanced Machine Learning Engineering (MLE) agents, autonomous systems that vow to mechanize the laborious task of developing and tweaking AI. But too many of these agents have been operating with one hand behind their back, hobbled by the static, frequently outdated knowledge of their fundamental language models. They use old maps to navigate a world that's constantly changing.

This dependence on prior knowledge is a brake on innovation. The challenge has not only been to construct an agent that can write code but one that can learn and adapt in real-time, as an expert human would do. It must solve problems with the precision of an experienced engineer and not the brushstrokes of a generalist.

To address this critical business requirement, a new architecture has been developed out of the research labs at Google. It's an agent built to run on the live edge of machine learning. By incorporating real-time web search for the most current models, using a new approach of focused code refinement, and including a set of automated quality tests, this agent is a qualitative breakthrough. This new paradigm is referred to as MLE-STAR.

What is MLE-STAR?

MLE-STAR is a sophisticated autonomous agent that recasts machine learning construction as a focused code optimization problem. In contrast to precursors that accessed a static body of knowledge, MLE-STAR is a dynamic system. It uses real-time web search to find and apply state-of-the-art solutions, producing high-performing Python code custom-designed for a massive range of data types, from images and text to tabular and audio data.

Key Features of MLE-STAR

In terms of engineering, MLE-STAR has distinctive and instantiated features that power it:

  • Live Web Model Search: The agent taps into the live geographically distributed global conversation of AI development to guarantee the models that it employs, not just the good ones, but the models that are actually for the purpose of the task the state-of-the-art models.
  • Exact Code Tuning: Rather than make general changes, the agent locates and tunes the elements of code that truly control performance, and applies all the agent's power to those elements of code to a maximum amount.
  • Automated Advanced Ensembling: It not only finds and creates advanced ensemble strategies, it will actually automatically achieve this.
  • Broad Task Generalization: MLE-STAR is a truly general framework that can do a nearly limitless set of tasks from classification to denoising, for any type of data, without making manual examples.
  • Coupled Code Reliability: MLE-STAR includes implicit QA to give reliable and trustworthy code, and will inherently find and change problematic fatal issues with code, such as bugs, info leaks and misuse of data.
  • Novel Solution Development: The agent is devised to create novel solutions, rather than suggesting simply repeating simple patterns from its training.

Use Cases and Capabilities of MLE-STAR

These technical capabilities drive value and deliver efficiencies from a business and strategic perspective and deliver the following benefits to the market:

  • Market Agility and Innovation: For any organization, the ability to develop high-performance solutions to new data problems rapidly is the definite competitive benefit. MLE-STAR reduces development time, and therefore enhances the opportunity to innovate.
  • Optimizing Present Investment: Organizations can install MLE-STAR and achieve well-targeted, high leverage improvement on their existing ML system instead of spending huge sums of money on a disruptive redesign of existing ML systems, and therefore achieve the best value on their existing infrastructure.
  • Securing a Competitive Edge: In industries like finance or medicine, where narrowly defined margins of error have enormous ramifications, the agent's automated ensemble processes provide it with a direct path to better performance and mastery.
  • De-risking AI Deployment: Defective AI models are very risky. By automatically determining the major errors, like data leaks and bugs, MLE-STAR not only ensures the models deployed are both high-performance and reliable, but also trustworthy by reducing the risk of poor outcomes and damaging reputational incidents.

How Does MLE-STAR Work?

MLE-STAR works through a sophisticated, multi-stage process that is capable of developing strong and high-performance machine learning models. Initial Solution Generation through Web Search kick-starts the process. Using Google Search, an agent called Aretriever retrieves relevant, state-of-the-art models and their respective code examples as a function of the task description provided by the user. A second agent, Ainit, then generates simple Python scripts for every model retrieved, which are assessed to determine the top performers. These highest-performing scripts are then merged into a powerful initial solution, usually a simple average ensemble, by the Amerger agent.

Overview of MLE-STAR
source - https://arxiv.org/pdf/2506.15692

The heart of MLE-STAR's workflow is the Iterative Refinement of Code Blocks. During this phase, a nested loop iteratively refines the initial solution. In the outer loop, an Aabl agent conducts an ablation study to determine the most important code block with respect to performance, and then an Aextractor agent selects it for refinement. Within the inner loop, a planning agent, Aplanner, suggests various strategies to enhance the focused block, which are carried out by a coding agent, Acoder. The solution is updated only when such modifications lead to improved performance.

MLE-STAR iteratively proposes effective ensemble strategies
source - https://arxiv.org/pdf/2506.15692

After this, MLE-STAR uses a Novel Ensemble Method where MLE-STAR suggests and refines different complex methods for combining the strong candidate solutions into the final, stronger ensemble model. As this whole process occurs, a suite of Robustness Modules, such as the debugging module (Adebugger), a data leakage checker (Aleakage), and a data usage checker (Adata) run continuously validating the code, and helping with reliability and correctness.

Performance Evaluation with Other Models

In competitive machine learning, there is only one thing that counts: results. MLE-STAR's performance was tested on MLE-Bench-Lite, a benchmark that includes 22 Kaggle competitions from real-world domains—the ultimate test ground for ML performance. Not only were the results affirmative, but they were overriding.

Main results from MLE-bench Lite
source - https://arxiv.org/pdf/2506.15692

MLE-STAR won a medal in an incredible 63.6% of the competitions. Better still, 36% of its victories were gold medals, a standard that consistently is higher than that of expert human professionals. This shows a capability not only to compete, but to succeed.

Model usage (%) on image classification competitions
source - https://arxiv.org/pdf/2506.15692

When compared against its competitors, MLE-STAR's design strengths stand starkly revealed. It put AIDE, an agent that is dependent on older internal models such as ResNet, well behind it, taking 37% of image classification medals to AIDE's 26%, with its capability to tap into newer architectures such as EfficientNet. It also handily outcompeted specialist agents such as DS-Agent (constrained by a manual case bank) and generalist agents such as gpt-4o and OpenHands, which achieved medal rates of only 6.1% and 12.1% respectively on the same test. That performance gap is not simply a figure; it's evidence that a specialist, dynamic, and strong architecture is the secret to state-of-the-art performance.

The Specialist's Edge

MLE-STAR's superior performance proves a key design principle: the benefit of a specialist tool over a general-purpose one. While capable generalist agents such as OpenHands or models such as gpt-4o (employed with MLAB) can try to perform machine learning tasks, they are like a Swiss Army knife attempting surgery. They do not possess the specialist architecture necessary for the highly specific challenges of competitive machine learning.

This expert benefit is embedded outright into its properties. Its focused code block optimisation achieves a more profound, more effective optimisation feature than the general approaches of other MLE agents such as AIDE. Most importantly, its built-in robustness modules, including the data leakage checker, address machine learning-specific issues that are not designed to be discovered by generalist developer agents. This intentional emphasis on MLE's distinctive pain areas, coupled with a flexible architecture that is scalable past the manually curated bounds of agents like DS-Agent, is exactly what produces such an enormous performance gap and creates its competitive advantage.

How to Access and Use MLE-STAR

For those who want to see what MLE-STAR can do, it is open-sourced on GitHub. The agent is developed with the Agent Development Kit (ADK). To utilize MLE-STAR, a user gives a description of the task and the datasets involved. The agent then works on it, doing the laborious machine learning work and creating an executable Python solution script. It should be noted that MLE-STAR is presently only for research use. The users are accountable for ensuring that any models or content obtained by the agent do not violate the relevant licensing restrictions.

Limitations and Future Work

Currently, the biggest limitation of MLE-STAR is its label of research use only, which puts responsibility on the user to comply with licensing for any models or content used. Another possible limitation is that since the computer-based LLM (Large Language Model) utilizes public data, it is plausible that some solutions generated are not entirely original because they may have been previously posted, such as on a user forum on Kaggle. 

Looking ahead, the nature of MLE-STAR provides exciting future work. MLE-STAR will likely improve, due to changes in performance and availability of state-of-the-art models in general to the user since it relies on web search. One potential improvement could involve a more direct human involvement by allowing users to enter descriptions of models that could be utilized more directly, and provide model descriptions so the agent could search for even newer models or model refinement strategies.

Conclusion

For developers, researchers, and companies, MLE-STAR is a vision for a world where the costs of entry for building AI solutions that have the ability to make a significant impact are greatly reduced, and we are paving the way for a new generation of innovation across nearly every industry. The AI journey has always been characterized as a constant journey to be able to do more, and with MLE-STAR we have taken a huge and exciting step forward.


Sources:
Tech blog: https://research.google/blog/mle-star-a-state-of-the-art-machine-learning-engineering-agents/
Research paper: https://arxiv.org/pdf/2506.15692
GitHub Repo: https://github.com/google/adk-samples/tree/main/python/agents/machine-learning-engineering


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Wednesday, 30 July 2025

GLM-4.5: Unifying Reasoning, Coding, and Agentic Work

Presentational View

Introduction

Breakthroughs in agentic AI and coding models leading to more advanced and autonomous systems. These models are now advanced to proactive agents that can reason, plan, and perform complex, multi-step actions. But there are obstacles. One main challenge has been fragmentation of capabilities; models tend to be excellent at either reasoning, coding, or being an agent, but hardly all three at once. This has resulted in clumsy and inefficient arrangements which involve the handling of many specialist models.

New Model is designed , to solve this very issue, by integrating reasoning, coding, and agentic functions into one, complete system. By combining these fundamentals, this new model intends to satisfy the sophisticated needs of intelligent agent applications, ushering in a new age of AI that is more productive, powerful, and integrated seamlessly. This new model is known as GLM-4.5.

The Visionaries Behind the Model

The GLM-4.5 series is the creation of Zhipu, originating from the technological advancements of Tsinghua University's Computer Science Department, is an artificial intelligence company with the mission of teaching machines to think like humans. The underlying philosophy for creating GLM-4.5 was to develop one comprehensive system that integrates reasoning, coding, and agentic capabilities. This ambitious aim was taken up in order to meet the increasing sophistication of intelligent agent applications.

What is GLM-4.5?

GLM-4.5 is a sequence of cutting-edge, open-source AI models that aim to be one system for reasoning, coding, and agentic work. It is constructed to manage the complex needs of contemporary intelligent agent applications by offering an extensive and cohesive set of skills.

Model Variants

The GLM-4.5 line consists of two different foundation models, each designed for different user use cases while sharing a common design of combined capabilities and a hybrid mode of thinking.

  • GLM-4.5 (The Flagship): This behemoth model has an impressive 355 billion total parameters and 32 billion active parameters. It has a huge 128k context length capacity, which means it can have very long and rich interactions. To be more efficient when inferring, an FP8 variant (GLM-4.5-FP8) exists. Its API cost is 60 cents per 1 million input tokens and $2.20 per 1 million output tokens.
  • GLM-4.5-Air (The Efficient Compact): This model is for users who value efficiency without sacrificing much on power. It has 106 billion total parameters with 12 billion active parameters and also has a 128k context length. There is also an FP8 variant (GLM-4.5-Air-FP8) for this model. The API cost for the Air model is very low at 20 cents per 1 million input tokens and $1.10 per 1 million output tokens, rendering it very cost-effective.

Key Features of GLM-4.5

GLM-4.5 is filled with cutting-edge features that set it apart from the rest.

  • Hybrid Thinking Modes: The two models each employ a dynamic hybrid reasoning model. They are able to alternate between a 'thinking' mode for sophisticated reasoning and tool employment, and a 'non-thinking' mode for fast, direct answers as per the complexity of the task.
  • Task-Oriented Optimized for Agentic: GLM-4.5 is naturally optimized as a foundation model for agentic tasks. It supports native function calling and has recorded the highest average tool calling success rate of 90.6% when compared to the likes of Claude-4-Sonnet, Kimi K2, and Qwen3-Coder.

    Average Tool Calling Success Rate
    source - https://z.ai/blog/glm-4.5

  • Novel MoE Architecture: GLM-4.5 follows a novel Mixture-of-Experts (MoE) architecture. Contrary to other MoE models, which are width-oriented, GLM-4.5 deepens (increases layers) but thins (decreases hidden dimension and number of routed experts). The design followed from the observation that deeper models have better reasoning abilities.
  • Innovative Reinforcement Learning Infrastructure ('slime'): One of GLM-4.5's main technical strengths is its tailor-made, open-sourced Reinforcement Learning (RL) infrastructure called 'slime'. 'slime' is designed for extremely fast training and has a hybrid architecture that is flexible enough to accommodate both synchronous and asynchronous training. This is especially important for advanced agentic RL where data generation may become a bottleneck.

Capabilities and Use Cases of GLM-4.5

The integrated design of GLM-4.5 opens up a wide range of sophisticated uses.

  • End-to-End Full-Stack Development: The framework can automatically produce complete web applications, from frontend coding to backend deployment and database handling.
    Use Case: An e-commerce site could be built using GLM-4.5 to quickly prototype and deploy a full-fledged e-commerce site, with an easy-to-use interface, product database, and payment gateway, all from a single set of high-level specifications.
  • Sophisticated Artifact Creation: In addition to regular code, the model may create advanced, standalone artifacts.
    Use Case: A game designer might create the full code for an interactive mini-game such as Flappy Bird, or a physicist might develop a working physics simulation right inside the development platform.
  • Sophisticated Frontend and Visual Design: GLM-4.5 is also great at designing beautifully crafted frontend interfaces in different forms.
    Use Case: A UI/UX designer may have the model create complex SVG graphics, i.e., a detailed drawing of a butterfly, or create a responsive and visually good-looking web page utilizing HTML and Python.
  • Agent-Augmented Content Creation: The model may utilize its agentic tools to create rich content.
    Use Case: A business analyst may assign GLM-4.5 to develop a complete slide deck for a market analysis report. The model would employ its web search feature to collect current market information and then create the presentation, including charts and editable HTML code.

Training and architecture

The great performance in GLM-4.5 is based on the new architecture of the design. This strategy of the model to focus on depth rather than width has allowed the model to have an advantage in its depth, which improves reasoning abilities. It uses lossless balance routing, and sigmoid gates as its MoE layers. It has a self-attention component using Grouped-Query Attention with partial RoPE, and 96 attention heads with a 5120 hidden dimension to achieve a high level of reasoning that provides enormous gains on reasoning benchmarks. QK-Norm has been used to stabilize attention logits in the model and Muon optimizer to speed up convergence. To be used at a faster rate, a Multi-Token Prediction (MTP) layer is inserted to facilitate speculative decoding.

Slime - RL Infrastructure
source - https://z.ai/blog/glm-4.5

In addition to its architecture, the capabilities of GLM-4.5 are the direct consequence of an enormous and state-of-the-art multi-stage training process. It started with using an astounding 22 trillion tokens in a process of pre-training by imaginatively splitting newer into a larger 15-trillion-token general corpus then a 7-trillion-token corpus specially concentrated on code and reasoning. This base was then refined with a decisive post-training process that follows the reinforcement learning (RL) concept to develop elite agentic and reasoning capabilities. To reason, the model was trained in a one-stage RL on full length of context representation, using a curriculum that was decided based on difficulty. In the case of agentic work, it was trained to handle testable domains such as software engineering and answering information-seeking Q&A where execution-based correction was used to guarantee practical value. All of this is fueled by, what we call, slime, an innovative RL infrastructure that has a decoupled agent-first design and mixed-precision data generation (FastFormat8 to accelerate training, BetterFloat16 to ensure stability) to address the common training bottlenecks.

Performance Evaluation

Thoroughly tested on 12 industry benchmarks, GLM-4.5 had an outstanding aggregate score of 63.2, 3rd among all proprietary and open-source models. Its lighter version, GLM-4.5-Air, also registered a high 59.8, showing a better cost to performance ratio that makes high-end AI more affordable.

Overall performance on 12 benchmarks covering agentic , reasoning, and Coding
source - https://z.ai/blog/glm-4.5

The model's agentic capability is its defining characteristic, supported by a best-in-class 90.6% tool-calling success rate—a key statistic for dependable automaton. On agentic metrics such as TAU-bench and BFCL v3, it outperformed peers such as GPT-4 consistently. This capability reaches into coding, where it not only recorded leading win rates over Kimi K2 (53.9%) and Qwen3-Coder (80.8%) on agentic coding tasks but also beat GPT-4 on real-world problems such as SWE-bench Verified.

Agentic coding in Real-World Development Scenarios
source - https://z.ai/blog/glm-4.5

This real-world power is founded on an elite-level of reasoning. GLM-4.5 exhibits state-of-the-art performance on challenging reasoning tests, matching the performance of top Google and Anthropic models on tough math and science problems such as AIME24 and MATH 500. This is evidence that the model's novel deep-network architecture has effectively translated to enhanced reasoning ability.

How to Access and Usage

GLM-4.5 is intended to be access-friendly. You can access it via the Z.ai API platform, which provides OpenAI-compatible interfaces, and the Z AI chatbot. For people who want local deployment, the open weights of the base and hybrid reasoning models, including the FP8 variants, are hosted on Hugging Face and ModelScope. The models integrate with mainstream inference frameworks such as vLLM and SGLang. Importantly, GLM-4.5 is open-source with a permissive MIT license for commercial use and secondary development, encouraging a thriving innovation ecosystem. Developers' main resource is the GitHub repository, which includes all the information needed for local deployment and integration.

Limitations and Future Work

While GLM-4.5 is a major leap towards unified AI, the process of reaching human-level capability in all areas remains underway. The developers admit that while the model goes a long way in unifying various capabilities, total proficiency in all tasks is an aspiration for subsequent versions. In particular, there are 'further optimization opportunities' in agentic coding tasks compared to certain competitors. Moreover, though effective as a reinforcement learning curriculum, further broadening to more complex, real-world situations may make the model even more adaptable.

Conclusion

Availability as good as it is (open-source), along with incomparable performance and price make GLM-4.5 a very tempting option in the eyes of a developer, a researcher, or a business. Generation 4.5 will use GLM-4.5 to construct the kind of smarter, more capable, and more autonomous systems of tomorrow. With GLM-4.5 released, it is clear to think that the future does not solely belong to AI on a large scale, but rather an intelligent, built-in, accessible design.


Source:
Tech blog: https://z.ai/blog/glm-4.5
GitHub Repo: https://github.com/zai-org/GLM-4.5
Model collections: https://huggingface.co/collections/zai-org/glm-45-687c621d34bda8c9e4bf503b
Base Model Weight; https://huggingface.co/zai-org/GLM-4.5
Air Model Weight: https://huggingface.co/zai-org/GLM-4.5-Air


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Friday, 25 July 2025

Qwen3-Coder: Open-Source AI for Deep Codebase Understanding

Presentational View

Introduction

The world of Agentic Coding is transforming rapidly from basic code completion to sophisticated AI agents that understand entire codebases and can autonomously  complete complex software engineering tasks.  This paradigm aims to change how developers measure productivity, but the concepts have not yet translated to using larger context and multi-step reasoning.  Qwen3-Coder understands this dilemma, and is poised to tackle it head-on. Unlike a supportive co-pilot, Qwen3-Coder is, paradoxically, an independent enforcement agent on the cutting edge of the recognition of code and tooling integrations on the pathway to a future of more productive, higher quality software design.

Development and Contributors

Qwen3-Coder is the flagship agentic model developed by the Qwen team, from an effort to create an AI as an excellent Software architect.  The goal was not to have simply a better assistant, but to bring a fundamental transformation in how developers leverage AI to manage and execute complex projects.

What is Qwen3-Coder?

Qwen3-Coder is an incredibly sophisticated, open-source AI model for agentic software development. It is meant to act as an advanced AI agent that can autonomously parse, plan, and execute complex coding projects by effectively comprehending the full context of a software project. 

Key Features of Qwen3-Coder

Qwen3-Coder is full of an arsenal of distinct features that make it diverge from its ancestors and contemporaries:

  • Effective Mixture-of-Experts (MoE) Architecture: Employs a enormous 480B-parameter MoE architecture that only uses 35B parameters for each query while balancing enormous capability with computational effectiveness.
  • Massive Long-Context Window: Natively supports a 256K token window, extendable to 1M tokens, providing a deep, repository-wide code understanding.
  • Streamlined Non-thinking Mode: Outputs straight, clean output without overt reasoning blocks so it can be used directly within automated scripts and toolchains.
  • Specialized Tooling & Open Integration: Augmented by the Qwen Code CLI and open to the OpenAI SDK, providing both a specialized feel and wide compatibility with community tools.

Capabilities/Uses of Qwen3-Coder

The unique characteristics of Qwen3-Coder translate into a variety of compelling capabilities and practical uses in real life that may change the way you develop software:

  • Self-Driving Large-Scale Codebase Optimization and Refactoring: Imagine needing to migrate a large project into a new framework. Qwen3-Coder is capable of self-planning and self-execution of such complex, large-scale, refactoring tasks across an entire codebase - improving performance and providing smarter edits across 1,000's of interrelated files.
  • Smart Software Bug Repair and Vulnerability Repair: The model can perform sophisticated self-service debugging. It can identify, understand and repair complex bugs or security vulnerabilities that require deep contextual understanding and an iterative approach to resolving problems in an entire system.
  • Automated Pull Requests and Sophisticated Rebase Management: Qwen3-Coder is able to offer a large amount of overhead to the developer by automating run operations. It can also autonomously review, resolve conflicts, and merge even very complex or large pull requests.
  • Fast and Autonomous Prototype and Feature Prototyping: The system can create functioning code for new features or proof-of-concept applications with little human effort. For example, it has shown the capability to  rapidly create simple SaaS landing pages with animations.
  • Improved Code Documentation and Complete Test Suite Generation: Qwen3-Coder is able to automate the frequently time-consuming process of documentation and testing for large codebases. This involves producing complete, precise documentation and coming up with strong, context-sensitive test suites that handle inferred edge cases.

How Does Qwen3-Coder Work

From an engineering perspective, Qwen3-Coder's strength is derived from a highly optimized architecture and disciplined training schedule. Its base is a Mixture-of-Experts (MoE) model, a deliberate decision that permits enormous parameter scale (480B) while being computationally practical by only engaging a portion of experts (35B) for each input. The model's remarkable competence was nurtured by pre-training on an enormous 7.5 trillion token dataset with a high 70% code ratio. Of paramount importance, the quality of training data was progressively developed through iteratively using its predecessor, Qwen2.5-Coder, to reword and clean noisy data—a high-level feedback loop that reflects a focus on data-driven AI development. That foundation is also supported by state-of-the-art reinforcement learning. The model makes use of scaled Code RL on 'hard-to-solve, easy-to-verify' problems as well as a new Long-Horizon RL (Agent RL) to learn challenging, multi-step engineering tasks. This agentic finetuning is underpinned by an enormous parallel system operating 20,000 independent environments, allowing the model to learn from a huge and varied array of feedback signals and thereby sharpening its autonomous problem-solving abilities.

Performance Evaluation with Other Models

When assessed on benchmarks, Qwen3-Coder is uniquely situated in performance among the best models, both open source and proprietary. For examples of performance in the real world, with limited tuning, simply compare SWE-Bench Verified. For an open model, Qwen3-Coder achieved state-of-the-art performance without any test-time scaling - an important distinction to note, focusing on performance out-of-the-box, and for agentic coding, tool-use, and processing of feedback, this is a powerful metric in performance, and contribution to the quality of results of a specific task completion.

Qwen3-Coder : Performance Benchmarks
source - https://github.com/QwenLM/Qwen3-Coder

SWE-Bench Verified is a particularly impressive benchmark that measures the outcomes of solving real software engineering (SE) evaluation tasks with multi-turn planning, tooling use, and processing feedback. Not only are its scores strong by themselves, they are directly competitive to (and often surpass) results achieved by existing proprietary models, such as Claude Sonnet 4 and GPT-4.1, obtained via reference evaluations (e.g., Spider, Aider). For any developer or teams considering accessing an AI coding agent, the results speak volumes - Qwen3-Coder is neither simply an alternative to an open-source coding agent - it is seriously challenging the highest tier agentic coding, tool-use and browser-use performance, and this should not be overlooked.

Key differentiators with Kimi K2

While Qwen3-Coder shares some commonalities with other powerful models like Kimi K2, there are several relevant differences in design that provide Qwen3-Coder with much greater capabilities. For example, Qwen3-Coder's native 256K and extended 1M token window exceed Kimi K2's context length of 128K context length in a major way, meaning that Qwen3-Coder will outperform Kimi K2 across any extended codebase. To put that in perspective, even if Kimi K2 were to run a context length of 128K on longer codebases, this would still be able to be further improve from other aspects such as task formulation to general alignment to tools towards its reasoning capabilities over and above just to get beyond the black box. A second key distinction is the actual internal data improvement loop that Qwen3-Coder ran on Qwen2, leveraging its predecessor to improve its training data, a significant mechanism for improving the quality of its data. From a practical development experience, Qwen3-Coder's non-thinking smoothness factor, and open model connected experience with tools offered by OpenAI SDK and specially designed Qwen Code CLI allows for usability experience that demonstrates functionality beyond Kimi K2's focus on multi-tool orchestration and passive thinking alone.

How to Access and Use Qwen3-Coder

Developers can easily access Qwen3-Coder and responsibly integrate it with systems and application development. Being an open-source and commercially usable model means the model weights are downloadable for local use. Installation instructions can be found on the official GitHub repository of Qwen3-Coder which centralizes everything. The easiest and fastest workflow to access for most integration is to utilize its support for the OpenAI SDK. This is a typical standard so the barrier to entry is low, i.e. some familiarity with the coding language used to develop your application. 

Qwen Code CLI
source - https://github.com/QwenLM/qwen-code

To fully realize the possibilities for agentic potential, you will want to use the Qwen Code CLI. Which is an open-sourced command-line interface at this stage that uses Node.js (v20+) and is built specifically for the model. This CLI has a built parser and function-calling form that optimally works with the code model. The benefit of using its CLI is to have a more defined and specialized workflow process for multi-level logical coding tasks.

Limitations and Future Work

Though Qwen3-Coder is an impressive tool, some practical considerations should be noted for developers. Being a new tech, the supporting Qwen Code CLI might have temporary stability problems. Hardware-wise, though long-context functionality is a flagship feature, some configurations might face out-of-memory situations, which would require lowering the context length for smoother functionality. In addition, within the aggressively competitive market of agentic models, benchmarks indicate that on certain tasks like SWE-bench Verified, competition from the likes of Kimi K2-Instruct already maintains an edge in performance, pointing out points for ongoing improvement.

The Qwen team has a great ambition for the future. First, the plan is to continue to enhance the Coding Agent's performance to help take on more software engineering tasks that are even longer and more cumbersome, allowing developers to achieve even more productivity! Then, the Qwen team will release additional model sizes that will strike even more of a balance between proper performance and cost-effective deployment. However, the most exciting long term vision is starting to explore self-improvement abilities of the agent. The idea of creating a tool that is evolving and able to learn and perform multiple tasks, including more complicated tasks, on its own demonstrates the beginnings of a truly exciting future for the evolution of agentic AI.

Conclusion

Qwen3-Coder is a watershed moment. It is (an open-source) software that provides access to state-of-the-art agentic AI to developers and organizations to build better software more quickly. The coding of tomorrow is not just coding lines, but the creation of intelligent systems, and Qwen3-Coder is one of the brightest future's architects.


Source:
Blog: https://qwenlm.github.io/blog/qwen3-coder/
Qwen Code CLI: https://github.com/QwenLM/qwen-code
GitHub Repo: https://github.com/QwenLM/Qwen3-Coder
Model Weights: https://huggingface.co/collections/Qwen/qwen3-coder-687fc861e53c939e52d52d10 


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Tuesday, 22 July 2025

MediPhi: Microsoft’s Specialized SLMs for Low-Resource Clinical AI

Presentational View

Introduction

In specialized fields like healthcare, it creates a classic paradox, where developing greater and greater AI models is becoming more and more impractical. In cases where accuracy and efficiency is key, huge Large Language Models (LLMs) provide impressive use cases but are operationally impractical because of an unsustainably high computational cost and latency for real-time clinical use. Top-tier clinical data is limited and sensitive, and useful real-world AI solutions often rely on valuable professional context. MediPhi, a family of Small Language Models (SLMs), is designed to fill the gap. Instead of realizing that size doesn't matter, MediPhi represents a strategic pivot towards cost-effective, highly-specific, computationally efficient models. MediPhi represents a value-driven approach not only to create research projects, but to create real-world AI deployment capabilities in clinical Natural Language Processing (NLP) where it adds value.

This forward-thinking approach reflects a clear vision to cut through the bottlenecks that continue to stall AI in medicine. The relatively high barriers to entry for LLMs, given their costs and data requirements, have delayed the widespread adoption of clinical AI in direct patient care. With its modular architecture of simple, compact models, MediPhi plans to democratize access to executing clinical AI. This approach enhances reproducibility, creates the conditions for sustained improvement, and may ultimately facilitate the direct incorporation of advanced NLP tools into workflows within clinical practice, which may fundamentally redefine productivity and patient care.

Development and Contributors

The collection of MediPhi models was created by Microsoft Healthcare & Life Sciences in collaboration with partners from Microsoft Research Montréal and IKIM, University Hospital Essen. The primary inspiration for creating MediPhi was to present a new framework for converting SLMs into high-performance clinical models by immediately addressing the cost and latency constraints of LLMs that limit their application in patient care environments.

What is MediPhi?

MediPhi consists of seven specialized Small Language Models (SLMs) with 3.8 billion parameters each based on the Phi3.5-mini-instruct base model. The models are carefully fine-tuned for clinical NLP applications and are mainly designed for research purposes in English. Microsoft highly recommends that all model outputs should be validated by a medical professional since the models are solely for research purposes and should be thoroughly reviewed for accuracy, safety, and fairness before any deployment.

MediPhi's Variants

The MediPhi Model Family is a general-purpose toolkit with multiple different variants, each with its own specific relevance:

  • MediPhi-Instruct: The ultimate, premier model, trained to carry out full-clinical NLP jobs utilizing the large-scale MediFlow synthetic instruction dataset.
  • MediPhi: The single generalist expert, produced by combining the five task-specific experts with the novel BreadCrumbs method, strikes a good balance of strong average performance across the CLUE+ benchmark.
  • Five Domain-Specific MediPhi Experts: These models constitute the base, each being fine-tuned on separate medical corpora before being combined back with the base model to retain general capabilities:

  1. MediPhi-PubMed: Fine-tuned on 48 billion tokens of PubMed scientific documents and abstracts, this expert is best suited to take advantage of biomedical literature.
  2. MediPhi-Clinical: Trained on open-source clinical documents (patient summaries, doctor-patient conversations), this variant is designed for tasks associated with immediate patient care.
  3. MediPhi-MedWiki: Trained on Medical Wikipedia, this specialist had the highest average improvement across individual specialists (3.2%), presumably because it learned from wide educational content.
  4. MediPhi-Guidelines: Transformed based on guidelines from established health organisations, prioritising structured authoritative knowledge.
  5. MediPhi-MedCode: Treated on medical coding corpora, this specialist achieved a staggering 44% relative improvement in ICD-10 coding compared to its base model, even surpassing GPT-4-0125 by 14% on this particular task.

Key Features of MediPhi

MediPhi possesses a number of central characteristics that identify it as a unique value proposition:

  • MediPhi demonstrates a modular and streamlined structure. Each SLM in the collection can be adapted as a high-performance clinical model with minimal computational cost compared to LLMs.
  • MediPhi retains the general capabilities. By utilizing advanced merging techniques like SLERP, MediPhi mitigates 'catastrophic forgetting' with respect to the general capabilities, so that each merge retains the basic skills of the Phi3.5 about following instructions and working with long context.
  • MediPhi has increased safety and groundedness. MediPhi-Instruct retains the safety protocols in its base model and is even more grounded, effectively refusing to respond to potentially harmful queries from both the clinician and patient vantage points.
  • MediPhi's license is commercially permissive. As the first high-performance SLM collection with an MIT license in the medical domain, MediPhi provides additional pathways for research and commercial use cases.

Capabilities and Use Cases of MediPhi

MediPhi's distinctive configuration presents exciting real-world use cases:

  • Automated Medical Coding: As accurate as a MediPhi-MedCode specialist, MediPhi is ideal to automate and enhance medical documentation and billing.
  • Custom Clinical NLP Solutions: Developers can quickly build custom summarisation, named entity recognition (NER), and relation extraction tools using the modular structure and commercial licence.
  • Specialized Knowledge Extraction: MediPhi-PubMed and MediPhi-Guidelines knowledge specialists provide an extremely accurate, context-aware information extraction mechanism from complex medical texts.
  • A New Benchmark for Strong Medical AI: Nearly as performant as models with twice the size, MediPhi is an excellent basis for a research program on strong and efficient medical AI.

How MediPhi Achieves its Specialisation

MediPhi's development represents a real-world example of model specialisation achieved via a sophisticated two-step process. 

Method for clinical SLMs as illustration includes two steps
source - https://arxiv.org/pdf/2505.10717

This process starts with Continual Pre-Training; the base Phi3.5-mini-instruct model has knowledge to a specific domain programmed into. This knowledge was incorporated via Domain Adaptation Pre-Training (DAPT) on large text corpora (e.g., PubMed), and a new approach called Pre-Instruction Tuning (PIT) in which the model learns from a similar data set and tasks (e.g., summarisation, NER) then learns on both instruction data and the pre-training data during the combined stage of training. After the first phase, there is an important Model Merging step. Each of the 5 fine-tuned experts is each merged back with the base model separately using Spherical Linear Interpolation (SLERP) to maintain general capabilities, with the five experts merged together into the general MediPhi SLM using the BreadCrumbs technology, which was found to be an optimal merging operator by the evolutionary algorithm.

The final stage of development is known as Clinical Alignment. This developmental stage involves the specification of the model for clinical tasks. In this work, we used the unique, large-scale synthetic dataset, MediFlow, which consists of a total of 2.5 million high-quality instruction examples distributed across 14 clinical task categories. The unified MediPhi model underwent Supervised Fine-Tuning (SFT) on the MediFlow dataset and was followed by Direct Preference Optimization (DPO) using a smaller curated dataset. Although this rigorous, multistage process may have been computationally expensive, taking approximately 12,000 GPU hours to complete, these processes enable MediPhi to attain a very specialized and robust performance.

Performance Evaluation

The assessment on the extended CLUE+ benchmark highlights both the efficiency and power of MediPhi. 

Performances on CLUE+ of other medical LLMs compared with MediPhi models
source - https://arxiv.org/pdf/2505.10717

The flagship model, MediPhi-Instruct (3.8B) achieved an average accuracy of 43.4% - a significant 18.9% relative increase from the base model. This shows the power of the specialized training and alignment process, with considerable enhancements to tasks such as identifying social determinants of health and question-answering on radiology reports. The individual experts performed well too; the MedCode model outperformed the substantially larger GPT-4-0125 model in ICD-10 coding, and the MedWiki expert showed the most consistent increases across the datasets.

Performances on new CLUE+ subset of datasets for MediPhi as well as other medical LLMs
source - https://arxiv.org/pdf/2505.10717

From a competitive standpoint, MediPhi definitely exceeds its weight. While MediPhi-Instruct is less than half the size of equivalent models, such as Meta-Llama-3-8B-Instruct (8B), it provides near equal performance (less than a 1% difference) on the CLUE+ benchmark and outperforms all LLaMA3 models on 4 datasets (it leads LLaMA3 by 29.2% on ICD10CM coding!). Compared to other medical models like Llama3-Med42-8B, MediPhi provides more overall uplift versus its original base model with improvements across 9-11 datasets to 5 for Med42. The evidence stacks up - MediPhi provides a perfect balance of size-efficiency and high-quality performance, as a new reference point for what specialised SLMs can achieve.

MediPhi vs. MedGemma

The table below highlights the strategies taken by Microsoft and Google in medical AI with MediPhi and MedGemma, respectively. MediPhi is, in essence, envisioned as a series of precision-crafted 'surgical instruments' for text-based clinical NLP tasks only. By contrast, MedGemma is a multi-modal diagnostic machine that can handle both text and medical images. In the end, the choice is a strategic one: MediPhi offers surgical precision for tasks with text; while MedGemma offers a wider base to stand on for multi-modal applications.

MediPhi vs. MedGemma
data driven comparison

How to Access and Use MediPhi

It is easy to get started with MediPhi for developers and researchers. The full model set, including the lead model MediPhi-Instruct and all of its domain-specific counterparts, is posted on Hugging Face and can be loaded locally through the transformers library in PyTorch. The models work best with prompts in a particular chat format (<|system|>.<|user|>.<|assistant|>). For hardware, the models fall back to Flash Attention, which necessitates state-of-the-art NVIDIA GPUs (A100/H100), but older GPU users can set attn_implementation="eager". Importantly, the whole bundle is released under a commercially permissive MIT license, a key step that stimulates both academic research and the creation of commercial clinical tools.

Limitations

Although there have been many exciting developments with MediPhi, it is also important to articulate limitations. The early development of the model was a computationally expensive process and, of course, this has implications for reproducibility. The core dataset provided by MediFlow has its own limitations, in that it's currently English only and does not support multi-turn conversational design. Although we attempted to maintain general capacities, this highly specialized development may have limited capabilities for non-medical tasks like multi-lingual support. Furthermore, like all language models MediPhi can deliver inaccurate information or societal biases. Long conversations may also be a vice to the driver of performance. Microsoft has made it a point to emphasize that these models must only be used for research purposes and any uses in a high-risk context such as direct medical advice should be properly evaluated and highly scrutinized by a written professional. 

Conclusion

MediPhi is just a great example of one potentially very important idea: perhaps the future of clinical AI does not lie in a singular, ultimate model for the field, but rather a broad assortment of various specialized tools. The proposition of many tools will help us optimize efficiencies, and will help bring us to a future when AI augments care and operations with patient care, not only just in the medical field, but adaptable and portable for many contexts.




Source 
Tech report: https://arxiv.org/pdf/2505.10717
MediPhi Model: https://huggingface.co/microsoft/MediPhi
MediPhi Variants: https://huggingface.co/collections/microsoft/mediphi-6835cb830830dfc6784d0a50



Disclaimer 
- This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All infor
mation presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Monday, 14 July 2025

Kimi K2: Open-Weight Agentic RL for Autonomous Tool Use

Presentational View

Introduction

The development of AI has come to a fateful turning point. We've learned to train models that can converse with breathtaking facility, yet the real chokepoint to progress is no longer language—it's action. The goal has changed from creating AI that can explain a solution to creating AI that can act on it by itself. This transition into 'agentic' AI, however, is beset by difficulty. Building an intelligent model that can consistently coordinate a set of digital tools to do something with the goal in mind, yet not requiring constant human involvement, has been the daunting task standing between the next generation of smart automation.

This is exactly the challenge which is being met head-on by Kimi K2. It's not a gradual upgrade to an existing chatbot but a fundamental shift in AI design, built ground-up to 'do things, not just talk.' Kimi K2 is a significant addition to the AI space, offering a real blueprint for the future of effectively capable, action-enabled digital agents.

The Visionaries Behind The Model

Kimi K2 is the flagship product of Moonshot AI, a company demonstrating that ground-breaking AI transcends borders. Their vision is simple, but bold - move from passive AI chatbots to active, action-oriented agents. They do this with a training philosophy they call the 'Era of Experience', whereby a model learns and improves from self-initiated interactions as it tries to break the ceiling of human-curated data and discover new abilities. 

What is Kimi K2?

Kimi K2 is a state-of-the-art one trillion-parameter open-weight coding model that is specifically optimized for autonomous problem-solving. It uses a sparse Mixture-of-Experts (MoE) architecture which gives it enormous power and efficiency by only utilizing 32 billion parameters per query.

Model Variants

Moonshot AI has published two different variants to address different requirements:

  • Kimi-K2-Base: The base, pre-trained model. To researchers and developers, its value is in offering a robust, fully customizable sandbox to develop highly customized solutions.
  • Kimi-K2-Instruct: The post-trained, polished model. To product creators and engineers, its worth is in its availability for instant, drop-in integration, providing a 'reflex-grade' agentic experience fine-tuned for velocity and rapid decision-making.

Key Features of Kimi K2

Kimi K2's architecture is a masterclass in intentional engineering, distinguishing it from general-purpose models.

  • Massive Scale with Smart Sparsity: A one trillion parameter model that effectively engages only 32 billion parameters, coupling massive power with pragmatic computational expense.
  • Expansive 128,000-Token Context: Offers a large memory for comprehending and carrying out intricate, multi-step activities.
  • Direct Reinforcement Learning of Tool Use: A new training method that gets the model intrinsically capable of action, instead of merely thinking about it.
  • Autonomous Multi-Tool Orchestration: Its fundamental ability to plan and conduct elaborate workflows with many tools without step-by-step direction.

Capabilities and Use Cases of Kimi K2

The real gauge of Kimi K2 is what it is capable of doing. It transcends theory into functional, high-impact implementation.

  • Zero-Scripting Automation: A developer can just give Kimi K2 his or her tools and tell it what to do. The model independently determines the 'how' without requiring brittle, complicated workflow scripts.
  • End-to-End Data Analysis: For a given dataset, it can conduct statistical tests such as two-way ANOVA on its own, produce sophisticated visualizations such as violin plots, and assemble the results into a completely interactive webpage, all controlled through a sequence of self-contained IPython calls.
  • Complex Project Planning: In a stunning demonstration, it organized a full concert tour by executing 17 smooth tool calls over a range of services—ranging from search and calendar management to flight and restaurant booking—within a single, integrated session.
  • Autonomous Software Engineering: Kimi K2 works in a command-line environment directly. It may refactor a codebase systematically from Flask to Rust while executing performance benchmarks, or drive the development and debugging loop for a JavaScript project automatically.

How Does Kimi K2 Work? An Architectural Deep Dive

Kimi K2's architecture represents an engineering feat in resolving issues of extraordinary scale. Its Mixture-of-Experts (MoE) architecture with 384 experts is well conceived. In the case of each token, 8 specialized experts are engaged, as well as a 'shared' expert. The architecture allows for a great degree of task-specific knowledge (by allowing each task its own expert), while the shared expert provides global coherence and context. This was a critical architectural advantage, but training it on 15.5 trillion tokens raised a key stability issue: 'exploding attention logits' (the bane of anyone training extremely large transformers). The team came up with a fundamentally new variably "MuonClip" optimizer, which rescaled query and key projection weights to directly control logits at the source. Therefore, there was totally stable training with no spikes; a considerable contribution to the field of large models development.

Aside from the intentional architecture, the model's agentic capability was made possible because of a complex and purposeful training strategy. Its performance is not due to a happy accident of emergence, but the result of a two-part strategy. the model first was trained on a vast distillation of thousands of real-world, tool-using episodes generated within hundreds of domains of generated usage. This establishes the action-oriented, performative baseline Secondly, its underlying general reinforcement learning system implements a unique 'self-judging function' that allows the model to be its critic and provides scaling, rubric-based feedback on its execution of tasks - even tasks for which there is little/no way to determine success. This is directly connected to the framework of 'Era of Experience' - enabling the model to walk through, for its own self, context-dependent activities and derive robust models of its own performance, instructional agency, and context-relevant perception that is based on its own self-generated interactions with the underlying world. 

Performance Evaluation: A New Open-Weight Standard

Benchmark results are the primary truth for any engineer or data scientist, and the performance of Kimi K2 sets a new standard for open-weight models. A result of 65.8% single-attempt accuracy on SWE-bench is revolutionary. SWE-bench measures the model's ability to disambiguate and fix real bugs and issues from Github repositories, giving a true sense of practical utility. Kimi K2's score pathically outperforms other models like DeepSeek V3's plunging 38.8% single hit ratio, and even captures a glimpse of the best proprietary models; and this will have a reverberating impact in the open-weight space. In particular, this result means developers can have much higher confidence in using Kimi K2 to autonomously manage more complex software maintenance.

Kimi K2 : Performance Evaluation
source - https://moonshotai.github.io/Kimi-K2/

Its 53.7% Pass@1 performance on LiveCodeBench v6, a benchmark that reflects 'real world' code generation, is just as impressive. Importantly, it outran well-known models, such as Claude Opus 4 (47.4%), suggesting its superiority is not constrained to a particular agentic 'niche' but likely extends to broader examples of practical coding. It also completed an impressive logic and STEM performance with a stunning 97.4% on MATH-500 and first place 69.6% Avg@64 on AIME 2024. Taken together, its extreme performance shows that it has a solid and complete reasoning foundation to give users confidence that it is not a single-use entity, but is equipped to handle a variety of more demanding technical tasks.

Kimi K2's Agentic Edge

On a high level Kimi K2 and Qwen3 are rooted in the same modern architecture, namely both are based on sparse MoE, both utilize reinforcement learning (RL), both support large 128,000 token context windows, and both are open-weight models.

However, their differences reveal a core distance in their design philosophy rooted in their operational futures. Kimi K2 is a purpose-built specialist, trained exclusively on agentic data, with reinforcement learning targeted exclusively on the mechanical operation of tools. Conversely, Qwen3 represents a generalist, built from a massive multilingual general dataset, and its reinforcement learning applied helpfully in the general space of reasoning. As a result, their user-experiences differ. Qwen3 allows developers explicit control through its "Hybrid Thinking Modes", while Kimi K2 is intended for a higher level of autonomy, which means it seeks to perform complex tasks without instructing the agent step-by-step.

This considerably targeted specialization results in a definitive advantage for Kimi K2 with respect to its target domain of autonomous agentic coding. As evidenced in the SWE-bench Verified benchmark, the results are very distinct. Kimi K2's performance clearly outperforms Qwen3's. Ultimately, Kimi K2 performs distinctly better in its primary mission to safely operate as an autonomous agent that independently executes complex software engineering workflows.

How to Access and Use Kimi K2

Moonshot AI has made this tremendous technology astonishingly accessible to everyone from individual developers to big companies. As an open-weight model, its weights are easily accessible on Hugging Face, complete with local deployment instructions on its GitHub repository. It is controlled by a business-friendly Modified MIT License permitting full commercial utilization, with only an attribution requirement at enormous scale. This legal certainty, combined with revolutionary API prices of only $0.60 per million input tokens and $2.50 for output, radically reduces the cost barrier. This strategic blend of openness and low prices democratizes access to top-shelf agentic capacity, enabling mass-market feasibility for previously unaffordable large-scale AI applications.

Current Limitations

Moonshot AI has been deservedly open regarding Kimi K2's present limitations. The model at times is overly wordy on challenging reasoning problems or breaks down when definitions of tools are ambiguous. This would indicate prompt engineering and explicit tool schemas continue to be essential for optimizing performance. Its most obvious limitation is that capabilities regarding vision are not yet available. This reflects a conscious strategic decision to excel at text-based agentic intelligence initially, with multimodality being projected as a future extension.

Conclusion

Kimi K2 fulfills the vision of agentic AI by providing top performance in an open, potentially affordable format. This democratization of power is its biggest contribution. For developers, it is a new, strong building block with which to build applications that act. For businesses, it is a compelling path to affordable intelligent automation. Kimi K2 marks a definitive and inspiring roadmap to the next generation of digital collaborators—an AI that doesn't just understand our world but engages in it.


Source
Blog: https://moonshotai.github.io/Kimi-K2/
Base Variant: https://huggingface.co/moonshotai/Kimi-K2-Base
Instruct Variant : https://huggingface.co/moonshotai/Kimi-K2-Instruct
kimi-k2 Variants: https://huggingface.co/collections/moonshotai/kimi-k2-6871243b990f2af5ba60617d 
GitHub Repo: https://github.com/MoonshotAI/Kimi-K2



Disclaimer 
- This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All infor
mation presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Google's MLE-STAR: Winning with Real-Time Web Search

Introduction In the never-ending competition for AI dominance, the real bottleneck is now not merely about building larger models but gettin...