Pages

Friday, 30 January 2026

How Open-Source Kimi K2.5 Swarm Beats GPT-5.2 and Claude Opus 4.5

Presentational View

Introduction

With the dawn of this new age of AI Agents, new AI has become a revolutionary change in the way models perform complex workflows. Consider an AI that not only processes search queries sequentially but also uses swarm parallel execution, building a team of sub-agents to work on gargantuan research or data tasks at the same time. For developers, the capability of a model to view and fix its own frontend display is a paradigm shift—no longer is it just code generation but visual debugging, where the AI scrutinizes the UI pixel by pixel. With High-Level Strategic Thinking, this model is no longer just answering questions but planning, reasoning, and acting on long-term goals with a depth of sophistication that defies even the most sophisticated proprietary models.

It shines in tightly coupled visual and text processing, especially in having the special capability to view and fix its own frontend output—going beyond simple code generation to pixel-perfect visual debugging. Whether choreographing large-scale simulations or computing the ROI of open-weights adoption, its capability to perform complex, self-contained workflows makes it an attractive option for anyone looking for an AI that can do true, multi-step problem-solving, as opposed to simple text prediction. This new AI model is named ‘Kimi K2.5’.

What is Kimi K2.5?

Kimi K2.5 is a 1 trillion parameter multimodal model created by Moonshot AI that serves as a self-directed agent. It is a Mixture-of-Experts (MoE) model that is a system which combines native visual intelligence with advanced reasoning capabilities, allowing it to perform tasks from vibe coding to academic research without the usual latency associated with massive dense models.

Key Features of Kimi K2.5

  • Swarming Agent Capability: In contrast to traditional compilation models, Kimi K2.5 can independently and simultaneously generate 100 sub-agents and invoke up to 1,500 tools within one operation. By executing in parallel, it can break big jobs down and run them together, thus tremendously reducing the time to completion.
  • Built-in Multimodal Architecture: Kimi K2.5 was trained using a custom-built multimodal model that uses mixed input data sources. This enables Kimi K2.5 to understand more complex visual data and its relationship to text by natively integrating them during training, rather than learning to process visual data and textual data separately before merging them into the model, as most other systems do.
  • Kimi Code and Visual Debugging: Through the use of its visual model, Kimi K2.5 is able to utilize Code-to-Visual functionality with very high accuracy. Additionally, it has the capability to visually inspect its rendered output on a pixel-by-pixel basis for layout shifts and errors, and then self-correct its code.
  • High-Level Strategic Planning: Through its process of extensive deep thinking, Kimi K2.5 generates internal thought traces to identify and plan multi-step workflows, reason through logic, and coordinate its sub-agents before executing any of the planned actions.

Use Cases of Kimi K2.5

  • Financial Modeling & Data Analytics: With the ability to act as an AI Excel Agent, the Kimi model will create very complex formulas, build pivot tables and dynamic charts that will follow the creation of data for the duration of that data based on its continuing evolution, and in effect, will automate a large portion of the heavy lifting of financial modelling.
  • Vibe Coding & Prototyping: Designers and developers can take abstract mood board images or screenshots and upload them to this Kimi model to have it generate an aesthetically designed, polished, interactive website layout and the associated code to execute that aesthetic, thereby closing the gap between aesthetic intent and the technical implementation of that intent.
  • Deep Research & Synthesis: Leveraging Kimi's swarm architecture the Kimi model has a very high level of performance for the completion of due diligence and competitive intelligence related to research. It synthesizes findings from hundreds of diverse sources into a single structured report that contains comprehensively researched findings, and produces that report at a speed much faster than any human analyst.
  • Professional Document Generation: Kimi goes beyond basic text generation and provides corporations with the ability to create LaTeX ready PDF documents and create board level or academically structured presentation slides, making both ready to present to the board or academic audience.
  • Visual Software Engineering: Kimi provides a closed loop automated full stack producer for engineering team’s technical output: reviewing & writing code against technical designs, rendering and debugging technical visual output.

How Does Kimi K2.5 Work?

Internally, Kimi K2.5 is based on a behemoth 1 trillion parameter Mixture-of-Experts (MoE) model, sparsely activating only 32 billion parameters per token. This sparse model is combined with the MoonViT vision encoder for direct visual insight and optimized with the MuonClip optimizer to maintain stability at this unprecedented scale.

Representative Trajectories demonstrating Kimi K2.5 Agent Swarm in action
source - https://www.kimi.com/blog/kimi-k2-5.html

The unique architectural innovation of the system is its shift from single-agent scaling to a self-led Agent Swarm, fueled by Parallel-Agent Reinforcement Learning (PARL). Rather than a linear pipeline, a learnable orchestrator independently breaks down gargantuan tasks into parallelizable parts, commanding as many as 100 sub-agents to perform 1,500 synchronized tool calls at once. This approach enables the model to perform deep Thinking Mode for self-correcting purposes while significantly cutting the overall end-to-end processing time over conventional linear models.

Future Horizons: Enhancing the Swarm

An exciting potential enhancement of the Agent Swarm architecture going forward may involve incorporating Federated Swarm Learning. Rather than operating only from centralized clusters, imagine the PARL orchestrator distributing sub-agents across local edge devices — all of which are secure. This new approach to distributed processing would allow localized, sensitive data (e.g., proprietary codebases and patient records) to be processed locally by local edge agents that specialize in these types of tasks while still benefiting from the swarm's combined reasoning ability. Such an advancement could open the door to large-scale, compliant workflows supporting privacy-critical roles in life sciences and law without sacrificing the sovereignty of their data.

In addition to the previous item for continued improvement, moving from static analysis to Real-Time Streaming Perception for the multimodal backbone could also redefine how active monitoring occurs. For example, would a model eventually collect information about the live interaction of end users with the system and/or feeds such as market ticker data so that they could execute hot-fixes to the user interface or deploy financial strategies that do not require the latency of uploading files? Also, by pairing this capability with an Episodic Swarm Memory — where the orchestrator will retain and store all successful tactical decompositions for each end user through multiple sessions of usage — the entire operation and delivery of the platform will evolve and provide an ecosystem that functions after the successful completion of each project. Furthermore, as time passes, the system will become more effective as each project is completed. 

Performance Evaluation 

Notably, Kimi K2.5 has displayed remarkably high efficacy in tests of benchmarking, often beating existing and recognized western world industry leaders. For example, in the Humanity’s Last Exam benchmark, which assesses highly advanced reasoning in a range of subjects and fields of inquiry, Kimi K2.5 achieved a remarkable 50.2% score. More remarkably, this exceeded the performance of proprietary industry leaders GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro.

Software engineering Benchmarks
source - https://www.kimi.com/blog/kimi-k2-5.html

The current state of the model in the software engineering world was underscored by its 76.8% score in the SWE-bench Verified, a benchmark that-rated it as one of the very best in the role of a coding assistant in resolving actual GitHub issues. On the BrowseComp benchmark, where the performance of an agent in its ability to traverse the web and retrieve relevant info is tested, the Kimi K2.5 scored 78.4% in its use of the Agent Swarm. This, in a way, emphasizes the superiority of the model in dealing with the dynamic world of info retrieval. 

Agent Swarm Benchmark
source - https://www.kimi.com/blog/kimi-k2-5.html

In addition to these major issues, Kimi K2.5 has excelled in MMMU pro (multimodal understanding) and Math vision tests, performing on par with or even better than state-of-the-art models on visual reasoning. Its capacity to cut execution time by 4.5 times on large-scale operations via parallel swarming reaffirms its design strengths.

How to Access and Use Kimi K2.5

Kimi K2.5 is easily accessible through various means. For direct use, it can be accessed through Kimi.com (Web & App) and the Moonshot Open Platform API. For developers and researchers who value data sovereignty or local development, the open-weights model can be downloaded from Hugging Face. The model is supported by inference engines such as vLLM and SGLang, and it is also quantizable (INT4) for use on consumer-grade hardware such as NVIDIA 4090s, although a cluster is recommended for optimal use.

Limitations 

However, Kimi K2.5 also has limitations. Video understanding is still considered an experimental API, and high-resolution image inputs can be quite costly in terms of the number of tokens used. Furthermore, in certain setups, the Thinking Mode is temporarily incompatible with certain APIs, such as the $web_search API, and users have to switch modes depending on whether they require heavy reasoning or just browsing.

Conclusion

Kimi K2.5 is a remarkable open-source model that is quite capable and ahead of the curve in the emerging class of multimodal, agentic AI models. It democratizes access to a trillion-parameter MoE model and brings swarm intelligence to the open-weights community. This makes it possible for biotech researchers and policy planners alike to create systems that not only speak but act.


Sources:
Blog: https://www.kimi.com/blog/kimi-k2-5.html
Document Guide: https://platform.moonshot.ai/docs/guide/kimi-k2-5-quickstart
Model Weights: https://huggingface.co/moonshotai/Kimi-K2.5


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Sunday, 18 January 2026

MedGemma 1.5: Mastering 3D Medical Imaging and EHR Analysis

Presentational View

Introduction

Artificial Intelligence (AI) in Healthcare is quickly evolving from a point of automating simple processes, such as completing clinical tasks; to meeting the complex needs associated with Clinical Decision Making. Today’s Medical Workflows require more than static verification to sufficiently evaluate the Complete Status and Pathology of a Patient.

Historically, Traditional Models have struggled to support the dynamic and long-term nature of Service Delivery utilized by Artificial Intelligence (AI). The combination of Historic Contexts with Future Progression utilized in assessing Patient Trajectories incorporates a high level of complexity. MedGemma 1.5 provides a New Way to approach this Element of Patient Care through New Technologies that provide Advanced Interpretative Capabilities for Multimodal Volumetric Datasets. Through the integration of 3D Data in conjunction with Printed Texts, MedGemma provides an Innovative Solution for Medical Professionals to create a widely applicable Data Integration Tool to provide holistic approaches to Patient Care through New Evidence based Practice Concepts.

What is MedGemma 1.5?

MedGemma 1.5 is an open multimodal generative AI-oriented system that is designed using the Gemma 3 architecture and is targeted specifically for understanding medical text and image modalities. Unlike previous models of similar capacity, this version 1.5 is designed specifically for working with high-dimensional data like 3D scans and whole slide images with a compute-friendly 4B  parameter size.

Key Features of MedGemma 1.5

  • High-Dimensional Imaging Support: The model goes beyond mere 2D imagery in interpreting 3D volumetric data, representing Computed Tomography and Magnetic Resonance Imaging scans. This allows for a depth and volume assessment not available using flat images.
  • Whole-slide histopathology image integration: It allows for the simultaneous interpretation of several patches from whole-slide images, a fundamental advance of pathology by allowing the model to synthesize information across a large tissue sample rather than view small, isolated segments.
  • Temporal and Spatial Reasoning: Longitudinal assessment, whereby the model has been given the ability to compare current and historical chest X-rays to enable the tracking of disease states over time. Its anatomical localisation via bounding box enables it to focus on specific findings within a radiograph at much higher detail and accuracy.
  • Structured Clinical Data Extraction: A key advantage is the capability to parse unstructured medical records, thereby extracting structured insights like values and units from lab reports that show superior comprehension of Electronic Health Records.
  • Seamless speech-to-text integration: it's designed to be natively compatible with MedASR, a specialized medical speech-to-text model that makes advanced, explicitly reasoned workflows directly driven by voice medical dictation possible.

    MedASR Integration with MedGemma 1.5
    source - https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-1-5-and-medical-speech-to-text-with-medasr/

Use Cases for MedGemma 1.5

  • Volumetric 3D Radiology Analysis: This represents a major evolution from existing API-based systems where it is possible to provide more than one slice of data from either CT or MRI to get immediate and automatic results of radiological findings.
  • Longitudinal Disease Monitoring: The developers can develop software that enables automated comparison between current and past data for a patient’s chest X-ray images. This can aid in the real-time evaluation of whether there has been stability or progression in a particular disease, which has not been directly related until now, as this was an activity that was performed manually by doctors for comparison purposes.
  • Real-Time Anatomical Localization: The bounding boxes around anatomical structures or pathological findings can be produced in real time when the model is reviewed in live mode. This is very useful for highlighting regions of interest in the radiographs in real time.
  • Automated Pathology Triage: Pathologists can harness the power of the model to examine various patches of a whole slide image together to arrive at a diagnosis, thereby efficiently working on large histology image datasets.
  • Offline Clinical Decision Support: Since it has a very compute-efficient size of 4B, deployment on-device for offline triaging and record parsing is possible. This will be particularly useful in low-connectivity environments and many other scenarios where cloud processing simply is not possible because of stringent data privacy requirements.

How Does MedGemma 1.5 Work?

MedGemma 1.5 is developed upon the Gemma 3 decoder-only transformer architecture, which now meets the stringent multimodal requirements in the medical environment. The core function for the vision component in the model is the SigLIP image encoder. This function extracts the input information into features that the large language model (LLM), the other component, uses for medical inference. To deal with large patient history and high-dimensional inputs, the model applies the Grouped-Query Attention (GQA) technique. This approach would allow the model to consider a context window size of a least 128K.

MedGemma as a developer tool
source - https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-1-5-and-medical-speech-to-text-with-medasr/

This architecture is better understood in practice from the flow chart describing the intended use of MedGemma as a developer tool. The journey of this operational workflow begins with use case definition, where specific clinical objectives are identified, and then involves model selection from the MedGemma collection to match those objectives. It then advances through a crucial step of validation and adaptation to ensure the model fits the purpose in the intended clinical setting, culminating in scaling on Google Cloud by making use of Vertex AI and Model Garden to take the prototype to the production stage of the medical AI application.

Future Horizons: Dynamic & Federated AI

Looking ahead, the smooth integration of MedGemma 1.5 with MedASR heralds a direction toward real-time, multimodal feedback loops. Can we envision a system where a clinician's spoken dictation during image review generates not only a report but also an immediate, active signal for learning? This would allow such a model to dynamically adjust its bounding boxes or diagnostic summaries based on spoken corrections, turning what is currently static validation into a conversational fine-tuning process that continually refines clinical reasoning without manual curation of data.

Moreover, this model's architecture is compute-efficient and primed for deployment with federated learning. The model could update its weights on sensitive, high-dimensional volumetric data with training distributed across decentralized networks of hospitals, without that data ever leaving the secure local environment. This would not only solve some very critical issues in data sovereignty but also allow institution-specific adaptation at scale, creating a self-evolving ecosystem of medical AI that becomes more robust and representative demographically with every deployment.

Performance Evaluation

The output of MedGemma 1.5 is a huge step forward in terms of spatial understanding, especially with respect to Anatomical Localization. On the Chest ImaGenome dataset, which is a benchmark designed to measure localization capability - how well an algorithm is able to locate a specific finding on a radiograph - version 1.5 of MedGemma reportedly reached an Intersection over Union (IoU) of 38%. This is an absolute jump of 35% over its predecessor, which had an IoU of 3%, a clear indicator of how the system has matured from a pure classification tool into a system with a strong spatial understanding capability.

Benchmark -  several forms of Medical Image Interpretation
source - https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-1-5-and-medical-speech-to-text-with-medasr/

In Electronic Health Record Comprehension, too, there has been approximately similar performance increases by the model. In medical document comprehension, for extracting structured data from unstructured medical reports, there was a 78% retrieval macro F1 score enhancement (18% increase over the predecessor on that particular task with 60% performance), and also, for answering questions on medical documents, as assessed by EHRQA, a test for question-answering on medical documents, MedGemma 1.5 has reached a 90% accuracy level, a 22% relative increase from the original  model with just 68% accuracy.

Benchmark - Medical Text Tasks
source - https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-1-5-and-medical-speech-to-text-with-medasr/

Further testing has reaffirmed the technical soundness of the model. Radiology classification improved by the good margin of 14% on the detection of MRI evidence and a further 3% on the accuracy of CT. Regarding medical reasoning, it got a 69% mark on the benchmark MedQA test, beating the previous highest of 64%. Most important of all, the generative fidelity of its histopathology (estimated through ROUGE-L) increased dramatically from the insignificant value of 0.02 to the value of 0.49.

How to Access and Use It?

The model can be accessed at the MedGemma GitHub repo, which is the central place where code, inference Jupyter notebooks, and fine-tuning lessons can be found. The weights of the model are located on Hugging Face and can be accessed at the Google Cloud Model Garden. Although the model can be used commercially and for research purposes, it has to be used under the acceptance of the Health AI Developer Foundations terms of use. The model has a unique license framework that, among other things, supports on-premises use on private infrastructure.

Limitations

It should be remembered here that MedGemma 1.5 is a developer-level tool and not a medical device. The results derived from this model should be validated and verified by a professional. It should not be attempted to use this model for the purpose of ruling out a medical condition or disease. The developer community needs to take particular care to make this model generalize well on a non-public dataset concerning medical concepts. Future research may probably work on improving this model on the multimodal front.

Conclusion

By assembling compute efficiency, high-dimensional imaging, and an awareness that drives temporal behavior into one efficient solution, it gives developers and engineers working with health tech the keys to provide all-important care pathways that for once understand patient trajectories. For those developing next-generation health tech, this solution has opened a gateway that leads from fragmented data and complex understandings to clarity.


Sources:
Blog: https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-1-5-and-medical-speech-to-text-with-medasr/
Model Details: https://developers.google.com/health-ai-developer-foundations/medgemma/model-card
Developer Guide: https://developers.google.com/health-ai-developer-foundations/medgemma
Model Weight: https://huggingface.co/google/medgemma-1.5-4b-it
GitHub Repo: https://github.com/google-health/medgemma


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Friday, 9 January 2026

MiniMax-M2.1: Automating Office Workflows with Agentic Intelligence

Presentational View

Introduction

Today, modern AI systems are no longer assessed strictly in terms of reason-to-result accuracy or no of parameters. Increasingly, it is a matter of just how well a system functions in a simulated software environment, interacts in a fractured tool chain, and maintains long-running autonomous processes. Today, modern models are increasingly being developed in consideration of new intersecting capabilities: the capability of scaling to a huge degree of parity in isolated software environments, to function as a self-governing software agent in typical software environments, to have a deep language-specific tooling knowledge, to produce well-functional software artifacts while maintaining a beautiful aesthetic.

MiniMax-M2.1 is designed to flourish in such friction. Its architecture signifies an evolution from conventional scripting intelligence to models resilient in real-world conditions such as varied languages, compiled worlds, executing tasks in large time horizons, and visually intensive applications. Instead of optimizing for specific applications, it is designed to perform well when subjected to concurrency, context pressure, and agent orchestration, all of which have direct effects on how AI is employed in production development tools and technical creativity.

What is MiniMax-M2.1?

MiniMax-M2 is an advanced sparse MoE language model tailored specifically to the intricate tasks of software development. It is a major upgrade to the former version, M2, to emphasize execution over reasoning. The new version is built to optimize tasks involving high concurrency, multi-lingual coding, and following long sequences of commands.

Key Features of MiniMax-M2.1

The value that MiniMax-M2.1 brings is based on its unique engineering skills that cover specific issues in software development.

  • Granular Linguistic Infrastructure: While other models are content to model code irrespective of language, M2.1 possesses the nuance to examine the plumbing of compiled languages. It integrates well into the disjointed ecosystems prevalent in non-Python build systems, supporting framework IDs for Java (JUnit/TestNG), JavaScript (Jest/Mocha), and Go (testify), and performing capably with complicated dependency resolutions, such as semantic versions managed in Cargo and linking/compiling managed by Maven.
  • Self-governed Digital Employee Workflows: This model goes beyond the scope of the IDE. It has its own special ability to fully automate office tasks without human intervention. It has the capability to integrate communication tools with project management tools. It automatically looks for data in its internal company servers or even consults team mates in case it is blocked.
  • Aesthetic-Driven Vibe Development: M2.1 brings to the table a skill that many models, especially the backend-intensive ones, tend to lack: taste. It shines as an Vibe Coding performer, delivering advanced creative apps. It also has the ability to engineer intricate simulations in 3D with over 7000 instances, providing an accurate understanding of refractions and collisions as well as an understanding of mobile subtleties, such as fluid animations involving click-to-wake functionalities for iOS and gyroscopic sensor animations for Android devices.
  • Resilient Context Management: In complex tasks, the context tends to become cluttered. M2.1 is designed to resist IQ degradation even when the content related to historical thinking is removed through agent scaffolds. Composite instruction constraint support allows the system to blend system requests, requests from the user, and specification files (e.g., Agents.md) together while staying on track with the logic.

Use Cases of MiniMax-M2.1

The capabilities of MiniMax-M2.1 translate into formidable use cases that solve systemic inefficiencies in enterprise and creative environments.

  • Supply Chain Security Remediation: If there is some vulnerability in any of the libraries of a compiled language, then the model can track the entire structure of the project to find the dependency. It automatically creates a fix, does parse fragmented link errors to debug the patch, and even optimizes the code for performance gains before deployment.
  • Global Release Validation: The model can be an automated quality assurance system prior to major retail events. This capability operates a large number of tests over massive codebases instantly on thousands of isolated environments, running regression tests across fragmented toolchains in a way that complex dependency logic is checked in seconds instead of hours.
  • Legacy System Bridging: When an organization uses older software that does not have APIs, the model bridges it. It can automate glue work: processing equipment requests coming in via emails, accessing and searching legacy internal servers through emulated keystrokes for pricing, and automatically updating procurement spreadsheets.
  • Precision Digital Twins: Field technicians would be able to use mobile applications driven by M2.1 to visualize high-fidelity three-dimensional simulations of industrial machines. The model would depict them using thousands of instances and physics to enable users to simulate stress tests using native gestures on the mobile device’s screen.
  • Visual Compliance Auditing: In the role of an Agent-as-a-Verifier, the software actively tracks applications in banking or in the fintech industry. It points out even the slightest errors in the intricate UI components like trading widgets and sliders through the verification of both the aesthetic stability (vibe) and the underlying logic.

How Does MiniMax-M2.1 Work?

The Sparse MoE architecture of MiniMax-M2.1 has a total of 230 billion parameters but uses only 10 billion parameters per inference. The goal of having such a sparse MoE architecture for MiniMax-M2.1 is to enable the model to benefit from the deep thinking of a large model as well as the speed of a smaller model while keeping the conversational flow of a long agent. This can be achieved through a very aggressive sparsity ratio of 23:1.

The training of the model is driven by the Workflow Realism. Contrary to previous models that were trained upon pre-codified snippets, the M2.1 model was trained upon over 100,000 real-world scenarios obtained from GitHub. These scenarios contain fully-fledged projects with various build systems, package managers, andCI/CD systems. Practicing on these high concurrency containerized sandboxes that are capable of spawning 5,000 environments in 10 seconds makes it possible for the model to deal with the thinking process of the environment as it interprets the undesirable tool results and its own thoughts in the <think>...<think> tags prior to acting.

The last architecture pillar is called Context Resilience. In the case of MiniMax-M2.1, it remedies the weakness in production agents in the sense that their performance will degrade as traces in the reasoning process are deleted by the scaffold management approach. The model will continue to display strong intelligence even when traces in the reasoning process are reduced by the scaffold management approach. The approach will ensure that the model stays on course according to the constraints in the specification file called Agents.md.

Evaluation of Performance Relative to Other Models

In the SWE-bench Multilingual evaluation as shown in table below, the rating received by MiniMax-M2.1 was historical at 72.5, thus beating Claude Sonnet 4.5, which scored 68.0. This test is very important since it validates the capacity of the model to resolve actual GitHub problems written in different languages and not just in Python, dealing with heavy dependency and compilation process requirements for Java and Rust production-level projects.

Software Engineering Benchmark
source - https://github.com/MiniMax-AI/MiniMax-M2.1

In the challenge of VIBE (Visual & Interactive Benchmark for Execution) as shown in table below, the cumulative score of M2.1 was 88.6, an enormous improvement over the previous version (67.5). Most significantly, in the category of VIBE-iOS subset, it scored an 88.0 with a resounding impact of doubling the performance of M2 (39.5). It clearly outshines others in the ability to design fully functional applications with proper UI.

VIBE aggregate benchmark
source - https://github.com/MiniMax-AI/MiniMax-M2.1

In addition, M2.1 achieved 49.4% pass rate on Multi-SWE-Bench, ranking first in open-source models, and increased its use of long-horizon tools in Toolathlon from 16.7 to 43.5. On performance-oriented benchmarks such as SWE-Perf, it self-optimized codes with an average performance improvement of 3.1%.

Access and Use of MiniMax-M2.1

MiniMax-M2.1 is released as an open-weight model through the Modified-MIT License, meaning there is no restriction on commercial use, and the model will always be accessible without any legal limitations. You should check Hugging Face, ModelScope or the GitHub repository for instructions and download links to the model weights for personal deployment. If you wish to use the model in production environments, it is designed to work with high-throughput inference systems like vLLM, SGLang and Transformers. Additionally, the MiniMax Open Platform provides an API to allow you to easily access the services provided by the MiniMax-M2.1 model.

Limitations

Although a huge improvement over the previous versions, users will need to understand certain limitations of the MiniMax-M2.1. A very important technical constraint will thus remain its use of Interleaved Thinking; performance may deteriorate as well as IQ if agent scaffolds or users suppress premise content enclosed in <think>...<think> tags when participating in multi-turn dialogue. Moreover, certain discrepancies will still remain in the current API; feedback includes the unimplemented modal for multi-modal submissions as well as both unimplemented as well as ignored parameters for presence and rate. In a real-world setting, it will encounter over-exploration problems when following actions such as reading the same files over and over or running the same tests. Lastly, although being very competitive, it will still lag slightly behind top-notch counterparts in foreign models for exclusive programming skills.

Conclusion

MiniMax-M2.1 offers a bridge between the digital and the functional, through understanding the graphic feel and complexity of compiled languages. The strength is in the realism of execution: depth, awareness, agency, and interaction. In total, it was made for engineers who require an AI they can truly ship to make something.

Source:
Blog: https://www.minimax.io/news/minimax-m21
Guide document: https://www.minimax.io/news/m21-multilingual-and-multi-task-coding-with-strong-general
Model Weight: https://huggingface.co/MiniMaxAI/MiniMax-M2.1
GitHub Repo: https://github.com/MiniMax-AI/MiniMax-M2.1


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Saturday, 27 December 2025

How GLM-4.7 Preserves Logic in Multi-Turn Engineering Workflows

Presentational View

Introduction

The true strength of AI today is in its capacity to maintain its deeper-level logic even during multi-turn conversations, where the underlying architectural choices in early projects are preserved in line with changing demands. The stateful system is an incredibly powerful tool in itself, much needed by technical managers working on long-term projects. Just as important is working not merely in disconnected results; instead, one also has to be capable of dealing with everything in between, right from frontend as well as backend integration as a function of one overall aim, to being able to generate high-quality deliverables in the form of presentation slides as well as Web UIs.

These capabilities are no longer on the horizon. Successful in this changeover is the example model GLM-4.7, which has shown itself to be a small model that is completely controllable, designed from the ground up to perform self-contained tasks. It brings to bear both stateful thinking, as in having the ability to maintain the complete logic of an undertaking in working memory, as well as unmatched reliability.

What is GLM-4.7?

GLM-4.7 is an active-agency Mixture-of-Experts (MoE) large language model created by Z.ai (Zhipu AI). It has been designed to go beyond answering questions and work towards task completion, which involves more than one step. Unlike other language models, GLM-4.7 has also been created for an execution-oriented AI system, which can work towards comprehending requirements, breaking down solutions, as well as integrating technologies.

Key Features of GLM-4.7

GLM-4.7 presents many industry-first features that make it differ from traditional LLMs

  • Preserved Thinking: This is a major leap forward in the GLM line, and it enables the model to preserve logic trees in multi-turn conversations without having to do anything extra. This saves information by remembering logic applied in a previous meeting, instead of having to reapply logic associated with every message in a long-horizon process.
  • Vibe Coding (UI/UX Excellence): This feature transcends the province of functional coding, aiming for aesthetic stability. GLM-4.7 has done a tremendous feat in churning out professional-grade graphics, thereby improving the PPT compatibility of 16:9 layouts to a whopping 91% (compared to the predecessor's 52% compatibility rate). Aesthetic output is flawless, to a point that the web page and ready-to-use slides require very few.
  • Interleaved Thinking: Unlike models which could think impulsively, GLM-4.7 will think before every response and tool call. This will ensure high compliance with complex instructions and will lower the level of errors that could occur in the orchestration of multiple external tools.
  • Turn-level Thinking Control: This provides detailed control over turn-level latency and depth. You can turn off thinking for short queries if needed for faster responses or turn it on for complex problem-solving in the same turn.

Use Cases of GLM-4.7

  • Single-Objective Software Delivery through to End Game: GLM-4.7 can be very helpful in environments where translating one targeted description into an entire, functional result is something that needs to be done. In particular, because this model generates individual bits of code, it can break down needs, harmonize interfaces, and integrate both frontend and backend aspects.
  • Evolution of Long-Horizon Projects with Stable Constraints: Different For projects that are worked on over a number of iterations, GLM-4.7 is capable of retaining architecture constraints as well as design decisions defined in the initial phases as active context in subsequent phases. This is effective in projects whereby requirements are defined in a number of iterations.
  • High Reliability Tool and API Orchestration: GLM-4.7 can be used under conditions that include frequent interaction with several tools or APIs. It can work well with uncertain or incomplete tool results for multi-step workflows and reach a correct final state using minimal human involvement.
  • Agentic Development and Maintenance Workflows: It comes with native support for agent frameworks like Claude Code, Cline, or Roo Code, making it capable of performing high-frequency iterations, or repeat work, related to auto-refactor, test, or documentation routines.

How Does GLM-4.7 Work?

The GLM-4.7 model retains the same general architecture for execution and training from previous models in the GLM-4 series, specifically from the GLM-4.5 model and the GLM-4.6 model. The model architecture is based on Mixture-of-Experts, with 355B total parameters and 32B active per token, designed to have large capacity for reasoning without using dense activation. The model adheres to the hybrid model for reasoning, with modes that include thinking, non-thinking, interleaved reasoning, planned before response, planned before tool call. These are made possible by architectural stabilizers that have attention logit normalization through QK-Norm, along with the Muon optimizer for faster optimization during large-scale training. Pre-training includes 15 trillion general, 7 trillion general/.reasoning.code-focused, which is a pipeline that the previous GLM-4 models have already employed in previous architecture for the capability to perform large context reasoning, tool usage, or agent-like workflows.

Preserved Thinking
source - https://github.com/zai-org/GLM-4.5/

Specifically unique to GLM-4.7 is how it extends these inherited capabilities into a more stateful and execution-focused system. Specifically, this model includes Preserved Thinking, so internal reason thinking blocks are preserved across multiple-turn dialogue systems as opposed to being recalculated or lost in favor of more short-run logical evaluations. These capabilities are combined with turn-level thinking controls that allow for adjusting levels of thinking or reason logic within a given dialogue session. These processes are further encouraged through slime reinforcement learning systems that allow for separate agentic rollout computation from model training and optimize complex task learning performance across high levels of GPU utilization levels within model training itself. For inference purposes within GLM-4.7, a Multi-Token Prediction (MTP) layer is used for supporting speculative decoding capabilities and improving performance levels within GLM-4.7 systems by preserving reason integrity upon inference processes being applied. All of these elements further refine GLM-4.7 from being purely a logical model capable of reason into one that preserves and leverages reason within its performance capabilities across its operational lifespan for its primary point of technical divergence from its forgoing models.

Future Horizons: Adaptive Logic and Collaborative Agency 

The future of adaptive logic decision making will be transformative and ambitious. Transitioning from the historical idea of a stateful reasoning, What will Adaptive Logic Life Cycles look like in the future? Can future iterations of Adaptive Logic have the ability to Identify critical architectural decisions that should be held long term from lessen architectural decisions that should be allowed to automatically retire? If we can develop a way to differentiate the two types of architectural decisions and allow for the elimination of lessen architectural choices, we will have a greater capacity to self-scale for larger projects and balance the speed at which we build context with the cost of operating responsibilities. Further, imagine if we could also apply this thinking to Cross-Session Continuity, where all aspects of project logic remain safe across various environments, provided that there are clearly established boundaries. Thus, we will progress beyond thinking of a single session worker model to a collaborative working environment to permit facilitation of engineering collaboration in a cohesive manner with multiple engineers benefiting from having a common reasoning state throughout long-duration work.

Future improvements to execution may include more closely linking the reasoning process with Artifact Validation. For example, could we build into our systems a way to automatically check the interface or integration produced against constraints of the structure or against pre-stated acceptance criteria before being approved for finalization? If so, this would reduce the amount of rework necessary later in the development cycle. A vision of Multi-Agent Collaboration under a unified Reasoning framework further supports this progression, as it envisions the collaboration of highly specialized agents—created specifically for Design, Implementation, and Verification—with appropriate control and oversight of the operation of all agents. The outcome of this evolution may be autonomous completion of project tasks that more closely reflect the behavior of engineers in the real world, thus creating a system of AI that not only takes action but develops and regulates itself in conjunction with increasingly complex Development Cycles.

Performance Evaluation with Other Models

GLM-4.7’s strength challenges and at times outperforms both open-weighted models and the best proprietary models. At the high-level reasoning level, GLM-4.7 scored an astonishing 42.8% on Humanity’s Last Exam (HLE). The new model shows a remarkable improvement of 41% over its previous version, GLM-4.6, which scored only 17.2%. More significantly, GLM-4.7 outperforms GPT-5.1 High (42.7%) and DeepSeek-V3.2 (40.8%) on HLE. Its superior performance.

Comprehensive Benchmark Comparison (GLM-4.7 vs. Frontier Models)
source - https://z.ai/blog/glm-4.7

On the level of programming proficiency, the model attained 73.8% accuracy on SWE-bench Verified, which is a very essential task for assessing real programming proficiency. It also improved from a 5.8% gain in GLM-4.6, placing it better than DeepSeek-V3.2 (73.1%). Additionally, in the SWE-bench Multilingual dataset, it increased to 66.7% accuracy, registering a gigantic 12.9% gain from the past model.

A professional coding evaluation (WebDev)
source -  https://docs.z.ai/guides/llm/glm-4.7

Aside from those headlines, GLM-4.7 is the best in utilizing interactive tools. On τ²-Bench, it got a total score of 87.4, beating both Claude Sonnet 4.5 (87.2) and GPT-5.1-High (82.7). It also topped the list for open-source models in the Code Arena for professionals and got a total score of 84.9 on LiveCodeBench-v6, proving to be more than a code generation tool but an elite coding.

How to Access and Use GLM-4.7?

 The GLM-4.7 model is designed to be easily accessible. The model weights, which have BF16 and FP8 precisions, can be downloaded from Hugging Face and ModelScope to be used in local deployment using industry-standard frameworks such as vLLM and SGLang.

For anyone considering managed services, this model is also fully accessible through the Z.ai API, providing an interface compatible with OpenAI. It is available commercially through GLM Coding Plan, designed to have cost-effective pricing, 1/7th that of Claude, making it competitively priced. You can find it from this GitHub link, which has all the information necessary to install it. I have provided you with this information in your sources section. 

Limitations 

Although the GLM-4.7 model exhibits good agentic capabilities, the MoE strategy has to be carefully planned for optimal efficiency, even if reasoning is preserved. Furthermore, the new aspects that come with preserved reasoning involve the management of context or costs for long reasoning sessions. The next versions will likely improve compression or boundaries for the reasoning agent. 

Conclusion 

GLM-4.7 represents a significant paradigm shift in AI models of small to medium size—no longer systems for responding, but systems that can execute, remember, and deliver. Its retained ability to reason, task focus, and tested performance level indicate the dawn of the age of controllable systems capable of taking genuine engineering initiative in these matters, not entailing the costs of frontier-scale systems. GLM-4.7 brings efficiency as well as a new paradigm in integrating humans and AI systems.


Sources:
Blog: https://z.ai/blog/glm-4.7
Guide document: https://docs.z.ai/guides/llm/glm-4.7
Model Weight: https://huggingface.co/zai-org/GLM-4.7
GitHub Repo: https://github.com/zai-org/GLM-4.5/


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Tuesday, 23 December 2025

NVIDIA Nemotron 3: Scaling Hybrid Mamba to 1M Tokens

Presentational View

Introduction

Hybrid Mamba-Transformer models appear to represent a game-changing solution to overcome quadratic scaling constraints of dense attention with state-space models (SSMs), for long-range memory, paired with Transformers for detailed structuring tasks. Meanwhile, training methodologies are being explored to move past strict supervision: models are able to develop reasoning skills over code-related environment, mathematical expressions-related environment, and tool use-related environment together with joint Reinforcement Learning (RL) approaches such as Concurrent Multi-environment RL (RLVR) using NeMo Gym, while a novel data synthesis scheme like InfiniByte cross-breeds different scientific fields for a trajectory of reasoning that is unlikely to pop up on the Web naturally.

Nemotron 3 pushes the frontiers of this area by integrating sparse hybrid architecture, synthetic data, and alignment via reinforcement learning in a completely controllable, open-weights setting. Instead of focusing on sheer size, Nemotron 3 illustrates the viability of long-horizon reasoning, throughput, and agentic stability on a scale more typical of much larger systems for small- to mid-scale models, giving a blueprint for building completely logically consistent, efficient, real-time AI systems that work well even in the resource-constraints of the enterprise setting, which will be explored extensively in the next few sections.

What is Nemotron 3?

Nemotron 3 is a family of Sparse Hybrid Mixture-of-Experts (MoE) large language models optimized for the accuracy-to-compute frontier. Unlike previous generations that relied on dense hybrid structures, Nemotron 3 utilizes a granular expert routing system that allows it to scale parameter counts into the hundreds of billions while maintaining the inference cost of much smaller models.

Model Variants

Three size variants of the Nemotron 3 AI models are available, allowing for large-scale production with differing reasoning abilities.

  • Nemotron 3 Nano: This is a model with 316 billion parameters, of which 32 billion are active and available for reasoning on each forward pass. This model has been optimised for high-speed processing applications such as debugging software or deploying locally on high-performance computers.
  • Nemotron 3 Super: The Nemotron 3 Super is a mid-sized model that contains approximately 100 billion total parameters. The Super also creates latent mixture of experts (MoE) with 10 billion active parameters so as to achieve greater precision in the automation of IT assistance and supporting multi-agent collaboration.
  • Nemotron 3 Ultra: The flagship of the Nemotron 3 line of models, the Ultra has approximately 500 billion total parameters. It is engineered to handle the largest and most complicated workloads encountered by businesses. The Ultra employs NVFP4 (4-bit floating point) to create a high price-to-accuracy ratio on state-of-the-art Blackwell generation processing hardware.

Key Features of Nemotron 3

Nemotron 3 maintains its uniqueness through a number of exclusive technological innovations, which emphasize control and performance:

  • 1-Million Token Context Support: The model employs a long context phase at the end of its pretraining phase to handle up to 1M tokens, bettering the existing techniques Qwen3 based on the RULER tasks.
  • Granular MoE Routing: Rather than having a conventional 8 or 16 experts in MoE layers of other models, Nemotron 3 Nano relies on 128 routed experts plus 1 shared expert, turning on just 6 of them per token.
  • Multi-Token Prediction (MTP): Super & Ultra models include MTP layers, which predict multiple future tokens in one step for higher throughput for structured predictions or long reasoning chains.
  • Hardware-Aware Design: The design accommodates the NVIDIA H200 and Blackwell GPUs natively and adopts the NVFP4 format to achieve the highest inference-throughput and reduce the loss of accuracy.
  • Controllable Reasoning: Equipped with the enable_thinking flag that enables users to view internal trace evidence regarding the model's logic, which can be a necessary condition depending upon the application domain, viz., legal and scientific contexts.

Use Cases for Nemotron 3

The flexibility of Nemotron 3 makes possible a wide variety of high-value applications in various fields:

  • Enterprise IT & Automation: The Super model is specifically tailored for automating IT tickets and teamwork involving multiple agents, in which the workload has to be handled both quickly and precisely.
  • Software Engineering & Local Debugging: Since the Nano model has only 3.2B parameters, it can be run on local machines by developers in order to execute code completion, transpile, and debug without any latency involved in cloud APIs.
  • STEM & Scientific Research: By utilizing the InfiniByte data set, it is highly adept at interdisciplinary problem-solving for physics, chemistry, and high-level math concepts and applications.
  • Agentic Tool Use: These models can be fine-tuned on target data like Nemotron-Agentic-v1, and the resulting models can engage in multi-turn dialog systems. The models have to analyze complex tasks, apply external tools, and then interpret their outputs.

How does Nemotron 3 work?

Through the use of Mamba 2 layers (for linear, time-scale processing of huge context windows) and Transformer office (Grouping Queries Attention) layers that keep the underlying structure of the model intact for producing high-accuracy models, the model uses a Sparse Hybrid MoE Architecture. The combination of the two provides the strengths of both. The method of combining the two types of layers is made possible through a custom-provided granular MoE architecture consisting of 128 routed experts. The energy to the model is routed through a learned MLP router to ascertain the top six experts used for each token. By selecting only the necessary neurons for the purpose of that token, the brain is able to maximize output using a focused set of neurons that specialize in their respective inputs.

Nemotron 3 hybrid architecture.
source - https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/

When designing the Super and Ultra Models, the method of constructing the model is different, utilizing Latent MoE. This is accomplished by utilizing the latent representation of each expert, rather than using distinct token embeddings as the token for which the model should operate on. Because each specialist now has access to four times more expert tokens than before, the model will be able to achieve a significantly higher level of knowledge density without an accompanying increase in the time it takes to develop an inference.

Performance Evaluation

The results for Nemotron 3 Nano clearly demonstrate that there is a considerable improvement in efficiency. In the normal testing, Nemotron 3 Nano 30B-A3B produced results of 78.05% for HumanEval (0-shot) and 92.34% for GSM8K (8-shot), as can be viewed in the technical results report tables for accuracy. What is important here is that it outperforms and oftentimes rivals much larger and even more complex models, such as GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507.

Accuracy and throughput comparisons
source - https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

In terms of throughputs in inferential reasoning, an imperative criterion for real-time tasks, Nemotron 3 Nano has 3.3 times higher throughputs than Qwen3-30B-A3B and 2.2 times higher throughputs than GPT-OSS-20B in heavy tasks involving tokenization and output (8K input, 16K output) using single H200 GPUs. This difference in throughputs would be further accentuated by the efficiency of this model in dealing with tasks requiring longer contexts, as it has beaten its competitors in RULER tests with respect to different token context lengths up to 1M.

Nemotron 3 Nano evaluations across a broad suite of established benchmarks
source - 
https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Supplemental assessments also show a strong level of capability in general knowledge and tools. The model achieved a score of 78.56% in MMLU (5-shot) and a score of 53.8% in the Berkeley Function Calling Leaderboard, validating the model’s preparedness for handling complex multi-step tasks. In addition to this, the model showcased unparalleled capability in dealing with mathematical problems by achieving a score of 78.63% in MATH-500 using advanced reasoning protocols.

How to Access and Use Nemotron 3

Nemotron 3 models can be obtained in different ways to suit both cloud-native and local-first developers. The weights for the Base, BF16, and FP8 models can be accessed on the Hugging Face model hub in the nvidia/nemotron-3 namespace. For more advanced applications, the models can be obtained through NVIDIA NIM (microservices), which is the optimized inference API. Instructions for executing the models locally can be obtained from the GitHub repos and the NVIDIA Research webpage. Nemotron 3 models come under the NVIDIA Open Model License. Though applications in research and commercial applications are encouraged in general, one still has to refer to the model card page for specifics.

Limitations 

Nemotron 3 also has certain limitations. Handling a 1M token environment requires a lot of RAM on a virtual machine, going beyond the standard 256k token capacity of typical consumer settings. Also, a review of training data shows that there is a certain imbalance towards 'male' and 'White' identifiers that is generally a problem with BFM and needs careful consideration on a per-prompt basis of bias examination. However, on looking ahead towards the first half of 2026, there is planned coverage of Super (100B), Ultra (500B), and so on towards finalizing Nemotron 3 on the NVFP4 standardization of Latent MoE models so as to enhance reasoning scale capabilities.

Possible Technological Advancements and Future Directions

There are many ways in which Nemotron 3 can continue its evolution by incorporating new innovative technology into its existing system. The addition of dynamic hardware aware routing will help to overcome the limits of static bounds set on expert system activation, while allowing flexibility in response to the changing complexity of a given task and/or the amount of available system memory. This level of flexibility during the process of inference will allow for greater scalability of workloads across different types of infrastructure, especially if they are located within the confines of the enterprise environment.

Another new direction is recursive synthetic logic evolution. This involves the iterative creation of reasoning scenarios based on observed gaps within a model’s internal reasoning traces using synthetic data pipelines. This self-correcting feedback loop would allow for the improvement of infrequent yet complex failure modes, which are difficult to capture with human-created training datasets alone. Neural symbolic verification of reasoning chains and the use of formal solvers should be added to ensure compliance with regulatory and logical constraints.

Over time, it is also possible to improve the ability of efficient hybrid systems to perform reasoning tasks that require working with continuously fed data sources (for instance, video and sensor data) through the integration of multi-modal state-space layers. Doing this will allow these systems to perform similar scaling operations as what is done today with large amounts of text.

Conclusion

For the expert, the value is not only in the benchmark results, but also in the controllability – the possibility of turning reasoning traces on and off and leveraging data recipes such as InfiniByte for specific tasks that can never be addressed by natural data. This is an AI model that is as efficient as it is smart.

source:
Research: https://research.nvidia.com/labs/nemotron/Nemotron-3/
News: https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models
Blog : https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/
Tech document: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
Nemotron3 collctions: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
Nano Base-BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Nano A3B-BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Nano A3B-FP8:  https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

How Open-Source Kimi K2.5 Swarm Beats GPT-5.2 and Claude Opus 4.5

Introduction With the dawn of this new age of AI Agents, new AI has become a revolutionary change in the way models perform complex workflow...