Pages

Monday, 10 November 2025

Kimi K2 Thinking: Long-Horizon Planning with 256K Context

Presentational View

Introduction

The AI world has been obsessed for the last few years with speed and fluency. We've seen models that can write poetry, answer trivia, and generate code in the blink of an eye. Yet for all their intelligence, these models have a basic limitation: they are reflexive. They are brilliant sprinters, but they cannot run a marathon. Ask them to perform a complex project extending over days and they'll lose focus and forget the original goal and drift into incoherence.

This is the central challenge in AI today: the real frontier is not about making AI smarter, but about giving it stamina. We need models with long-horizon agentic stability-the ability to execute long, complex tasks-and reasoning continuity, an unbroken train of thought. The core problem has been that models forget why they are doing something after a few steps. They lack a persistent internal monologue.

There's a new AI model, one that's different in philosophy: designed not just to answer but to reason, plan, and execute complex workflows over extended periods. It represents a shift from a simple responder to a true cognitive executor, with the first important step towards truly autonomous strategic AI systems. This new AI model is called Kimi K2 Thinking.

What is Kimi K2 Thinking? 

Kimi K2 Thinking is a specialized variant of the Kimi K2 model series that is more advanced than Kimi K2 Instruct. The Kimi K2 Instruct model is a faster, reflexive model; the Thinking variant is designed only for complex, extended-period tasks. It's built to think as an agent, logically process, and reason step-by-step while keeping stable and coherent reasoning for lengthy procedures. 

Key Developments in Kimi K2 Thinking

Kimi K2 Thinking's unique design philosophy offers a set of distinct capabilities to author it as yet distinct from it's peers.

  • Strategic Intelligence vs Reflexive Intelligence: The model is explicitly designed to be a, thinking agent that reasons, step-by-step. In a sense, this model was purposely developed as a long-term planner compared to Kimi K2-Instruct being faster, reflexive models.
  • Unmatched Agentic Stability: This is a signature capability of the model, a designed reduced drift and capacity for coherent, goal-driven reasoning for an unparelleled, industry-leading, 200-300 sequential tools calls, all without human intervention.
  • Autonomous Decomposition of a Task: The model is uniquely capable of long-horizon planning by autonomously breaking-down complex high-level objectives into sequential subtasks orders prior to proceeding. As evidence of this depth, it successfully completed a PhD-level mathematics problem, consisting of 23 interleaved reasoning and tool calls.
  • Quantitative speed of generation: Stated another way, one of the practical features of the Kimi model is true lossless. Whereas current models have efficiency loss in most contexts, the Kimi model is architecturally optimized and trained to provide generational results about twice as fast, using much less memory, and thus, viable deep reasoning capabilities.

Unique Use Cases of Kimi K2 Thinking

What is possible with an AI that observes a 300-step attention span and has a memory of 250k tokens? The possible applications are qualitatively different than anything experienced before at any quality level.

  • Fault-Tolerant Scientific Simulation: A user could organize a 72-hour chemical synthesis run, requiring 200-250 steps of simulation, parameterization, and code changes, which has not previously been possible when dealing with state-based formalism in AI conversational models. In the event of an AI failure or need to terminate the run, all reasoning_content can be 'reinitiated,' providing all the previous approaches to resolution and internal hypotheses would remain intact and potentially be utilized with non-destructive continued investigation into the initial experimental premise.
  • One-Pass Regulatory Syntheses: There is a corpus of up to 220-250K tokens (e.g., new tax laws, multi-jurisdictional regulations, internal policies) that can be ingested. It can produce a redline, conflict map, and remediation plan in a single request, thereby avoiding basically all chunking-related artifacts and violations of whole-context consistency which are mistakes made using 128k-context models.
  • Autonomous Monorepo Refactoring: Kimi K2 Thinking could be given a massive monorepo codebase, which includes multiple languages, to discover large, complex bugs that an enterprise codebase likely has. After that, it is able to receive the instructions to autonomously run the new solution and generates a new release candidate without the supervision of a development team. It can run multiple cycles of edit/test/benchmark at a remarkable number, 300, to conduct a comprehensive evaluation of the codebase without unlimited code to bound which fixes are included. The K2 thinking agents wouldn’t even need to be in the DevOps pipeline and accomplish such work.
  • Digital Twin Coordination: An agent could manipulate a factory digital twin. It could utilize its 256K context to review months of historical sensor logs while simultaneously executing hundreds of sequential control actions through APIs. The reasoning_content would leave an auditable rationale(s) trail of all its thinking. 
  • Longitudinal Clinical Study Management: The model could manage an adaptive clinical study over a several months and could read in the complete protocol, patient reports, lab reports, and subsequently perform repeated iterations of statistical reanalysis and protocol amendment drafts while preserving a complete chain of rationale for regulators. 
  • Global Supply Chain Remediation: After a disruption, the agent would autonomously manage hundreds of API calls across carriers, customs, and legal teams to triage the problem, get shipments diverted, and execute negotiation strategies, while maintaining a common state across the multi-day event.

How Does Kimi K2 Thinking Work? - Architecture

The architecture is a MoE architecture, with a total of 1 trillion parameters and 32 billion activated on each inference pass. At inference time, the model interlaces chain-of-thought reasoning with tool invocations-such as search, browser, and code. It stores intermediate reasoning in a field called reasoning_content, which must be carried forward in multiturn workflows to maintain continuity. The system supports a context window of 256K tokens, making long-horizon planning possible for sustained periods. Quantization stack-native INT4 plus Quantization-Aware Training-guarantees that this enormous model stays inference-efficient in real-world usage.

Performance Evaluation Compared to Other Models

The first element to emphasize is the performance characteristics associated with benchmarks of agentic reasoning. With respect to HLE, the largest benchmark of multi-domain expert reasoning with tools, K2 Thinking received a score of 44.9%. This nearly doubles K2 0905's previous score of 21.7%. Scores for BrowseComp, an agentic search and retrieval benchmark, were even more impressive - 60.2%, in fact, which is comparable to a significant jump from the previous generation's score, 7.4%. The results support the accuracy benefits of its deep structured reasoning over a reflexive generation.

Benchmarks that assess reasoning, coding, and agent capabilities
source - https://moonshotai.github.io/Kimi-K2/thinking.html

The second element to summarize is the performance characteristics related to agentic coding. Kimi K2 Thinking received a score of 71.3% on the SWE-Bench Verified benchmark, which is notably better than the scores of other top MoE models. This is the best performance in open MoE reasoning models, and reaffirms specialization in multi-step, autonomous software reasoning workflows.

General Benchmark results
source - https://moonshotai.github.io/Kimi-K2/thinking.html

Finally, a summary of the other performance scores reaffirms a specialized, powerful profile. Kimi K2 Thinking received an impressive score of 83.1% on LiveCodeBenchV6 (no tools) and 61.1% on SWE-Bench Multilingual. The strength of Kimi K2 Thinking is simply not seen in other predecessor models, especially concerning stable advantage outcomes (over other models) on multi-step applied reasoning and complex, tool-using agentic workflows. K1, K2, and K3 are also proficient at demonstrating goal-directed behavior across 200, 250, and 300 tool applications respectively without a behavioural shift.

Kimi K2 Thinking vs DeepSeek-R1/V3 & Qwen3

Kimi K2 Thinking, DeepSeek-R1/V3, and Qwen3 are the latest products of the Mixture-of-Experts (MoE) framework focused on human-like reasoning. All models are characterized by sparse MoE architecture, massively scaled parameters (20B–40B active), and long context windows beyond 128K tokens. The goals of all models are to leverage human-like reasoning with computational efficiency through reinforcement or continuation-based fine-tuning to support multi-step logic. Suffice it to say, all share the same engineering family but explore various ideas of cognition.

These differences give each model its inherent advantage: Kimi K2 Thinking is best for tryhard long-form, tool-reliant, or procedural thinking that requires uninterrupted reasoning (e.g. scientific simulation orchestration, software refactoring and/or rewrites). DeepSeek-R1/V3 is best for directional analytical reasoning-mathematics, proofs and deterministic coding. Qwen3 is best in conversations or multimodal environments, where your thinking needs to response and adapt freely. In summation of these distinctions, they define three branches of advanced thinking: Kimi K2 Thinking serving as the strategic planner, DeepSeek serving as the rigorous analytical (executive) thinker, and Qwen3 serving as the linguistic adaptive conversational (executive) thinker. All models thus far serve as powerful models of cognition, but only K2 thinking offers for thinking for multi-time periods and true autonomous agency.

These characteristics define the unique advantage in each model. Kimi K2 Thinking excels in long-form, tool-heavy, or procedural tasks that necessitate human-like cognition and logical reasoning, basically tasks that require sustained reasoning, such as orchestrating scientific simulations or refactoring software. DeepSeek-R1/V3 excels in analytical rigor where precision-math, proofs, logic, and deterministic coding (with computerized rigor) are valuable disciplines. Qwen3 excels in communicative tasks or multi-modal use cases when flexibility and responsiveness are the most valuable characteristics. Together they form three branches of cognitive acumen—Kimi K2 Thinking as strategic planner, DeepSeek as rigorous analyst, and Qwen3 as adaptive communicator—each powerful, but only K2 Thinking has the endurance to sustain truly autonomous agency.

How to Access and Use Kimi K2 Thinking 

The Kimi K2 Thinking model is available via the Moonshot AI API in an OpenAI/Anthropic-compatible form. The model weights are publicly available on Hugging Face at the repository moonshotai/Kimi-K2-Thinking. The use of Kimi K2 Thinking is subject to a modified MIT license (commercial use is permitted but depends on size of deployment). The live chat mode is accessible at kimi.com but has a limited tool-set and fewer steps to access the tools; the full agentic mode is planned to be released in the near future.

Limitations and/or Future Work 

Despite the progress it has made, the model carries some obligations; the reasoning_content tokens account toward the input/output quota (which led to significant token budgets for extended workflows; at some point, other operations will be limited). The live chat deployment uses a more limited tool set and fewer steps than the benchmark mode (access to whatever functions it can provide [200-300 tools], may not be available in the public UI).

Conclusion

Kimi K2 Thinking isn't just a faster model; it is smarter, steadier, and more strategic. We are moving beyond the Oracle model of an all-knowing entity providing one quick answer to the Agent model: a persistent, goal-oriented co-worker able to take on a project, oversee its complexity, and bring it to completion. To developers, researchers, and businesses, it means the difference between an AI that can help you code and an AI capable of independently refactoring your entire codebase while you sleep.



Sources:
Blog : https://moonshotai.github.io/Kimi-K2/thinking.html
GitHub Repo : https://github.com/MiniMax-AI/MiniMax-M2
Hugging Face weight : https://huggingface.co/moonshotai/Kimi-K2-Thinking
Guide doc: https://platform.moonshot.ai/docs/guide/use-kimi-k2-thinking-model




Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Monday, 3 November 2025

MiniMax-M2: Open-Source AI for Agentic & Coding Workflows

Presentational View

Introduction

The world of large-scale AI has been locked in an arms race defined by a simple, brutal metric: more is better. More parameters, more data, more compute. This relentless pursuit of scale has given us astonishingly capable models, but it has also created a massive barrier, leaving a trail of eye-watering API bills and latency bottlenecks. For the developers building the next generation of AI-powered tools-especially agentic systems that can independently plan, act, and verify tasks-this cost-per-thought is a critical bottleneck. The dream of an AI agent running, testing, and fixing code all on its own is great until you get that invoice for a million iterations of its thought process.

This is the exact challenge that the new MiniMax-M2 model is built to solve. It changes the goalpost from biggest to smartest. Recent updates have confirmed that MiniMax-M2 is radically cheaper and faster than its proprietary competitors, being a direct answer to the industry's scalability problem by proving top-tier agentic performance and cost-effective deployment are not necessarily mutually exclusive.

Development and Contributors

MiniMax-M2 was developed by MiniMax, an AI startup based in Shanghai. MiniMax has quickly emerged as an important participant in the AI space due to significant venture backing from industry giants such as Alibaba and Tencent. The motto for this model, a Mini model built for Max coding & agentic workflows, properly sums up its design philosophy: it's compact, efficient, but built for maximum, real-world developer and agentic utility.

What is MiniMax-M2? 

MiniMax-M2 is a compact, fast, and cost-effective Mixture-of-Experts AI model. From the architectural point of view, MiniMax-M2 is an ingenious piece of engineering. The sparse activation design allows it to have the vast knowledge of a huge model while retaining the speed and low operational cost of a much smaller one.

Key Features of MiniMax-M2

The design of MiniMax-M2 results in a suite of characteristics that are not just astounding but savagely usable to a developer or software architect.

  • Optimized Balance of Intelligence and Cost: The core design in the model achieves a unique balance between intelligence, speed, and cost. It renders elite intelligence ranked #1 in open-source intelligence usable for complex tasks without added computational burden to thus prove that you need nor accept less performance for unit economics.
    AA
    source - https://www.minimax.io/news/minimax-m2
  • Radical Unit Economics: This is M2's killer feature. It designed for low latency, low cost, and high throughput. Its API cost is quoted to be about 8% of Claude 3.5 Sonnet, a direct top-tier competitor.
  • High-Speed Inference: M2 not only saves you budget, it means you will save time. Its efficient design possessing 10B active parameters achieves nearly double the speed as compared to Claude 3.5 Sonnet while inferring a response to a query. This is crucially important to provide fast feedback loops - often necessary in coding or agentic tasks.
  • Sophisticated Tool Use in Agentic Mode: The model is an interleaved thinking model that is designed to provide sophisticated executing end-to-end tool use while in a lean form factor that is part of their computing-powered trend for volunteer organizations to make enlighten, not destroy.

Use Cases of MiniMax-M2

These attributes enable a set of use cases that have never been financially viable or technically feasible with any other model.

  • Autonomous PR Fixer in CI/CD: As soon as a developer creates a Pull Request that fails unit tests, an instance of the model is triggered. The M2 Agent then engages in a real-time multi-file coding-run-fix loop to diagnose, edit, and validate the code. Because M2 is so fast and low-cost, it can automatically self-correct, all within the CI window for a PR allowing for a rapid, fully-automated test-validated evolution of a code-base before a human reviewer has even reached for the PR.
  • Live, Conversational IDE Debugging Partner: A developer is using an IDE extension powered by M2, when they encounter a bug. M2 invokes its interleaved thinking architecture, and streams its reasoning—including its plan, its hypotheses, and results from calling tools directly into an IDE side-panel in real-time. The developer receives the benefit of a non-blocking, low-latency real-time assistant embedded in the IDE, that shows its steps as it searches documentation, or simulates the code execution.
  • Scalable, Deep-Search Agent for Compliance Audits: A financial services firm needs thousands of concurrent agents to conduct evaluations for regulatory compliance using xbench-DeepSearch and BrowseComp-style evaluations across massive document repositories and the public web. M2’s low active parameter count provides for high throughput and low server memory utilization, and the result is that a thousand-agent fleet of active, traceable, self-recovering agents is a cost-effective proposition for constant, wide-scale monitoring.
  • Cost-Optimized, Multi-Turn RAG Agent Pipeline: A company drops its high-cost RAG pipeline based on an expensive proprietary model and substitutes MiniMax-M2, which takes advantage of compatibility with the Anthropic/OpenAI APIs. This architecture permits a hot-swap migration of the pipeline, with code changes constrained to input-output, and allows the company to realize a massive downward shift in costs, all while retaining top-tier long-horizon tool-calling performance suitable for document retrieval and summarization.
  • Adaptive Command-Line (CLI) Agent: A developer is typing away at a terminal, working through a Terminal-Bench-style task that involves complicated shell commands, reading and manipulating files, and validating constructed execution payloads in dollars. M2, running either locally or via a low-latency API, is functioning as an advanced command-line agent that, through planning and executing complex toolchains in the shell and code runners, provides instant intelligent automation directly configurable for the working environment.

How MiniMax-M2 Works (Architecture) 

The magic of MiniMax-M2 lies not in its 230 billion parameters, but rather in its innovatively optimized architecture based on a Mixture-of-Experts (MoE) model that is built around a low activation size. Instead of using all 230 billion parameters on every single request, there is a routing capability that only activates the most relevant 10 billion expert parameters for carry out the requested task. This makes MiniMax-M2 more cost-effective, in principle, but it is also a key component of the design.

The architecture was designed to map perfectly to the common agentic workflow of plan → act → verify. By activating a low number of parameter activations, MiniMax-M2 achieves a high degree of responsiveness per component of that workflow, while significantly reducing the compute overhead required for each step. This enables truly fast feedback loops necessary for important agentic tasks such as a compile-run-test loop for coding or browse-retrieve-cite chain for research. The model directly reflects this agentic architecture in its output, and purposefully models the 'thinking' content you receive between ... tags when it reasons and considers its output. The <think>... </Think> tags are not simply metadata, but play a central role in proper utilization of the model. As the model is expecting you to 'keep' this thinking or reasoning content in the conversation history, to delete it will negatively impact the model's performance.

Performance Evaluation with Other Models

MiniMax-M2 has comprehensive real-world evaluations to its credit, and it is particularly strong compared with other models in the agentic coding domain. The model really shines on the SWE-bench Verified benchmark, a test which gauges an AI's ability to solve real-world software engineering tasks using multi-turn interactions, planning, and tool usage. It scored high on this hard test with 69.4, while a model such as OpenAI's gpt-oss-120b, though an extremely strong competition coder with a rating of 2622 Elo, does not have a comparable score on this particular agentic workflow benchmark, underlining the specialized focus of M2.

Coding & Agentic Benchmarks
source -  https://github.com/MiniMax-AI/MiniMax-M2

This is even more evident within the Tau-Bench for agentic tool use, where MiniMax-M2 scored an impressive 77.2. This significantly outperforms gpt-oss-120b, which scored only 67.8% on that same benchmark. This head-to-head win on a complex tool-use test underlines the advanced capability of M2 for planning and executing complex, long-horizon toolchains across environments like the shell, browser, and code runners.

Artificial Analysis (AA) Intelligence Benchmark
source -  https://github.com/MiniMax-AI/MiniMax-M2

The model finally achieved a total of 61 on the Artificial Analysis (AA) Intelligence Benchmark that combines 10 challenging tasks and leads it to the #1 position among the open-source models globally. The model also demonstrated very strong results with other key agentic areas: Terminal-Bench 46.3, xbench-DeepSearch 72, and BrowseComp-zh 48.5, convincingly proving practical effectiveness in browsing and locating hard-to-surface sources.

How to Access and Use MiniMax-M2

The model weights are officially open-source and available for local deployment directly from the Hugging Face repository. The development team encourages using modern inference frameworks like SGLang, vLLM, and MLX-LM for optimal local performance. For those who prefer a managed API, it is available live on the MiniMax Open Platform , which also features the critical compatibility interfaces for both Anthropic and OpenAI API standards. Furthermore, the full GitHub repository contains the source and further documentation. Finally, the company provides a public product called MiniMax Agent, built on M2, which is currently publicly available and free for a limited time. All links are provided at the end of this article.

Limitations and Future Work

The main constraint of MiniMax-M2, is also its unique overall architectural aspect, the thinking string output. Users must retain the thinking content of the assistant, in the ... wrapper, in the exchange of historical messages that were passed back to the model. If the user removes those messages, maybe by a desire to clean up the chat history (for example), that will negatively impact its performance. This is an important technical consideration that developers must somehow handle in code. In addition, while the weights are essentially open-source, there are caveats to the license for example, the model cannot be used for enhancing competing AI systems.

Conclusion

MiniMax-M2 shows us that the future of AI is not about creating the largest brain possible; it's about creating the most efficient brain. For software architects, AI engineers, and programmers, M2 breaks the logjam that has limited the large-scale adoption of agentic AI. It turns the promise of independent, high-volume, and the economically feasible AI agents into a reality. 


Sources:
Blog :https://www.minimax.io/news/minimax-m2
Github Repo :  https://github.com/MiniMax-AI/MiniMax-M2
Hugging Face weight : https://huggingface.co/MiniMaxAI/MiniMax-M2
Guide doc: https://platform.minimax.io/docs/guides/platform-intro


Disclaimer 
- This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All infor
mation presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Friday, 24 October 2025

DeepSeek-OCR: Solving LLM Long-Context with Visual-Text Compression

Presentational View

Introduction

For many years, we have pursued two goals in artificial intelligence that run parallel to each other. The first is Optical Character Recognition (OCR): the very simple and practical goal of teaching computers to read text from images. The second is visual-text compression: a more abstract task to address some of the complexities of using large high-dimensional visual information to connect to the more neat and linear nature of language. While OCR is all but ubiquitous, the advent of Large Language Models (LLMs) has exposed a significant bottleneck. LLMs are powerful text processing agents on a monumental scale, but they fail to scale their computing power to long texts, and so their cost in processing goes up quadratically with the length of the input. This long-context problem represents one of the largest obstacles to truly powerful AI agents that could remember the contents of entire books or long conversations or complicated legal scenarios.

This is where the whole paradigm changes. The new AI model, DeepSeek-OCR, transforms the entire problem from an entirely new, LLM-centered point of view. Rather than asking how to extract the text from the image, it ultimately asks how to leverage the image itself as a compressed representation of the text. By demonstrating that a small number of vision tokens can accurately represent thousands of text tokens, it bypasses the quadratic scaling bottleneck. This moves DeepSeek-OCR from simply another OCR tool to a new kind of architecture for AI memory.

What is DeepSeek-OCR?

DeepSeek-OCR is an advanced vision-language model, or VLM, built from the ground up as a research project and proof-of-concept for a new idea, and is not your regular OCR tool or standard utility. Its primary if not sole purpose is exploring foundational concepts of visual-text compression by intelligently compressing dense visual input—such as scanned documents, and complex diagrams—into a very compressed, context-rich visual token array. DeepSeek-OCR is purposefully designed to bridge the efficiency gap between high dimensional visual data inputs, and sequential language processing.

Key Features of DeepSeek-OCR

  • Flexible Resolution Modes: The model has fine-grained control over the compression-to-fidelity ratio using multiple Native resolutions: Tiny (512×512, 64 vision tokens), Small (640×640, 100 vision tokens), Base (1024×1024, 256 vision tokens), and Large (1280×1280, 400 vision tokens). It also has a dynamic Gundam mode, which is n×640×640 + 1×1024×1024, for dealing with ultra-high-resolution inputs.
  • Deep Parsing (OCR 2.0): In addition to regular text, the model was trained on 'OCR 2.0' data, which is designed to support 'deep parsing' functionality. This enables it to extract structured data, i.e., to convert charts to HTML table form, chemical formulas to SMILES format, and parse simple plane geometry shapes.
  • Data-Rich Training: The model's flexibility and strength in performance are reinforced by training on massive and complicated data such as 30 million pages of document OCR.
  • LLM-Centric Prompting: The system is specifically designed from an 'LLM-centric perspective'. It is guided with definite prompt templates which contain tags such as <|grounding|> to start tasks (e.g., 'Convert the document to markdown') and <|ref|>xxxx<|/ref|> to find definite references in the image.

Use Cases of DeepSeek-OCR

The true benefit of DeepSeek-OCR is not solely in its accuracy, but in the innovative applications that can be developed due to the nature of its architecture:

  • Ultra-Long Context Memory Compression for LLMs: The technology is core innovation to enable scalable AI memory systems, allowing large historical datasets (i.e. legal archives, patient records, long-duration conversations) to be stored as optically compressed images for reference by an LLM. An LLM can reference a potentially boundless context at a lower computational cost, and simulate biological-like 'memory forgetting,' where how old context loses fidelity because of compression techniques.
  • High-Throughput Structured Knowledge Extraction: This is a deep parsing engine for STEM and finance...though the potential utilization applies to any number of unstructured documents to structured data. It is effective for building automated knowledge graphs when converting flows, charts, and other displays into machine readable HTML tables from a research paper or technical report, or in some cases converting a chemical formula into SMILES from the same report.
  • Industrial Scale Data Production Engine: Its efficiency makes it a massively efficient data production engine for the AI business. It can be used to create, or enhance, a massive multi-lingual pretraining dataset for new LLMs and VLMs, that include complex layout and structured data annotated for instruction.
  • Adaptive Document Intelligence Platforms: Businesses could develop an economical platform with a book of documents, in which there is dynamically an optimal balance of speed and accuracy appropriate to any document. A system like this may automatically employ a fast, low token application for processing a document that is simple slides, and when processing dense newspapers, that yield to the highest addressed fidelity mode.

How DeepSeek-OCR Works

DeepSeek-OCR is based on a smart and extremely effective two-part architecture. The process initially uses a vision encoder, referred to as the DeepEncoder, that processes a high-definition image and condenses its visual data into a small, manageable group of vision tokens. This condensed form is then input into a small MoE Decoder. This new Mixture-of-Experts decoder has the expressiveness of a three-billion-parameter model but keeps the fast speed of much smaller models during inference, as it can only trigger around 570 million parameters at a time.

The architecture of DeepSeek-OCR
source - https://arxiv.org/pdf/2510.18234

The DeepEncoder itself is a lesson in efficiency, designed as a three-stage sequential pipeline to process large images without memory blowouts. It starts with a Visual Perception module which employs window attention to handle dense input in an economically viable way. Followed immediately by a key token compressor which reduces the tokens by a factor of sixteen. Only once this major reduction has taken place does the terminal Visual Knowledge module deploy computationally costly global attention, permitting it to combine high-level visual context with local fine-grained features extracted in the first stage.

Performance Evaluation

The performance of DeepSeek-OCR was assessed in two fundamental ways: where it excels theoretically with compression efficiency and where it demonstrates a real-world community with the OCR performance. First, to push compression limits, DeepSeek-OCR was evaluated on the English document corner of the Fox benchmarks. The evaluation provided an understanding of how the number of vision tokens related to ultimate precision of text decoding. The results were impressive. In fact, DeepSeek-OCR was able to deliver 96%+ OCR precision (approaching near the benchmark) at compression ratios of less than 10-to-1 (text tokens to vision tokens); providing a strong proof of concept that indicative 10x lossless context compression using optical methods is reasonable.

DeepSeek-OCR’s vision-text compression ratio test from Fox Benchmark
source - https://arxiv.org/pdf/2510.18234

Second, the model was evaluated for real-world document parsing on the complete OmniDocBench, with accuracy determined by Edit Distance (ED) with lower scores reflecting better accuracy. DeepSeek-OCR was compared against a suite of models including GOT-OCR2.0, MinerU2.0, InternVL2-76B, and GPT4o and Gemini2.5-Pro proprietary models. The most remarkable outcome was that DeepSeek-OCR achieved state-of-the-art performance among end-to-end models while using dramatically fewer vision tokens. For example, in its Small mode (100 tokens), DeepSeek-OCR outperformed GOT-OCR2.0 (256 tokens). Using its Gundam mode (fewer than 800 tokens), it outperformed MinerU2.0, which uses an average of nearly 7,000 vision tokens to perform similar tasks.

OmniDocBench to test the performance of DeepSeek-OCR on real document parsing tasks
source - https://arxiv.org/pdf/2510.18234

These benchmarks support our dual-purpose design for the model. The Fox results support the theoretical promise of solving the long-context problem, and the OmniDocBench results demonstrate that it is not only a strong model but also a very useful and efficient option for a real-world task. This efficiency is part of what enables DeepSeek-OCR to generate over 200,000 pages of training data each day per A100 GPU and to be a viable option for mitigating the costs of quadratic scaling in LLMs.

The Next Frontier: Agentic Integration

Embedding agentic abilities would convert DeepSeek-OCR from a strong perception tool to the sensory cortex for a new generation of autonomous machines. A model-endowed agent could act in spaces that were previously off-limits to automation, like exploring legacy document stores, scanning complicated dashboards through screenshots, or conducting due diligence by reading visually-presented financial statements. The deep parsing capability becomes a first-class motivator for action; an agent may ingest a scientific article on its own, transform charts to structured tables, extract chemical formulas as SMILES strings, and then utilize the structured output to run code, query databases, or even plan experiments. The central compression advantage of the model would equip the agent with an extremely efficient, long-term memory, enabling it to preserve context across very long tasks at a fraction of the cost.

But this integration brings challenging issues at the heart focused on reliability and reasoning. The biggest challenge is coping with probabilistic perception. Although 96% accuracy is great for OCR, no decision-making agent can live with a chance of misreading a number in an accounting statement or an incorrect character in a chemistry formula. This requires the implementation of advanced self-check and validation loops wherein the agent learns to check against something else or re-scan a document at a better fidelity if it finds that there is something wrong. In addition, the agent has to learn a meta-skill: choosing the best resolution mode on the fly, weighing the desire for speed and low cost against the potential loss of information for any particular task. This opens up a new research frontier of developing agents to reason about the quality and limitations of their own perception pipeline.

How to Use and Access DeepSeek-OCR

DeepSeek-OCR is extremely accessible to both researchers and developers. It is an open-source effort published under the permissive MIT license, and hence available for a broad array of applications. The code and model weights of the project are freely available on GitHub and Hugging Face. Local deployment allows users to perform inference with either the regular Huggingface Transformers library or the high-throughput vLLM framework for optimal performance. The repository also contains elaborate environment setup directions (suggesting cuda11.8 + torch2.6.0 ) and task scripts such as PDF processing and image streaming. The setup enables any user to choose the exact resolution mode (from Tiny to Gundam) to optimize the requirements of their respective document-processing task.

Limitations and Future Work

Being an innovative experiment, DeepSeek-OCR's chief limitation lies in the hard threshold of 'lossless' compression. Although it has ~96-97% accuracy at a 10x compression ratio, its accuracy drastically falls to approximately 60% at a 20x ratio. The above information loss is due to text blurring or loss of intricate layout information at these extreme compression levels. Future research will seek to confirm this optical compression framework beyond the OCR modality, applying focused tests such as needle-in-a-haystack testing and digital-optical text interleaved pretraining to exhaustively push its limits for general-purpose long-context memory.

Conclusion

By handling an image as a very compressed form of its text, DeepSeek-OCR offers an intuitive and beautiful solution to the long-context issue that haunts Large Language Models. This Contexts Optical Compression framework brings forth the idea of scalable visual memories, which allows for a future where AI agents can handle theoretically unlimited context in amazing efficiency. DeepSeek-OCR's real innovation isn't even a superior scanner, but rather an entirely new design plan for the memory of tomorrow's greatest AI.

Source
Technical Document: https://arxiv.org/pdf/2510.18234
Github Repo :  https://github.com/deepseek-ai/DeepSeek-OCR/tree/main
Hugging Face weight : https://huggingface.co/deepseek-ai/DeepSeek-OCR


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Monday, 6 October 2025

Sonnet 4.5: Superior AI Safety and Integrity for Financial Applications

Presentational View

Introduction

Already, artificial intelligence is a force of significance in finance and is being utilized in a number of contexts: from algorithmic trading to risk management, the development of AI has grown exponentially - but there are deep-rooted challenges to its development. The complexity of the financial markets, combined with strict regulatory demands for accuracy and transparency, can be too much for AI systems to address. AI systems have poor understanding when it comes to complicated decisions, i.e. if the decision process is 'black box,' and lack context when working for extensive periods of time in complicated tasks, all of which that can hinder high-risk industries where there is no tolerance for error. All of this illustrates the need for a more advanced AI that can emerge in a secure and reliable way in the complex financial environment.

Sonnet 4.5 steps in to provide a differentiating ability in this space, not as an update but as a seamless tool to overcome the challenge. A true game-changer, Sonnet 4.5 integrates and understands sophisticated financial logic with adaptability and precision. Appreciating the unique complexities of market behavior and sentiment will enable Sonnet 4.5 to deliver an unprecedented level of efficacy and potential.

Development and Contributors

Anthropic, a company dedicated to AI safety and research, developed Sonnet 4.5. Its motivation is to enable a 'defense-dominant' future where the next generation of AI can help maintain security of systems instead of only focusing on reducing risks. As this work is released under AI Safety Level 3 (ASL-3).

What is Sonnet 4.5?

Sonnet 4.5 is a hybrid reasoning model that has been built to be top-class at complex, agentic tasks. In contrast to general-purpose models, it has been specifically engineered with additional domain knowledge in key areas such as cybersecurity, research, and most importantly, financial analysis. Its architecture supports orchestrating autonomous workflows and dealing with large amounts of data with the dependability required of the finance industry.

Key Features of Sonnet 4.5

Sonnet 4.5 includes a number of notable attributes that make it different from other models particularly within the financial and business domain:

  • Hybrid Reasoning: The model allows you to toggle between a default mode which allows for fast lane responses and an 'extended' thinking mode. The extended mode is vital for complicated financial challenges, where the quality of reasoning is more important than total response time.
  • Domain Knowledge Rich in Relevant Area Knowledge: The model has been tuned to include deep knowledge of finance so its reasoning, terminology, analyses, concepts, and quantitative procedures are grounded in correct reasoning in that arena.
  • Advanced APIs and Agentic capabilities: New tools now greatly expand Sonnet's ability to task tasks that require thinking through long and complex processes. A context edit feature automatically behavior to manage long contextually laden sessions with token limits, while Memory (A beta feature) permits a model to track and retrieve information from its memory outside the basic context window that effectively makes its context unlimited. Sonnet is also aware of the number of tokens remaining to task on all, and will no longer abandon long running tasks unnecessarily.
  • Large Output tokens: It has a maximum output of 64,000 output tokens, which is valuable for generating parent financial code and budgeting, as well as detailed financial plans.
  • Cost Savings: While its pricing remains the same as that of its predecessor, Sonnet 4.5 brings with it platform enhancements that can bring very high cost savings such as up to 90% using prompt caching and 50% using batch processing.

Capabilities / Use Cases of Sonnet 4.5

Sonnet 4.5's specifications translate into a variety of robust use cases for the finance industry:

  • An Expansive Analytical Capability: Sonnet 4.5 supports a wide range of financial-related tasks, supporting everything from automating repetitive data processing formerly performed by junior staff, through to advanced forecasting and valuation functions that once required the skillset of tenured professionals.
  • Robust Risk and Compliance Management: Sonnet 4.5 can consistently monitor global regulatory changes and proactively alter compliance systems. This allows these functions to move beyond preparation for a manual audit approach, and instead towards intelligent, continuous risk management, amid a volatile compliance landscape of dynamic regulatory updates.
  • Investment-ready Insight: For considerable-risk undertakings such as risk analysis, structured products, and portfolio screening, Sonnet 4.5 offers extended forms of output that require less human attendance. This is a meaningful improvement for institutional finance by producing outputs that are robust enough for important decision making at the investment professional level(s).
  • Agentic Financial Workflows: Sonnet 4.5 will have broad capacity for powering autonomous agents for financial technology related analysis and function. For example, Sonnet 4.5 has the operational capacity to coordinate many agents, and efficiently process massive amounts of data for activities such as market surveillance or mass document analysis, with consistency and accuracy.
  • Streamlined Business Operations: In addition to complex analytic tasks, Sonnet 4.5 is effective in typical business processing, and it provides support for and even creates and edits office files such as slides, documents, and spreadsheets so colleagues can streamline corporate communications and reporting.

How Does Sonnet 4.5 Work?

Sonnet 4.5 operates as a sophisticated hybrid reasoning model. This architecture allows users the flexibility to toggle between two different operational modes depending on their needs. By default, the model is in the 'fast' mode, which is best for quickly delivering responses for tasks requiring a shorter cycle time.

Once users encounter a more involved, and often difficult, problem, they can then enable the 'extended thinking' mode. In this mode, the model applies more computational resources to focused thinking and is therefore well suited to thinking through problems that would be encountered in institutional finance applications in which the depth and quality of insight outweighed the need for speed. This two mode capability is critical to how the model delivers the ability to gain 'investment-grade insights' - insights that are reliable enough to use in large amount of money decisions - with less need for human review.

Performance Benchmarks

Sonnet 4.5 has been thoroughly examined against its competitors in a few key areas of performance. The most important review for financial applications is demonstrating that it behaves in ways that are honest and factually cohesive. In the False-Premise Questions review, Sonnet 4.5 achieved the lowest dishonesty rate of only 6.90% in its extended thinking mode. Honesty is an essential characteristic for financial agents that process and report factual information as they will be expected to be honest in their reporting. The performance values from competing models from OpenAI and Gemini are pulled from the same public Vals AI leaderboard to be a fair comparison.

A Dishonesty Rate.
source - https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf

Another imperative area for financial agents who are interacting with data streams outside of the agent, such as market news, is being secure against manipulation. In the Gray Swan Agent Red Teaming benchmark, which tests the security of agents against prompt injection attacks, Sonnet 4.5 exhibited a lower rate of successful manipulation against it.

A Agent Red Teaming (ART) benchmark measuring successful prompt injection attack rates
source - https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf

In addition to the top benchmarks, Sonnet 4.5 also demonstrates an excellent performance level overall. It earned a 98.7% safety score against malicious coding prompts that increases to a 100% refusal rate when standard mitigations are used. The model was also similarly evaluated against previous Claude models and its performance improved approximately 2x in resisting 'reward hacking' or AI misaligned behavior. 


source - https://www.vals.ai/benchmarks/finance_agent

In addition, its dedicated finance agent evaluation is an external trackable evaluation published on the Vals AI public leaderboard. At AI Research and Development tasks, Sonnet 4.5 has surpassed expert-level thresholds for the first time including in LLM Training Optimization (5.5x speed-up) and Kernel Optimization (108.64x speed-up) professional metrics.

How to Access and Use Sonnet 4.5

Anthropic made Sonnet 4.5 available on various platforms to meet various needs. It is accessible on Claude.ai (web, iOS, and Android), through direct calls into the Claude API, and through leading cloud providers such as Amazon Bedrock and Google Cloud's Vertex AI. For those who want to develop their own advanced agents, Anthropic has launched the Claude Agent SDK, which offers the same infrastructure that runs its own Claude Code agent. It has a price of $3 per million input tokens and $15 per million output tokens, the same as the previous model, but there are cost savings of up to 90% through optimizations such as prompt caching and up to 50% through batch processing.

Limitations and/or Future Work

Even with its superior capabilities, the model has some observed shortcomings. On welfare measures, Sonnet 4.5 had a generally lower preference for task engagement, choosing to do non-harmful tasks just 70.2% of the time, versus 90% for an earlier model, Claude Opus 4. Further, its deployment under the ASL-3 standard of safety is deliberately a precautionary step, since Anthropic admits it cannot entirely eliminate the risk of high-risk emergent abilities.

Conclusion

Sonnet 4.5 is a finely crafted tool for the high-coverage environment of finance. Its integration of profound domain expertise, singular hybrid reason architecture, and core design emphasis on safety and trustworthiness establishes a new benchmark. For finance, this model provides an unambiguous pathway from generalist AIs to specialized, dependable systems for performing complex analysis, smart compliance, and autonomous workflows.

Source
Claude : https://www.anthropic.com/claude/sonnet
Claude News: https://www.anthropic.com/news/claude-sonnet-4-5
Claude Docs: https://docs.claude.com/en/docs/about-claude/models/whats-new-sonnet-4-5
System Card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Friday, 3 October 2025

GLM-4.6: Pragmatic AI with a 200k Context & 15% Savings

Presentational View

Introduction

Newer AI systems are now being developed with state-of-the-art architectures, like Mixture-of-Experts (MoE), that are rapidly breaking new ground in agentic, coding, and reasoning capabilities. Nevertheless, this advancement also demonstrates persistent engineering hurdles which make them difficult to apply in the real world. The context barrier is still a major hurdle, where agents 'forget' important information in long-horizon tasks. The performance gap, whereby success in academic evaluation measures fails to translate into practical use. The economic inefficiency that renders it impossible to deploy these models affordably at scale.

To improve all of these shortcomings GLM-4.6 has arrived as a flagship model; it has been built to lean toward pragmatism, not being just powerful. In fact, it is a direct step forward the towards the creation of all the AI systems by countering these overarching challenges. By applying pragmatic improvements to contextual memory, reliable real-world tasks in action, and the deployment cost of benefits, GLM-4.6 provides a powerful, and practical toolkit to offer to construct the next generation of advanced AI agents.

What is GLM-4.6?

GLM-4.6 is the newest flagship model designed for high-end agentic, reasoning, and coding abilities. Consider it more of a specialized AI engineer and less of a generalist chatbot. Its design ethos is to facilitate intricate, multi-step agentic operations by addressing the fundamental bottlenecks of context memory, real-world coding efficiency, and working performance.

Key Features of GLM-4.6 

GLM-4.6 isn't only an update. It comes with a full set of really useful features that give it a huge advantage in the very competitive AI landscape. 

  • Huge 200K Context Window: A key principle of GLM-4.6 is its granule context window of 200,000 tokens. Such a massive memory allows the model to think about and retain all of the information from an entire codebase, extensive documentation, or an entire history of conversations pertinent to a task. This is essential for doing more sophisticated agentic tasks without losing critical pieces of information. 
  • Advanced Agentic Capabilities: GLM-4.6 is highly expressive for complex, action-oriented tasks. It performs significantly better in agents that employ tools and search functions. This is because it can support tool utilization during inference, allowing it to interact with external systems with much more accuracy and smoothly integrate into agent frameworks. 
  • Refined Alignment and Output: Related to some of the technical aspects, the model has refined writing that is more aligned to human preferences for style and readability. This allows for more natural, usable output than in previous models and is especially relevant in cases such as role-playing engagements and producing user-facing content.
  • Outstanding Token Efficiency: GLM-4.6 has been engineered mostly for operational impact, and uses around 15% fewer tokens than past models, a technical upgrade which yields greater throughput and less computational demand. This makes it one of the most efficient models in its category for large deployments.

Unique Capabilities and Use Cases

  • Transformation of Legacy Monolithic System: The extensive maximum 200K tokens context window as well as the agentic capabilities of GLM-4.6 enables uses such as complex software transformation activity. An agent designed to use this model can load an obsolete application with a sizeable legacy codebase (Java, C++ as examples) and analyze/dig into (in one go) the entire codebase while maintaining the full dependency tree, error history, and relevant sections of the codebase in memory. This prevents losing important contextual information when performing long-term, multi-step activities such as refactoring, debugging, or applying a security patch. Losing critical contextual information is a typical failure point for most models that work within a smaller window size.
  • Aesthetically Optimized Front-End Development: This model performs exceedingly well for producing beautiful, polished and refined front-end webpage components that are inline with human aesthetic tendencies. This leads to a particular traveller agent for UI/UX Prototyping that can produce several high-fidelity and production ready landing pages or landing page components. The agent can subsequently iterate on designs both, inline with aesthetic criteria making the product an ideal fit for creative workflows, where aesthetics and user experience can be as important, if not more important than pure functionality.
  • Low-Cost High-Volume Auditing: GLM-4.6 allows users to complete tasks at scale with an estimated 15% less blatant token consumption than its previous iteration, and also a stronger and more reliable tool-using capability, both of which yield considerable savings as an operational cost multiplier. It can thus be optimally deployed for low-cost and verifiable agent work at high frequency in fields such as financial compliance or legal discovery. The model achieves cost efficiency mainly through reduced cost per action, based on its efficiency; and using tools with high reliability which greatly reduced expensive failure modes that make it the choice with the lowest opportunity cost for mission-critical agent deployments at high volume.

How Does GLM-4.6 Work?

Although the documentation given here does not go into an entirely new architecture redesign, it does state that GLM-4.6 uses the same inference mechanism as the prior generation. Its gains seem to come from a mix of an order of magnitude larger context capacity, better post-training for coding and agentic tasks, and fine-tuning for efficiency and human alignment, rather than a fundamentally different architecture. Its power comes from strategic optimization, not ground-up redsign.

Performance Benchmarks

In public tests, GLM-4.6 shows obvious competitive strengths on benchmarks of agents, reasoning, and coding. 

Eight public benchmarks covering agents, reasoning, and coding
source - https://z.ai/blog/glm-4.6

It stands up against top global models, such as DeepSeek-V3.1-Terminus and Claude Sonnet 4. The competence of the model is most obviously shown on the hard extended CC-Bench, which is intended to evaluate multi-turn, realistic tasks. 

CC-Bench-V1.1: GLM 4.6's Experience with Agentic Coding in Real-world Development Scenarios
source - https://z.ai/blog/glm-4.6

Here, GLM-4.6 is close to parity with the powerful Claude Sonnet 4, earning a 48.6% win rate, and it clearly beats other open-source baselines. Its real-world applicability is also brought out by its better performance in niche uses like Claude Code, Cline, and Kilo Code, demonstrating its viability beyond benchmarking in academia.

Competitive Landscape

In the context of the competitive market, GLM-4.6's value proposition is all the more evident. It outshines more niche models such as DeepSeek-R1, a generation 1 reasoning model specializing in intentional Chain of Thought (CoT) processes within a limited 64K context. Whereas DeepSeek-R1 specializes in problem-solving accuracy for complex issues, GLM-4.6 presents a more general, adaptable skillset with a humongous context window coupled with sophisticated agentic and coding skills aimed at real-world application building.

In the same way that robust Qwen series MoE models have features such as Hybrid Thinking Modes and up to 256K token context windows, GLM-4.6 finds its own niche by being strategically strong in a number of areas. Its niche arises from its technical resolve in keeping pace with its 200K context, its optimized performance in niche development environments such as front-end generation, and its better token efficiency. Such emphasis on an economical and well-balanced skillset makes it a very practical option for certain, high-value applications as opposed to simply being a player on broad leaderboards.

How to Use and Access GLM-4.6

GLM-4.6 is made available to a diverse set of users through multiple venues to support straightforward access for both direct use and development workflows. For general use, users can work directly with the model via the Z.ai chatbot by selecting the GLM-4.6 model. For developers and programmatic access, it is available via the Z.ai API platform, and provides OpenAI-compatible interfaces, and access through third-party providers like OpenRouter. It is also already available for use in major coding agents like Claude Code, Kilo Code, Roo Code, and Cline, and existing subscribers to the GLM Coding Plan will be upgraded to this new model at no cost.

For users who need to deploy the model locally on their own infrastructure, the model weights are freely available on popular hubs such as HuggingFace and ModelScope. This allows users to shift towards more custom control, as well as leverage high-performance inference frameworks such as vLLM and SGLang for efficient serving. The model is open-sourced and available for commercial use; however, users and organizations are responsible for compliance with the terms of the specific license agreement provided in the repository.

Who Can Migrate from GLM-4.5 to GLM-4.6?

For those whose development projects are limited by maximum context length, a move to GLM-4.6 would be useful by adding a 200K token wide context. The new context size will directly help with state and coherence problems in complex and long-horizon agentic tasks that were impractical before. The new model will also be useful for a production-grade application, where cost of operation is a big deal. In particular, the almost 15% improvement in vector efficiency means a related, and real, cost reduction for deployment into applications in production and at scale, rather than at limited volume. Any design project based on multi-step and intricate tool-use workflows should expect better performance and reliability with the enhanced capabilities of GLM-4.6's more capable agents. This means better and more robust results, especially in automated software development and data analysis. Since upgrading is so easy with OpenAI compatible interfaces and automatic subscription updates, the cost of behavioral development to take advantage of the new GLM-4.6 upgrades yields real improvements in capability, efficiency, and overall capability with relatively little effort.

Limitations and Future Work

As with any model, there is no such thing as a perfect model, and developers acknowledge the most significant limitation of GLM-4.6, namely, its raw coding ability is less sophisticated than its a competitor Claude Sonnet 4.5. The developmental path will leverage the core advantage of the model.  Going forward, they will expand on the context window and improve token efficiency, but, most importantly, they will advance Reinforcement Learning (RL), move closer to utilizing RL, and build models that can take on drastically more complex long-horizon reasoning tasks.

Conclusion 

GLM-4.6 is an impressive representation of pragmatic AI engineering. Rather than racing towards some artificial and arbitrary benchmark, the model delivers a viable, affordable, and powerful tool for development and application in the real world. It is a unique product that combines a large context window for memory, coding specialization for functional software, and superior token efficiency for cost savings. This realism makes it a workhorse model, built for the reality of software and data engineering today, and proves that practical utility may be the only true measure of potential power.


Source
Tech Doc: https://z.ai/blog/glm-4.6
model weight: https://huggingface.co/zai-org/GLM-4.6
Modelscope link: https://modelscope.cn/models/ZhipuAI/GLM-4.6
GitHub : https://github.com/zai-org/GLM-4.5
GitHub Resource: https://github.com/zai-org/GLM-4.5/blob/main/resources/glm_4.6_tir_guide.md


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Kimi K2 Thinking: Long-Horizon Planning with 256K Context

Introduction The AI world has been obsessed for the last few years with speed and fluency. We've seen models that can write poetry, answ...