Pages

Thursday, 19 June 2025

Microsoft's GUI-Actor: A New Coordinate-Free Method for AI Agents

Presentational View

Introduction

The search for the development of AI in a position to run software as fluently as a human being is one of the most utilitarian aspects of technology. Such Graphical User Interface (GUI) agents have the potential to automate our online lives, yet their development has been constrained by the basic problem: visual grounding. For decades, the problem of instructing an AI to consistently link a command such as 'save file' with the right icon was addressed by having it output explicit screen coordinates. It's always been the weak link of automation—it's brittle, it will break at the first change in layout, and it doesn't grasp the dynamic nature of modern interfaces.

An important breakthrough, GUI-Actor, rethink this paradigm in its very foundation. Rather than compelling an AI to conceptualize in terms of brittle numerical coordinates, it innovates a coordinate-free architecture that emulates human perception: we just see a target and act directly upon it. This naturalistic approach gives a more intuitive, resilient, and generalisable route to truly autonomous interaction. GUI-Actor's novel architecture contributes to these basic challenges being solved and establishing a new paradigm for the next generation of AI agents.

Development and Contributors

GUI-Actor is the outcome of collective effort, most notably created by Microsoft Research researchers, with contributions from Nanjing University and the University of Illinois Urbana-Champaign. They noticed that humans do not compute exact screen coordinates prior to action; they notice the target object and click on it. This innate human behavior led the design of GUI-Actor to overcome conventional coordinate-based grounding approaches, striving towards more natural and efficient interaction paradigm for AI agents.

What is GUI-Actor?

GUI-Actor is a Vision-Language-Action (VLA) model that learns to decode natural language directions and interact with computer software interfaces independently from screenshots. It is a Vision-Language Model (VLM) that is augmented with an attention-based action head and a distinctive grounding token, allowing it to observe and act in GUI worlds without computing explicit screen coordinates.

Model Variants 

GUI-Actor comes in several variants to accommodate the performance needs and computational budgets that various developers may have. They are mainly distinguished by their scale and method of training.

  • Core Models: The core models are available in a number of different sizes: 2B, 3B, and 7B. The core models utilize strong Vision-Language Model (VLM) backbones like Qwen2-VL and Qwen2.5-VL to achieve state-of-the-art performance.
  • LiteTrain version: To maximize performance efficiency, developers can leverage the GUI-Actor-LiteTrain option. In this approach, developers can freeze the model parameters of base VLM, only training the newly introduced action head (there are approximately 100M parameters to learn in a 7B model). In this way, developers get effective GUI grounding performance without having to change or retrain the what is useful general-purpose knowledge in the VLM.
  • Grounding Verifier: Also distinct from either the monolithic, scalable core models or LiteTrain versions is the separate lightweight GUI-Actor-Verifier-2B module that can be used in conjunction with any gui-actor, either model or person, is a modification for Verifying user action locations. The GUI-Actor-Verifier acts as a refinement layer to the proposed action locations.

Key Features of GUI-Actor

GUI-Actor's design offers valuable advantages for digital interaction automation:

  • Human-Like Grounding: Its characteristic feature is a new coordinate-free method. It bypasses the restrictions of conventional coordinate generation by aligning language instructions with screen areas directly.
  • Handles Ambiguity Well: GUI interactions tend to be ambiguous (any portion of a large button is an acceptable click target). The model learns to accept all portions of the target element as correct, avoiding over-penalisation and enhancing learning.
  • Decision Refining using a Grounding Verifier: The model's performance is also improved by an optional verifier that is a last resort, ensuring that a suggested action properly aligns the user's intention.
  • Effective Candidate Generation: The model is able to detect several candidate regions of action in one forward pass without additional computational expense, maximizing the probability of detecting the right target effectively.

Capabilities and Use Cases of GUI-Actor

GUI-Actor turns its distinctive architecture into strong real-world capabilities:

  • End-to-End Workflow Automation: Converts high-level natural language instructions to direct, human-readable actions on any application's interface, automating intricate digital workflows solely from rendered screenshots.
  • Cross-Platform System Integration: Runs as a single, cohesive agent on various operating systems (Windows, macOS) and environments (mobile, web), reducing the necessity for platform-specific automation scripts and lessening development overhead.
  • Proven Success in Complicated Situations: Exhibits practical success within complicated, multi-purpose workflows, realizing a top-ranked task success rate on real-world challenges such as OS-World-W that involve completing uncertain, multi-step tasks.
  • High-Confidence Action Execution: Utilizes a two-phase process in which its attention head generates several action candidates and an extremely efficient grounding verifier determines the most likely one, infusing that vital layer of trustworthiness with a low computational cost.
  • Elegant Management of UI Uncertainty: Its multi-patch management training enables it to realize that a complete element (i.e., a button) is a legitimate target, in accordance with the nature of interface design, and avoiding the brittleness of one-point prediction frameworks.

Technical Details

GUI-Actor's innovation is its coordinate-free design. It is constructed on top of existing Vision-Language Models (VLMs) but adds a unique token to its vocabulary. When it processes an instruction and a screenshot, this token serves as a conceptual anchor for the action. The model is conditioned to produce this token rather than coordinate strings, and the final hidden state of the token acts as a query to recognize the appropriate visual target, essentially transforming the grounding task from regression to attention-based alignment.

Overview of GUI-Actor
source - https://www.arxiv.org/pdf/2506.03143

The action head for this alignment is an attention-based head. To start with, visual patch features from the screenshot go through a self-attention layer such that related patches (e.g., various parts of one button) can share contextual information. The token's representation and the patch features are next projected into the same space. By calculating attention between the token and the patches, the model can produce an attention map that indicates the most appropriate screen region for the action. The process is trained with multi-patch supervision, in which all image patches within a ground-truth bounding box are considered as positive samples, and give a dense and spatially aware learning signal.

For high reliability, a light-weight grounding verifier is used as an ultimate decision-making layer. As an independent, compactly-trained VLM module, it takes a candidate location suggested by the action head, annotates it on the screenshot, and predicts a 'True' or 'False' label according to whether the annotated area meets the instruction's purpose. At inference time, the top choice from the attention map is verified by the verifier sequentially until the first one that meets a high-confidence criterion is chosen. This refinement phase significantly enhances accuracy with little computational cost.

Performance Evaluation

The strongest validation of GUI-Actor's performance is on the ScreenSpot-Pro benchmark, a hard testbed that includes high-resolution interfaces and pronounced domain changes. As presented in Table below, the GUI-Actor-7B model registered a strong average accuracy of 40.7%, which increased further to 44.2% with the addition of its grounding verifier. This achievement is especially significant because it performs extremely well relative to and surpasses much larger models, such as  UI-TARS-72B, which achieved a 38.1% score. The relevance of this test is its out-of-distribution nature; by performing well on professional software situations unseen in training, GUI-Actor shows an improved capacity for generalizing its spatial-semantic alignment and validates that its coordinate-free approach is more resilient and scalable compared to standard coordinate-generation methods.

Performance comparison on ScreenSpot-Pro
source - https://www.arxiv.org/pdf/2506.03143

In a second significant test on the defined ScreenSpot benchmark that includes a wide variety of mobile, desktop, and web user interfaces, GUI-Actor again showed industry-leading performance. From the figures in Table below, the GUI-Actor-7B model had an average accuracy of 88.3%, and when supplemented with the verifier, its accuracy increased to 89.7%. This puts it in the elite class of models, surpassing top 72B-parameter models such as Aguvis-72B  and UGround-V1-72B, and is very competitive with UI-TARS-7B . This test is important because it confirms the model's performance across the most prevalent digital environments users engage with every day and shows that its attention-based mechanism is not just resilient but also adaptable across various GUI structures and platforms.

Performance comparison on ScreenSpot
source - https://www.arxiv.org/pdf/2506.03143

Above these initial benchmarks, additional experiments confirm GUI-Actor's strengths. The model outperforms rivals on the ScreenSpot-v2 benchmark, a refined and more accurate version of its precursor, consistently. It also demonstrates extraordinary sample efficiency, achieving its highest accuracy levels from just around 60% of the training data used by baseline models. Such efficiency testifies to the strength of its explicit spatial supervision. Additionally, the light version with a frozen backbone VLM still compared favorably to fully fine-tuned models, demonstrating its capability to improve current VLMs without expensive retraining. Lastly, in an actual online test on the OSWorld-W benchmark, GUI-Actor got the highest task success rate, affirming its practical usability and strong performance on difficult, multi-step tasks.

GUI-Actor Vs ShowUI Vs UI-TARS

Among today's leading GUI agents, GUI-Actor, ShowUI, and UI-TARS represent distinct technical philosophies. UI-TARS follows the traditional, and often brittle, path of generating explicit text coordinates for actions. ShowUI carves its niche by prioritizing efficiency, intelligently reducing visual data from screenshots to accelerate processing. GUI-Actor, however, sets itself apart by pioneering a coordinate-free, human-like methodology that fundamentally changes how an agent perceives its target.

This architectural difference is key. GUI-Actor’s use of an "attention head" to directly highlight a target is inherently more robust than the fragile single-point prediction of UI-TARS and provides a more direct perception-to-action link than the pre-processing focus of ShowUI. This advantage allows GUI-Actor to excel where others struggle, particularly on new or complex interfaces, establishing its coordinate-free design as a leading choice for building the next generation of truly capable and intuitive GUI agents.

Accessibility and Licensing

GUI-Actor is an open source initiative with wide accessibility in mind. The code, models, and related data are publicly hosted by the microsoft/GUI-Actor GitHub repository, which is the main portal for setup guidance and project resources. Pre-trained versions of different sizes, including the verifier, are published on Hugging Face for effortless incorporation. Made available under an MIT license, GUI-Actor is acceptable for both research and commercial purposes, inviting free usage and contribution to the AI community.

Limitations

While GUI-Actor has been able to improve on existing solutions, it too has some limitations. Its use of a fixed-size VLM (e.g., 28x28 pixel patches) can become a problem for precisely interacting with very small UI objects, for example, icons that are less than 10x10 pixels in size. This might be problematic for high-precision control in professional environments, such as CAD packages. Also, like all existing GUI agents, it has the more general challenge of generalising to wholly new or dynamic situations not covered in its training data.

Conclusion

GUI-Actor is a turning point in GUI automation. Transcending the constraint mechanisms of coordinate-based systems, its human-inspired strategy—based on an attention-based action head and a light verifier—is more intuitive, stronger, and more efficient. Its cutting-edge performance, particularly on difficult real-world benchmarks, highlights its strong capabilities.


source
Project deatils: https://microsoft.github.io/GUI-Actor/
Tech paper: https://www.arxiv.org/pdf/2506.03143
GitHub Repo: https://github.com/microsoft/GUI-Actor
Model Weights: https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Wednesday, 11 June 2025

Mistral AI Magistral: Elite Reasoning, Fast Throughput, Open-Source

Presentational View

Introduction

From basic task automation to sophisticated cognitive processes that are starting to simulate human deliberation, Artificial intelligence has traveled an astonishing distance. In this fast-paced progress, we've seen the emergence of AI agents and systems that are not only processing information but are now beginning to reason about it. This transition from predictive text generation to systematic step-by-step problem-solving is the turning point in efforts toward artificial general intelligence.

For decades, the development of AI reasoning models has been hindered by major obstacles. Early models tended to be too general and therefore lacked the in-depth specialization needed for domain-specific issues, rendering them expert generalists in an increasingly expert-requiring world. They also lacked transparency, presenting conclusions from a 'black box' that made it hard to trust or audit their outputs—a major hurdle to adoption in high-risk, regulated domains. In addition, authentic multilingual reasoning was still lagging behind, and most models were unable to keep the consistency of their logic intact when they worked outside of English. 

It is here, at the point where progress meets challenge, that Mistral AI presents its revolutionary model, Magistral. Magistral is not an incremental advance; it is a direct answer to these enduring constraints, designed to deliver profound expertise, provable transparency, and solid multilingual flexibility, thus advancing the boundary of what is possible for AI.

What is Magistral?

Magistral is a pioneering model of reasoning thoroughly crafted to dominate domain-specific, clear, and multilingual reasoning. It is essentially designed to supercharge human thinking, dealing with complex problems with a degree of precision and in-depth consideration that is the new benchmark.

Model Variants

In acknowledgment of the varied requirements of the AI community, Mistral AI published it in two different forms: Magistral Small, a high-end 24-billion parameter version, and Magistral Medium, a yet more powerful enterprise-oriented model. This two-releases approach emphasizes a central philosophy of facilitating real-world reasoning while encouraging a loop of iterative improvement based on community and enterprise inputs.

Key Features of Magistral

Magistral separates itself with a set of advanced features engineered for better, real-world reasoning:

  • Transparent, Step-by-Step Reasoning: Optimized for multi-step reasoning, the model gives a transparent, easily traceable thought process in the user's own language, so its conclusions are completely auditable and simple to trust.
  • Unparalleled Velocity and Productivity: Magistral Medium is capable of token throughput as high as 10 times faster than most others, particularly with "Flash Answers" in the Le Chat interface, and facilitating real-time reasoning at a usable scale.
  • High-Fidelity Multilingual Reasoning: One of the key design principles is to reason natively in many languages, such as English, French, Spanish, German, Italian, Arabic, and others, so that the chain-of-thought and the final answer can be preserved in the user's language.
  • Unexpectedly Robust Multimodal Capabilities: In a dramatic development, Magistral achieves strong performance on multimodal tests even though it was only trained on text-only data, indicating its deep thinking mechanism can transfer cross-data types uniquely.

Capabilities and Use Cases of Magistral

Magistral's deep capabilities open up uses where accuracy, depth, and clarity are an absolute requirement:

  • Problem-Solving: Perfect for any task requiring intensive thinking and detail beyond ordinary LLMs, from sophisticated financial projections to complex planning of software development.
  • Business Strategy and Operations: Business-oriented, it can address sophisticated tasks such as multi-factor risk modeling or determining optimum logistics under diverse constraints.
  • Auditable AI for Regulated Industries: Lawyers, finance professionals, and healthcare providers can use Magistral's traceable reasoning to satisfy strict compliance needs since each conclusion is able to be proven step-by-step.
  • Advanced Code and Systems Engineering: The model shines at augmenting development pipelines, from high-level architecture planning to sophisticated data engineering work requiring external tools and APIs, and thus serves as a formidable tool for constructing agentic systems.
  • Creative and Content Partnership: Initial trials find it to be a first-rate creative collaborator, able to create coherent and, when wanted, wonderfully quirky stories for storytelling and content purposes.

How does Magistral Work?

Magistral's superior performance rests on a highly advanced technical architecture based on its forebears, Mistral Small 3 and Mistral Medium 3. As the two models are shown in the below Figure 4, the two models took different training paths. Magistral Medium was trained using an RL-only method from scratch, representing a major change from the ones based on distilled data from large models.

Overview of the filtering, training and RL stages
source - https://mistral.ai/static/research/magistral.pdf

By comparison, Magistral Small was 'cold-started' through Supervised Fine-Tuning (SFT) prior to being further augmented with the same RL process. At the center of this RL phase lies a highly scalable pipeline utilizing an adapted version of the Group Relative Policy Optimization (GRPO) algorithm. Technical optimizations, including the removal of KL divergence and utilization of a 'Clip-Higher' approach, were performed to loosen the training constraints and make the model explore better.

A central part of the training involves the reward shaping, where model responses are compared against four dimensions: format, correctness, length, and consistency of language. Reward is given specifically for mathematical correctness or code correctness, while a soft penalty is applied to overly long responses. To maintain multilingual fidelity, another reward is provided if the thinking process and final response continue to be consistent with the input language of the user.

The whole process is orchestrated by a distributed framework that controls Trainers, Generators, and Verifiers in a loop. Generators generate text completions, which are verified by Verifiers using reward criteria and passed on to Trainers to fine-tune the model. One of the notable innovations of this pipeline is that generators run asynchronously, which enables them to run at full throughput without holding up the trainers, maximizing efficiency and performance.

Performance Evaluation

Magistral's performance on a variety of metrics cements its place as an important emerging leader in the space of reasoning AI.

Results of Magistral Medium trained solely with RL
source - https://mistral.ai/static/research/magistral.pdf

Magistral Medium registered a remarkable 73.6% (pass@1) on the AIME-24 benchmark, a whopping 50% improvement in accuracy from its base model, Mistral Medium 3. With majority voting, its accuracy on AIME-24 jumped to 90.0%, putting it strongly on par with models such as DeepSeek-R1-Zero. In addition, on the text portion of Humanity's Last Exam, Magistral Medium scored 9.0, a bit better than DeepSeek-R1. It also performed strongly on other tests, including GPQA and LiveCodeBench v5.

Performance of Magistral Small  across various benchmarks.
source - https://mistral.ai/static/research/magistral.pdf

Magistral Small also performed well, attaining 70.7% on AIME-24 and 83.3% using majority voting. Interestingly, the combination of SFT on reasoning traces followed by RL training for Magistral Small resulted in a gain of more than 5 points on different benchmarks over SFT or RL individually. This flatly contradicts earlier research conclusions claiming RL alone may not significantly improve smaller models.

In addition to quantitative metrics, Magistral's RL learning on text-only data surprisingly retained and even extended its multimodal comprehension, instructional following, and function calling abilities. The model also displayed excellent cross-domain generalization, with strong performance on tasks that were outside its main training domain (e.g., code performance resulting from math-only training).

For multilingual tasks, although Magistral Medium kept high-fidelity reasoning across different languages, it experienced a minimal performance drop of 4.3-9.9% on multilingual versions of the AIME 2024 benchmark from its English performance. Yet again, this drop is similar to that of the base model and most importantly, the model performs both its reasoning and final answer in the input language.

How to Use and Access Magistral

Mistral AI has made Magistral widely available to developers and businesses as well. Magistral Small is an open-weight model that is available under the permissive Apache 2.0 license, downloadable on Hugging Face. It is resource-hungry enough to fit into one RTX 4090 GPU or one 32GB MacBook when quantized, making strong reasoning within reach for solo developers. A preview release of Magistral Medium has been placed in Mistral AI's conversational platform, Le Chat, and through API on La Plateforme. It also includes integration in large cloud marketplaces such as Amazon SageMaker, IBM WatsonX, Azure AI, and Google Cloud Marketplace.

Limitations and Future Work

Mistral AI is open about Magistral's current limits. A real-world limitation is its context window; though it can handle 128k tokens, performance is likely to suffer on tasks that need strong focus after 40k tokens. As mentioned, there is also some drop in performance on translated reasoning tests versus English, which suggests an area of future optimization. In the future, Mistral AI aims to break new ground on what's achievable with RL. Their research agenda also involves investigating more ideal loss functions, realizing the promise of bootstrapping models on their own reasoning traces, and extrapolating these techniques to advanced tool-use, effortless multimodality, and the creation of more powerful AI agents.

Conclusion

Magistral is a more-than-incremental advance; it's a root change in AI reasoning. Its pioneering, RL-driven training is a technical innovation, demonstrating that compact models can reproduce premier, explainable performance. For accountability-driven industries, it provides the auditable, step-by-step reasoning that elevates AI from an impenetrable 'black box' to trusted collaborator. Magistral presents an interesting vision for a future in which AI doesn't merely deliver answers but works in cooperation with a quality of lucidity that inspires true trust and genuinely adds to our own capacities. Mistral AI is certainly at the vanguard.

Source
Blog: https://mistral.ai/news/magistral
Tech document: https://mistral.ai/static/research/magistral.pdf
Model: https://huggingface.co/mistralai/Magistral-Small-2506


Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

Devstral 2: SOTA Open-Weight Code Agents for Engineering

Introduction Code Agents are the next major advancement in Generative AI, as they are autonomously operating systems that can Reason, formul...