Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

Agent S is an open source framework developed by Simular AI that lets intelligences operate computers like humans through a graphical user interface (GUI). It uses a multimodal macrolanguage model and empirical learning techniques to perform tasks such as browsing the web, editing documents, and using software. The project is open-sourced on GitHub and has an active developer community. Agent S1's paper was accepted by ICLR in 2025, and Agent S2 was released in March 2025, outperforming OpenAI and Anthropic It supports macOS, Windows, and Linux. It supports macOS, Windows, and Linux and is suitable for automated offices, software testing, and AI research.

Agent S：像人类一样操作电脑的开源智能体框架-1

Function List

Graphical User Interface (GUI) operation: Analog mouse and keyboard to interact with computer software.
Tasking and planning: Split complex tasks into small steps and automate their execution.
Learning from experience: Learning from historical tasks to increase efficiency.
Cross-platform support: Available on macOS (One-click installation package), Windows and Linux.
Multi-modal inputs: Combine screen images and interface elements for precise operation.
Open Source Customization: Source code and documentation are provided and can be freely adapted by the developer.
Knowledge base update: Continuous update of experience data at runtime to improve intelligence.

Using Help

Agent S is an open source tool for developers that requires a certain programming foundation to install and use. Below are the detailed steps and functional instructions to help users get started quickly.

Installation process

Preparing the environment
- Install Python 3.9 through 3.12.
- Install Git for downloading code.
- Optional: Prepare a virtual machine (such as VMware) for testing or isolating the environment.
Download Code
- Open a terminal and run it:
```
git clone https://github.com/simular-ai/Agent-S.git
```
- Go to the project catalog:
```
cd Agent-S
```

Installation of dependencies

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # macOS/Linux
venv\Scripts\activate     # Windows

Install the core library:
```
pip install gui-agents
```

Setting environment variables (e.g. API keys):

export OPENAI_API_KEY=<你的密钥>
export ANTHROPIC_API_KEY=<你的密钥>
export HF_TOKEN=<你的Hugging Face密钥>

Starting Agent S
- Run Agent S1 or S2:
```
agent_s1  # 运行 Agent S1
agent_s2  # 运行 Agent S2
```
- Once launched, enter the task to get started.

Main Functions

Graphical User Interface (GUI) operation

Functional Description: Simulates human operations through screen shots and interface recognition.
procedure::
1. (of a computer) run agent_s2The
2. Enter the task: "Open Notepad and type 'Hello'."
3. Agent S2 Locate the Notepad icon, click on it to open it, and then enter your text.
4. Press Ctrl+C to stop at any time.

Tasking and planning

Functional Description: Break down complex tasks into small steps and complete them incrementally.
procedure::
1. Type, "Send an e-mail to a friend."
2. Agent S2 automates the process: open the mail program, create a new message, fill in the content, and click send.
3. Users can view the logs for each step in the terminal.

Learning from experience

Functional Description: Record the course of the task and optimize subsequent operations.
procedure::
1. After completing a mission, experience is saved in the gui_agents/kb Folder.
2. Running similar tasks again will improve efficiency.
3. Developers can check the knowledge base document for learning content.

Featured Function Operation

Cross-platform support

Functional Description: Support for three major operating systems.
procedure::
1. Windows requires the installation of pywin32 cap (a poem) pywinautoThe
2. macOS required pyobjcuse pip install pyobjc Installation.
3. Linux Check pyautogui compatibility, permissions may need to be adjusted.

Multi-modal inputs

Functional Description: Combine image and interface data to improve operational accuracy.
procedure::
1. Type in, "Search for 'weather' in your browser."
2. Agent S2 analyzes the screen, finds the browser window, and enters a search term.
3. The results are displayed automatically.

Knowledge Base Download

Functional Description: Agent S2 uses a pre-trained knowledge base and supports offline operation.
procedure::
1. Automatically downloads the knowledge base from GitHub Releases on first launch.
2. Manual download example:
```
download_kb_data(version="s2", release_tag="v0.2.2", download_dir="kb_data", platform="linux")
```
3. The knowledge base path is in the kb_data Folder.

Advanced Configuration

Integration Perplexica Search

Functional Description: Enhancement of web knowledge retrieval capability of Agent S.
procedure::
1. Install Docker Desktop and start it.
2. Download Perplexica:
```
cd Perplexica
git submodule update --init
```
3. rename sample.config.toml because of config.tomlIf you are not sure about the API key, fill in the API key.
4. Start the service:
```
docker compose up -d
```
5. Set the Perplexica URL:
```
export PERPLEXICA_URL=http://localhost:端口/api/search
```

Custom Models

Functional Description: Support for multiple large models and custom endpoints.

procedure::

utilization Claude Model:

agent_s2 --model claude-3-7-sonnet-20250219

Use the Hugging Face endpoint:

agent_s2 --endpoint_provider "huggingface" --endpoint_url "<端点URL>/v1/"

caveat

First run requires internet connection to download dependencies and knowledge base.
Linux users avoid Conda environments that may interfere with the pyatspiThe
Detailed documentation is available at README.md cap (a poem) models.md Center.

application scenario

office automation
Agent S can automatically fill out forms and send emails to reduce repetitive work.
software testing
Simulate user operations and test the stability of the software on different systems.
AI Research
Researchers use it to explore the technical principles of intelligent body-computer interaction.

QA

What is the difference between Agent S2 and S1?
S2 is an upgraded version of S1 with more performance and support for more benchmarks such as OSWorld and AndroidWorld.
Do I need to be connected all the time?
Internet access is required for the first installation and download of the knowledge base, after which it can be run offline.
How do I contact community support?
Join the Discord server (https://discord.gg/E2XfsK9fPV) or submit an issue on GitHub.

Agent S2 Technical Details Announced: A Combinatorial AI Framework for General Purpose Computer Operations

Building intelligences capable of using computers as skillfully as humans is one of the key challenges on the road to general-purpose artificial intelligence (AGI). Such tasks cover a wide range of scenarios from performing open-ended numerical tasks to navigating unfamiliar applications through graphical user interfaces (GUIs) in a problem space that is characterized by being large, noise-filled, and highly dynamic. Recently, a paper on Agent S2 The official release of the technical paper of the research, which proposes a modular framework and achieves leading performance in multiple computer usage benchmarks.

Agent S2 The code associated with this release has previously been open sourced. The technical paper released (available at arXiv (Get) provides an in-depth look at the system's core concepts and architectural design. Simular AI, the research team behind the system, has also previously published an introductory article for non-specialized readers.

Agent S2 Overview: Combined Intelligent Designs

Agent S2 The core design concept is to decompose complex computer operation tasks, not relying on a single, large model to do all the work of planning, action, and screen interaction comprehension, but rather assigning these responsibilities to a generalist planning module and a specialized execution/comprehension module (specialists). This combined architecture is intended to mimic the way teams of human experts work: high-level planners, low-level executors, and interface interaction specialists working in tandem.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-1
Agent S2 architecture diagram: combines a generic planner with specialized base modules.

Agent S2 Key features include:

Mixture of Grounding (MoG). Utilize a set of underlying expert models (including visual, textual, and structured information extraction) to accurately locate GUI elements. For example, working with a spreadsheet may focus on structured data, while clicking a button relies on visual localization. This design decouples Grounding from Planning, effectively reducing the complexity of the problem and bringing it more in line with the current distribution of training for generalized reasoning models and specialized visual base models.
Proactive Hierarchical Planning (PHP). The framework is able to dynamically adjust and refine its plans based on environmental feedback, rather than rigidly following a preset script. This allows the intelligence to respond more flexibly to unanticipated situations.

Benchmark Results: Cross-Platform Performance Leader

Data from the paper show that Agent S2 In the widely used OSWorld A new performance record was set in benchmark testing. OSWorld It mainly evaluates the ability of AI intelligences to accomplish diverse tasks such as file management, software usage, and information retrieval in a simulated operating system environment.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-1
OSWorld Benchmark Success Rate Comparison.

In addition. Agent S2 It also shows good generalization capabilities on other platforms:

WindowsAgentArena. This is a benchmark focused on complex interaction tasks in the Windows environment. Agent S2 Performance in this test improved by 52.81 TP3T compared to the previous Best of the Open Results (SOTA).
AndroidWorld. This benchmark measures the ability to complete tasks on the Android mobile operating system. Agent S2 performance here also outperforms the previous SOTA results with a 16.51 TP3T improvement.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-3
Success rate performance at OSWorld shows that Agent S2 outperforms previous methods.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-1
Success rate performance on WindowsAgentArena.

Design Innovation: Synergies between MoG and PHP

The main challenges faced by many existing computer intelligences in real-world applications stem from inaccurate understanding of interface elements (i.e., the "base grounding" problem) or from overly rigid program execution processes. Agent S2 These issues are countered through its two core designs:

Mixed Base Model (MoG). The MoG mechanism is able to intelligently route tasks to the most appropriate expert model based on the current interaction requirements. For example, recognizing and manipulating a spreadsheet cell might invoke an expert based on structural analysis, while switching to a visual base model when clicking on a visually distinctive button. Separating basic interaction understanding from high-level task planning essentially breaks down a complex problem into two relatively simpler, more easily modeled subproblems.
Proactive Planning (PHP). The PHP module enables intelligences to continuously adapt sub-goals and action plans in response to new observations in the environment. This adaptation mimics the human behavioral pattern of re-evaluating and revising plans when the situation changes while performing a task.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-5
Example: Agent S2 self-corrects in an interaction, switching from a visual base model to a textual base model.

Scalability and Error Recovery

It has been shown that in tasks that require the execution of longer sequences of operations Agent S2 The combinatorial architecture exhibits better scalability than monolithic models. Their dynamic adaptation and self-correction capabilities allow them to adjust their strategies when initial actions do not have the desired effect, thus improving the completion rate of complex tasks. Monolithic models tend to be more prone to failure in long sequential tasks due to cumulative errors or planning rigidity.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-6
Reasons why Agent S2 maintains performance in long time-series tasks: adaptive navigation, interaction and error correction mechanisms.

Beyond the desktop environment: generalized performance on the Android platform

(go ahead and do it) without hesitating Agent S2 s primary development target is intelligences for desktop environments, but its framework design has shown good generalization capabilities for mobile environments as well. In the AndroidWorld The leading performance in the benchmark test proves the applicability of its core concepts, such as MoG and PHP, to different types of GUI environments.

Agent S: An Open Source Intelligent Body Framework for Operating Computers Like Humans-7
Agent S2 achieves leadership in AndroidWorld smartphone usage benchmarks.

Advances in modular intelligences

Agent S2 The results of the study suggest that combinatorial design is not only an architectural choice, but may be an effective way to build intelligences that can operate computers in a robust, human-like manner. This work opens up new possibilities for future research in AI planning, basic interaction understanding (grounding), and multimodal coordination.

Interested readers are encouraged to consult the detailed Technical Papers and related open sourceThe

Agent S: An Open Source Framework for Intelligent Bodies to Operate Computers Like Humans