General Introduction
Agent S is an open source framework developed by Simular AI that lets intelligences operate computers like humans through a graphical user interface (GUI). It uses a multimodal macrolanguage model and empirical learning techniques to perform tasks such as browsing the web, editing documents, and using software. The project is open-sourced on GitHub and has an active developer community. Agent S1's paper was accepted by ICLR in 2025, and Agent S2 was released in March 2025, outperforming OpenAI and Anthropic It supports macOS, Windows, and Linux. It supports macOS, Windows, and Linux and is suitable for automated offices, software testing, and AI research.
Function List
- Graphical User Interface (GUI) operation: Analog mouse and keyboard to interact with computer software.
- Tasking and planning: Split complex tasks into small steps and automate their execution.
- Learning from experience: Learning from historical tasks to increase efficiency.
- Cross-platform support: Available on macOS (One-click installation package), Windows and Linux.
- Multi-modal inputs: Combine screen images and interface elements for precise operation.
- Open Source Customization: Source code and documentation are provided and can be freely adapted by the developer.
- Knowledge base update: Continuous update of experience data at runtime to improve intelligence.
Using Help
Agent S is an open source tool for developers that requires a certain programming foundation to install and use. Below are the detailed steps and functional instructions to help users get started quickly.
Installation process
- Preparing the environment
- Install Python 3.9 through 3.12.
- Install Git for downloading code.
- Optional: Prepare a virtual machine (such as VMware) for testing or isolating the environment.
- Download Code
- Open a terminal and run it:
git clone https://github.com/simular-ai/Agent-S.git
- Go to the project catalog:
cd Agent-S
- Open a terminal and run it:
- Installation of dependencies
- Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows
- Install the core library:
pip install gui-agents
- Setting environment variables (e.g. API keys):
export OPENAI_API_KEY=<你的密钥> export ANTHROPIC_API_KEY=<你的密钥> export HF_TOKEN=<你的Hugging Face密钥>
- Create a virtual environment (recommended):
- Starting Agent S
- Run Agent S1 or S2:
agent_s1 # 运行 Agent S1 agent_s2 # 运行 Agent S2
- Once launched, enter the task to get started.
- Run Agent S1 or S2:
Main Functions
Graphical User Interface (GUI) operation
- Functional Description: Simulates human operations through screen shots and interface recognition.
- procedure::
- (of a computer) run
agent_s2
The - Enter the task: "Open Notepad and type 'Hello'."
- Agent S2 Locate the Notepad icon, click on it to open it, and then enter your text.
- Press Ctrl+C to stop at any time.
- (of a computer) run
Tasking and planning
- Functional Description: Break down complex tasks into small steps and complete them incrementally.
- procedure::
- Type, "Send an e-mail to a friend."
- Agent S2 automates the process: open the mail program, create a new message, fill in the content, and click send.
- Users can view the logs for each step in the terminal.
Learning from experience
- Functional Description: Record the course of the task and optimize subsequent operations.
- procedure::
- After completing a mission, experience is saved in the
gui_agents/kb
Folder. - Running similar tasks again will improve efficiency.
- Developers can check the knowledge base document for learning content.
- After completing a mission, experience is saved in the
Featured Function Operation
Cross-platform support
- Functional Description: Support for three major operating systems.
- procedure::
- Windows requires the installation of
pywin32
cap (a poem)pywinauto
The - macOS required
pyobjc
usepip install pyobjc
Installation. - Linux Check
pyautogui
compatibility, permissions may need to be adjusted.
- Windows requires the installation of
Multi-modal inputs
- Functional Description: Combine image and interface data to improve operational accuracy.
- procedure::
- Type in, "Search for 'weather' in your browser."
- Agent S2 analyzes the screen, finds the browser window, and enters a search term.
- The results are displayed automatically.
Knowledge Base Download
- Functional Description: Agent S2 uses a pre-trained knowledge base and supports offline operation.
- procedure::
- Automatically downloads the knowledge base from GitHub Releases on first launch.
- Manual download example:
download_kb_data(version="s2", release_tag="v0.2.2", download_dir="kb_data", platform="linux")
- The knowledge base path is in the
kb_data
Folder.
Advanced Configuration
Integration Perplexica Search
- Functional Description: Enhancement of web knowledge retrieval capability of Agent S.
- procedure::
- Install Docker Desktop and start it.
- Download Perplexica:
cd Perplexica git submodule update --init
- rename
sample.config.toml
because ofconfig.toml
If you are not sure about the API key, fill in the API key. - Start the service:
docker compose up -d
- Set the Perplexica URL:
export PERPLEXICA_URL=http://localhost:端口/api/search
Custom Models
- Functional Description: Support for multiple large models and custom endpoints.
- procedure::
- utilization Claude Model:
agent_s2 --model claude-3-7-sonnet-20250219
- Use the Hugging Face endpoint:
agent_s2 --endpoint_provider "huggingface" --endpoint_url "<端点URL>/v1/"
- utilization Claude Model:
caveat
- First run requires internet connection to download dependencies and knowledge base.
- Linux users avoid Conda environments that may interfere with the
pyatspi
The - Detailed documentation is available at
README.md
cap (a poem)models.md
Center.
application scenario
- office automation
Agent S can automatically fill out forms and send emails to reduce repetitive work. - software testing
Simulate user operations and test the stability of the software on different systems. - AI Research
Researchers use it to explore the technical principles of intelligent body-computer interaction.
QA
- What is the difference between Agent S2 and S1?
S2 is an upgraded version of S1 with more performance and support for more benchmarks such as OSWorld and AndroidWorld. - Do I need to be connected all the time?
Internet access is required for the first installation and download of the knowledge base, after which it can be run offline. - How do I contact community support?
Join the Discord server (https://discord.gg/E2XfsK9fPV) or submit an issue on GitHub.
Agent S2 Technical Details Announced: A Combinatorial AI Framework for General Purpose Computer Operations
Building intelligences capable of using computers as skillfully as humans is one of the key challenges on the road to general-purpose artificial intelligence (AGI). Such tasks cover a wide range of scenarios from performing open-ended numerical tasks to navigating unfamiliar applications through graphical user interfaces (GUIs) in a problem space that is characterized by being large, noise-filled, and highly dynamic. Recently, a paper on Agent S2
The official release of the technical paper of the research, which proposes a modular framework and achieves leading performance in multiple computer usage benchmarks.
Agent S2
The code associated with this release has previously been open sourced. The technical paper released (available at arXiv (Get) provides an in-depth look at the system's core concepts and architectural design. Simular AI, the research team behind the system, has also previously published an introductory article for non-specialized readers.
Agent S2 Overview: Combined Intelligent Designs
Agent S2
The core design concept is to decompose complex computer operation tasks, not relying on a single, large model to do all the work of planning, action, and screen interaction comprehension, but rather assigning these responsibilities to a generalist planning module and a specialized execution/comprehension module (specialists). This combined architecture is intended to mimic the way teams of human experts work: high-level planners, low-level executors, and interface interaction specialists working in tandem.
Agent S2 architecture diagram: combines a generic planner with specialized base modules.
Agent S2
Key features include:
- Mixture of Grounding (MoG). Utilize a set of underlying expert models (including visual, textual, and structured information extraction) to accurately locate GUI elements. For example, working with a spreadsheet may focus on structured data, while clicking a button relies on visual localization. This design decouples Grounding from Planning, effectively reducing the complexity of the problem and bringing it more in line with the current distribution of training for generalized reasoning models and specialized visual base models.
- Proactive Hierarchical Planning (PHP). The framework is able to dynamically adjust and refine its plans based on environmental feedback, rather than rigidly following a preset script. This allows the intelligence to respond more flexibly to unanticipated situations.
Benchmark Results: Cross-Platform Performance Leader
Data from the paper show that Agent S2
In the widely used OSWorld
A new performance record was set in benchmark testing. OSWorld
It mainly evaluates the ability of AI intelligences to accomplish diverse tasks such as file management, software usage, and information retrieval in a simulated operating system environment.
OSWorld Benchmark Success Rate Comparison.
In addition. Agent S2
It also shows good generalization capabilities on other platforms:
- WindowsAgentArena. This is a benchmark focused on complex interaction tasks in the Windows environment.
Agent S2
Performance in this test improved by 52.81 TP3T compared to the previous Best of the Open Results (SOTA). - AndroidWorld. This benchmark measures the ability to complete tasks on the Android mobile operating system.
Agent S2
performance here also outperforms the previous SOTA results with a 16.51 TP3T improvement.
Success rate performance at OSWorld shows that Agent S2 outperforms previous methods.
Success rate performance on WindowsAgentArena.
Design Innovation: Synergies between MoG and PHP
The main challenges faced by many existing computer intelligences in real-world applications stem from inaccurate understanding of interface elements (i.e., the "base grounding" problem) or from overly rigid program execution processes. Agent S2
These issues are countered through its two core designs:
- Mixed Base Model (MoG). The MoG mechanism is able to intelligently route tasks to the most appropriate expert model based on the current interaction requirements. For example, recognizing and manipulating a spreadsheet cell might invoke an expert based on structural analysis, while switching to a visual base model when clicking on a visually distinctive button. Separating basic interaction understanding from high-level task planning essentially breaks down a complex problem into two relatively simpler, more easily modeled subproblems.
- Proactive Planning (PHP). The PHP module enables intelligences to continuously adapt sub-goals and action plans in response to new observations in the environment. This adaptation mimics the human behavioral pattern of re-evaluating and revising plans when the situation changes while performing a task.
Example: Agent S2 self-corrects in an interaction, switching from a visual base model to a textual base model.
Scalability and Error Recovery
It has been shown that in tasks that require the execution of longer sequences of operations Agent S2
The combinatorial architecture exhibits better scalability than monolithic models. Their dynamic adaptation and self-correction capabilities allow them to adjust their strategies when initial actions do not have the desired effect, thus improving the completion rate of complex tasks. Monolithic models tend to be more prone to failure in long sequential tasks due to cumulative errors or planning rigidity.
Reasons why Agent S2 maintains performance in long time-series tasks: adaptive navigation, interaction and error correction mechanisms.
Beyond the desktop environment: generalized performance on the Android platform
(go ahead and do it) without hesitating Agent S2
s primary development target is intelligences for desktop environments, but its framework design has shown good generalization capabilities for mobile environments as well. In the AndroidWorld
The leading performance in the benchmark test proves the applicability of its core concepts, such as MoG and PHP, to different types of GUI environments.
Agent S2 achieves leadership in AndroidWorld smartphone usage benchmarks.
Advances in modular intelligences
Agent S2
The results of the study suggest that combinatorial design is not only an architectural choice, but may be an effective way to build intelligences that can operate computers in a robust, human-like manner. This work opens up new possibilities for future research in AI planning, basic interaction understanding (grounding), and multimodal coordination.
Interested readers are encouraged to consult the detailed Technical Papers and related open sourceThe