AI Agent: The Ultimate Weapon in the AIGC Field

baoshi.rao

AI Agent

AI Agent is undoubtedly the most exciting development direction in large models today, hailed as 'the next battle in large models', 'the final killer product', and 'the Agent-centric era that will usher in a new industrial revolution'. On November 7th, OpenAI's first developer conference (OpenAI DevDay) ignited the AI Agent trend. OpenAI released the initial form of AI Agent products called GPTs and introduced the corresponding creation tool, GPT Builder. Users simply need to describe the desired GPT functionality to GPT Builder through conversation, and it will generate a customized GPT. These personalized GPTs can be more suitable for daily life, specific tasks, work, or home use. To facilitate this, OpenAI also released a range of new APIs (including vision, image generation with DALL·E3, and speech) and the newly introduced Assistants API, enabling developers to more easily create their own customized GPTs. Bill Gates recently published an article explicitly stating that within five years, AI Agents will become ubiquitous, with every user having their own personalized AI Agent. Users will no longer need to use different apps for various functional needs; they can simply tell their Agent in everyday language what they want to do. [1]

GPTs

Within a week of its release, GPTs have already accumulated over 17,500 creations.

So, what exactly is an AI Agent? Why is it so important that it has garnered such high attention in the industry, with some scholars even asserting that 'the development of Agent Stores in the U.S. could widen the gap between Chinese and American large models' [2]?

In the fields of computer science and artificial intelligence, the term 'agent' is generally translated as 'intelligent agent.' It is defined as a software or hardware entity that exhibits one or more intelligent characteristics—such as autonomy, reactivity, sociality, proactiveness, deliberativeness, and cognitive abilities—within a certain environment [3].

OpenAI defines an AI Agent as a system driven by a large language model as its 'brain,' capable of autonomous understanding, perception, planning, memory, and tool usage to automate the execution of complex tasks [4]. The basic framework of an AI Agent is illustrated below:

AI Agent Basic Framework

Basic Framework of LLM-driven Agent [5]

It has four main modules: memory, planning, action, and tool usage:

(1) Memory. The memory module is responsible for storing information, including past interactions, learned knowledge, and even temporary task information. For an intelligent agent, an effective memory mechanism ensures it can draw on past experiences and knowledge when facing new or complex situations. For example, a chatbot with memory capabilities can remember user preferences or previous conversation content, thereby providing a more personalized and coherent communication experience. It is divided into short-term memory and long-term memory:

a. Short-term memory: All contextual learning utilizes short-term memory.

b. Long-term memory: This provides the agent with the ability to retain and recall (unlimited) information over extended periods, typically by leveraging external vector databases and fast retrieval, such as vast amounts of accumulated data and knowledge in a specific industry domain. With long-term memory, a wealth of data can be accumulated, making the agent more powerful and offering advantages like industry depth, personalization, and specialized capabilities.

(2) Planning. The planning module consists of two phases: pre-planning and post-reflection. In the pre-planning phase, this involves forecasting future actions and decision-making. For instance, when executing complex tasks, the agent decomposes major objectives into smaller, manageable sub-goals, enabling efficient planning of a series of steps or actions to achieve desired outcomes. In the post-reflection phase, the agent possesses the capability to review and improve deficiencies in formulated plans, reflecting on errors and shortcomings to incorporate lessons learned for refinement. This process forms and integrates long-term memory, helping the agent avoid mistakes in the future and update its understanding of the world.

(3) Tool Use. The tool use module refers to the agent's ability to leverage external resources or tools to perform tasks. Examples include learning to call external APIs to obtain additional information missing from model weights, such as real-time data, code execution capabilities, or access to proprietary information sources, thereby compensating for the LLM's inherent weaknesses. For instance, since LLM training data isn't updated in real-time, tools can be used to access the internet for the latest information or specialized software to analyze large datasets. With numerous digital and intelligent tools already available on the market, agents often handle tools more adeptly and efficiently than humans. By invoking different APIs or tools, they can accomplish complex tasks and produce high-quality outputs, representing a significant characteristic and advantage of intelligent agents.

(4) Action. The action module constitutes the part where the agent actually executes decisions or responses. Faced with diverse tasks, the agent system maintains a comprehensive set of action strategies, enabling it to select necessary operations during decision-making, such as the widely recognized memory retrieval, reasoning, learning, and programming.

Overall, these four modules work together to enable agents to act and make decisions in a broader range of scenarios, executing complex tasks in a more intelligent and efficient manner.[6]

Large model-based Agents not only provide everyone with personalized AI assistants that enhance capabilities but also transform the paradigm of human-machine collaboration, leading to deeper human-machine integration. The evolution of generative AI's intelligent revolution has so far presented three modes of human-machine collaboration:

(1) Embedding mode. Users interact with AI through language, using prompts to set goals, and then AI assists users in achieving these goals. For example, ordinary users input prompts to generative AI to create novels, musical works, 3D content, etc. In this mode, AI acts as a tool for executing commands, while humans take on the roles of decision-makers and commanders.

(2) Copilot mode. In this mode, humans and AI collaborate more like partners, jointly participating in workflows and each playing their part. AI integrates into the workflow, from providing suggestions to assisting in completing various stages of the process. For instance, in software development, AI can help programmers write code, detect errors, or optimize performance. Humans and AI work together in this process, complementing each other's capabilities. AI acts more like a knowledgeable partner rather than just a tool.

In fact, Microsoft first introduced the concept of Copilot on GitHub in 2021. GitHub Copilot is an AI service that assists developers in writing code. By May 2023, with the support of large language models, Copilot underwent a comprehensive upgrade, launching solutions like Dynamics 365 Copilot, Microsoft 365 Copilot, and Power Platform Copilot, while promoting the philosophy that "Copilot represents a completely new way of working." As work benefits from such assistance, life similarly requires a "Copilot." Li Zhifei, founder of Mobvoi, believes the optimal role for large models is to serve as humanity's "Copilot."

(3) The Agent model. Humans define objectives and provide necessary resources (e.g., computing power), then the AI independently handles most tasks, with humans ultimately supervising the process and evaluating outcomes. In this paradigm, AI fully demonstrates the interactive, autonomous, and adaptive characteristics of intelligent agents, functioning akin to independent actors, while humans primarily assume supervisory and evaluative roles.

Three Modes of Human-AI Collaboration[7]

From the functional analysis of the four main modules of agent memory, planning, action, and tool usage, the agent model is undoubtedly more efficient than the embedded or copilot modes and may become the primary mode of human-machine collaboration in the future.

Based on the Agent-driven human-machine collaboration model, every ordinary individual has the potential to become a super individual. A super individual possesses their own AI team and automated task workflows, establishing more intelligent and automated collaborative relationships with other super individuals through Agents. The industry is already witnessing active exploration of one-person companies and super individuals. On the GitHub platform, there are automated team projects based on Agents, such as the GPTeam project. GPTeam utilizes large models to create multiple agents assigned specific roles and functions, enabling multi-agent collaboration to achieve predefined goals. For example, Dev-GPT is a multi-agent collaborative team for automated development and operations, including roles such as Product Manager Agent, Developer Agent, and Operations Agent. This multi-agent team can support the normal operations of a startup marketing company, effectively functioning as a one-person company. Another example is NexusGPT, which claims to be the world's first AI freelancer platform. [8] This platform integrates various AI-native data from open-source databases and features over 800 AI agents with specialized skills. On this platform, you can find experts in different fields, such as designers, consultants, and sales representatives. Employers can select an AI agent at any time to assist them in completing various tasks.

Agent-Based Collaboration

AI Agent is redefining software. Bill Gates believes that AI Agents will completely disrupt the software industry, impacting how we use software and how we write it.[9]

AI Agents will shift the paradigm of software architecture from process-oriented to goal-oriented. Existing software (including apps) relies on predefined instructions, logic, rules, and heuristic algorithms to fix processes, ensuring the software's outcomes meet user expectations—where users follow step-by-step instructions to achieve goals. This process-oriented architecture offers high reliability and determinism. However, such goal-oriented architecture can only be applied to vertical domains and cannot be universally adopted across all fields. Thus, balancing standardization and customization has become one of the key challenges facing the SaaS industry.

Software Architecture Paradigm Shift[10]

The AI Agent paradigm is transitioning function development from human-led to AI-driven approaches. With large language models as the technical infrastructure and Agents as the core product form, traditional software's predefined commands, logic, rules, and heuristic algorithms are evolving into goal-oriented autonomous generation by intelligent agents. This shift means that while previous architectures could only handle limited-scope tasks, future architectures will solve problems across unlimited domains.[11]

Comparison between RPA (Robotic Process Automation) and APA (Agentic Process Automation) paradigms

RPA (Robotic Process Automation) vs. APA (Agentic Process Automation) comparison[13]

Taking Facewall Intelligence's first "LLM+Agent" SaaS product ChatDev as an example - this AI-powered software development platform operates like a fully automated development company. Staffed entirely by AI Agents playing roles like CEO, CTO, Development Manager, Product Manager, Tester, and Supervisor, users simply articulate requirements to the CEO Agent. This AI CEO then autonomously orchestrates the entire development lifecycle, ultimately delivering both the software product and its complete source code through fully automated processes.[14] This innovation will reduce production costs, enhance customization capabilities, and usher in a "3D printing" era for software development.

AI Agents are a crucial driving force in making artificial intelligence a fundamental infrastructure. Reflecting on technological history, the ultimate destiny of technology is to become infrastructure—like electricity, which is as unnoticeable as air yet indispensable, or cloud computing. This transformation typically undergoes three stages:

Innovation and Development: New technologies are invented and begin to be applied.
Popularization and Application: As technologies mature, they are widely adopted across various fields, profoundly impacting society and the economy.
Infrastructure Stage: When a technology becomes nearly ubiquitous, it transitions into infrastructure, an integral part of daily life.

It is widely acknowledged that AI will become the infrastructure of future society, and AI Agents are accelerating this process. This is due to their cost-effective software production and adaptability to diverse tasks and environments, coupled with their ability to learn and optimize performance. These traits enable Agents to support a broad range of industries and societal activities.

AI Agent Application Overview
Overview of AI Agent Applications [15]

Moving forward, AI Agents may evolve in two parallel directions:

Human-Assistive Agents: Focused on tool-like attributes, assisting humans by executing various tasks.
Human-Like Agents: Emphasizing autonomy, long-term memory, and anthropomorphic traits, leaning toward human or superhuman characteristics.

From the perspective of technological optimization and implementation, the development of AI Agents also faces several bottlenecks:

First, as seen with OpenAI's GPTs, the insufficient complex reasoning capabilities and high latency of LLMs hinder the true maturity of Agent applications. This remains a key direction for industry engineering optimization and technological breakthroughs.

Second, multi-agent (Multi-agent) development still faces significant challenges. Multi-agent systems represent a highly complex academic research direction, and as agents begin to proliferate in the consumer market, they have become an important technical reality. For example, Stanford's virtual town includes a multi-agent study with 25 agents. However, after the town framework was open-sourced, developer tests showed that a single Agent could consume $20 worth of tokens per day due to the extensive memory and action processing required. This cost exceeds that of many human workers, necessitating subsequent optimization in both Agent frameworks and LLM inference.

Breaking through the challenges of multi-agent development is a crucial prerequisite for establishing the future Agent Society. Multi-agent collaboration can form the highest form of technological social system—the Agent Society. This society exhibits complex, dynamic, self-organizing, and adaptive characteristics, enabling cooperation, competition, and continuous evolution. Within this system, agents can execute complex and flexible tasks based on goals and environmental changes, engaging in high-level, multi-dimensional interactions and collaborations with humans and other agents. The Agent Society not only aids humanity in exploring and expanding both physical and virtual worlds but also enhances and extends human capabilities and experiences.

Meanwhile, these development trends suggest that AI Agents may face multifaceted challenges such as security and privacy concerns, ethical and accountability issues, as well as economic and social employment impacts.

(1) Security and privacy are critical attributes of intelligent agents, essential for their stable operation and the protection of users and society. These two factors directly influence the trustworthiness and controllability of AI agents. If AI agents encounter vulnerabilities, suffer attacks, or experience data leaks, they may cause harm to users or society. For instance, shortly after the release of OpenAI's GPTs, security vulnerabilities emerged, leading to the leakage of user-uploaded data.

(2) Ethics and accountability are core principles of intelligent agents, determining their values and objectives, as well as their respect and protection for users and society. These principles directly affect the credibility and controllability of intelligent agents. If agents exhibit unfairness, opacity, or unreliability, they may provoke rejection from users or society. Accountability is also a key issue for intelligent agents; unclear or unjust responsibility attribution in human-agent collaboration can lead to severe consequences.

(3) Economic and social employment impact. A significant challenge in future work will be the competition between humans and intelligent agents. For example, the emergence of the AI freelancer platform NexusGPT poses a threat to traditional freelancers. In future workplace collaborations, more and more intelligent agents will appear. Employers, considering efficiency and benefits, may minimize human labor input. As intelligent agent technology matures, we must proactively consider the long-term effects of these technological developments on society and individual careers.