Yoav Goldberg, Nov 24, 2024
This piece started with a pair of twitter and bluesky posts:
let's talk about "agents" (in the LLM sense). there's a lot of buzz around "multi-agent" systems where agents collaborate but... i don't really get how it differs from a thinking of a single agent with multiple modes of operation. what are the benefits of modeling as multi-agent?
— (((ل()(ل() 'yoav))))👾 (@yoavgo) November 23, 2024
https://bsky.app/profile/yoavgo.bsky.social/post/3lbl6skxowk27
Many of the instances that are currently called multi-agent systems might as well be described as modular single-agent systems, or as multi-expert systems. But there is such a thing as "true" multi-agent systems, and it is important to be clear about the differentiating property, so that we are incentivized to explore it more. The minimal differentiating property in my view is agents with private state that persist across calls. A more advanced case (actor system) is concerned with truly asynchronous agents.
LLMs brought with them plenty of talk about "LLM agents" / "AI Agents" / "language agents". They are indeed very interesting. There is also plenty of talk about "multi-agent" systems. But what is a multi-agent system, and what separates a multi-agent system from a complex single-agent system? I was quite confused by that, and went for twitter and bluesky for answers, which I received (you are encouraged to look at the threads).
Many (really, MANY!) of the replies were about notions that I consider to be irrelevant to multi-agency. This post discusses them. But some of the comments and discussion made me reflect and figure out what is a true distinguishing factor between multi-agent and complex-single-agent, and this post discusses this as well. It may be trivial in retrospect, but it took me a while to figure out and the replies to the threads suggest it might not be obvious to others either, so worth spelling out at some length.
The talk about "multi-agent system" or "a coalition of agents" and similar descriptions suggest some form of complex, even social, behavior and dynamics, which may or may not be present the current systems (spoiler: they mostly do not). By calling our current simple systems "multi-agent" we (a) suggest they have more power than they really have; and (b) miss out on the power a "real" multi-agent system does have. Also (c) we may introduce complexity where it is not really needed.
And now, let's dive in. Before we ask what distinguishes a true multi-agent system, let's start by asking what is a single LLM agent.
While there is no clear single definition of what an "agent" is (even if we restrict ourselves to the realm of LLM-powered agents), there is a shared core that I think most people will agree on. However, I realized this shared core lacks a crucial component, making it ambiguous when we try to move from single-agent to multi-agent systems: a communiacation pattern. We'll get to it. For now, let's start with two definitions from 2023 papers that define agents: LLM Agents for Task Planning and Tool Usage (TPTU)1 and Cognitive Architecture Agents (CoALA)2.
The CoALA paper defines "language agents" as "an AI system that uses LLMs to interact with the world". The TPTU paper defines: "a program that employs AI techniques to perform tasks that typically require human-like intelligence". Both definitions are very broad, and each definition implicitly assumes the other one. TPTU then list a set of abilities that such an agent should have:
- perception (observing some environment)
- task planning (planning and executing multiple steps required to complete the task)
- tool use (to interact with the world)
- learning/reflection/memory from feedback (adopting the plans and/or reacting to change)
- summarization (of the memory of experiences so far, to condense them).3
The CoALA paper suggests a similar set of abilities under somewhat different names. In both cases, the agent receives a request (that "requires intelligence") to perform, and then performs a series of steps that involve internal monologue, interacting with an environment and using tools in order to perform the goal. The series of steps may involve interacting with other agents, for example by delegating tasks to them. The part that is left ambiguous / undefined is the nature of this interaction, which I will discuss below. I believe most people working on LLM-agents (or Language Agents as others prefer to call them) will subscribe to this characterization of an agent.
In practice, such agents are implemented in terms of LLM calls: a base (or system) prompt that defines a behavior or an ability, and describes a set of available tools, and an execution mechanism which performs repeated prompting and tool execution while maintaining the LLM's chat history and interpreting the tool-use instructions.
An agent is often characterized by:
- a prompt that defines a behavior.
- a set of tools that can be called.
- ability to perform a multi-step process by repeated prompting and maintaining some memory/state between the different calls.
Given this characterization of an agent, it may be tempting to define any system that has multiple such agents "working together" as a "multi-agent system". I think this is wrong, though, as in many cases the distinction between multi-agent and single-agent becomes very arbitrary, and boils down to implementation choices and code organization choices that do not really change the system in any fundamental way.
For example, consider a "teach me about a topic" agent which first calls a "retrieval agent" that will fetch documents about the topic, then calls a "relevance agent" that will filter the returned documents, and finally a "summarization agent" which will produce a summary based on the documents that passed the relevance judgements. Is this a multi-agent system? what if the retrieval agent was also in charge of filtering? and what if we also added the summarization function to it? is it still a multi-agent system, or a single-agent system with a set of skills and a control strategy? What if instead of "retrieval agent", "filtering agent" and "summarization agent" we would think of them as a "retrieval tool", "filtering tool" and "summarization tool" that are called by a single agent?will this change the system's behavior in any fundamental way? I would argue that these would just be notational variants, but hey, we just moved from multi-agent to single-agent and have pretty much the same system with the same expressive power.
Furthermore, consider that the "summarization agent" (or summarization tool) may work by creating three different "summarization sub-agents", each producing its own summary, and then merging their resulting summaries somehow, producing a final summary. Is it now a multi-agent system? Or just a single-agent using an ensemble approach? I'd argue that it's the later, and there is nothing gained from the "multi-agent" nature of the system. These are not "true multi-agent systems".
Why do I care what things are called? So what if we call it multi-agent in some cases and complex single-agent in another? Because I think this has implications on how we think about this systems, and, more importantly, how we do not think about them. "Multi-agent" sounds cool, but if we get to a conclusion that is just a notational variant over something else, then we may dismiss it as "just a hype name" and may miss on some real opportunities. Names help us categorize things, and then develop structures and theories around them. It also shapes our thinking about the objects in the named category. In this sense, using the "wrong name" is harmful: for example, if we conflate between "ensembles" and "multi-agent" calling them both "multi-agent", we miss out on the ensemble theories and ways of thinking. Similarly, if we conflate "multi-agent" and "multi-expert" calling them both "multi-agent", we may lose sight of the distinguishing aspects of multi-agent that do make them more powerful or more interesting.
Indeed, when I asked on twitter and on bluesky about the difference between multi-agent and single-agent system, a very common response was that they are just ways of code organization. Which was also my initial thought, but I came to realize that this thought is wrong, and that this view is reductive.
The following cases are often mentioned as benefits of, or reasons for, a multi-agent system. However, I argue that while they are all useful, they are neither necessary not sufficient for a system to be considered multi-agent.
Modularity is important. It is convenient to break a complex process into subtasks, break a complex module into sub-modules, define clear boundaries between different sub-processes, etc. For example if we need a system that can do two separate things (say "code" and "search the web") we can implement each of these ability in its own sub-module, define clear interfaces (input and output formats) for each of the modules, and then call each of them separately. We can then call each of these modules "an agent" or "a sub-agent", and call them separately, maintain them separately, swap them separately, etc. The two sub-modules (sub-agents?) can be developed by different groups or at different organizations. Changes to one will not affect the other (as long as the interfaces are preserved). The different agents can also call each other (for example the coding agent may ask the web-search agent to help it find documentation), or a third module ("agent") can be used synchronize between them. This is nice software engineering, and makes a lot of sense. But I would argue that breaking down a system into modules like this is not sufficient to make it a "multi-agent" system, even if the modules do call each other. This is just a modular system.
Another common theme is to designate each agent to a combination of "an expertise", a "system prompt" that defines the expertise and "a set of tools" to help realize the expertise. So if you have two different kinds of expertise (again, think "coding" and "searching the web") each will have its own prompt and associated set of tools. This concern is similar to the modularity one, and follows the same "separation of concerns" philosophy, but now stems from a different reason: the limitations of current technology. Current LLMs are still not that great when faced with very long, multi-objective prompts, and do not work well when the prompt specifies a very large set of tools. Thus, it makes sense to implement the separate processes as different sub-modules (or "sub-agents") each specializing in its own task. (This also has the regular software engineering benefit of decoupling, and helps ensure that changes to "coding" part of the prompt won't interfere with the performance of the "web search" ability).
However, this kind of specialization does not necessitate a multi-agent system. First, there is something to be said in the benefit of the two abilities being aware of each other, as Graham argues in a recent blog post.4 Second, there are solutions that do not require separation across agent boundaries. An single agent may have a more than one expertise. It may know both how to code an how to search, (each implemented in an isolated function and a separate prompt) and decide which function to call at each step. Or a clever tool-choice mechanism may be devised to include the right set of tools in the prompt at the right time. Indeed, the TPTUv2 paper5 deals exactly with this: how to work around such LLM limitations in a single agent system.
For both modularity and specialization, I will call the resulting systems "Multi-expertise" or "Multi-role" and not "Multi-agent".
Another recurring answer about the supposed benefit of multi-agent systems is that each agent can be powered by a different LLM (an agent handling sensitive information may use a local version, an agent handling easier tasks may use a cheaper online LLM API, and one handling harder tasks may use an expensive online LLM API). This is also not a real distinction. We can easily consider a single agent that calls different LLMs on different occasions. There is no reason whatsoever to couple each agent to a single LLM.
"In a multi-agent system, I can call different agents in parallel". Well, but a single-agent system can also use parallelism. For example if it calls two tools, or calls the same tool with two different inputs, there is no reason to do it sequentially (unless the output of one call is part of the input to the other). It can call them in parallel. Similarly, if it has two skills it can invoke them both in parallel, and if it has "sub-agents" for modularity, it can also call them in parallel. Parallelism does not necessary mean multi-agent.6
"Each agent can have its own opinion, and we can use the output of several agents to form a consensus opinion, which will be more accurate than the sum of its parts." This also goes by the name "Wisdom of the Crowds". While I agree with the mechanism and its utility, it---like the previous ones---does not necessitate nor characterize a multi-agent system. This is just an ensemble, and we should treat it as that. We can do many forms of ensembling also in a single-agent system. We can even run things in parallel if we want to speed things up.
- modular is not multi-agent.
- multi-role is not multi-agent.
- multiple subsets of tools is not multi-agent.
- multi-prompt is not multi-agent.
- multi-model is not multi-agent.
- parallel-execution is not multi-agent.
- multiple-opinions is not multi-agent.
I listed many things that are often brought up as multi-agentic, and claimed they are not multi-agentic. What, then, is a multi-agent system? And why having multiple agents is not enough? The crux of the matter is that the definitions of a "language agent" above focused on the kinds of tasks being performed ("requiring intelligence") and on the mechanisms for performing them ("using planning, tool use, ..."), but they neglected a crucial component, which is having a state the persists across invocations.
For me, a core property of an agent is a mutable state which is kept between calls. So if your "multi-agent" system can be easily implemented by making each agent a function that doesn't have access to global variables, or by making each agent an immutable object, then it is not "really" a multi-agent system, it is just a modular system.7
In the context of LLM-based agents, the mutable agent state is the conversation history. For example a ReAct style8 agent repeatedly observes, plans, and acts, where each such step grows its memory (conversation history, prompt), and each step is conditioned on the outputs of the previous steps. However, just having two react-style agents do not yet make it a multi-agent system. The question is the scope in which the state is maintained.
Consider for example a ReAct-style coding agent. It receives a task, and starts to perform it. It takes multiple ReAct steps, growing its history each time. In the middle of the process, it realizes it needs some documentation, so it asks the web-search agent to find it. The web-search agent then performs its own ReAct steps and tool-calling, until finds the documentation, which is then returned to the coding agent. The coding agent then picks up where it stopped, continues the coding with the new information. It then realizes it needs documentation for some other package, so it asks the web-search agent again. The web-search agent starts a brand new search process where it searches for the docs, until it returns them. The coding agent then proceeds, completes its task, and returns to its caller.
While two "agents" communicated, in terms of software this is just function calling. When a function is called, it is allocated its own stack frame with its own state, which is maintained until the function ends, and then the stack-frame (and the state) is cleared. While the function is working, it may suspend its execution to call other functions, and then resume based on their outputs. Each function call starts a fresh state, updates it while it is working, and clears is when done. The next call will start a new history. This is how the "agents" in the above example behaved (and this is also how the agents described in the TPTU paper are defined, and how many current "multi-LLM-agent" systems do).
What is missing here is a mutable state that is kept between function calls. For example, if in the second call to the search-agent the search-agent did not start with a fresh history but rather with all its previous history from previous calls, then it would be a multi-agent system. Having the history kept like that may be beneficial (for example, the agent will be able to respond to followup requests like "this is great, can you find more of the same" or "can you elaborate on the second option") but also carries risks (having the previous request in the prompt may confuse it in case where the history does not actually matter for the followup request). But this state persistent between calls to the same agent is a core part of being multi-agent system.
To give another example: consider a system with a coder and a verifier. The coder codes, then starts a loop in which it: shows the code to the verifier, breaks the loop if the verifier verifies, and fixes according to the verifier comments if it does not. Now, there are two options:
- The verifier starts afresh in each turn, and verifies the current version of the code detached from all previous interactions. This option can be implemented as just function calling (or with the verifier being an object with immutable state, or with a new verifier object being created by the coder on each turn). This will not have a power of a "true" multi-agent system. The verifier is just a function call / stateless tool.
- In its second execution, the verifier looks not only at the code, but also at the history: its previous comments to the coder, and maybe a version it kept of the previous version of the code, and maybe traces of a static analysis it run on the code before providing its previous comments. And then it takes these into account in its verification process.
Option 2 cannot be (easily) implemented as just function calling or with an immutable/fresh object, and does have the extra power of a multi-agent system.
To be part of a "true" multi-agent system, then, at least some of the agent should:
- have a private state, not observable by other agents.
- this state must persist between different invocations of the agent and affect its behavior. (of course, we assume that the invocations may change the internal state).
Note that both are needed: if there is not private state, and the state is accessible by other agents, then this agent can be considered as part of the other agent. On the other hand, if the state only persists within an invocation, and will not affect future interactions, then this is just like function calling, or like accessing a stateless tool.
Can't these conditions be implemented also in a single-agent system? Yes, but with extra complexity, and I would argue that a system that implements this is a multi-agent system.9
Observe also what is not part of the requirements: different LLMs, different toolsets, different system prompts, different skillsets, or different personas. Indeed the different agents may be copies of one another. We just need for them to have (a) private state not observable by other agents; and (b) state persistency across calls to the same agent. For example, consider "negotiating agents" talking with each other, observing their own (private) accepted ranges as well as their own and the other agents' proposals and responses. At the first step, each agent has an internal dialog in which it chooses a random range, and then they negotiate. This can already lead to an interesting multi-agent behavior, and may be part of an interesting simulation.
As I remarked above, each sub-module, or each sub-expert, maintaining its own state that persists across invocations qualifies for me as a multi-agent system and gives rise to interesting behaviors. However, as I also remarked above, by sharing the state between experts and reducing modularity, we may get a single-agent system that might be even stronger than the multi-agent one! (This is actually not that surprising. Consider yourself performing a task that requires multiple expertise on your own, vs. delegating parts of it to others. In the delegating case, even if the others are more qualified or talented than you, the end-product might suffer because they may not see the entire picture of the project and all the interactions the way you would if you did it on your own).
There is another aspects of multi-agentic systems, though, that will be significantly harder to replicate in a single-agent system: true asynchronicity.
When discussing multi-agent LLM systems, many people bring up "the Actor model" as a way to implement it. The fact is, they are likely implementing a weak version of the Actor model, or don't fully understand it, as the Actor model has a core property that current multi-agent LLM systems lacks: each Actor (agent) is independent and asynchronous. That is, when an Actor sends a message to another agent, it does not wait for a reply. Likewise, messages from other Actors (agent) may arrive while the Actor is performing another task, and the Actor may choose to respond, ignore or reject them. Meanwhile, the Actors all act on a shared environment and may affect it in an async way. This adds a lot of complexity to the system, a complexity that is currently ignored by most LLM agents. Indeed, it is hard to perform systems that operate robustly in this way (but when one succeeds, the system may be doing very interesting things). One place that do make use of such async modeling is when agents are actors in a non-turn-based multi-player game. This is also a situation that should be considered if we really want the AI agents to model swarm or crowd behavior or dynamics.
I highlighted two aspects of "true" multi-agent systems that I find to be lacking in the current discourse about multi-Language-Agent systems: private shared state that persists across invocation, and true async execution model. Both of these can lead to interesting behaviors and dynamics that are not easily obtainable in system that do not have these properties. My feeling is that both of them are under-explored, and offer interesting opportunities both in research and applications. I hope that by highlighting these differences between "multi-expert" and "multi-agent" systems, we may better understand what we may be missing in our current systems, and find opportunities for improvements (or decide that the extra complexity is not worth it, but at least do so consciously).
Footnotes
-
TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage ↩
-
I am not convinced summarization is really integral to being an agent, but that's what they use. ↩
-
TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems ↩
-
There is a form of parallelism that does call for a multi-agent system, which is interesting and which I will discuss at the end (see "Actors, the final frontier"). But current systems by and large do not use this form of parallelism, as it will really complicate things. ↩
-
This is not as simple, as if a function is passed a reference to a mutable environment and changes it, or if it has access to an environment-writing tool which allows it to store information between calls, then it can simulate the global variable (and cross-calls memory) this way. We can also implement a function that always returns a pair of result and a state, coupled with an execution architecture which keeps track of the returned state and passes it as a parameter in the next call. This is actually a common technique for storing state between calls in scalable systems. It's software, everything can be implemented as everything else. But I think you get the idea, let's assume this does not happen, and if it does, consider it as a cheat. ↩
-
ReAct: Synergizing Reasoning and Acting in Language Models ↩
-
Note however that a system implementing (2) but not (1), is a single-agent system which may be stronger, in some ways than the a multi-agent system (though it pays in complexity and reduced modularity). This might be a bit of a mess and maybe harder to design (as it introduces higher coupling between modules), but may also have benefits in terms of information access and information flow (for example, it may be beneficial for the documentation-search agent to know the entire recent-history of the coding agent, as it will help it decide which search terms to use). ↩