Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control

Timothée Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, Sebastian Risi

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. A promising but largely under-explored area is their potential to facilitate human coordination with many agents. Such capabilities would be useful in domains including disaster response, urban planning, and real-time strategy scenarios. In this work, we introduce (1) a real-time strategy game benchmark designed to evaluate these abilities and (2) a novel framework we term HIVE. HIVE empowers a single human to coordinate swarms of up to 2,000 agents using natural language dialog with an LLM. We present promising results on this multi-agent benchmark, with our hybrid approach solving tasks such as coordinating agent movements, exploiting unit weaknesses, leveraging human annotations, and understanding terrain and strategic points. However, our findings also highlight critical limitations of current models, including difficulties in processing spatial visual information and challenges in formulating long-term strategic plans. This work sheds light on the potential and limitations of LLMs in human-swarm coordination, paving the way for future research in this area.

test

Introduction

Large Language Models (LLMs) are revolutionizing how we interact with artificial intelligence, and one exciting frontier is their ability to coordinate multiple agents in complex scenarios. Enter HIVE (Hybrid Intelligence for Vast Engagements), a new framework that bridges human strategy and AI execution in real-time environments.

HIVE works by taking natural language instructions from humans and transforming them into detailed operational plans for controlling thousands of agents simultaneously. Think of it as a translator that converts your high-level strategic thoughts into tactical instructions that AI agents can understand and execute.

To put HIVE through its paces, we created a comprehensive benchmark testing five essential capabilities: coordination, weakness exploitation, spatial awareness, terrain usage, and strategic planning. Our research not only demonstrates HIVE’s potential for enhancing human decision-making but also reveals important insights about current LLM limitations, including their challenges with visual-spatial reasoning and long-term strategy.

HIVE: Hybrid Intelligence for Vast Engagements

test

We present HIVE—outlined in the image above—a novel framework enabling natural language control of thousands of units in strategy games through human-AI collaboration. HIVE translates high-level human commands into detailed operational plans using Large Language Models (LLMs).

Overview

HIVE operates through three key components:

A natural language interface allowing players to give commands and place markers
An LLM that generates structured plans using a domain-specific language
A behavior tree system that executes plans by controlling individual units

The Game Environment

The game features:

Three unit types (spearmen, archers, cavalry) with rock-paper-scissors dynamics
Four terrain types affecting movement and visibility
Support for thousands of units with parallel processing in JAX
Local unit observations within a 15m range
Continuous movement and discrete attack actions

Benchmark Results

We evaluated HIVE across five core capabilities:

Coordinate (managing 1000+ units)
Exploit weakness (utilizing unit type advantages)
Follow markers (precise positioning)
Exploit terrain (strategic navigation)
Strategize points (defensive positioning)

The map used to test each can be seen in the image bellow.

Ability Tests

Key findings:

Ability evaluation for each model is seen in the image below.

Ability evaluations

Claude-3 Sonnet performed best overall, solving all ability tests
HIVE shows superior performance with human collaboration vs. AI alone
The system scales effectively up to 4000 units
LLMs still struggle with visual map interpretation compared to textual descriptions

Conclusion

In this work, we present a new challenge for LLMs as human assistants to control up to two thousand units in a strategy game. We propose a new framework, HIVE, to allow a player to give high-level commands that an LLM translates into a long-term plan that controls the behavior of each unit. We showed that generalist LLMs such as Claude Sonnet and GPT-4o can handle such tasks but are still sensitive to slight changes in the player’s prompts. Complimentary experiments showed that HIVE requires human help to get the best performance and that generalist LLMs’ visual capacity to use an out-of-distribution map for terrain and landmark locations is still to be improved. This work opens many interesting avenues for improving LLMs’ capacities to collaborate with humans, such as improving their map-reading abilities, reducing their sensitivity to prompts, and increasing their long-term planning.

BibTeX

@misc{anne2024harnessinglanguagecoordinationframework,
      title={Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control},
      author={Timothée Anne and Noah Syrkis and Meriem Elhosni and Florian Turati and Franck Legendre and Alain Jaquier and Sebastian Risi},
      year={2024},
      eprint={2412.11761},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2412.11761},
}