What if you could talk to a swarm of 70 robots like you talk to a friend?

Usually, if you want a fleet of drones to fly in the shape of a giant star, a human has to spend hours drawing precise maps and typing in complicated math codes. It is slow, difficult, and a little bit boring.

But a new system called CLIPSwarm is changing the game. Instead of drawing maps, scientists can now just type "a circle outline" or "the contour of a drop," and the robots figure out where to stand all by themselves.

Pablo

Pueyo

CLIPSwarm paves the way and is the first step to creating these formations autonomously, which is the main contribution of this work.

How CLIPSwarm "Talks" to Robots

The Super-Librarian Brain

To understand a text command, the robots use a smart "brain" called CLIP. Think of CLIP like a super-powered digital librarian that has looked at every picture and read every caption on the internet. It knows exactly what a "drop" or a "circle" should look like.

The Training Process

Step 1

Scientists create 1,600 different "practice" formations for the robots. To help the AI "see" the shapes, they use a Convex-Hull—like stretching a giant rubber band around the outside of the robots to see the overall outline.

Step 2

The system uses a Monte Carlo Particle Filter. This is like a high-speed guessing game where the computer tries thousands of random positions, keeps the best ones, and throws away the "losers" that don't look like the target shape.

The Impressive Results

In just 50 rounds of guessing, the robots get much smarter.

For a circle shape, their "accuracy score" (scientists call this CLIP Similarity) climbed from 0.311 to 0.342.
That is a 9.97% improvement!

For a water drop shape, the score jumped even higher—from 0.271 to 0.312.
That is a 15.1% boost in how well the robots understood the mission.

Most of this "learning" happens incredibly fast, usually within the first 15 tries.

The Reality Check: Not Perfect Artists Yet

The Concave Challenge

Because the robots only look at the "rubber band" outline of the group, they struggle with "concave" shapes—which are shapes that curve inward, like a bite taken out of a cookie or the windows on a house.

The Computer vs. Human Vision Problem

The team also noticed that sometimes the AI gives a high score to a shape that looks right to a computer but looks like a mess to a human.

The Next Frontier: For now, these 70 robots are stuck working in 2D on a flat surface, but the goal is to get them flying through the sky in 3D soon!

Source: "CLIPSwarm: Converting text into formations of robots," Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, and Mac Schwager. (Presented at ICRA 2023 Workshop on Multi-Robot Learning). See also arXiv:2311.11047v1.