Imitation learning by U-Net for a multi-player strategic game

Kaggle write-up, similar to this post but with more community comments

Introduction

After 3 months competing from July to September 2020, I and my team, named “KhaVo Dan Gilles Robga Tung”, finished high (rank 8/1143) in a world-class AI programming challenge “Halite by Two Sigma” held on Kaggle. In this post I would like to represent my team to publish our solution (co-written by all members: Kha Vo, Dan Grunberg, Gilles Vandewiele, Rob Gardiner, and Tung M Phung). So much thanks to my teammates for the great time we spent together. We would also like to thank Two Sigma and Kaggle for hosting this wonderful competition. Personally after competing more than 10 competitions back-to-back in the last two years, this competition is something really special that we enjoyed so much:

Supportive Kaggle staff, huge thanks to DJ Sterling, Myles O’Neill, Addison Howard, and especially Sam Harris for constantly replying to bug reports and improving Kaggle simulation environment.
Game replay animation is great that allows us to enjoy the entertainment just like watching competitive sport matches.
Generally matchmaking system is fine and relatively fair.
No need to be scared, worried, or desparate about LB shake-up.
Competing against a different set of competitors on Kaggle that I’m sure are very strong in programming skills.
No discussion brawls, scandals, or cheating.
Normal ranking points and medals, which is a huge motivation.

If there was any problem with this competition, that would be only 4 halite-threaded swags given to the top 4 teams. We need more swags!

Generally speaking, our best bot consists of a core imitation learning based model to directly predict each ship’s immediate action in a centralized learning fashion. This core backbone is equipped with fine-tuned heuristics on some key policies such as base spawning, ship converting, base protecting,… which directly override the outputs of the model when needed. We will show how a data-driven approach can mimic behaviours of other bots that are extremely hard to code.

After reading, a new reader may expect to understand:

What is this competition about
How challenging it is to hard-code a bot that can compete well
How a data-driven approach can massively mitigate this challenge and produce outstanding result

The Competition Game Rules

The full game rules are described here in the competition webpage. We only extract some key rules as follows.

There are 4 players (battle royale style) competing in each “episode”, or game.
Each player starts with 5000 halite and 1 ship.
At each time step, a ship can choose from 6 actions: NORTH, EAST, SOUTH, WEST, CONVERT, and None (stay idle).
Each ship can CONVERT itself into a shipyard (or base). Conversion fee is 500 halite. For the other moving actions or staying idle, there is no fee.
If the ship stays idle, it can mine 25% of halite from the position it is on, and adds to its “cargo”. Its cargo is unlimited.
Each base cannot move, and can choose from 2 actions: SPAWN and None (do nothing). SPAWN means that the base will spawn a new ship on its position. SPAWN fee is 500 halite.
Ships need to be move back to any base to deposit its cargo. The deposited halite will be added to the bank. This is similar to a Starcraft game.
In order to spend halite for SPAWN or CONVERT, halite must be in the bank (except the case that a ship wants to CONVERT itself, so its cargo is valid to be spent)
At any time step, multiple ships (from all teams, including friendly ships) can step on the same position. In this case, the lightest ship (the ship with least cargo) will destroy all other ships. In case of tie, all ships are destroyed.
A ship stepping on an enemy base will destroy that base.
If multiple ships stepping on a base, then ship collision is done first, then ship-base collision later. The full turn resolution order is described in the full game rules.
Halite on the map is randomly and symmetrically distributed at the start of the game.
Halite in each cell regenerates by 2% per turn, up to a maximum of 500 halite.
The game lasts a maximum of 400 turns.
At the end of the game, player ranking (1,2,3,4) is calculated based on final banked halite.

Phew, sorry for making you reading that boring rules. Just watch this game and you’ll see how a game flows!

Manual Bot

As we can see, despite the simple game rules, the match is extremely competitive. Hundred of ships were fierce fighting for each halite cell. The logic to make a bot work well is certainly not simple, as it takes into account multiple aspects of the game. For simplicity, we will describe some traits that contribute to a good bot:

How to move all ship into various cells to optimize the halite per turn.
How to consider which area should we go to, i.e., avoid areas with densely populated enemies.
When and how to deposit a ship, i.e., too heavy ships are prone to be destroyed, but returning to frequent might decrease halite mined.
How to detect an isolated enemy ship, and how to move our herd of light ships to kill it.
When to spawn new ships. Spawning is a crucial factor, as spawning too many/too few greatly affect the game mechanics: mining-focus and halite save (few ships) or attack-focus and dominate (many ships).
When and where to convert a ship into a base.
How to track and rank opponents in order to make attack targets or avoid collisions.
…

Each of the mentioned points above caused us a great headache. It is more like a parameter tuning task, but this “parameter” is not simple a set of numbers than can vary, but it is a whole mess of programming code.

Anyway, we managed to have some very decent variants of manual bots that work considerably well in the live leaderboard (LB). Wait! What is a LB? Let me quickly explain as well

How the live leaderboard ranking works
The most interesting feature of this competition is the live LB ranking. Everybody can submit their own bots everyday. Each bot will start with 600 points representing its “skill” level. Continually, bots from all teams will be drawn against each other (with similar skill) to form a 4-player battle royale match (all 4 bots must be from 4 different teams). When a match finishes, 4 bots will be ranked (1,2,3,4) based on their final halite. The 1st place and 2nd place teams will be rewarded some points, and the 3rd and 4th place teams will be penalized some points. As you can imagine a very good bot will win most of the matches and continue to increase its “skill” points, until it fights with bots with similar high skill and lose. The matches for the whole pool of thousands of bots are played restlessly until the deadline day, which is 3 months apart from the competition commence day. In other words, this matching system is similar to an ELO rating in chess, if one is familiar with chess ratings. A one-week-old bot may play several hundreds of matches, guaranteeing its skill score convergence on the LB.

How a manual approach cause such a headache
Now let us go back to discussing the manual bot. To resolve various questions mentioned, one needs a bot “framework” to begin programming with. One typical framework is shown below

At each time step, we know everything about the game, including the position of all our/enemy ships/bases, how much halite/cargo every one has, and also past time steps information. That means we also know how opponents move in previous steps, and we can aggregate/accumulate some game statistics as well, such as how much halite had each enemy gained per turn, how many kills did they commit… Using all these information, one can perform a two-stage assignment problem to assign move to each ship.

Stage 1: Assigning a mission to each ship, with some input parameters from the global information, and output parameters indicating the properties of the mission, i.e., which enemy to attack is an output parameter.
Stage 2: Assigning a move to each ship, based on the assigned mission in stage 1. For instance, a “mining” mission for ship A with target cell (12, 17) can simply guide ship A from its current position, such as (10, 10), to (12, 17) to the cell that is 1-step nearer to (12, 17), that is (11, 10) or (10, 11).

The headache begins. Seeing this board state with more than 30 ships, how do we assign the optimal mission for each ship? Which ships need to mine, which ships to attack, to harass, to deposit…? Even if we managed to assign, say, ships 1,2,3,4 to attack an enemy ship (5), how do we chase the prey effectively? Now, let us go through each key mission for a manual bot.

How to mine

An optimization assignment problem will assign for each ship a target cell to come to. The problem is global for all ships.
The score for each cell for each ship is determined by average halite per turn for that ship to come to that cell and return to the nearest base. The algorithm is explained in this great notebook from our team.
That score is partly affected by a “dominance” factor. The dominance factor is the score for each cell that indicates how many enemy ships surrounding it.
The cells of our shipyards are scored based on the current carried halite divided by distance of ship to that cell. For early mining boost, we use carried=0 to not let the ship return too frequently in start game.
A shipyard can be targeted by at most 4 friendly ships.
When the assignment problem is done, each ship will claim its target and begin to move into its target.
The moving procedure is also done by another assignment problem. For each cell in the map, for each ship, there will be a score of this “immediate target” that is next to each ship. This second assignment problem is to avoid self-collision.
Enemy attacking and collisions are all considered in this moving assignment problem.
Detecting campers (enemy zero-ship staying idling at a zero-cell next to our shipyard) and destroy them.
Can assign a subset of our ships to attack an enemy ship.
Max_ship calculation and whether to spawn is extremely important. We have a decent one now using logistic regression with some basic features, and some rules.
Predicting next enemy moves: using avoidance consideration for each enemy ship and attacking options…, we can predict the most probable next enemy moves now, and it is integrated in our move assignment problem. But this is for sure not as good as a ML approach to navigate in local region.

The general framework above has a lot of customized tricks and tunes. We also worked on the following ideas:

Our ships camping opponents’ shipyards is possible, but needs to be applied in a smart way, i.e., we must not camp anymore when knowing the opponent has a counter-camp tactic.
“Global-memory” variables that capture all history moves, position, missions, targets, game stats… of all our ships and every aspect of the game.
Attacking strategy: how to use a fixed set of ships to kill 1 enemy ship in the shortest time? This toy problem is needed to be mastered as a black-box, and will be discussed later.
Predicting final finished halite of all teams, at step 300-340: this serves as a factor to decide which enemy should we launch the end-game base raid on. The prediction is conducted by a simple regression model using some basic in-game stats, such as number of ships/bases, remaining halite, …

There are many more aspects of our current best manual bot, but the general ideas are as mentioned.

How to launch an effective attack
Chasing and isolating a single enemy ship is not as simple as we may think. It requires some analytical problem solving skills. We will quickly show how to scan for a potential scene that we have high chance to kill off an enemy ship. In a general case, if we randomly assign a bunch of ships to attack an enemy ship by just moving towards the prey without considering anything else, they may chase… forever. So, we come up with a detailed tactic on how to scan for a potential prey, and chase it with minimal number of ships. Let us look at the below figure.

To ensure a certain kill, we need to have at least one ship occupying each coloured quadrant from the prey’s perspective. The predators will move towards to prey by selecting the immediate move that approaches the nearest red lines. In this tactic, the prey will be surely be encircled, and we might need an extra ship to ensure a kill. Note that if there is no ship in any quadrant, the chase will be mostly unsuccessful. Using this tactic one can easily estimate the “threat” that any ship has. A cool thing is that a team can “borrow” a ship from a 3rd team to kill the prey. This is done by using 3 ships to close the range and chase the victim to the end of the tunnel where another enemy is waiting for.

Manually Control Aggressiveness:
Next, we would like to quickly describe how we control the “aggressiveness” of our bot throught the game flow. Warning: this is a highly-customised tactic that deals with this specific game that requires massive programming and tuning. You can choose to skip reading this part and jump to the next section without losing your grip on this post.

The aggressiveness is reflected by the maximum percentage of ships whose halite is 0 and state is ATTACK_SHIP.
For the first 50 steps of the game, our bot has 0 aggressiveness. After that, this value switches between 0.8 and 0.3 each 60 steps.
From steps 320, the aggressiveness is fixed to be 0.3. We found that the changing aggressiveness is better than a fixed value. We tried a dozen combinations and finally chose 0.8, 0.3 and 60 as the parameters.
In most cases, our ships only attack opponent ships which have more halite (don’t attack opponent ships with equal halite). The exception is when an empty enemy ship is very close to one of our bases. In this case, one empty ship is assigned a role to chase this opponent ship. The aim is not to die together with it, just to make it go far from our bases.
In normal cases when attacking, an empty opponent ship is targeted by at most 1 of our ships; a nonzero opponent ship is targeted by at most 4 of our ships (and these four all have less halite than the opponent ship). An exception is at end game, when protecting our bases.
The opponent ships to be targeted (i.e. the preys) with high priority are:
those who are not empty and have only 0 or 1 safe move. (At most 4 of my ships attack these preys.)
those “near” our bases. If they are empty, at most 1 of our ships attack them. If they are not empty, at most 4 ships attack them. How “near” is near enough to attack is determined by the number of friendly ships.
The preys with low priority are those who are not empty and in our vicinity. If our ships have no good mining spots to collect from, they go attack these preys.
Only empty ships (i.e. ships with 0 halite) are allowed to attacking bases.
Only attack a base of a player whose “worth” is high enough. The worth of a player is computed from the number of his ships, bases, halite, cargo and game step. The detail is in the below function. We only attack opponents whose worth is > 0.8 our worth and his worth > our worth - 2000.
` // function to compute worth if self.board.step < 200: worth = halite + nship500 + nyard1000 + cargo0.8 elif self.board.step < 300: worth = halite + nship400 + nyard400 + cargo0.8 elif self.board.step < 370: worth = halite + nship100 + nyard100 + cargo0.9 else: worth = halite + cargo0.9 `
If at a step, one of our ships is next to a suitable opponent base, we command the ship to attack it. A suitable opponent base is one that:
satisfies the “worth condition” described above.
The player that own the shipyard either has less than 500 halite (so that he cannot spawn a ship on that shipyard to counter my attack) or he doesn’t want to spawn to counter an attack (we record this information from the previous steps of the match).
From step 360, empty ships which cannot collect a significant amount of halite until end game are commanded to attack enemy bases.

Protecting Bases
One important aspect of the game is the protection of bases. Enemies can destroy our bases by deliberately crashing into it. In these situations, both the base and the enemy ship are destroyed. To protect a base, we have two approaches. First, a friendly ship can constantly stand on the base until a depositing ship arrives and become the new protector, pushes the old protector out for mining. Second, a more complex refactored protector strategy is to ensure that one of our ships is always closer to our bases than the closest enemy ship. If the distance from the closest enemy ship to our base is equal to d, then our closest ship cannot leave the radius of d - 1. This updated approach is the improved version of the first approach. Our mining efficiency is decreased by roughly 17% if 2 out of the 12 ships act as protectors. However this approach still has cons as it cannot defend from attacks of multiple ships. On the other hand, the first approach has its own pros that it always prevents enemy ships to come close to our base territories, if the opponent tactic is just simply scan for any unprotected bases.

Two base protecting tactic

Machine Learning Bot: Imitation Learning by Semantic Segmentation

During the course of competition, we were inspired by some machine learning approaches discussed publicly on the forum. We decided to have a try on this direction, and personally, we believe this would give significant boost because we always trust in data and the magical methods the human beings have now advanced with it. Inspired by this starter notebook by [david], we managed to make a data-driven machine learning bot work nicely! The high-level steps are listed as below.

Continually collect match replays from top teams on the live LB.
“Imitate” their bots by learning the actions of all ships given a board state. Basically for each match, we have 400 turns corresponding to 400 images to be learned. Thousands of matches would suffice for a model to generalise.
The machine learning model is inspired from semantic segmentation (SS) in computer vision. Specifically in SS, we are given raw images as input and required to label the object class of each pixel in each image. We formulate the multi-agent board as an image with size equal to the board (n_featuresx21x21), and each pixel represents a cell, which has various features (n_features) on it, i.e., occupied ship, amount of halite, distance to nearest base, how many enemy ships are around, …
The output of the task is another image of size (5x21x21), such that each pixel in it represents an action (depth 5) of any object it occupies (an empty cell can still produce an action but since there’s no ship on it we don’t use it anyway).

High summary of a machine learning bot

Feature Engineering
Feature engineering efforts were minimal. Except some basic game feature layers such as ships/bases positions, cells’ halite, …, we also hand-crafted some informative features such as dominance map, threat estimation, last action, … The features are shown as belows, with embedded data augmentation explained where due.

Ship presence (4 layers for 4 teams, with self position fixed in layer indices, and randomly swap enemy layer indices).
Base presence (4 layers)
Empty ship presents (4 layers)
Ship cargo (4 layers)
Overall halite of player (4 layers)
Manhattan distance to nearest friendly base (4 layers)
Manhattan distance to nearest enemy base (4 layers)
Self/enemy dominance map, computed by convolution of self/enemy ship positions (4 layers)
Last action of each ship (4 layers)
Threat estimation of each ship (4 layers, each is the count of nearby lighter enemy ships with distance<=2)
Map halite (1 layer)
Game step (1 layer, all pixels same value of normalised game step)

We also tried to combine each group of 3 layers into one layer representing “all enemy” and remove separate layers for different enemies. Results are varied. We think that the model can exploit the information about different enemies to produce more reliable action. For instance, multiple teams trying to occupy the same cell but no one dares to go into it first, forming a “limbo” state. Using separate enemy layers can help the model to command our ship to break the deadlock.

Data Augmentation and Test-time Augmentation

3 layers of 3 enemies (for each feature) are swapped.
Whole n_featuresx21x21 input image and output label are randomly flipped/rotated. When doing so, it is important to change the pixel values/labels related to direction: label pixel or last action pixel (i.e., NORTH to SOUTH if using vertical flip).
Random toroidal crop in a wrap fashion to enrich training data.
Expand the 21x21 view into the 32x32 view to resolve edge problem during learning. The reason is that the pixels at the edges will lose information of the opposite edge in the board wrap fashion.
Random dropout: remove a few friendly ships.
TTA: apply the similar flip/rotate/toroidal crop techniques.

How semantic segmentation in computer vision relates to multi-agent learning

The model

We employed catalyst (https://github.com/catalyst-team/catalyst) as the main training framework and found it an extremely convenient. It provides us with error-free training, and all convenience on checkpointing, schedule, optimizer,… like Tensorflow, while still allows Pytorch dynamical coding style and free network architecture modifications.
The n_featuresx21x21 image is passed through a U-net semantic segmentation model, with a modified EfficientNet (EffNet) encoder. The predictions of actions (NORTH, EAST, SOUTH, WEST, None) are passed into a linear sum assignment function for a next action recommendation. The model originally predicted CONVERT and SPAWN but it was found a heuristic handled these better.
The encoder is a modified EfficientNet influenced by the Alaska2 Steganography challenge which had many single pixel artefacts to detect. The Halite board requires single pixel classification, so we must ensure only Stride 1 convolutions at the start of the network so as to not lose detail in early feature maps. Qishen Ha posted a discussion on how to modify EffNet to do this. We further modified this with 5x stride-1 conv2D layers before the EffNet B0. This encoder is placed in a very shallow depth 2 Unet.
Data and Training: More than 100,000 matches were scraped during the challenge. Most of our submissions used about 3,000 of the top matches which changed daily. Some players were easier to imitate than others.
Models used 20-30 channels of data, including ship and base positions, distance to bases, dominance matrices, attack ships with no cargo, etc. David NQ’s starter kernel was an invaluable resource that inspired us.
We didn’t have a good metric on how good a move was other than improving Dice and CrossEntropyLoss.
Augmentation included random 32x32 crops of 21x21 toroidal space. Rotation and flips of a 5 channel directional arrays involved permuting NSEW channels.
Surprisingly the network could learn how to create attack formations, send ships to base in the last steps, and mine very efficiently. - Agents trained on games of 1330+ were able to store 1400+ with added heuristics.

We would like to visualise some feature channels (first 100 steps of a specific match). Some have flickering effects (channels 9, 10, 11 counted from left to right, from top to bottom) due to the enemy-swapping data augmentation technique while training: we have 3 enemies and the model should not differentiate which enemy should be put into layer 9, 10, or 11. So we can randomly swap them across multiple time steps.

Visualization of some feature layers (100 first steps)

Results
Since the machine learning (ML) bot only produces action probabilities, a simple assignment solver must be used to assign to each ship an action, such that no two friendly ships will collide on the same cell, as well as to maximise the joint probability of all actions from all ships. This pure ML bot alone got us to rank around 20 on the LB, which is considered a huge success.

To rise up 10 more ranks to reach top 10, we need to combine some manual heuristics to alleviate the burden of learning. For instance, the ML bot needs not to learn when to spawn, or where/when a ship should convert into a shipyard. Although those are immediate actions that can be predicted by the ML bot, they are not recommended to be used directly, due to the lack of high-level complex human reasoning. The ML bot is adept at another more subtle skill: navigating ships in a fiercely competitive dense area. Light ships are more frequent in accompany each other to together fight for mining territory or attack. Manual coding is possible, but personally to a data scientist, it is a nightmare!

Let us take a look at how the ML bot plays. The below figure shows some stats of our top ML bot as compared to other top teams that we imitate from. We can see that our ML bot can achieve the more or less averaged behaviour of all teams throughout the whole game.

Some stats during 400 game steps, averaged on all games of top competitors

Some points:

Team “convex” has the best average halite reward after step 300, thanks to its aggressiveness earlier in the game (number of ships with 0-halite).
Team “robiland” has quite low reward, but significantly many more ships for trying to completely “dominate” the game.
Late peaks in the third plot observed in some teams indicate that they launch base raid at end-game, a tactic that can massively damage any opponent if all of their bases are destroyed.
Team “leukocyte” mostly focuses on mining mission and truly wants peace. It doesn’t want to spawn many ships to attack other people. This strategy is also competitive if one manages to have a very good dodging technique to avoid being killed and efficient mining tactics, then just try to survive the crazy blitz.
Our manual bot cannot cope with the ML bot in terms of mining efficiency by a mile, but its versatility in attacking is outstanding since we can adaptively adjust the number of empty ships at various stages of the game based on different situations.

Now, let’s have a look at some specific one game.

Stats of a specific match

We won the above game against 3 strong opponents. The finished halite of 3 top teams are very close, shown in the last plot “Halite”. We won this game because of superior mining efficiency in early game that gave us an advantage, and we have a great killing rate that is only worse than the… 4th placed team. This game demonstrates that being too aggressive is not recommended. As well, 2nd and 3rd placed team have many more ships than us, but in the end we won, thanks to a great “Nash equilibrium” that we reached specifically in this game.

Stats of another match

On the other hand, in another game where we finished 3rd, we did not do well in battle. Our number of ships are still competitive until mid-game (around step 200), then gradually dropped due to the advancement of opponents’ manual attacking strategy that our ML bot cannot fully imitate. Losing too many ships led us to a terrible state at step 300 when two top teams just begin to focus on mining, depicted by the late peaks in Cargo per ship, when we couldn’t compete anymore.

Conclusions
We as a team just have had a great time together learning, discussing, enriching our knowledge, sharpening our skills, making new friends, and competing with talented people in the world. We are so proud of this result, given that most of us seasoned data scientists on Kaggle may not find this competition comfortable enough to compete. There is also an interesting fact that no ML bot can finish in top 10 in the previous 3 Halite competitions, and another fact that most top data scientists on Kaggle really shy away from this competition due to its programming-skill-based nature. This type of competition is completely different than most others, because it requires visible progressive improvements due to true public LB. It’s like playing open card games.
We hope that our solution can inspire as many data scientists/researchers as possible in the future. The power of machine learning is something immeasurable and… magical. When we realise that our data-driven bot can copy complex attacking tactics of other teams, we were literally dropping our jaws given the fact that we struggled so hard to code an attacking tactics by hand without any great success.
Personally, I would like again send my thanks to great teammates. Everyone was enjoying the competition, and will enjoy the (maybe) gold medal. It would be sweet to all if it comes true. I, especially, like this type of competition, because of its LB transparency similar to some Santa competitions. I also wrote a topic about a year ago, titled The Elegance of Policy Gradient in Reinforcement Learning to demonstrate my passion for this area.
Finally, we’ll show a typical full match animation that we beat 3 strong opponents on the LB (divided into 4 animations: step 0-99, step 100-199, step 200-299, and step 300-399). Have fun!

Full match (step 0-99)

Full match (step 100-199)

Full match (step 200-299)

Full match (step 300-399)

Thanks for reading, and comments are wholeheartedly welcomed!