How AI beat the world champion at the game of Go

artificial intelligence google deepmind reinforcement learning Aug 07, 2023

Since the release of chatGPT a few months ago, almost everyone is aware of the impact that AI has and will continue to have on our lives. However, today I want to tell you about a milestone that marked a turning point in the development of AI, and yet went almost unnoticed.

In 2016, an AI algorithm called AlphaGo defeated Lee Sedol, the world champion of the ancient Chinese board game, Go. You may think that this is nothing new. After all, in 1996 a program called Deep Blue defeated the world chess champion, Garry Kasparov. But it’s not the case. Go is one of the most complex and difficult games to master, in which methods that “just search” for the best move simply do not work. In this article we will see what makes Go so difficult to solve, how a team of researchers found a disruptive solution to master it, and how a single move made by AlphaGo challenged 2500 years knowledge about the game.


Take a deep dive into Reinforcement Learning with our course series:


Go is a really challenging game

Go is a game of influence. The board has a 19x19 grid, and the players take turns placing stones in the intersections. The goal is to surround (capture) as much empty territory with our stones as possible. If a surrounded area contains enemy stones, our score is increased even more.

What makes Go so challenging is that, in a given board state, the number of possible moves is orders of magnitude higher than in chess. In fact, there are more possible board states in the game of Go than atoms in the known universe.

Traditional programs use search mechanisms to evaluate how good the available moves are, based on all the possible paths that the game can take after making them. However, the search space is so immense that a computer trying to evaluate every possible move would never complete the task.


How DeepMind mastered the game of Go

DeepMind’s approach to solving the game of Go consisted in using the latest advancements in neural networks and parallel processing to drastically reduce the search space of these algorithms, focusing only on the promising moves.

To achieve this, the researchers used a strategy where they combined two neural networks with different roles: a policy network that picks the moves and a value network that evaluates board positions. The more likely a board position is to lead us to victory, the higher its value. To produce their outputs, both neural networks take as input a representation of the board state.

Neural networks are data structures capable of storing internal representations of the dynamics of a given problem (in our case, the game of Go). By using the appropriate data to build that representation, we can make the policy network select the most effective actions and the value network give us an accurate estimate of the goodness of a given board state. The researchers at DeepMind achieved this through a multi-step process.


Learning from expert moves

The first step consisted of training the policy network to predict the actions that an expert would select. To achieve this, the researchers trained the neural network with a dataset of 30 million pairs of board configuration and the move selected by an expert. After this first round of training, the policy network was able to predict reasonably well (55.7% of the time) the move picked by the expert.


Reinforcement Learning

Once the policy network can predict the moves an expert would make, the second training phase begins. In it, our policy network will learn to choose even better actions through experience, following a learning paradigm known as Reinforcement Learning.

Under this paradigm, the policy network plays against an opponent, and each time it makes a move, it receives a reward that measures how good or bad its choice is. Those rewards guide its future choices towards moves that help it win.

In this case, the researchers had the neural network play against itself, saving different versions of it throughout the learning process. In each game, the latest version of the neural network played against a randomly selected earlier version.

Once the training process ends, the resulting neural network beats the version from the previous learning stage (trained using expert moves) 80% of the time. Furthermore, the policy network at this stage was tested against the previous state-of-the-art Go program, Pachi. This program used sophisticated simulation techniques to find the best available moves. AlphaGo’s policy network won 85% of the games against Pachi.


Training the Value Network

The Value Network estimates how good a board position is for the player, and reflects the expected outcome of the game when we play from it.

To train the value network to accurately predict the value of a given board state, the researchers collected a dataset of 30 million positions extracted the previous self-play phase. The value network was trained to estimate correctly the value of each board position based on the final outcome of the game.


Monte-Carlo Tree Search (MCTS)

So far, the researchers had neural networks trained to select moves and evaluate board positions. Just by letting the trained policy network pick the moves, AlphaGo would have been a strong Go playing software. But using these neural networks to extend classical tree search algorithms is what made it good enough to beat the human world champion.

The resulting program selects actions by lookahead search, following the procedure described in the following graph, extracted from the original Nature article.

In each turn, the program will search for the best move by simulating a large number of games. In each simulation, the program will play the game from its current state until completion. It will build a tree in which each branch is a move and each node is the game state resulting from making that move. Then, in each simulation it will traverse the tree to the end of the game via a determined path.

To choose the moves in the simulation, AlphaGo combines three pieces of information: the probability with which our policy network would choose each move, the evaluation that the value network makes of the resulting game state, and a measure of the empirically obtained results when taking that action in previous simulations (with each simulation, those results change, altering the actions that are chosen).

Upon completing the simulations, AlphaGo chooses to actually execute (not simulate, but move in the real game) the move that has been chosen most often in the simulations (the most visited branch starting from the root).


Alpha Go vs Lee Sedol

Armed with this new program, DeepMind researchers challenged Lee Sedol, the world champion, to a 5-game Go match between March 9 and 15, 2016. The matches were broadcast live in Korean, Chinese, Japanese, and English, with accompanying expert commentary. Before the first match, many world-class Go players and public commentators were skeptic of the abilities of AlphaGo, as many algorithms had previously failed to pull off a feat like this. However, after the first few moves, it was clear that the program could play at the level of high-rank players, and the mood turned to silent expectation.

AlphaGo won four out of the five games, defeating the world champion and pulling of a feat that was not expected to take place in, at least, another decade. Following the match, The Korea Baduk Association bestowed upon AlphaGo the highest Go grandmaster rank — an honorary 9 dan.


Move 37

During the second match, something very unexpected happened. Move 37, played by AlphaGo, left many commentators speechless. At first, some of them believed that the AI had made a mistake, but as the game progressed, it was revealed that the move was part of an exquisitely planned strategy by AlphaGo, which led it to victory. Michael Redmond, a high level player with 9 Dan rank, stated that the move made by AlphaGo was a move that no human would have played. And it is that, the launch of AlphaGo has opened a window to new strategies and insights that escape the 2,500 years of accumulated human knowledge of the game.

Perhaps move 37 is the perfect example of what we can expect in the coming decades of AI: new ways of doing things that challenge all our assumptions and force us to review what we take for granted.

AI moves fast. We help you keep up with it.

Get a monthly selection of the most groundbreaking advances in the world of AI, top code repositories, and our best articles and tutorials. 

We hate SPAM. We will never sell your information, for any reason.