From Deep Blue to AlphaZero

Twenty years ago, IBM impressed the world with its specialized chess-playing system that won the match against the best chess player of all times, Garry Kasparov. Last year, Google repeated the same achievement with its Go-winning solutions AlphaGo, AlphaGo Zero and most recently AlphaZero. How did this evolution happen and what are the key success factors of these remarkable events in the history of AI ?

Deep Blue

First, a little bit of history and background theory. The history of chess-playing AI solution goes back to 1957 when IBM research scientist Alex Bernstein wrote the first complete chess-playing program and ran it on IBM 704. But it was only 40 years later, in 1997 that computer-based solution reached such a level to surpass the best human performance. AI-based solutions for game-playing haven’t changed a lot during this period. The best practice was the use of opening books for the first dozen moves, and then use of Minimax algorithm for the search. Chess has up to 35 different moves, and can go to 50 moves deep, which makes it computationally impossible for any computer to identify all possibilities until the end of the game (35^50 is higher than the number of atoms in the universe). Due to the wide search, even nowadays this algorithm would go to the depth of maximum 10 levels within the allotted time for the move decision. Alpha-Beta pruning was developed to discard those subtrees that don’t contribute to the evaluation of the best moves for the Minimax algorithm. This helped to reach deeper levels (up to 18 levels from the current position). Once this depth is reached, an evaluation function must be defined to assess the strength of achieved positions and to grade the selected moves. Evaluation functions are very complex, and this is were humans typically excel. Garry Kasparov had for example extraordinary sophisticated evaluation capabilities, better than any other player.

IBM developed a specialized machine with 30 nodes , each node with 16 specialized chess processing units, capable to run hardware search. These units combined were able to evaluate 100-200 million of moves per second. IBM also implemented non-uniform software search algorithm capable of exploring some branches more deeply than the others. The evaluation function was hand-crafted by a grandmaster and took into account 8’000 different features, like rook trap, bishop pair, wasted pawns, opponent block, mobility, pawn structure, etc. The features were weighted in a dynamic way, since most of the features’ importance depends on the stage of the game or overall number of pieces on the board. In addition to this, an opening book with about 4’000 positions were used, and an extended book of 700’000 games was consulted by the evaluation algorithm to assign bonuses to those moves already frequently played by the grandmasters.

The final success of this solution (win against Kasparov 3.5 to 2.5) was the result of several factors:

specialized hardware,
hybrid software-hardware, non-unioform search based on Alpha-Beta pruning, and
complex hand-crafted and fine-tuned evaluation function.

Ever since then, computers have become better in playing chess than the best humans, and always using the same mechanisms. Just the hardware has been comoditized so now even a normal CPU can deliver sufficient computation performance.

Fast-forward to 2016 and 2017. First, AlphaGo (by Google and DeepMind) winning against Lee Sedol, the human champion in game go (chinese : 棋, qí), then AlphaGo Zero that can self-learn to play the game without any human assistance, and which has in only 36 hours outperformed and beaten 100-0 even the original AlphaGo champion software (trained during several months), and in the end AlphaZero that has disrupted the whole AI chess industry by beating the best software in a trivially short time.

So, how does this now work ?

AlphaGo Zero

First, about the hardware. 2017 was marked by another achievement in the development of artificial intelligence solutions: Google has significantly improved its specialized chip for matrix multiplications, which are essential for deep learning use cases. Tensor Processing Unit (TPU) uses a very different architecture from CPUs and GPUs, featuring Matrix Multiplication Unit with 65’536 arithmetic units, which can use and share the same data. This, in conjunction with quantization, which is basically a conversion of float numbers into integers, greatly accelerates predictions with the deep neural networks (NN).

Now about the algorithms and the software. This is shortly how it works:

First, a convolutional neural network (CNN) is instantiated with randomly initialized weights, with a tensor of 19 X 19 X 17 units as the input. Dimensions of 19 X 19 are for the board size and 17 stands for the number of feature planes, which denote positions on the board and which extend 7 steps in the past for each player. This CNN will have 20 residual blocks, each block consisting of 2 convolutional layers and its own normalization and activation functions. Residual blocks help to backpropagate the gradients and thus avoid vanishing gradient problem. On top of this CNN “tower” are two “heads” – one to predict the next move and the other to predict the winner.

The training of this network is done with Monte-Carlo Tree Search (MCTS) algorithm, which is based on random sampling and simulation. In the beginning of the training, MCTS will propose very simple moves. When it comes to a move that has not been encountered in the previous MCTS runs, it will consult the neural network for an estimate. The neural network will , in the beginning of the training process, propose very naive predictions. But using these predictions, MCTS will run the game until the end and memorize the outcome. The outcome, together with the played positions will be then used back by the neural network to train the parameters of its layers. The weight training will be done by using gradient descent to optimize the standard loss functions (mean-squared error and cross-entropy with L2 regularization). Weights will be modified and the cycle MCTS+NN prediction will start again. I omit here quite a lot of details, especially on the use of MCTS hyperparameters like degree of exploration, upper confidence bound, use of noise, etc. but here is the original paper if someone would like to explore further.

For each training cycle, MCTS will perform 1’600 simulations to find the best move based on the NN’s own prediction.

After three hours of self-play training, AlphaGo Zero starts playing at the human beginner level. After 19 hours of training, it plays like a human master. After 36 hours, it beats the best go playing software, and after 70 hours of self-play training it discovers the game strategies and positions previously unknown to the human players.

From AlphaGo Zero to AlphaZero

And the latest achievement, the ultimate playing machine: AlphaZero. As of December 2017, Google’s AlphaZero is undisputable also in chess. Based on the go-specialized implementation, this is a now a generic algorithm that can be used also in other deterministic games like chess or shogi. The difference from AlphaGo Zero is minimal: NN will predict more complex game outcomes and not only winning probabilities; it won’t augment training data performing position rotations, and similar. In other aspects, AlphaZero is very close sibling to AlphaGo Zero.

After only 4 hours of self-play training, it has beaten Stockfish, one of the strongest chess playing programs today. Read the story here.

Conclusion

The King is dead, long live the King !

Alpha-Beta pruning, opening books, hand-crafted evaluation functions etc.. Yes, all that was very good. But as of now, reinforcement learning based on deep neural networks and MCTS has become undisputable winner in the field of deterministic games like chess and go.

The promise for the future of these technology is huge, because using this combination, and including other more traditional search algorithms will make a significant impact in other domains of our civilization, from biotechnology, medicine, to robotics and logistics, and many others.

And by learning from the past, we can be sure that new methods and combinations of the old ones in some new innovative ways will definitely continue emerging and impressing the world. Definitely worth of following.

Sasha Lazarevic,
January 2018

Sources:

Deep Mind research paper, Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (2017)

Deep Mind research paper, Mastering the game of Go without human knowledge (2017)

Deep Mind research paper, Mastering the game of Go with Deep Neural Networks and Tree Search (2015)

Google blog post, An in-depth look at Googles TPU (2017)

TensorFlow guide, How to Quantize Neural Networks with TensorFlow (2017)

IBM research paper, Deep Blue (2002)