Playing with Monte-Carlo Tree Search

This interactive article provides an accessible explanation of how Monte-Carlo Tree Search (MCTS) works. Tic-Tac-Toe, Go and Sokoban are provided to show the procedure of MCTS.

Authors

  • Yunlong Zhao | Southern University of Science and Technology
  • Chengpeng Hu | Southern University of Science and Technology
  • Jialin Liu | Southern University of Science and Technology

Published

Dec. 1, 2023

Click

Indicates interactive elements

Contents:

I. Introduction

Monte-Carlo Tree Search (MCTS) is a tree-based policy that aims to achieve optimal performance in solving sequential decision problems such as games [1]. In sequential decision problems, a single action taken by the agent not only affects the current outcome but also has a long-term future effect. A series of actions is made over a time horizon. Because of few requirements for domain knowledge and simplicity, MCTS has achieved significant success for these problems, such as Go [2], video games [3], and chemical synthesis [4].

MCTS leverages the concept of tree search to address situations where local optimal actions may not lead to global optimal performance. The four core components of MCTS are Selection, Expansion, Simulation, and Backpropagation. The basic procedure of MCTS involves building a tree with nodes that represent visited states and selecting the node with the highest probability of reaching an optimal outcome. The selected node is then expanded according to the available action space. Once a simulation terminates, the results of the simulation are propagated back to update the statistics of the nodes on the trajectory. A game tree is then constructed to identify the optimal action at each state. MCTS addresses the exploration-exploitation dilemma with specific selection strategies, such as the upper confidence bound (UCB) [5] , that balance exploitation and exploration within a given computational budget.

II. Games

Two classic two-player board games, Tic-Tac-Toe and Go, and a single-player video game, Sokoban, are used as illustrative examples. Figure 1 shows screenshots of the three games.

Tic-Tac-Toe Game. Go Game. Sokoban Game.
Figure 1:
Tic-Tac-Toe (left), Go (center) and Sokoban (right)

  • Tic-Tac-Toe is a two-player competitive board game with a limited length of gameplay. Players take turns drawing their own marker (usually X or O) on a $3\times3$ grid board, as presented in Figure 1 (left). A player wins once three placed markers can be connected together vertically, horizontally, or diagonally. But if all squares are filled but neither player has won, the game ends in a draw. Due to the participation of the counterpart, MCTS is involved with the min-max and opponent modelling problem.
  • Go is a two-player competitive board game, as illustrated in Figure 1 (center). In this game, two players take turns placing black and white stones on the board. The game begins with an empty board controlled by the player holding black stones. If a player's stones manage to encircle the opponent's stones, the surrounded stones are removed from the board. The objective of Go is to control more territory on the board than the opponent by strategically placing stones. $7 \times 7$ board is used as the interactive example in this immersive article, following the logical rules [6].
  • Sokoban is a one-player puzzle game in which a player must strategically push boxes to their designated positions in order to win. The game is set on a fixed board made up of different tiles, as demonstrated in Figure 1 (right). The player can move through empty spaces and push boxes onto spaces without obstacles such as walls and other fixed boxes. In this game, the representation of the state is more complex than Tic-Tac-Toe and a challenge of deep exploration with unlimited steps arises.

III. Algorithm

Modelling

MCTS is typically used to solve sequential decision problems, which are often formulated as a Markov decision process (MDP). The four components of an MDP are state $S$, action $A$, transition model $P$ and reward $R$. An MCTS policy $\pi$ takes an action in a given state $s$ aiming at maximising the long-term expected reward, then obtains a reward $r$ and the next state of the environments according to the transition model. For example, in Tic-Tac-Toe, as shown in Figure 1 (left), state of the game is the game board, represented as nodes in the game tree. The marker placed by a player is considered an action taken by that player.

The interactive interface is implemented based on [7]. The game interfaces and data representations of Tic-Tac-Toe and Sokoban are presented below. You can activate either game by hovering your mouse pointer over its corresponding image.

In Tic-Tac-Toe, the game state is represented as a $3 \times 3$ matrix.

You can select whether you want to play the game as the first or second player by clicking the corresponding "Human" or "AI" button . The objective of the game is to connect three of your pieces in a row, column, or diagonal. The right panel presents the current state of the Tic-Tac-Toe game as a 3x3 matrix, where unoccupied cells are symbolized by empty spaces. Your pieces and the ones of the AI opponent are denoted by "1" and "-1", respectively.
Click "Reset Game" button to start a new game.
Game: Tic-Tac-Toe

Choose First Player

The winner is... HUMAN!
Figure 2: An introduction to Tic-Tac-Toe and its representation.

Similar to Tic-Tac-Toe, Go shares a similar representation structure with a $7 \times 7$ matrix. The number 1 represents a white stone, while -1 represents a black stone. Empty spaces are represented by 0.

Click on the intersections of the lines to alternately place black and white stones. The objective of Go is to control more territory on the board than the opponent by strategically placing stones.
Click "Reset Game" button to start a new game.
Game: Go

Figure 3:An introduction to Go and its representation.

The Sokoban game is represented by an $n \times n$ matrix, where $n$ represents the size of the game board. The matrix uses 0, 1, 2, 3, and 5 to indicate empty spaces, walls, boxes, targets, and the player, respectively.

Click game panel and use the W, S, A, D, or arrow keys on your keyboard to move the avatar. The objective is to push all the boxes to the green points (targets) to win the game.
Click "Reset Game" button to start a new game.
Game: Sokoban

Clear! Step Used:

Figure 4:An introduction to Sokoban and its representation.



Main steps of MCTS.
Figure 5:
Main steps of MCTS (Figure 2 from [1] ). Nodes and branches represent state and action, respectively. Wins and visit counts are usually the values of a node.

The MCTS follows a workflow as illustrated in Figure 5. It involves four main components: Selection, Expansion, Simulation, and Backpropagation. The specific four steps will be discussed in detail in the subsequent sections. Interactive components are provided to demonstrate these steps intuitively.

Selection

Starting from the root node, which usually represents the current state, MCTS employs a traversal strategy to explore the tree and identify the final node that is expected to lead to an optimal solution, as estimated by the selection strategy. It is known as the "exploration-exploitation dilemma" that MCTS encounters. The dilemma considers a general problem of choosing actions that seems to be optimal at current state or taking a wide step to explore some unknown area to potentially discover better solutions. Three selection strategies are described as follows.

  • A random strategy selects nodes randomly.
  • $\epsilon$-greedy strategy selects nodes randomly with a probability $\epsilon$. Otherwise, it chooses the node that has the highest reward with a probability $1- \epsilon$.
  • Upper confidence bound (UCB) [5] measures the uncertainty of the expected reward of an action. The average rewards and visit counts of actions form the upper confidence bound, seeing Eq. (1).

$$ UCB = x_i +c\sqrt{\frac{\ln{n}}{n_i}} \qquad (1)$$ where $x_i$ is the averaged reward of action $i$, $c$ is the exploration coefficient, $n$ is the total visit counts and $n_i$ is the number of counts that the action $i$ is selected. The node with the highest UCB value will be selected. UCB is the most commonly used selection strategy in MCTS, called as Upper Confidence Bound for Trees (UCT). It balances the exploration and exploitation by considering both reward and visit counts of a node. $x_i$ estimates the current collected experience, e.g., the optimal action that has been taken, while $c\sqrt{\frac{\ln{n}}{n_i}}$ considers the visit frequency. The less frequently visited node has larger possibility to be chosen so that more unknown area will be explored. The coefficient $c$ controls the importance of exploration.

The Selection algorithm is shown below.

              \begin{algorithm}
              \caption{Selection}
              \begin{algorithmic}
              \PROCEDURE{Select}{$root$}
              \STATE Set $node$ $\gets$ $root$
              \WHILE {$node$ is not leaf and $node$ is not fully explored}
              \STATE $node$ $\gets$ Node with the highest UCB value(Eq. (1)) among $node.children$
              \ENDWHILE
              \STATE \Return $node$
              \ENDPROCEDURE
              \end{algorithmic}
              \end{algorithm}
            

You can try out the Selection procedure in the interactive panel below.

Click the "Random board" button to randomly generate a Tic-Tac-Toe game board.
Click the "Continue" button for the Selection step in the MCTS.
The node currently selected is highlighted with a red border.
In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.




Choose First Player






Run MCTS for:

1 iterations
The winner is... HUMAN!
Current action:
(0/0)
---
Current iteration:
(0/0)


Figure 6: Procedure of Selection.

Expansion

Expansion is the process of exploring new states and adding new nodes to the tree. An expandable node is the one that still has available actions to explore. When a node is selected for Expansion, the algorithm creates new child nodes based on the available actions. If the selected node has no available actions, the selection strategy continues searching until an expandable or terminal node is found. The number of expanded nodes usually depends on the available action space and the computation budget.

The Expansion algorithm is as below.

              \begin{algorithm}
              \caption{Expansion}
              \begin{algorithmic}
              \PROCEDURE{Expand}{node}
              \STATE Sample $a$ from unvisited actions of $node$
              \STATE $new\_node$ $\gets$ $node$ after taking action $a$
              \STATE $node.children.append(new\_node)$
              \STATE \Return $new\_node$
              \ENDPROCEDURE
              \end{algorithmic}
              \end{algorithm}
            

You can test the Expansion algorithm in the interactive panel below.

Click the "Random board" button to randomly generate a Tic-Tac-Toe game board.
Click "Continue" button for the Expansion step in the MCTS.
The selected node to be expanded is highlighted with a pink border. The expanded node is highlighted with a green border. In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.




Who will start the game?

Whose turn?
Current Action:




Run MCTS for:

1 iterations
The winner is... HUMAN!
Current action:
(0/0)
---
Current iteration:
(0/0)


Figure 7: Procedure of Expansion.

Simulation

Simulation is a crucial component of the MCTS algorithm, which involves rolling out various gameplays from the expanded node. The choice of Simulation policy has a large impact on the final result. A common strategy for selecting actions is to use a random policy due to its simplicity, which chooses actions uniformly at random at successive states. The Simulation continues until a terminal state is reached. The results of the simulation, such as the final state or a reward value, are then collected and used to update the values of the node. However, the random policy can be highly inefficient, which has led to the development of more advanced techniques, such as rule-based simulation and learning methods [4][8] .

The Simulation algorithm is shown as below.

              \begin{algorithm}
              \caption{Simulation}
              \begin{algorithmic}
              \PROCEDURE{Simulate}{node}
              \STATE $r \gets 0$
              \While{not terminated}
                \STATE Sample $a$ from available action space of $node$
                \STATE $new\_node$ $\gets$ $node$ after taking action $a$
                \STATE $r \gets r+evaluate(new\_node)$
                \STATE $node$ $\gets$ $new\_node$
              \ENDWHILE
              \STATE \Return $r$
              \ENDPROCEDURE
              \end{algorithmic}
              \end{algorithm}
            

You can try out the Simulation process in the interactive panel below.

Click the "Random board" button to randomly generate a Tic-Tac-Toe game board.
Click "Continue" button for the Simulation step in the MCTS.
The node being simulated is connected with a dotted line. In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.







Run MCTS for:

1 iterations
The winner is... HUMAN!
Current action:
(0/0)
---
Current iteration:
(0/0)


Figure 8: Procedure of Simulation.

Backpropagation

During the backpropagation phase of MCTS, the statistics gathered during Simulation are propagated back along the path up the tree from the expanded node. This involves updating the visit counts and values, such as rewards and wins, for all nodes along the path from the expanded node to the root node of the tree. Considering Tic-Tac-Toe, players A and B take turns making moves. When MCTS determines the next move for player A, it obtains the action that leads to win with the highest probability in the current state. The tree is built with different layers representing the players, so the expandable states belong to player B, i.e., those where B would make a move next. If, after simulating a number of future moves, it is determined that A wins the game, the visit count for each of the nodes in the visited path will be incremented by one. It is notable that since Tic-Tac-Toe is a two-player competitive game, the value update mechanism of nodes is different compared with the single player case. For example, wins for nodes that belong to player A will be incremented by one, while wins for nodes that belong to player B will be decremented, if player A wins the game according to the Simulation. All updates are limited to nodes that trace back from the expanded node for the Simulation.

The backpropagation algorithm is as below.

              \begin{algorithm}
              \caption{Backpropagation}
              \begin{algorithmic}
              \PROCEDURE{Backpropagation}{$node$, $value$}

                \STATE $node.visit$ $\gets$ $node.visit+1$

                \IF{node is root}
                  \RETURN
                \ENDIF

                \STATE $node.value$ $\gets$ $node.value+value$
                \STATE Backpropagation(node.parent, value)

              \ENDPROCEDURE
              \end{algorithmic}
              \end{algorithm}
            

You can test the backpropagation algorithm in the interactive panel below.

Click the "Random board" button to randomly generate a Tic-Tac-Toe game board.
Click "Continue" button for the backpropagation step in MCTS.
The node being back propagated is highlighted with a blue border. In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.




Who will start the game?

Whose turn?
Current Action:




Run MCTS for:

1 iterations
The winner is... HUMAN!
Current action:
(0/0)
---
Current iteration:button
(0/0)


Figure 9: Procedure of Backpropagation.

By integrating the four essential components of Selection, Expansion, Simulation, and Backpropagation, a succinct and cohesive algorithm for the complete MCTS process can be constructed. Finally, the child node with the highest visit count is selected as the best move since it has been selected and evaluated more often during the tree growth, which may result in a robust performance.

              \begin{algorithm}
              \caption{MCTS}
              \begin{algorithmic}
              \PROCEDURE{Mcts}{root}
              \While{not end\_condition}
                \State $node$ $\gets$ Select($root$)
                \State $node$ $\gets$ Expand($node$)
                \State $value$ $\gets$ Simulate($node$)
                \State Backpropagation($node$, $value$)
              \EndWhile
              \Return Highest visit value node in $root.children$
              \EndProcedure
              \end{algorithmic}
              \end{algorithm}
            

An example of a good move

In the depicted game-state, the AI agent is in a disadvantageous position. Can it find a good move?
Try to figure out which action the AI agent will select. A, B, C or D?

Who will start the game?





Run MCTS for:

1 iterations

Current Player:

The winner is... HUMAN!
Current step:
(0/0)

Current iteration:
(0/0)


---
Figure 10: An interactive evaluation of a Tic-Tac-Toe game state.

IV. Practise

MCTS is a powerful algorithm that can effectively handle various types of games, including adversarial games such as Tic-Tac-Toe and Go, as well as puzzle games such as Sokoban. In the following sections, we will delve into how MCTS works in these three games and provide interactive components to illustrate its effectiveness.

For the Tic-Tac-Toe game, choose "Human" or "AI" to decide who goes first. During the player's turn, click on the highlighted area on the board to make a move. During the AI's turn, you can select one of three decision-making modes for the AI: random, greedy, and MCTS. In MCTS, you can slide the slider to select the number of iterations. The higher the number of iterations is, the better decisions the AI will make. In the MCTS panel on the right, choose "Next step" to view the MCTS' steps at each stage or select "Next iteration" to see the situation after each iteration, including Selection, Expansion, Simulation, and Backpropagation. The tree displays the current MCTS information. The possible future actions of the opponent are estimated during searching, which is considered as a case of opponent modeling. You can also select "Visualise last step" to directly view the final step and each node's value. Finally, click "make play" to take AI's action.

Select the first player by clicking "Human" or "AI". The detailed information about MCTS will be shown in the panel below. Click "Next step" and "Next iteration" to show the procedure of MCTS. Click "Visualise last step" to show the final tree. Click "Make play" to make AI's action.
In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.

Who will start the game?





Run MCTS for:

1 iterations

Current Player:

The winner is... HUMAN!
Current step:
(0/0)

Current iteration:
(0/0)


---
Figure 11: MCTS for Tic-Tac-Toe.

In Sokoban, selecting "Auto Play" enables the MCTS to run automatically. At each step, the Monte Carlo tree will select the optimal solution it found. Alternatively, you can use "Make MCTS Move" to view each step of the search in detail on the right panel. You can also choose to play manually by moving the mouse to the game and using W, A, S, D, or arrow keys to move.

The detailed information about MCTS will be shown in the panel below. Click "Next step" and "Next iteration" to show the procedure of MCTS. Click "Visualise last step" to show the final tree. Click "Make play" to take AI's action.
Notably, human player can also use the W, S, A, D, or arrow keys on the keyboard to move the avatar afther AI' moves.
In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.






Run MCTS for:

1000 iterations

Clear! Step Used:

Current step:
(0/0)

Current iteration:
(0/0)


---
Figure 12:MCTS for Sokoban.

An interative example of Go is shown below. For simplicity, we choose $7\times7$ board and follow the logical rules [6]. If an area covered by a color does not reach an empty area, it will be removed. This goal of this game is to cover more area than the opponent. In this Go, the AI will play black stones, while the player will play white stones. The AI can choose to play randomly or use the Monte Carlo Tree Search (MCTS) algorithm to make its moves. During the construction of the MCTS search tree, the AI takes into account opponent modeling, estimating the potential effects of the opponent's decisions on the progression of the game.

By adjusting the number of iterations in the Monte Carlo Tree Search, the AI's playing ability will also change accordingly. Notably, running MCTS with 3,000 iterations, which offers the stongest capability, might take around 10 seconds.
In the tree panel, you can drag the canvas, scroll the wheel the zoom in and out. When moving the cursor to the tree node, a detailed panel will be shown in the upper left area.






Run MCTS for:

500
Current step:
(0/0)

Current iteration:
(0/0)


---
Figure 13:MCTS for Go.

V. Conclusion

As a summary, this paper provides a comprehensive explanation of the basic procedure of MCTS. Four main components, Selection, Expansion, Simulation, and Backpropagation are described in detail. To fully demonstrate the workflow of MCTS in two-player and single-player cases, three classic games, Tic-Tac-Toe, Go and Sokoban are used. MCTS has been successfully applied to game AI areas such as Go [2] and video games [3]. Beyond these game-based environments, MCTS also presents its abilities in black-box optimisation [9], robotics [10] and real-world scheduling problems [11]. Readers are referred to [1,2] for more about MCTS, its variants and applications.

References

  1. A survey of monte carlo tree search methods Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S. and Colton, S., 2012. IEEE Transactions on Computational Intelligence and AI in Games, vol. 4(1), pp. 1--43. IEEE.
  2. Mastering the game of Go without human knowledge Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Diessche, G.v.d., Graepel, T. and Hassabis, D., 2017. Nature, vol. 550(7676), pp. 354--359. Nature Publishing Group.
  3. General video game AI: Competition, challenges and opportunities Perez-Liebana, D., Samothrakis, S., Togelius, J., Schaul, T. and Lucas, S.M., 2016. Thirtieth AAAI Conference on Artificial Intelligence, pp. 1--3.
  4. Planning chemical syntheses with deep neural networks and symbolic AI Segler, M.H., Preuss, M. and Waller, M.P., 2018. Nature, vol. 555(7698), pp. 604--610. Nature Publishing Group.
  5. Bandit based Monte-carlo planning Kocsis, L. and Szepesvari, C., 2006. European conference on Machine Learning, pp. 282--293.
  6. The game of Go http://tromp.github.io/go.html
  7. Visualization of MCTS algorithm applied to Tic-tac-toe vgarciasc. https://github.com/vgarciasc/mcts-viz/
  8. Knowledge-based fast evolutionary MCTS for general video game playing Perez, Diego and Samothrakis, Spyridon and Lucas, Simon, 2014. 2014 IEEE Conference on Computational Intelligence and Games, pp. 1-8.
  9. Learning search space partition for black-box optimization using monte carlo tree search Wang, Linnan and Fonseca, Rodrigo and Tian, Yuandong, 2020. Advances in Neural Information Processing Systems, vol. 33, pp. 19511--19522
  10. Dec-MCTS: Decentralized planning for multi-robot active perception Best, Graeme and Cliff, Oliver M and Patten, Timothy and Mettu, Ramgopal R and Fitch, Robert, 2019. The International Journal of Robotics Research, vol. 38, no. 2-3, pp. 316-337.
  11. An effective MCTS-based algorithm for minimizing makespan in dynamic flexible job shop scheduling problem Li, Kexin and Deng, Qianwang and Zhang, Like and Fan, Qing and Gong, Guiliang and Ding, Sun, 2021. Computers & Industrial Engineering, vol. 155, pp. 10721.