Artificial intelligence has been able to achieve human-like and, in some cases, human-superior performance in a variety contexts such image recognition and natural language processing. Yet, the most famous achievements in artificial intelligence have occured in board games. Events like IBM’s DeepMind beating chess grandmaster, Gary Kasparov, and Deepmind’s AlphaGo beating world class Go player, Lee Sedol, captured the public imagination about the future of artificial intelligence. These applications of cutting-edge methods in AI seem far removed from economics. However, Igami (2017) argues that the algorithms underlying DeepMind and AlphaGo are equivalent to some econometric methods used to estimate dynamic structural models. For instance, DeepBlue can be interpreted as simply an approximate value function combined with numerical backwards induction on a truncated game tree. DeepBlue’s approximate value function was hand-tuned by computer scientists and grandmaster chess players. As a result, Igami (2017) labels this as a “calibrated value function.” Bonanza, a computer program that sparked the AI community’s interest in the Japanese board game Shogi, estimates an approximate value function over a large database of games and uses the value function to determine the optimal action. In particular, Igami (2017) argues that the procedure Bonanza follows - logistic regression to estimate a value function and backwards induction on a truncated game tree - is analagous to Rust (1987)’s nested fixed point algorithm. The only difference is that Rust (1987) considers an infinite horizon problem and so, uses value function iteration to determine the optimal action rather than backwards induction.
Finally, AlphaGo consists of four distinct components. First, it uses a policy network, which is a deep neural network that is used to predict the play of professional players. This is simply a policy function that is estimated using a large database of Go games. Second, it uses a value network, which is another deep neural network estimated over a synthetic dataset of Go games that were constructed by having the policy network play itself. Third, reinforcement learning techniques are used to improve the play of the policy network and the accuracy of the value function. Finally, the strategy actually played by AlphaGo is an ensemble that combines the recommended strategy from the policy network, the implied optimal strategy of the value network and a simulated optimal strategy produced by a markov chain tree search algorithm. Igami (2017) notes that AlphaGo’s flexible approximation of the true policy function via the policy network can be related to Hotz and Miller (1993) two-step method for the estimation of conditional choice probabilities. Moreover, AlphaGo’s use of the policy network to estimate the value network is analagous to Hotz, Miller, Sanders and Smith (1994)’s use of conditional choice simulations for the estimation of structural parameters.
Taken together, Igami (2017) suggests that we can draw upon cutting-edge techniques in applications of AI that appear at first glance unrelated to economic applications. It turns out that under-the-hood of AI algorithms such as DeepMind and AlphaGo are concepts that are familiar to economists.
alphago reinforcement learning structural estimation
Reviewed by Ashesh on .