Research paper
Nash Q-learning agents in Hotelling’s model: Reestablishing equilibrium

https://doi.org/10.1016/j.cnsns.2021.105805Get rights and content

Highlights

  • Agent-based simulation under the Hotelling’s model setting is conducted.

  • Two agents use the Nash Q-learning mechanism for adaptation.

  • Under quadratic consumer cost function, agents learn an aggressive market strategy.

  • Principle of Minimum Differentiation can be justified based on repeated interactions.

Abstract

This paper examines adaptive agents’ behavior in a stochastic dynamic version of the Hotelling’s location model. We conduct an agent-based numerical simulation under the Hotelling’s setting with two agents who use the Nash Q-learning mechanism for adaptation. This allows exploring what alternations this technique brings compared to the original analytic solution of the famous static game-theoretic model with strong assumptions imposed on players. We discover that under the Nash Q-learning and quadratic consumer cost function, agents with high enough valuation of future profits learn behavior similar to aggressive market strategy. Both agents make similar products and lead a price war to eliminate their opponent from the market. This behavior closely resembles the Principle of Minimum Differentiation from Hotelling’s original paper with linear consumer costs. However, the quadratic consumer cost function would otherwise result in the maximum differentiation of production in the original model. Thus, the Principle of Minimum Differentiation can be justified based on repeated interactions of the agents and long-run optimization.

Introduction

Some of the most influential economic models stand on game theory. Game-theoretic concepts have been famously used in the past to show that in a society of self-interested individuals, the tragedy of the commons arises [15], or that non-dictatorial voting methods are subject to strategic voting [7]. Game theory allows us to formulate and analyze problems that include decision making in competitive or cooperative environments and offers solution concepts such as the Nash equilibrium. Based on game-theoretic models, we can make conditional conclusions about the behavior of real economic actors. Such models often rely on strong assumptions such as agents’ perfect rationality and complete and perfect information. However, humans are neither perfectly rational nor have perfect and complete information available. Thus, in addition to finding the Nash equilibrium in games, we should in the first place also ask how and if ever boundedly rational agents or agents do so without perfect information and get to play the Nash equilibrium.

Experimental economics provides a few studies that partly tackle these issues. In experiments presented, participants were supposed to play certain games multiple times. It has been shown that over time most of the experiment participants were getting closer to the Nash equilibrium in the Beauty Contest game [17] and similar conclusions were obtained for bargaining games [22]. These results indicate that some underlying adaptive processes could, at least some games, enable agents to converge towards the Nash equilibrium during repeated interactions. Recently, [2] build a dynamical model of experimental oligopoly games with the Cournot-Nash outcome as a stationary state of the model with two types of agents: adaptive agents that adjust their behavior to increases their profit and agents with imitative behavior. The authors suggest that their model is capable of reproducing the outcomes of experimental oligopoly games qualitatively.

With the rise in modern computers’ operational capacity, new techniques for analyzing economic systems have emerged. For instance, using numerical simulations, [21] study a Cournot duopoly model with heterogeneous competitors using the bifurcation analysis and further analyzing the stability switching curves. The authors suggest stability conditions of the unique Nash equilibrium and conclude the stability of the economy. Another such technique is called ‘agent-based simulation’ that consists of software agents placed in a virtual environment and the environment itself. Agents interact with each other and/or with the environment, and from their micro behavior, a global behavioral pattern can emerge. The rules that guide the agents’ behavior range from simple heuristics to more complex, possibly adaptive ones. For example, Waltman and Kaymak [24] use Q-earning to model firms in repeated Cournot oligopoly game, and [12] study differentiated market dynamics for agents imitating the behavior of more successful agents. Nagel and Vriend [18] apply learning direction theory to analyze agents in an oligopolistic environment with restricted information and Golman and Page [8] study basins of attraction and equilibrium selection under different learning rules. Nakov and Nuño [19] use mechanisms similar to reinforcement learning to simulate learning of economic agents on stock markets and Lahkar and Seymour [13] apply reinforcement learning to show that agents in a population game revise mixed strategies. An overview of learning methods can be found in [3], [6]. This paper analyzes a learning method inspired by reinforcement learning called Nash Q-learning [11].

Utilizing an agent-based simulation and the reinforcement learning methodology, we explore how adaptive agents without perfect information behave in Hotelling’s location model [10] with quadratic consumer cost functions. Comparative analysis between the theoretical findings and the results of the agent-based simulation is provided. Additionally, we evaluate the reinforcement learning suitability for use in economic agent-based simulations and compare it to other learning methods.

Hotelling’s location model is a microeconomic model presented by Harold Hotelling in 1929. The author found that two rational producers in the same market should make their products as similar as possible [10]. This phenomenon is called the Principle of Minimum Differentiation. Nevertheless, it has been shown that Hotelling’s conclusions regarding minimum differentiation are invalid, and based on Hotelling’s argumentation, “nothing can be said about the tendency of both sellers to agglomerate at the center of the market” [4, p. 1145]. Slightly modified versions of the location model with different consumer cost functions have been proposed, where the Principle of Maximum Differentiation [4] and the Principle of General Differentiation [5] can be justified. However, Hanaki et al. [9] analytically and numerically show that for n,n2, boundedly rational players following myopic best-reply strategy, the players spend most of the time around the center of Hotelling’s street, which could re-establish Hotelling’s Principle of Minimum Differentiation. Similarly, Matsumura et al. [16] show that minimum differentiation could be realized with evolutionary dynamics. Also, according to Bester et al. [1, p. 165], there are infinitely many mixed strategies in Hotelling’s location game, in which “coordination failure invalidates the Principle of ‘Maximum Differentiation’ and firms may even locate at the same point”.

The simulation contains two self-interested agents competing in a location model framework. The agents have no previous knowledge of the game or the opponent. The agents’ main challenge is to communicate respective preferences through mutual interaction, learn about the game pay-offs, and try to find the Nash equilibrium strategy profile of the game. Every round of the simulation, agents have to choose what direction to move (location change) and what price to charge (price change). After taking actions, they receive information about their opponent’s action and thus also about the current state of the game. They also receive positive or negative feedback based on how well they played in that particular round. The feedback is constructed in compliance with Hotelling’s profit function. Agents can see theirs as well as their opponent’s profit. The Nash Q-learning algorithm by Hu and Wellman [11] is used to guide our agents’ adaptive behavior. Since Hotelling’s location model contains convenient symmetries, agents learn not only from their experience but also from their opponent’s experience. That is, both agents model their opponent as if they were the opponent.

The paper proceeds as follows. Section 1 provides details of Hotelling’s location model. In Section 2, we theoretically discuss the learning methods. In Section 3, we describe technical details of the implementation, and Section 4 summarizes and interprets the important results of our simulation-based analysis. The penultimate Section 5 addresses several technical issues of our pioneering approach that might introduce open questions for future research. Finally, in Section 6, we conclude the paper with a summary of the crucial findings. The complete code is available on GitHub.

Section snippets

Hotelling’s location model

The location model by Hotelling [10] introduces a strategic game among two producers. The game can be divided into two stages. In the second stage (short-run), producers compete in prices given to a fixed pair of locations c1,c2[a,b]. In the first stage (long-run), producers compete in a location given that they adjust the Nash equilibrium in prices instantly. Then a combination of prices and locations (pl*,pr*,cl*,cr*) is a pure strategy Nash equilibrium if pl*,pr* are the Nash equilibrium

Reinforcement learning

Many times, humans and animals alike have no explicit teacher who would show them how to act in an unknown environment. Nevertheless, through experimenting with available actions and a recognition of how the actions influence the world that humans and animals perceive, they are able to form conclusions about cause and effect, consequences of actions, and about what to do in order to achieve goals [23]. According to the authors, agents in a reinforcement learning setting learn how to choose

Implementation

The original Hotelling’s model is a differential game, where the seller’s location and price are continuous variables. For simulation purposes, we discretized the price and location space. Thus, the game played by our agents is no longer smooth, and there can be multiple equilibria. Moreover, the original game is played only once. Since learning happens iteratively, we added a time dimension to our simulation. The agents cannot freely select their location in a single step but have to travel

Asymptotic behavior

First, we summarize the asymptotic behavior of key metrics for our agents. We observe the development of prices, locations, and profits in our simulation. From the location we also calculate the distance between the agents at every time step. The development of all parameters is captured for 30,000 time steps. To soften the raw data we use rolling averages of various lengths according to given situations. Let us now only concentrate on profits. We take profits from all 100 simulation runs for

Technical issues

We take Hotelling’s location model [10] and cut the continuous parameters from the model into discrete pieces in order to be able to run a discrete simulation. This is technically straightforward to accomplish. However, it brings the following issues:

  • i.

    On one hand, there is surely a Nash equilibrium in a finite game [20], on the other hand, this equilibrium may be non-unique and depends on the way we cut the parameters of the differential game.

  • ii.

    Due to non-uniqueness of the Nash equilibrium, our

Conclusion

This paper aims at examining whether simulated adaptive behavior in Hotelling’s location model without perfect information brings different results from the theoretical model. Originally, Hotelling’s model with linear consumer costs supported the Principle of Minimum Differentiation, where the sellers meet in the middle of the street. However, the model was shown to be invalid [4], and valid alternatives with adjusted consumer cost function have been proposed [5]. Nevertheless, the adjustments

CRediT authorship contribution statement

Jan Vainer: Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Visualization, Investigation. Jiri Kukacka: Conceptualization, Resources, Supervision, Writing - review & editing, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Jiri Kukacka gratefully acknowledges financial support from the Charles University PRIMUS program [project PRIMUS/19/HUM/17] and from the Charles University UNCE program [project UNCE/HUM/035].

References (25)

  • T. Chmura et al.

    Learning in experimental 2x2 games

    Games Econ Behav

    (2012)
  • N. Economides

    Minimal and maximal product differentiation in Hotelling’s duopoly

    Econ Lett

    (1986)
  • Cited by (5)

    • A large-scale equilibrium model of energy emergency production: Embedding social choice rules into Nash Q-learning automatically achieving consensus of urgent recovery behaviors

      2022, Energy
      Citation Excerpt :

      A novel grouping Nash scenarios learning algorithm with social choice rules is developed by combining Social Choice Theory with being modified Nash Q-learning methods. Here, the Nash Q-learning methods follow a noncooperative multiagent context based on assuming Nash equilibrium behaviour over the current Q-values [34], the Nash Q-learning mechanism for adaptation [35], Nash Q-learning algorithm applied for computation of game equilibrium under the unknown environment [36], and Q-learning algorithm to solve the problem of non-zero sum Nash games of linear time invariant systems [37]. Introducing Social Choice Theory concerns with the Social Choice Rules in Common Resource Conflicts [38], Nash equilibrium and group strategy consensus problem [39], a Graph Model for Conflict Resolution methodology [40], a positive theory of social choice [41], an approach to rational equilibrium analysis [42] and the social choice rule implemented as the unique perfect equilibrium outcome of a game of perfect information [43].

    • Simulation of the Hotelling–Smithies game: Hotelling was not so wrong

      2022, Communications in Nonlinear Science and Numerical Simulation
      Citation Excerpt :

      Last but not least, we aim to connect our work with more general approaches to the problem of how learning dynamics lead (or not) to the NE [34,35].

    View full text