Brief paperFully probabilistic design of strategies with estimator☆
Introduction
On the paper context and main result This brief paper focuses on a technical problem related to a prescriptive theory of dynamic decision making (DM). The theory is dubbed fully probabilistic design (FPD) of decision strategies.1 It generalises methodologies developed in connection with (adaptive) control theory (Åström and Wittenmark, 1994, Bertsekas, 2017) and Markov decision processes (Puterman, 2005). Since its initial publication (Kárný, 1996), it was broadly elaborated (Kárný & Guy, 2006), axiomatised (Kárný, 2020), applied (Kárný et al., 2006, Quinn et al., 2003) and used for supporting decision makers (Guy et al., 2018, Kárný, 2021, Kárný and Guy, 2012, Zugarová and Guy, 2020).
The paper deals with the evergreen known as dual control (Feldbaum, 1961, Klenske and Hennig, 2016, Mesbah, 2018) or exploration–exploitation dichotomy (Besbes, Gur, & Zeevi, 2019). It concerns the balance of random explorative actions, supporting parameter estimation, with actions moving the closed control loop to the desired state.2 The main contribution of the paper is an optimised feedback that “naturally” diminishes exploration (gained in learning) as the learning progresses.
On the addressed technical problem Any estimation serves to decision making seen as the aim-focused selection and use of actions. The agent – the decision maker or the action selector, referred as “it” – acts under uncertainty. The inspected agent uses FPD. FPD models the closed-loop behaviour by the joint probability density (pd). The behaviour consists of all considered uncertain variables. The inspected estimation arises when the behaviour includes a parameter unknown to the agent. Its adopted handling as random variable coincides with bayesianism (Berger, 1985).
The FPD-optimal strategy minimises Kullback–Leibler divergence (KLD) (Kullback & Leibler, 1951) of the behaviours’ pd to its ideal, DM-aims expressing, twin. The estimation has the parameter estimates as (a part of) agent’s actions. The wish to obtain good estimates of the unknown parameter is the generic agent’s aim. The key question is: What ideal pd expresses this wish? A universal conversion of a usual loss into the ideal pd exists, Prop. 3. in Kárný (2020). It often violates the dictum (Kárný & Guy, 2019): Select an ambitious but reachable ideal pd! Our solution meets this dictum and leads to the mentioned main result.
Layout Section 2 recalls FPD, embeds it into a slightly generalised dynamic programming and advocates the use of closed-loop states. Core Section 3 solves FPD with an estimator. It proposes the relevant ideal pd and finds the FPD-optimal estimator. Section 4 summarises properties of the proposed strategy and outlines open problems.
Notation marks the set of s defined if needed. Sanserif fonts denote mappings. The superscripts , refer to the ideal pd and optimality, respectively. The symbol defines by assigning; is proportionality; marks interim objects. The time subscript of a function on drops if the function argument has it, . The text prefers mnemonic identifiers.
Section snippets
Fully probabilistic design
FPD deals with the closed DM loop. An agent and its environment form it. The agent’s actions , at time moments tagged by , , influence transitions of states to states . The inspected transition model depends on an unknown, time-invariant, parameter . The closed-loop states are gradually observed and constructed. A fixed, known initial state implicitly conditions all used pds. The case with internal states is left aside to keep the paper simple.
FPD with estimator
Section 3.1 constructs the FPD-optimal strategy. It relies on a slight extension of stochastic dynamic programming (Bertsekas, 2017) that minimises4 the strategy-dependent expectation of the -dependent additive loss , . The optimal strategy minimises the expectation of the loss The dependence of the loss on makes the optimised functional
On the proposed strategy
The novel FPD-optimal strategy with estimator:
- ✓
respects both the knowledge collected in the posterior pd (8) and influence of the parameter estimate on -driven DM via the function (16), which is the expected (weighted) divergence of the environment model to its ideal twin ;
- ✓
correlates, due to the previous property, usual actions with estimates more deeply5
Acknowledgments
MŠMT ČR LTC18075 and EU-COST Action CA16228 support this research.
Miroslav Kárný, Ing. (M.Sc.) in theoretical cybernetics, Czech Technical University (CTU) Prague, 1973; CSc (Ph.D.) 1978, DrSc (DSc) 1990, both in technical cybernetic at the Institute of Information Theory and Automation, the Czechoslovak Academy of Sciences employing him since 1973 in the department of Adaptive Systems. Research: conceptual, theoretical and algorithmic aspects of adaptive systems based on Bayesian dynamic decision making and its fully probabilistic extension. Teaching: the
References (42)
- et al.
Dual adaptive model predictive control
Automatica
(2017) Towards fully probabilistic control design
Automatica
(1996)Axiomatisation of fully probabilistic design revisited
Systems & Control Letters
(2020)- et al.
Fully probabilistic control design
Systems & Control Letters
(2006) - et al.
Preference elicitation within framework of fully probabilistic design of decision strategies
Model predictive control: Recent developments and future promise
Automatica
(2014)Stochastic model predictive control with active uncertainty learning: A survey on dual control
Annual Reviews in Control
(2018)BayesIan system identification
- et al.
Adaptive control
(1994) Information and exponential families in statistical theory
(1978)
Statistical decision theory and Bayesian analysis
Dynamic programming and optimal control
Optimal exploration – exploitation in a multi-armed bandit problem with nonstationary rewards
Stochastic Systems
Action-constrained Markov decision processes with Kullback-Leibler cost
Theory of dual control
Automation and Remote Control
Online Markov decision processes with Kullback-Leibler control cost
Lazy fully probabilistic design: Application potential
Affective decision-making in ultimatum game: Responder
Linear theory for control of nonlinear stochastic systems
Physical Review Letters
Towards on-line tuning of adaptive-agent’s multivariate meta-parameter
International Journal of Machine Learning and Cybernetics
Optimized bayesian dynamic advising: Theory and algorithms
Cited by (0)
Miroslav Kárný, Ing. (M.Sc.) in theoretical cybernetics, Czech Technical University (CTU) Prague, 1973; CSc (Ph.D.) 1978, DrSc (DSc) 1990, both in technical cybernetic at the Institute of Information Theory and Automation, the Czechoslovak Academy of Sciences employing him since 1973 in the department of Adaptive Systems. Research: conceptual, theoretical and algorithmic aspects of adaptive systems based on Bayesian dynamic decision making and its fully probabilistic extension. Teaching: the advanced course on dynamic decision making, CTU since 1991; supervision of 13 defended Ph.D. students (+10 co-supervision) and numerous B.Sc.,M.Sc. theses and research projects. Publications: 1 monograph, 6 edited books, 420 works ( 10 chapters, 120 articles); for the list and many preprints after 1989 see http://www.utia.cz/people/karny.
- ☆
The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Chanying Li under the direction of Editor Miroslav Krstic.