Elsevier

Automatica

Volume 141, July 2022, 110269
Automatica

Brief paper
Fully probabilistic design of strategies with estimator

https://doi.org/10.1016/j.automatica.2022.110269Get rights and content

Abstract

The axiomatic fully probabilistic design (FDP) of decision strategies strictly extends Bayesian decision making (DM) theory. FPD also models the closed decision loop by a joint probability density (pd) of all inspected random variables, referred as behaviour. FPD expresses DM aims via an ideal pd of behaviours, unlike the usual DM. Its optimal strategy minimises Kullback–Leibler divergence (KLD) of the joint, strategy-dependent, pd of behaviours to its ideal twin. A range of FPD results confirmed its theoretical and practical strength. Curiously, no guide exists how to select a specific ideal pd for an estimator design. The paper offers it. It advocates the use of the closed-loop state notion and generalises dynamic programming so that FPD is its special case. Primarily, it provides an explorative optimised feedback that “naturally” diminishes exploration (gained in learning) as the learning progresses.

Introduction

On the paper context and main result This brief paper focuses on a technical problem related to a prescriptive theory of dynamic decision making (DM). The theory is dubbed fully probabilistic design (FPD) of decision strategies.1 It generalises methodologies developed in connection with (adaptive) control theory (Åström and Wittenmark, 1994, Bertsekas, 2017) and Markov decision processes (Puterman, 2005). Since its initial publication (Kárný, 1996), it was broadly elaborated (Kárný & Guy, 2006), axiomatised (Kárný, 2020), applied (Kárný et al., 2006, Quinn et al., 2003) and used for supporting decision makers (Guy et al., 2018, Kárný, 2021, Kárný and Guy, 2012, Zugarová and Guy, 2020).

The paper deals with the evergreen known as dual control (Feldbaum, 1961, Klenske and Hennig, 2016, Mesbah, 2018) or exploration–exploitation dichotomy (Besbes, Gur, & Zeevi, 2019). It concerns the balance of random explorative actions, supporting parameter estimation, with actions moving the closed control loop to the desired state.2 The main contribution of the paper is an optimised feedback that “naturally” diminishes exploration (gained in learning) as the learning progresses.

On the addressed technical problem Any estimation serves to decision making seen as the aim-focused selection and use of actions. The agent – the decision maker or the action selector, referred as “it” – acts under uncertainty. The inspected agent uses FPD. FPD models the closed-loop behaviour by the joint probability density (pd). The behaviour consists of all considered uncertain variables. The inspected estimation arises when the behaviour includes a parameter unknown to the agent. Its adopted handling as random variable coincides with bayesianism (Berger, 1985).

The FPD-optimal strategy minimises Kullback–Leibler divergence (KLD) (Kullback & Leibler, 1951) of the behaviours’ pd to its ideal, DM-aims expressing, twin. The estimation has the parameter estimates as (a part of) agent’s actions. The wish to obtain good estimates of the unknown parameter is the generic agent’s aim. The key question is: What ideal pd expresses this wish? A universal conversion of a usual loss into the ideal pd exists, Prop. 3. in Kárný (2020). It often violates the dictum (Kárný & Guy, 2019): Select an ambitious but reachable ideal pd! Our solution meets this dictum and leads to the mentioned main result.

Layout Section 2 recalls FPD, embeds it into a slightly generalised dynamic programming and advocates the use of closed-loop states. Core Section 3 solves FPD with an estimator. It proposes the relevant ideal pd and finds the FPD-optimal estimator. Section 4 summarises properties of the proposed strategy and outlines open problems.

Notation

x marks the set of xs defined if needed. Sanserif fonts denote mappings. The superscripts i, o refer to the ideal pd and optimality, respectively. The symbol defines by assigning; is proportionality; ̃ marks interim objects. The time subscript tt of a function ft on x drops if the function argument has it, f(xt)ft(xt). The text prefers mnemonic identifiers.

Section snippets

Fully probabilistic design

FPD deals with the closed DM loop. An agent and its environment form it. The agent’s actions ata, at time moments tagged by tt{1,,h}, h<, influence transitions of states st1s to states sts. The inspected transition model depends on an unknown, time-invariant, parameter pp. The closed-loop states (st)tt are gradually observed and constructed. A fixed, known initial state s0 implicitly conditions all used pds. The case with internal states is left aside to keep the paper simple.

FPD with estimator

Section 3.1 constructs the FPD-optimal strategy. It relies on a slight extension of stochastic dynamic programming (Bertsekas, 2017) that minimises4 the strategy-dependent expectation Es[Ls] of the s-dependent additive loss Ls(b), bb. The optimal strategy so minimises the expectation of the loss Ls(b)ttlrt(st1),lrt:s(,)soArgminsEs[Ls]ArgminsbLs(b)cs(b)db. The dependence of the loss Ls on s makes the optimised functional

On the proposed strategy

The novel FPD-optimal strategy with estimator:

  • respects both the knowledge collected in the posterior pd pt1 (8) and influence of the parameter estimate pˆ on α-driven DM via the function Δt(α,pˆ,s) (16), which is the expected (weighted) divergence of the environment model mt(s̃|α,s,p) to its ideal twin mit(s̃|α,s,pˆ);

  • correlates, due to the previous property, usual actions αt with estimates pˆt more deeply5

Acknowledgments

MŠMT ČR LTC18075 and EU-COST Action CA16228 support this research.

Miroslav Kárný, Ing. (M.Sc.) in theoretical cybernetics, Czech Technical University (CTU) Prague, 1973; CSc (Ph.D.) 1978, DrSc (DSc) 1990, both in technical cybernetic at the Institute of Information Theory and Automation, the Czechoslovak Academy of Sciences employing him since 1973 in the department of Adaptive Systems. Research: conceptual, theoretical and algorithmic aspects of adaptive systems based on Bayesian dynamic decision making and its fully probabilistic extension. Teaching: the

References (42)

  • BergerJ.O.

    Statistical decision theory and Bayesian analysis

    (1985)
  • BertsekasD.P.

    Dynamic programming and optimal control

    (2017)
  • BesbesO. et al.

    Optimal exploration – exploitation in a multi-armed bandit problem with nonstationary rewards

    Stochastic Systems

    (2019)
  • BušičA. et al.

    Action-constrained Markov decision processes with Kullback-Leibler cost

  • FeldbaumA.A.

    Theory of dual control

    Automation and Remote Control

    (1961)
  • GuanP. et al.

    Online Markov decision processes with Kullback-Leibler control cost

  • GuyT.V. et al.

    Lazy fully probabilistic design: Application potential

  • HomolováJ. et al.

    Affective decision-making in ultimatum game: Responder

  • KappenH.J.

    Linear theory for control of nonlinear stochastic systems

    Physical Review Letters

    (2005)
  • KárnýM.

    Towards on-line tuning of adaptive-agent’s multivariate meta-parameter

    International Journal of Machine Learning and Cybernetics

    (2021)
  • KárnýM. et al.

    Optimized bayesian dynamic advising: Theory and algorithms

    (2006)
  • Cited by (0)

    Miroslav Kárný, Ing. (M.Sc.) in theoretical cybernetics, Czech Technical University (CTU) Prague, 1973; CSc (Ph.D.) 1978, DrSc (DSc) 1990, both in technical cybernetic at the Institute of Information Theory and Automation, the Czechoslovak Academy of Sciences employing him since 1973 in the department of Adaptive Systems. Research: conceptual, theoretical and algorithmic aspects of adaptive systems based on Bayesian dynamic decision making and its fully probabilistic extension. Teaching: the advanced course on dynamic decision making, CTU since 1991; supervision of 13 defended Ph.D. students (+10 co-supervision) and numerous B.Sc.,M.Sc. theses and research projects. Publications: 1 monograph, 6 edited books, 420+ works ( 10 chapters, 120 articles); for the list and many preprints after 1989 see http://www.utia.cz/people/karny.

    The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Chanying Li under the direction of Editor Miroslav Krstic.

    View full text