Balancing Exploitation and Exploration via Fully Probabilistic Design of Decision Policies

Kárný, Miroslav; Hůla, František

doi:https://dx.doi.org/10.5220/0007587208570864

Number of the records: 1

Balancing Exploitation and Exploration via Fully Probabilistic Design of Decision Policies

1.

SYSNO ASEP	0503817
Document Type	C - Proceedings Paper (int. conf.)
R&D Document Type	Conference Paper
Title	Balancing Exploitation and Exploration via Fully Probabilistic Design of Decision Policies
Author(s)	Kárný, Miroslav (UTIA-B)_{RID, ORCID} Hůla, František (UTIA-B)
Number of authors	2
Source Title	Proceedings of the 11th International Conference on Agents and Artificial Intelligence, 2. - Setúbal : SciTePress, 2019 / Rocha A. ; Steels L. ; van den Herik J. - ISBN 978-989-758-350-6
Pages	s. 857-864
Number of pages	8 s.
Publication form	Print - P
Action	International Conference on Agents and Artificial Intelligence
Event date	19.02.2019 - 21.02.2019
VEvent location	Praha
Country	CZ - Czech Republic
Event type	WRD
Language	eng - English
Country	PT - Portugal
Keywords	exploitation ; exploration ; adaptive systems ; Bayesian estimation ; fully probabilistic design ; Markov decision process
Subject RIV	BC - Control Systems Theory
OECD category	Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
R&D Projects	GA16-09848S GA ČR - Czech Science Foundation (CSF)
	GA18-15970S GA ČR - Czech Science Foundation (CSF)
Institutional support	UTIA-B - RVO:67985556
EID SCOPUS	85064837601
DOI	10.5220/0007587208570864
Annotation	Adaptive decision making learns an environment model serving a design of a decision policy. The policy-generated actions influence both the acquired reward and the future knowledge. The optimal policy properly balances exploitation with exploration. The inherent dimensionality curse of decision making under incomplete knowledge prevents the realisation of the optimal design. This has stimulated repetitive attempts to reach this balance at least approximately. Usually, either: (a) the exploitative reward is enriched by a part reflecting the exploration quality and a feasible approximate certainty-equivalent design is made, or (b) an explorative random noise is added to the purely exploitative actions. This paper avoids the inauspicious (a) and improves (b) by employing the non-standard fully probabilistic design (FPD) of decision policies, which naturally generates random actions. Monte-Carlo experiments confirm its achieved quality. The quality stems from methodological contributions, which include: (i) an improvement of the relation between FPD and standard Markov decision processes, (ii) a design of an adaptive tuning of an FPD-parameter. The latter also suits for the tuning of the temperature in both simulated annealing and Boltzmann’s machine.
Workplace	Institute of Information Theory and Automation
Contact	Markéta Votavová, votavova@utia.cas.cz, Tel.: 266 052 201.
Year of Publishing	2020

Number of the records: 1