The following information was submitted:
Transactions: WSEAS TRANSACTIONS ON MATHEMATICS
Transactions ID Number: 89-330
Full Name: Raul Montes-de-Oca
Position: Doctor (Researcher)
Age: ON
Sex: Male
Address: Universidad Autonoma Metropolitana-Iztapalapa, Av. San Rafael Atlixco 186,09340 Mexico,D.F.
Country: MEXICO
Tel: 58044654
Tel prefix: 52 55
Fax: 58044660
E-mail address: momr@xanum.uam.mx
Other E-mails: none
Title of the Paper: When are the value iteration maximizers close to an optimal stationary policy of a discounted Markov decision process? Closing the gap between the Borel space theory and actual computations
Authors as they appear in the Paper: Raul Montes-de-Oca, Enrique Lemus-Rodriguez
Email addresses of all the authors: momr@xanum.uam.mx,elemus@anahuac.mx
Number of paper pages: 10
Abstract: {\it Abstract:} Markov Decision Processes [MDPs] have been repeteadly used in Economy and Engineering but apparently are still far from achieving its full potential due to the computational difficulties inherent to the subject due to the usual impossibility of finding explicit optimal solutions. Value iteration is an elegant, theoretical method of approximating an optimal solution, frequently mentioned in Economy when MDPs are used. To extend its use and benefits, improved understanding of its convergence is needed still even if it would appear not to be the case. For instance, the corresponding convergence properties of the policies is still not well understood. In this paper we further analyze this issue: using Value Iteration, if a stationary policy $f_N$ is obtained in th $N$-th iteration, such that the optimal discounted rewards of $f^*$ and $f_N$ are close, we would like to know whether are the corresponding actions $f^*(x)$ and $f_N(x)$ necessarily close fo!
r each state $x$? To our knowledge this question is still largely open. In this paper it is studied when it is possible to stop the value iteration algorithm so that the corresponding maximizer stationary policy $f_N$ approximates an optimal policy both in the total discounted reward and in the action space (uniformly over the state space). In this article the action space is assumed to be a compact set and the reward function bounded. An ergodicity condition on the transition probability law and a structural condition on the reward function are needed. Under these conditions, an upper bound on the number of steps needed in the value iteration algorithm, such that its corresponding maximizer is a uniform approximation of the optimal policy, is obtained.
Keywords: Markov decision process, Compact action space, Bounded reward, Expected total discounted reward, Approximation of optimal policies by means of value iteration policies
EXTENSION of the file: .pdf
Special (Invited) Session: Value iteration and action epsilon-approximation of optimal policies in discounted Markov decision processes
Organizer of the Session: 697-320
How Did you learn about congress: none
IP ADDRESS: 148.230.10.16