Introduction to Reinforcement Learning (PENDING)
In this introductory post we discuss the mathematical model of an MDP and some elementary ways to solve it in both offline and online setting.
1. The Mathematical Model
Reinforcement Learning is a technique used to solve, i.e. find an optimal strategy, a particular class of problem called “Markovian Decision Process”. The problem is apt at modeling several of the real world situation which we associate with “learning from the environment”. For example traffic, with time we get accustomed to the pattern and adjust our path so as to reduce our travel time. The key aspect is: interaction with the environment and learning from it (understanding its dynamics, and then logically using it to our advantage). The logically using part is theoretically done, there is not much to do (as we will see)1. The central problem is understanding the dynamics of problem (which is random) and exploiting it. This too has to be done cleverly, one just doesn’t sit for days to observe traffic (and failing to get paid for his job!) and then exploits. This is bad, first - well, it surely isn’t optimal, second - who knows you set to observe on a very wrong and misleading day! In practice we balance the exploration and exploitation.
1.1 Markov Chain
Before we introduce the notion of MDP, we begin with its natural predecessor - a Markov Chain.
Notes
That is to say, when we have full knowledge of the environment (MDP) we know the best strategy. In practice, along the lines of exploration and exploitation, how we use this knowledge can vary and so one might say it is not “done”. Take what you will. ↩︎