Some Reinforcement Learning: The Greedy and ExploreExploit Algorithms for the MultiArmed Bandit Framework in Python
In this article the multiarmed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.
The Problem Statement and Some Theory
Given a set of actions with some unknown reward distributions, maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately. This is the traditional exploreexploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found sofar by exploring.
The following figure defines the problem mathematically and shows the explorationexploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.
The following figure shows a motivating application of the multiarmed bandit problem in drug discovery. Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.
The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.
 The Naive RoundRobin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
 The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lockon to a suboptimal action forever.
 The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilongreedy chooses an action uniformly at random with probability epsilon, whereas
the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a suboptimal action.
The following code from the github repository of the same course shows how the basic bandit Framework can be defined:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

import numpy as np import sys ### Interface class Environment( object ): def reset( self ): raise NotImplementedError( 'Inheriting classes must override reset.' ) def actions( self ): raise NotImplementedError( 'Inheriting classes must override actions.' ) def step( self ): raise NotImplementedError( 'Inheriting classes must override step' ) class ActionSpace( object ): def __init__( self , actions): self .actions = actions self .n = len (actions) ### BanditEnv Environment class BanditEnv(Environment): def __init__( self , num_actions = 10 , distribution = "bernoulli" , evaluation_seed = "387" ): super (BanditEnv, self ).__init__() self .action_space = ActionSpace( range (num_actions)) self .distribution = distribution np.random.seed(evaluation_seed) self .reward_parameters = None if distribution = = "bernoulli" : self .reward_parameters = np.random.rand(num_actions) elif distribution = = "normal" : self .reward_parameters = (np.random.randn(num_actions), np.random.rand(num_actions)) elif distribution = = "heavytail" : self .reward_parameters = np.random.rand(num_actions) else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) if distribution ! = "normal" : self .optimal_arm = np.argmax( self .reward_parameters) else : self .optimal_arm = np.argmax( self .reward_parameters[ 0 ]) def reset( self ): self .is_reset = True return None def compute_gap( self , action): if self .distribution ! = "normal" : gap = np.absolute( self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action]) else : gap = np.absolute( self .reward_parameters[ 0 ][ self .optimal_arm]  self .reward_parameters[ 0 ][action]) return gap def step( self , action): self .is_reset = False valid_action = True if (action is None or action = self .action_space.n): print ( "Algorithm chose an invalid action; reset reward to inf" ) #, flush = True) reward = float ( "inf" ) gap = float ( "inf" ) valid_action = False if self .distribution = = "bernoulli" : if valid_action: reward = np.random.binomial( 1 , self .reward_parameters[action]) gap = self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action] elif self .distribution = = "normal" : if valid_action: reward = self .reward_parameters[ 0 ][action] + self .reward_parameters[ 1 ][action] * np.random.randn() gap = self .reward_parameters[ 0 ][ self .optimal_arm]  self .reward_parameters[ 0 ][action] elif self .distribution = = "heavytail" : if valid_action: reward = self .reward_parameters[action] + np.random.standard_cauchy() gap = self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action] #HACK to compute expected gap else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) return ( None , reward, self .is_reset, '') #Policy interface class Policy: #num_actions: (int) Number of arms [indexed by 0 ... num_actions1] def __init__( self , num_actions): self .num_actions = num_actions def act( self ): pass def feedback( self , action, reward): pass 
Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).
In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.
Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.
Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.
 The next figure shows the reward distributions for a 5armed bandit framework.
 If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1p_i we shall get a reward of 0.
 Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..4.
 We first generate the parameters for the distribution corresponding to each arm randomly.
 As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
 Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.
Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive RoundRobin algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
 Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
 Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
 Both Greedy and RoundRobin has linear regrets (w.r.t. timesteps).
 In this case the greedy algorithm even with suboptimal action still performs relatively better than the roundrobin.
Greedy
RoundRobin
 The next figure shows the reward distributions for a 10armed bandit framework.
 If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1p_i we shall get a reward of 0.
 Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..9.
 We first generate the parameters for the distribution corresponding to each arm randomly.
 As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
 Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.
Again, let’s assume that we don’t know p_i values and we use the EpsilonGreedy (with different values of the hyperparameter ε) and the OptimisticGreedy (with different values of the hyperparameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
 EpsilonGreedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the εgreedy algorithm has linear regret.
 ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sublinear regrets.
 OptimisticGreedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the εgreedy algorithm has linear regret.
 R = 3 and R = 5 find the optimal arm 6 eventually and they have sublinear regrets(w.r.t. timesteps).
EpsilonGreedy
ε = 0
ε = 0.05
ε = 0.1
ε = 0.15
ε = 1.0
Optimistic Greedy
R = 0
R = 1
R = 3
R = 5
R = 10000
The next figure shows two algorithms (UCB and Bayesian ThompsonBeta Posterior Sampling) that achieve logarithmic regret.
 The next figure shows the reward distributions for a 10armed bandit framework.
 If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1p_i we shall get a reward of 0.
 Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..9.
 We first generate the parameters for the distribution corresponding to each arm randomly.
 As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
 Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.
Again, let’s assume that we don’t know p_i values, we implement and use the UCB1 and ThompsonBeta algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
 Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
 Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
 For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the noninformative flat prior.
 Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
 The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.
UCB
Thompson Beta
Thompson Normal
>
Recent Comments