Reinforcement Learning -- I want to play another game.¶

Tianzhu Qin¶

The essay hereby only uses SARSA($\lambda$) method to solve the variant question of previous question.

Environment Introduction¶

Consider a variant of the reinforcement learning problem which is identical to that in previous question but where the player does not incur a cost by investing effort but can invest high effort if she has a sufficient energy.

The player starts with a given energy level $B$ which is decremented by $c_H$ whenever in a round the player invests high effort. If the energy level would become negative after subtracting the value of $c_H$, it is set equal to $0$.

In each round, an amount of energy of value $a$ is added to the player, independently with probability $p$.

The player can invest high effort in a round only if her energy level is at least $c_H$.

The player receives a prize of value $R$ if winning the game and this is the only reward received.

Use the following setting of parameters:

$p_H=0.55$

$p_L=0.45$

$c_H=1$

$R=1000$

$B=10$

$a=2$

$p=0.2$

step size $\eta=0.001$

Discount parameter $\gamma=0.9$

0. Prepare for the TwoPlayerGame (for P3)¶

I hereby build a TwoPlayerGame class to give the environment of states (2-dimension, first is energy (0~10), and second is game state (-3~3)), the step function to get next state (with some probabilities), reward and done flag of differnt actions, and the reset function to restart the game. All environment attributes are set down here with the greedy policy function.

import numpy as np

n_states = 7
n_actions = 2

low = 0
high = 1

d = 3
p_h = 0.55
p_l = 0.45

c_h = 1

RETURN = 1000

b = 10
a = 2
p = 0.2
step = 0.001
gamma = 0.9
lam = 0.9

episodes = 500000

def win_or_loss(p):
    return np.random.choice([1, 0], p = [p, 1-p])
    
def epsilon_greedy_policy(q_values, b, epsilon=0.1):
    if b < c_h:
        return 0
    if np.random.binomial(1, epsilon) == 1:  
        return np.random.choice([low, high]) 
    else:  
        return np.random.choice([action_ for action_, value_ in enumerate(q_values) if value_ == np.max(q_values)])
    
class TwoPlayerGame:  
    def __init__(self, initial_state):  
        self.initial_state = initial_state  
        self.state = self.initial_state  
        self.reward = 0.0
        self.is_terminal = False  
  
    def step(self, action):  
        i, j = self.state  # two-dimension
                
        if win_or_loss(p_h if action else p_l): # win or loss by different prob, set limit below
            self.state = (min(max(i - c_h * action, 0) + a * np.random.binomial(1, p), b), min(j + 1, n_states-1))
        else:
            self.state = (min(max(i - c_h * action, 0) + a * np.random.binomial(1, p), b), max(j - 1, 0))

        if self.state[1] == n_states-1:  # terminate
            self.reward = RETURN
            self.is_terminal = True  
        elif self.state[1] == 0:  
            self.is_terminal = True  

        return self.state, self.reward, self.is_terminal  
  
    def reset(self):  
        self.state = self.initial_state  
        self.reward = 0.0  
        self.is_terminal = False  
        return self.state

def main():
    qsa = np.zeros((b+1, n_states, n_actions))
    esa = np.zeros_like(qsa)
    game = TwoPlayerGame((b, 3))
    episodes = 500000
    for index in range(episodes):
        s = game.reset()
        a = epsilon_greedy_policy(qsa[s], s[0])
        done = False
        while not done:
            s_p, r, done = game.step(a) # next step
            a_p = epsilon_greedy_policy(qsa[s_p], s_p[0])
            theta = r + gamma * qsa[s_p][a_p] - qsa[s][a] 
            esa[s][a] += 1
            for i in range(11):
                for j in range(n_states):
                    for k in range(n_actions): # all related qsa
                        qsa[i, j, k] += step * esa[i, j, k] * theta
                        esa[i, j, k] *= gamma * lam
            s = s_p
            a = a_p
        if index % 100000 == 0 and index != 0:
            print('%s Iteration(s):\n' %index, np.argmax(qsa, axis=2))

    print("\n At last, Optimal Policy is\n", np.argmax(qsa, axis=2), 
          "\n The related action value:\n", qsa)
    
if __name__ == '__main__':
    main()

100000 Iteration(s):
 [[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 0 1 1 0 0]
 [0 1 0 1 1 0 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 0 1 1 0 0]
 [0 0 1 0 1 1 0]
 [0 0 1 1 1 0 0]]
200000 Iteration(s):
 [[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 0 1 1 0 0]
 [0 1 0 1 1 0 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 0 1 1 1 0]
 [0 0 1 0 1 1 0]
 [0 0 1 1 1 0 0]]
300000 Iteration(s):
 [[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 0 1 1 0 0]
 [0 1 0 1 1 0 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 0 1 1 1 0]
 [0 0 1 1 1 1 0]
 [0 1 1 1 1 0 0]]
400000 Iteration(s):
 [[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 0 1 1 0 0]
 [0 1 0 1 1 1 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 0 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 0 0]]

 At last, Optimal Policy is
 [[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 1 1 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]] 
 The related action value:
 [[[  0.           0.        ]
  [ 50.03080226   0.        ]
  [118.70933321   0.        ]
  [219.27493925   0.        ]
  [379.94588404   0.        ]
  [641.54429293   0.        ]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 26.93599062  63.97283106]
  [ 66.07691495 131.58393122]
  [207.27582614 242.72966172]
  [396.36296505 362.01649403]
  [652.52942375 504.99903956]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 63.36158953  24.60673349]
  [121.09578837 143.05658172]
  [212.10888679 252.66807126]
  [311.46617705 423.91300702]
  [515.91402123 735.30896415]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 44.25502125  75.10963968]
  [140.87005565 153.93699264]
  [223.19031893 269.95557116]
  [390.08672339 467.73113634]
  [680.92865872 448.53471963]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 48.31978757  79.36183704]
  [159.38695508  96.1236278 ]
  [259.8962214  290.86024423]
  [327.7356708  449.03087049]
  [661.2333034  738.57812604]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 25.0012495   79.24858477]
  [160.38078645 180.55482978]
  [126.79186082 289.21703598]
  [419.91404211 488.20436897]
  [693.01661339 185.85318596]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 70.11574497  90.81383924]
  [166.69325515  80.54890306]
  [267.51790617 324.01794494]
  [453.07035437 334.01422848]
  [666.42284077 744.21339679]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 28.53313916  87.08499832]
  [159.61645192 198.77656161]
  [294.60979985 253.44001567]
  [452.42480711 503.36084765]
  [607.57376678 734.57505642]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 71.00606696  98.16080997]
  [163.36927366 180.2935887 ]
  [284.88851475 330.54069909]
  [444.32012815 479.86742035]
  [699.69388314 755.5692799 ]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 71.78622382  84.85089586]
  [174.77685136 197.69405755]
  [290.62894969 309.24938161]
  [445.2941401  505.82475653]
  [659.40626846 738.00527167]
  [  0.           0.        ]]

 [[  0.           0.        ]
  [ 73.38419083 100.08642433]
  [164.59071586 185.48647861]
  [292.00449747 324.82559358]
  [445.96051166 490.3952147 ]
  [679.37491458 745.60277822]
  [  0.           0.        ]]]

Answer:¶

Answer this time:

[[0 0 0 0 0 0 0]
 [0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 1 1 0]
 [0 1 1 1 1 0 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]]

Answer some other time:

 [[0 0 0 0 0 0 0]
 [0 1 1 0 0 1 0]
 [0 0 1 0 1 0 0]
 [0 1 0 1 0 1 0]
 [0 0 1 0 1 1 0]
 [0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0]
 [0 1 0 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]
 [0 1 1 1 1 1 0]]

SARSA($\lambda$) hereby uses a backward view to appromixate. Choosing action by the state-action values and changes the values for all states by the action and the related rewards. The on-line policy makes the updates per each step, rather than off-line that makes updates at the end of episode.
The optimal policy here in this jupyter notebook is the above matrix, as showed in previous cell. A matrix with row as energy from 0 to 10, columns as different states (from -3 to 3 and -3 and 3 as termination).
Since the result is not quite stable, I showed some other time work for comparison, to give robust answer.
The estimated action values are also showed above, with each cell as energy from 0-10, each row in cell is state (-3~3) and each column in cell is action low (the first column) to high (the second column). State-Action values hereby also obeys its possible situations.
For the final policy, if the player has more energy, the player will use "high" to fight for good chance to win at last. On the other hand, energy has upper limit, so that if not using energy, the player will not get more, which is not worthwhile. And the first row is full of "low", because no energy lefts to use "high".
SARSA is an algorithm that gives more likely safer policy. Here we can see the player tends to a little bit like to use "high" if it loses a lot, and chooses to save energy and uses "low" if it wins a lot already.