Reinforcement Learning Review -- I want to play a game.¶

Tianzhu Qin¶

The essay hereby compares Dynamic Programming, Monte Carlo, Q-Learning and SARSA methods to solve the game described below.

PS: SARSA$(\lambda)$ algorithm can be found here.

Environment Introduction¶

Consider a two-player game that proceeds in successive rounds where in each round one player wins and other loses. The winner of the game is the player who first wins $d$ rounds more than the opponent, where $d>=1$ is a parameter.

Suppose we model this game so that we consider one of the players. In each round, this player has two available actions, either to invest 1) high or 2) low effort.

If the player invests high effort in a round, she wins this round with probability $p_H$, otherwise, she wins this round with probability $p_L$. And they are parameters such that $0<p_L<p_H<1$.

By investing high effort in a round, the player incurs a cost of value $c_H$, and otherwise, a cost of value $c_L$, in this round, with $0<c_L<c_H$.

The player receives a prize of value $R$ if winning the game.

The essay hererby assumes the following values for parameters:

$d=3$

$p_H=0.55$

$p_L=0.45$

$c_H=50$

$c_L=10$

$R=1000$

Termination criteria:

$||V_{t+1}-V_t||_{\infty}{\leq}1e-6$ or $Episodes=500000$

Target¶

Find the best strategy of the game (i.e. the value of the respective action).

0. Environment Implementation¶

I hereby build a _gridworld() function to give the environment of states, and the next state (with some probabilities), reward and done flag of differnt actions. All environment attributes are set down here with the greedy policy function.

import numpy as np
from itertools import permutations, combinations_with_replacement # get permutations of policies

n_states = 7
n_actions = 2

low = 0
high = 1

d = 3
p_h = 0.55
p_l = 0.45

c_h = 50
c_l = 10

RETURN = 1000

def grid_world():
    p = {}
    grid = np.arange(n_states) # only one-dimension
    it = np.nditer(grid)
    is_done = lambda x: x == 0 or x == (n_states - 1) # write it here to aviod repeat
    with it:
        while not it.finished:
            s = it.iterindex
            p[s] = {a: [] for a in range(n_actions)}
            if s == n_states - 1: # terminal state
                p[s][low] = [(1.0, s, 0, True)]
                p[s][high] = [(1.0, s, 0, True)]
            elif s == 0: # terminal state
                p[s][low] = [(1.0, s, 0, True)]
                p[s][high] = [(1.0, s, 0, True)]
            else: #  - cost if not terminate else r - cost (win) or - cost (lost)
                p[s][low] = [(p_l, s+1, - c_l if s+1 != n_states-1 else  RETURN - c_l, is_done(s+1)), 
                             (1-p_l, s-1, - c_l, is_done(s-1))]
                p[s][high] = [(p_h, s+1, - c_h if s+1 != n_states-1 else  RETURN - c_h, is_done(s+1)),
                              (1-p_h, s-1, - c_h, is_done(s-1))]
            it.iternext()
    return p

def epsilon_greedy_policy(q_values, epsilon=0.1):
    if np.random.binomial(1, epsilon) == 1:
        return np.random.choice([low, high])
    else:  
        return np.random.choice([action_ for action_, value_ in enumerate(q_values) if value_ == np.max(q_values)])

1. Dynamic Programming¶

Compute numerically the value function, by assuming perfect knowledge of the environment, for each deterministic policy and rank these policies with respect to the value at the initial state.

def main():
    lam = 1.0  # assume no decay
    theta = 1e-6
    pi_a = 1.0 # assume deterministic
    
    transition_probs = grid_world()
    
    # Assume the policies (with no need to think about the two termination states)
    policies = [tuple([0]) + j + tuple([0]) for i in combinations_with_replacement(range(n_actions), n_states-2) 
                                            for j in set(permutations(i))]
    
    all_state_values = []
    for policy in policies:
        state_values = np.zeros(n_states)  # Initial state values
        iteration_counter = 1
        while True:
            v_old = np.copy(state_values)
            delta = 0.0
            for s in range(n_states):
                v_s = 0.0
                current_entries = transition_probs[s][policy[s]] # different results possible
                for current_entry in current_entries:
                    p_sa = current_entry[0]
                    next_s = current_entry[1]
                    reward = current_entry[2]
                    v_s += pi_a * p_sa * (reward + lam * v_old[next_s])
                state_values[s] = v_s
                delta = np.maximum(delta, np.abs(state_values[s] - v_old[s]))
            iteration_counter += 1
            if delta < theta: # converge
                break
        all_state_values.append(state_values)
    
    policy_value = zip(policies, all_state_values)
    policy_value = sorted(policy_value, key=lambda x: x[1][3], reverse=True)
    print("\n-------All %d policies are checked.------"%len(policy_value),
          "\nThe best policy is", policy_value[0][0],
          "\nAll values are", policy_value[0][1],
          "\nWith initial value as", policy_value[0][1][3],
          "\n\nAll ranked policy are below:")
    for i, j in policy_value:
        print(i, ":\n", " ".join([str(num) for num in j]), sep="")
if __name__ == '__main__':
    main()

-------All 32 policies are checked.------ 
The best policy is (0, 0, 0, 0, 1, 1, 0) 
All values are [  0.          56.5245315  147.83229307 281.65289022 467.43362197
 710.3451295    0.        ] 
With initial value as 281.6528902232666 

All ranked policy are below:
(0, 0, 0, 0, 1, 1, 0):
0.0 56.52453149747993 147.83229306697373 281.6528902232666 467.4336219706282 710.3451295041244 0.0
(0, 0, 0, 0, 0, 1, 0):
0.0 55.316838440154214 145.1485297908077 277.16504293423054 460.7407810424483 707.3333515073754 0.0
(0, 0, 0, 0, 1, 0, 0):
0.0 53.200111600376616 140.44469319181326 269.299181481399 449.01022448061457 686.9556230001597 0.0
(0, 0, 0, 0, 0, 0, 0):
0.0 52.36976954096041 138.59948857828462 266.2135894796013 444.4086034266152 684.4247314077077 0.0
(0, 0, 0, 1, 1, 1, 0):
0.0 51.14247161779807 135.87216007454418 261.6528900476724 455.4734889781987 704.9630696244426 0.0
(0, 0, 0, 1, 0, 1, 0):
0.0 48.85630382238699 130.79178704005218 253.157377503142 444.1837711867955 699.8826966478905 0.0
(0, 1, 0, 0, 1, 1, 0):
0.0 10.052245278954135 109.185901327834 252.57148056822712 450.0427458142541 702.519235289273 0.0
(0, 0, 1, 0, 1, 1, 0):
0.0 36.0117292339314 102.24828802572257 247.3509265846518 446.9208199335029 701.1143686320689 0.0
(0, 0, 0, 1, 1, 0, 0):
0.0 46.657166123929024 125.90481449145888 244.98527319614922 433.32383203685833 678.3281070722202 0.0
(0, 1, 0, 0, 0, 1, 0):
0.0 7.065336061502997 103.75515733509478 244.1538275718055 437.97442643808483 697.0884915103024 0.0
(0, 1, 0, 0, 1, 0, 0):
0.0 4.731181371644908 99.511239696325 237.57575498656846 428.5434977608412 675.6989233071288 0.0
(0, 0, 1, 0, 0, 1, 0):
0.0 32.978055244285166 95.5067903081159 237.57575494898617 433.4378247908745 695.0470207615265 0.0
(0, 0, 0, 1, 0, 0, 0):
0.0 44.66324261188635 121.47387315047871 237.5757547798661 423.477295681519 672.9125121679797 0.0
(0, 1, 0, 1, 1, 1, 0):
0.0 4.540762363945181 99.16502331006944 237.03911963834756 440.7542907750458 698.339430518321 0.0
(0, 0, 1, 1, 1, 1, 0):
0.0 31.252055992472584 91.6712363411648 232.01420167903692 437.7493565125957 696.987210107189 0.0
(0, 0, 1, 0, 1, 0, 0):
0.0 30.909089930516437 90.90908953871177 230.909088734481 424.2424228720451 673.3333321372979 0.0
(0, 1, 0, 0, 0, 0, 0):
0.0 2.1732729332151344 94.86049702960081 230.36710404390755 418.2085144406763 670.0146824686241 0.0
(0, 1, 0, 1, 0, 1, 0):
0.0 -1.8374930945697088e-06 90.90908848902191 224.24242090152774 424.24242182235525 690.9090894056875 0.0
(0, 0, 1, 0, 0, 0, 0):
0.0 28.210728094869793 84.91272987502983 222.21436729622195 412.24970371136345 666.7373365783546 0.0
(0, 1, 1, 0, 1, 1, 0):
0.0 -15.52602532703884 62.679953565829656 217.5757559667605 429.11506922676915 693.1017812937994 0.0
(0, 0, 1, 1, 0, 1, 0):
0.0 26.771158624912985 81.71368671843683 217.5757547986572 419.64472120119547 688.8401241421543 0.0
(0, 1, 0, 1, 1, 0, 0):
0.0 -2.365592945924913 86.6080137484702 217.57575476107496 415.6402718129864 668.602148989559 0.0
(0, 0, 1, 1, 1, 0, 0):
0.0 24.72968787415953 77.17708500368053 210.9976821714435 411.3963541066705 666.267994297284 0.0
(0, 1, 1, 1, 1, 1, 0):
0.0 -19.469084980509795 55.51075550451278 207.7669880929018 423.2493617220553 690.4622124348642 0.0
(0, 1, 0, 1, 0, 0, 0):
0.0 -6.666668188323062 78.7878759090751 205.45454254329016 399.99999684761553 659.9999983337364 0.0
(0, 1, 1, 0, 0, 1, 0):
0.0 -21.481482995818926 51.85185000413744 202.76094010445755 409.4276077243349 684.2424231002765 0.0
(0, 1, 1, 0, 1, 0, 0):
0.0 -23.448277264563238 48.27586045991942 197.86833610758995 402.92580835295706 661.6091941215205 0.0
(0, 0, 1, 1, 0, 0, 0):
0.0 20.371455155665778 67.49212345646562 196.95448802015855 393.78733344454594 656.5830328644928 0.0
(0, 1, 1, 1, 0, 1, 0):
0.0 -27.326267449354006 41.22496798943201 188.2214348037815 399.4003617025346 679.7301628840664 0.0
(0, 1, 1, 0, 0, 0, 0):
0.0 -29.14236666380266 37.92297068186499 183.70370059918693 384.1023726752205 651.2563044325428 0.0
(0, 1, 1, 1, 1, 0, 0):
0.0 -29.21269757570476 37.79509617741022 183.5287450400879 393.6744592365827 656.5209521474827 0.0
(0, 1, 1, 1, 0, 0, 0):
0.0 -37.18223389447976 23.305030263139848 163.70370037910965 369.4844322564953 643.2164372018658 0.0

Result:¶

The best policy is (Low, Low, Low, High, High) with regards to State (s) -2, -1, 0, 1, 2, which means winning s times more than the opponent.
The initial value is around 281.65289.
All ranked policies by the initial value is showed above.
Dynamic programming assuming perfect knowledge of the environment gives us all values for each policy, by looping each state value until converging.
It is likey to have this policy as optimal, because if we already won a lot, then it is a good idea if we use "high" to try to get the final reward. But if we already lost a lot, then we do not really need to pay another cost for the less likely final reward.

2. Monte Carlo¶

Firstly, Estimate the action-value function by using the off-policy Monte Carlo estimation method for a threshold policy $\pi$ such that $\pi(s)=high$ whenever $s<=s^*$ and $\pi(s)=low$ otherwise, for given threshold value $s^*=1$ where the behaviour policy, in each state, selects either action equiprobably.

Then, Compute the optimal policy by using the off-policy Monte Carlo algorithm for the behavior policy that in each state selects either action equiprobably. Show action values and policy

2.1 Action-Value Function Estimate¶

star = 1 + 3 # state below than this

def main():
    qsa = np.zeros([n_states, n_actions]) # value Q(s, a)
    c = np.zeros([n_states, n_actions]) # The cumulative denominator
    lam = 1
    policy = [1 if i <= star and i >= 1 else 0 for i in range(n_states)]
    
    transition_probs = grid_world()
    
    episodes = 500000

    for index in range(episodes):
        done = False
        s = 0 + 3
        state_action_history = []
        while not done:
            a = np.random.choice([low, high])
            current_entries = transition_probs[s][a]
            probs = [i[0] for i in current_entries]
            current_entry = current_entries[np.random.choice([0, 1], p = probs if len(probs) == 2 else [1, 0])] # generate the episode
            r = current_entry[2]
            state_action_history.append([s, a, r])
            s = current_entry[1] # next state
            done = current_entry[-1] # stop loop

        g = 0
        w = 1
        for s, a, r in reversed(state_action_history):
            g = lam * g + r
            c[s, a] = c[s, a] + w
            qsa[s, a] = qsa[s, a] + w / c[s, a] * (g - qsa[s, a])
            w = w * (a == policy[s]) / 0.5               
            if w == 0: # confirm no nan
                break

        if index % 100000 == 0:
            print('%s Iteration(s):\n' % index, qsa)
    print("\n At last, Q(s, a) for Policy", policy, ":\n", qsa)

if __name__ == '__main__':
    main()

0 Iteration(s):
 [[  0.   0.]
 [-10.   0.]
 [  0.   0.]
 [  0.   0.]
 [  0.   0.]
 [  0.   0.]
 [  0.   0.]]
100000 Iteration(s):
 [[  0.           0.        ]
 [ 63.62658471  50.80687019]
 [155.16155089 161.88094322]
 [358.19650247 319.65976751]
 [498.54752871 571.83728141]
 [799.54608905 754.98546422]
 [  0.           0.        ]]
200000 Iteration(s):
 [[  0.           0.        ]
 [ 61.97054154  44.44790102]
 [155.31714719 171.79756836]
 [357.50220773 351.93067851]
 [506.52089407 574.61865095]
 [777.07148651 740.00713632]
 [  0.           0.        ]]
300000 Iteration(s):
 [[  0.           0.        ]
 [ 77.13613369  15.52097597]
 [126.45581828 124.66859929]
 [334.06344562 287.49042225]
 [497.52418546 567.68052847]
 [780.94825057 765.64858115]
 [  0.           0.        ]]
400000 Iteration(s):
 [[  0.           0.        ]
 [ 70.4790872   13.44726862]
 [123.07303245 126.1116671 ]
 [287.15114279 295.95893296]
 [518.90466827 527.84615294]
 [752.59756172 766.52532962]
 [  0.           0.        ]]

 At last, Q(s, a) for Policy [0, 1, 1, 1, 1, 0, 0] :
 [[  0.           0.        ]
 [ 68.95697429  26.46042638]
 [138.19364128 140.11640233]
 [298.86021729 314.43961316]
 [532.07505217 541.20599982]
 [764.908025   769.28903695]
 [  0.           0.        ]]

Answer:¶

The state-action value policy for it is as above. The first column is for action High and the second is action Low, with each row is state -3, -2 ... 2, 3 from first to the last (with -3 and 3 as termination state).
This policy intuitively disobeys what we discuss before. If we lost a lot already and we still try to pay the higher cost for low chance of winning at last, we are really likely to lose a lot more. And if we have already won a lot and choose not to do "high", then we are likely to lose this good chance to get final reward.
We can not say this is the optimal policy since the action High for the initial is around 314.44 that is a lot larger than the optimal policy initial value by Dynamic Programming, that is around 281.65. We can see Monte Carlo can be easy to provide a local optimization that is not enough to tell the right answer.
For algorithm, off-policy Monte Carlo estimation is somehow like the use of a weight to calculate values and monitor the progress by another policy, and modifying the needed policy by that weight.

2.2 Optimal Policy¶

def main():
    qsa = np.zeros([n_states, n_actions]) # value Q(s, a)
    c = np.zeros([n_states, n_actions]) # The cumulative denominator
    pi_s = np.argmax(qsa, axis=1)
    
    lam = 1
    
    transition_probs = grid_world()
    
    episodes = 500000

    for index in range(episodes):
        done = False
        s = 0 + 3
        state_action_history = []
        while not done:
            a = np.random.choice([low, high])
            current_entries = transition_probs[s][a]
            probs = [i[0] for i in current_entries]
            current_entry = current_entries[np.random.choice([0, 1], p = probs if len(probs) == 2 else [1, 0])] # generate the episode
            r = current_entry[2]
            state_action_history.append([s, a, r])
            s = current_entry[1] # next state
            done = current_entry[-1] # stop loop

        g = 0
        w = 1
        for s, a, r in reversed(state_action_history):
            g = lam * g + r
            c[s, a] = c[s, a] + w
            qsa[s, a] = qsa[s, a] + w / c[s, a] * (g - qsa[s, a])
            
            pi_s[s] = np.argmax(qsa[s])
            
            if a != pi_s[s]:
                break
            
            w = w * 1 / 0.5

        if index % 100000 == 0 and index != 0:
            print('%s Iteration(s):' % index, pi_s)
            print(qsa)
    print("\n At last, Optimal Policy is", pi_s, 
          "\n The related action value:\n", qsa)

if __name__ == '__main__':
    main()

100000 Iteration(s): [0 0 0 0 1 1 0]
[[  0.           0.        ]
 [ 97.88002755  18.17213296]
 [238.92477734 226.53238197]
 [386.5406472  383.684988  ]
 [479.71255861 571.45430346]
 [787.65427643 820.68056462]
 [  0.           0.        ]]
200000 Iteration(s): [0 0 1 0 1 0 0]
[[  0.           0.        ]
 [ 89.06237     18.01783909]
 [188.89032964 216.15748575]
 [364.48849302 359.90091475]
 [496.53671793 567.38372227]
 [789.67148474 784.26345759]
 [  0.           0.        ]]
300000 Iteration(s): [0 0 0 0 1 0 0]
[[  0.           0.        ]
 [ 73.13562002  28.07193289]
 [194.33345486 193.49223483]
 [351.7676211  343.97302155]
 [509.77270547 560.8844256 ]
 [799.489643   783.7818045 ]
 [  0.           0.        ]]
400000 Iteration(s): [0 0 0 0 1 0 0]
[[  0.           0.        ]
 [ 65.95973057  29.72315063]
 [180.97071021 178.21555867]
 [347.65700364 338.68385433]
 [513.46850315 565.74272691]
 [796.91303722 792.35920236]
 [  0.           0.        ]]

 At last, Optimal Policy is [0 0 1 0 1 1 0] 
 The related action value:
 [[  0.           0.        ]
 [ 59.73298266  31.16959897]
 [168.46045967 170.44949161]
 [339.2422794  337.13903092]
 [510.5304322  559.99723421]
 [782.75116068 783.93259116]
 [  0.           0.        ]]

Answer:¶

The optimal policy here is (Low, High, Low, High, High) for the state -2, -1, 0, 1, 2.
The relatered state-action is as above. The value is a bit larger than the first answer, and the policy gives a not cool "High" at state -1, due to Monte Carlo gives not a stable answer all the time and may converge to differing answer if episodes are not enough or local optimial.
Different from prediction problem, Monte Carlo control will change the needed policy all the time so as to approch optimal policy.

3. Q-Learning¶

Solve the problem by using Q-learning algorithm, with $\epsilon$-greedy with $\epsilon=0.1$.

def q_learning(qsa, next_qs, r, alpha=0.1, gamma=1.0):  
    return qsa + alpha * (r + gamma * np.max(next_qs) - qsa)

def main():
    qsa = np.zeros([n_states, n_actions]) # value Q(s, a)
    transition_probs = grid_world()
    episodes = 500000
    for index in range(episodes):  
        done = False
        s = 0 + 3   
        while not done:
            a = epsilon_greedy_policy(qsa[s])
            current_entries = transition_probs[s][a] # get sequences
            probs = [i[0] for i in current_entries]
            _, s_p, r, done = current_entries[np.random.choice([0, 1], p = probs if len(probs) == 2 else [1, 0])]
            qsa[s, a] = q_learning(qsa[s, a], qsa[s_p], r)  
            s = s_p  
        if index % 100000 == 0 and index != 0:
            print('%s Iteration(s):' %index, np.argmax(qsa, axis=1))
            print(qsa)

    print("\n At last, Optimal Policy is", np.argmax(qsa, axis=1), 
          "\n The related action value:\n", qsa)        
            
if __name__ == '__main__':
    main()

100000 Iteration(s): [0 0 0 1 1 1 0]
[[  0.           0.        ]
 [ 25.90562049  -1.29539127]
 [124.20143261  69.52900334]
 [173.7655199  304.63390696]
 [340.99582468 461.07598384]
 [659.45370257 686.77184456]
 [  0.           0.        ]]
200000 Iteration(s): [0 0 0 0 0 0 0]
[[  0.           0.        ]
 [ 66.44645005  40.92864432]
 [192.34403794  91.1630978 ]
 [346.9543757  220.75441129]
 [513.10959046 436.64330952]
 [771.36488485 662.66077008]
 [  0.           0.        ]]
300000 Iteration(s): [0 0 0 0 0 0 0]
[[  0.           0.        ]
 [ 42.06571227   5.20739704]
 [107.73107641  50.69751489]
 [280.31182047 207.4776719 ]
 [434.08392132 338.85076334]
 [749.96073769 642.86487091]
 [  0.           0.        ]]
400000 Iteration(s): [0 0 0 1 0 1 0]
[[  0.           0.        ]
 [ 29.53613302   1.86317893]
 [138.8880006   70.29531273]
 [225.3829422  335.28276131]
 [551.80702289 431.07542756]
 [681.93703699 691.32828848]
 [  0.           0.        ]]

 At last, Optimal Policy is [0 0 0 0 1 1 0] 
 The related action value:
 [[  0.           0.        ]
 [ 75.70311613   4.00956524]
 [167.36660907  31.93214925]
 [312.08710101 257.37021551]
 [458.66694735 545.57153842]
 [678.98829142 822.41851561]
 [  0.           0.        ]]

Answer:¶

The optimal policy here is (Low, Low, Low, High, High) for the state -2, -1, 0, 1, 2.
The relatered state-action is as above.
The answer is quite like the optimal policy and the corresponding value as well, by this off-policy TD algorithm.
Temporal-Difference Learning is as the middle of MC and DP, which has not evident model but has the further value back to now. At the same time, it uses estimation like MC to compute by each episode. Hereby as a comprehensive algorithm, it gives better result, by off-policy method to have two policies to iterate like before.

4. SARSA¶

$\epsilon$-greedy with $\epsilon=0.1$

def sarsa(qsa, next_qsa, r, alpha=0.1, gamma=1.0):  
    return qsa + alpha * (r + gamma * next_qsa - qsa)

def main():
    qsa = np.zeros([n_states, n_actions]) # value Q(s, a)
    transition_probs = grid_world()
    episodes = 500000
    for index in range(episodes):  
        done = False
        s = 0 + 3   
        a = epsilon_greedy_policy(qsa[s])  
        while not done:
            current_entries = transition_probs[s][a]
            probs = [i[0] for i in current_entries] # get sequences
            _, s_p, r, done = current_entries[np.random.choice([0, 1], p = probs if len(probs) == 2 else [1, 0])]
            a_p = epsilon_greedy_policy(qsa[s_p])
            qsa[s, a] = sarsa(qsa[s, a], qsa[s_p, a_p], r)  
            s = s_p  
            a = a_p
        if index % 100000 == 0 and index != 0:
            print('%s Iteration(s):' %index, np.argmax(qsa, axis=1))
            print(qsa)

    print("\n At last, Optimal Policy is", np.argmax(qsa, axis=1), 
          "\n The related action value:\n", qsa)

if __name__ == '__main__':
    main()

100000 Iteration(s): [0 0 0 0 1 1 0]
[[  0.           0.        ]
 [ 51.2564968  -14.26960721]
 [158.19185647  50.71615102]
 [274.27878906 146.80690731]
 [263.50313211 507.36615683]
 [547.04159128 708.10249539]
 [  0.           0.        ]]
200000 Iteration(s): [0 0 1 0 0 1 0]
[[  0.           0.        ]
 [ 28.39570517   5.96502907]
 [ 87.87379927  95.45459455]
 [280.3626639  181.86344561]
 [440.32996018 308.52452791]
 [600.24648701 741.58579809]
 [  0.           0.        ]]
300000 Iteration(s): [0 0 1 0 1 1 0]
[[  0.           0.        ]
 [ 27.38565396   1.60327485]
 [ 92.05111067  99.17950015]
 [236.32624859 170.33505922]
 [338.59388135 480.990968  ]
 [621.79886605 745.06690434]
 [  0.           0.        ]]
400000 Iteration(s): [0 0 0 0 1 1 0]
[[  0.           0.        ]
 [ 51.23808912  21.66059325]
 [154.70947691  35.82909832]
 [361.94522177 190.32703634]
 [405.76720288 495.17013786]
 [649.42719813 731.30706564]
 [  0.           0.        ]]

 At last, Optimal Policy is [0 0 0 0 1 0 0] 
 The related action value:
 [[  0.           0.        ]
 [  8.36333541   7.91129473]
 [ 89.54161273  63.67564151]
 [225.45058595 164.33166465]
 [350.01016351 358.3633355 ]
 [620.78537248 580.41708251]
 [  0.           0.        ]]

Answer:¶

The optimal policy here is (Low, Low, Low, High, Low) for the state -2, -1, 0, 1, 2.
The relatered state-action is as above.
The answer is also quite like the optimal policy and the corresponding value for on-policy TD control algorithm.
SARSA is like Q-learing here gives somehow good result. But it uses the on-policy TD algorithm, to have it iterated both generating the alike "MC" sequences and updated.

5. On-Line TD$(\lambda)$-algorithm¶

$\lambda=0.9$ and step size $\eta=0.0001$.

def take_random_action(): # evaluate the randome policy
    return np.random.binomial(1, 0.5)

def main():
    gamma = 1.0
    lam = 0.9
    step = 0.0001
    vs = np.zeros(n_states)
    es = np.zeros(n_states)
    transition_probs = grid_world()
    episodes = 500000
    for index in range(episodes):  
        done = False
        s = 0 + 3   
        while not done:
            a = take_random_action()  
            current_entries = transition_probs[s][a]
            probs = [i[0] for i in current_entries] # get next step
            _, s_p, r, done = current_entries[np.random.choice([0, 1], p = probs if len(probs) == 2 else [1, 0])]
            theta = r + gamma * vs[s_p] - vs[s]
            es[s] += 1
            for s_pp in range(n_states):
                vs[s_pp] += step * es[s_pp] * theta
                es[s_pp] *= gamma * lam
            s = s_p
        if index % 100000 == 0 and index != 0:
            print('%s Iteration(s):\n' %index, vs)

    print("\n At last, value of the Policy is:\n", vs)
    
if __name__ == '__main__':
    main()

100000 Iteration(s):
 [  0.          11.57541921  90.74812735 226.69291158 422.45219728
 683.2624719    0.        ]
200000 Iteration(s):
 [  0.          15.85249635  93.39676859 233.08715894 429.90337717
 685.68042149   0.        ]
300000 Iteration(s):
 [  0.          22.11305082  97.63226993 235.87906608 433.31591513
 683.74047124   0.        ]
400000 Iteration(s):
 [  0.          12.102267    87.93678202 224.97729339 423.71275706
 677.0777833    0.        ]

 At last, value of the Policy is:
 [  0.          18.38965233  96.05008208 237.08719965 433.61402664
 689.33273377   0.        ]

Answer:¶

The relatered state is as above.
The answer looks good as the normal policy like DP as the very above this notebook.
$\lambda$ hereby means make several time steps ahead will be backed to now by $\lambda$ to give whole approximate values for each state. In this case it is a more robust algorithm.