Published on

An introduction to Reinforcement Learning

Authors

Besides traditional data-based methods on Machine Learning, e.g. Clustering or Maximum Likelihood Estimation, Reinforcement Learning is a family in which tiny (or even no) data is required to do the training and testing phase. In this post, I would give a minor introduction on Reinforcement Learning, its basic concepts and methods.


Table of contents


1. What is Reinforcement Learning

Let say you want to help Pac-man on the game of Pac-man achieve the best score among different rounds and difficulties. However, you do not know surely how to play the game optimally and all you can do is to give Pac-man the instructions about the game only. On learning to find the best way to play the game with a maximum score, Pac-man has to play the game continuously, continuously, and continuously. That is the most basic understanding on what Reinforcement Learning looks like.

illustration

Mathematically, we could denote Pac-man as an agent and the game as the environment:

  • One random strategy to play the Pac-man game as a policy (π ∈ Π).
  • One position of Pac-man at a specific time as a state (s ∈ S).
  • An event that Pac-man moves forward (or backward of left or right) as an action (a ∈ A).
  • A score that Pac-man received after taking an action as the reward (r ∈ R).

    illustration

    Fig 1. An illustration on Reinforcement Learning

The ultimate goal of Reinforcement Learning is to find an optimal policy π* which gives the agent a maximum reward r* on any diverse environments.

2. Principal (mathematical) concepts of Reinforcement Learning

a. Transition and Reward

Dummies' reinforcement learning algorithms often require a model for the agent to learn and infer. A model is a descriptor in which our agent would contact with the environment and receive respective solution feedbacks. Two most contributing factors to receive accurate feedbacks on how the agent performs are transition probability function (P) and reward function (R).

The transition probability function P, as it is called, records the probability of transitioning from state s to s’ after taking action and get a reward r at time t.

P(s', r | s, a) = p[St+1 = s', Rt+1 = r | St = s, At = a]

The equation could be understood as "the transition probability of state s to s' to get a reward r after taking action a equals to the conditional probability (p) of getting a reward of r after changing state from s to s' and taking an action a".

And the transition state function could be defined as:

Pss', a = P(s' |s, a) = r ∈ R P(s', r | s, a)

The reward achieved on taking a new action a' is demonstrated using a reward function R and equals to the expected reward after taking an action a in the state s.

R(s, a) = 𝔼[Rt + 1 | St = s, At = a] = r ∈ R(r) x s' ∈ SP(s', r | s, a)

b. Policy

A policy π(·) is a mapping from a state s to its corresponding action a according to a specific strategy to take a suitable move toward the current state of our agent. A policy could be deterministic or stochastic, as which demonstrated:

  • Stochastic policy: given a random state s, there is only one action a which always get the maximum reward r*.

    π(s) = a

  • Deterministic policy: given a random state s, the optimal action to get the highest reward is distributed over a probability distribution over all appropriate actions.

    π(a | s) = Pπ[A = a | S = s]

c. State-Value and Action-Value Function

Value function, generally, is used as a metric to measure how good a state is and how high a reward we could achieve by staying in a state or taking an action.

Let denote G as the total reward that our agent would take, starting from time t:

Gt = Rt+1 + Rt+2 + ... = k = 0 →(Rt + k + 1)

However, in reality, a reward would be more appetizing in the present than it is in the future (imagine which decision would you make if you are allowed to choose between receiving 100 bucks now or 100 bucks in 10 more years). It is, thus, a must to propose a coefficient of γ ∈ [0; 1] to penalize rewards in the future times. The above equation would be rewritten as follows

Gt = γRt+1 + γ2Rt+2 + ... = k ∈ [0;)k Rt + k + 1)

Action-Value function

The Action-Value value (Q-value) is the expected total reward return of a state-action pair at a given time t and following policy π:

Qπ(s, a) = 𝔼π[Gt | St = s, At = a]

State-Value function

The State-Value value of a state s is the expected total reward return if we are in state s at time t

Vπ(s) = 𝔼π[Gt | St = s]

On stochastic policy, the relationship between Action-Value and State-Value is demonstrated as:

Vπ(s) = a ∈ A [π(a | s) * Qπ(s, a)]

Qπ(s, a) = s' ∈ S[P(s' | s, a) [R(s, a, s') + γVπ(s')]]

Optimal value and policy

By iterating around State-Value and Action-Value function, we could get the maximum return of the State-Value

V*(s) = maxπ{Vπ(s), Q*(s, a)} = maxπ[Qπ(s, a)]

The optimal policy would be the argmax from the optimal policy from the aforementioned maximum return of the State-Value

π* = argmaxπ{Vπ(s) π*}

3. A literature on preliminary approaches in Reinforcement Learning

In this section, I would drive you through the very initial solution for Reinforcement Learning

a. Dynamic Programming

Dynamic Programming is the first thought coming into Computer Scientists when Reinforcement Learning had just appeared as a successive of previous mathematics foundations such as Bellman Equations [Bellman (1959)] and Markov Decision Processes [Puterman (1994)]. There are a great number of DP-based and DP-inspired algorithms to deal with Reinforcement Learning. There lie two most highlighted algorithms when mentioned about DP in RL which are Policy Iteration and Value Iteration.

Both algorithms are DP algorithms based on Bellman Equations which aimed to find an optimal policy π* in a Reinforcement Learning environment using one-step look-ahead strategy.

Policy IterationValue Iteration
FasterSlower
Fewer iterationsMore iterations
Guaranteed to convergeGuaranteed to converge
Algorithm is more sophisticatedAlgorithm is more trouble-free
Start with a random policyStart with a random value function
Cheaper computational costsMore expensive computational costs

Table 1. A minor comparison between Policy Iteration and Value Iteration.

Policy Iteration

In policy iteration, one fixed policy is utilized firstly. Then, the algorithm would traverse through iterations, on every iteration, the policy is evaluated and updated gradually until the convergence.

def policy_iteration(env, max_iters, gamma):
    # Intialize a random policy
    policy = np.zeros(env.observation_space.n, dtype=np.int)
    for i in range(len(policy)):
        import random
        policy[i] = random.randint(0, 3)

    for i in range(max_iters):
        policy_stable = True 

        # Evaluate the present policy
        v_values =  policy_evaluation(env, 
                                      max_iters=max_iters, 
                                      gamma=gamma, 
                                      policy=policy)

        # Update the policy
        new_policy = policy_improvement(env, 
                                        gamma=gamma, 
                                        v_val=v_values, 
                                        policy=policy)

        # Check convergence
        if np.all(np.array_equiv(policy, new_policy)): 
            policy_stable = True
            print(f'Converged at {i}-th iteration')
            break
        else:
            policy_stable = False
        policy = new_policy.copy()
    return policy

Value Iteration

Value Iteration, on the other hand, begins with a selected value function and iterates over that value function. Value Iteration undergoes a heavier computing process by taking a maximum over the utility function for all possible actions. And subsequently, Value Iteration has a tendency to take more iterations before fully converged compared to the Policy one.

def value_iteration(env, max_iters, gamma):
    v_values = np.zeros(env.observation_space.n)

    for i in range(max_iters):
        prev_v_values = np.copy(v_values)

        # Compute the value for state
        for state in range(env.observation_space.n):
            q_values = []
            # Compute the q-value for each action
            for action in range(env.action_space.n):
                q_value = 0
                # Loop through each possible outcome
                for prob, next_state, reward, done in env.P[state][action]:
                    q_value += prob * (reward + gamma * prev_v_values[next_state])
                
                q_values.append(q_value)
            
            # Select the best action
            best_action = np.argmax(q_values)
            v_values[state] = q_values[best_action]
        
        # Check convergence
        if np.all(np.isclose(v_values, prev_v_values)):
            print(f'Converged at {i}-th iteration.')
            break
    
    return v_values

b. Q-Learning

Q-Learning is a model-free reinforcement learning algorithm to learn the quality of action a to tell the agent which actions to execute at a certain state s. It does not require a model of a certain environment. It can handle dynamically environmental problems and achieve rewards without any adaptations. The development of Q-Learning is actually a breakthrough in the early days of reinforcement learning.

The interesting point in Q-Learning is the independence of the current policy to choose the second action at+1. It, essentially, estimates Q* among the best Q-values, but it does not matter which action leads to this maximum Q-value and in the next step, this algorithm does not have to follow that policy.

q-learning
Pseudo code for Updating Q-Value [Tan Ngoc Pham et al. (2020)]

4. Conclusions

  • Throughout this blog, I have brought to you the most preliminary concepts on what Reinforcement Learning looks like, its mathematical fundamental and some of the algorithms to deal with Reinforcement Learning problems.
  • Further construction in Reinforcement Learning and more intensive algorithms surrounding would be settled on later posts on this blog.