Skip to content

Latest commit

 

History

History
28 lines (16 loc) · 7.18 KB

Part5_Reinforcement_Learning_AND_Control.md

File metadata and controls

28 lines (16 loc) · 7.18 KB

Part 5: Reinforcement Learning and Control

Reinforcement learning (RL) is a type of machine learning which operates in an environment where direct supervision isn't possible. The learning agent interacts with its environment and receives rewards or penalties. The goal is to learn a strategy that maximizes these rewards over time, learning from its interactions. The RL problem is mathematically framed using Markov Decision Processes (MDPs), which capture the dynamics of the environment, the possible states, actions, and the associated rewards.

The MDP includes states, actions, state transition probability, discount factor, and reward function. The process begins with an initial state. An action is then taken which results in a transition to a new state. This transition is determined by the state transition probability function. The reward function determines the immediate reward received for this state-action pair. The future rewards are discounted by the discount factor, which ranges between 0 and 1. This is designed to balance the trade-off between immediate and future rewards.

The actions are generated by a policy function which maps state to action based on the action that provides the highest expected sum of discounted reward. In order to find the optimal policy, a value function is defined (as described by the Bellman equations) which determines the policy that yields the highest expected sum of discounted rewards. The optimal policy is the one that maximizes the expected payoff for all states regardless of the initial state.

In order to find the optimal policy function, two common approaches are used: value iteration and policy iteration. Value iteration seeks to find the optimal policy function indirectly. It does this by repeatedly updating the value function until it converges, then the policy is updated based on the converged value function. On the other hand, policy iteration directly iterates on the policy function. This approach alternates between computing the value function for a given policy and updating the policy based on the determined value function, until both converge to an optimal solution.

If the state transition probability of reward function are unknown, then we can estimate the transition probability by running trials and obtain the ratio based on how many times the agent transitioned from s1 to s2 over the total number of trials. The same approach could also be employed to estimate the reward function.

Continuous state MDP

In real-world scenarios, Markov Decision Processes (MDPs) often involve an infinite number of states across multiple dimensions. In order to adapt the MDP for a problem with continuous state, the states could be discretized into finite number of intervals. However, this approach assumes that the value function or policy is constant within each discretized interval, which may not always be true. Furthermore, this method becomes less effective when dealing with high-dimensional problems, usually more than 4 dimensions.

Alternatively, the continuous MDP could be solved using Value Function Approximation (VFA) rather than discretizing the states. However, this approach requires having a model or simulator to generate the next state given the present state and the given action. The model could be physic based for instance built using finite element or using machine learning if sufficient data is available. The Fitted Value Iteration (FVI) is a type of VFA which approximates the value function by fitting a supervised learning algorithm against the synthetic data generated using the simulator.

Finite-horizon MDPs are a different type of problem where the decision-making process happens within a defined time frame, though the state can still be infinite. In these situations, the optimal policy could change over time (non-stationary), and rewards are determined based on both state transitions and actions taken. A strategy for solving these finite horizon problems could involve calculating the best value function for the final time step and using this as a boundary condition to calculate the best value functions for all other time steps.

In the context of finite horizon MDP, the linear quadratic regulation (LQR) is a technique used to find a linear relationship between policy function and state per time. The process begins with estimating the parameters of the state transition model through a trial-and-error approach, then finds the policy function using the LQR framework and then works backwards from the final state to determine the optimal actions. The use of a quadratic loss function helps to penalize deviations from the target state, thus optimizing the system's trajectory towards desired outcomes. It is important to note that LQR framework could be applied to system with nonlinear dynamics which could involve using Taylor expansion to make the system appear linear.

In real-world scenarios, we often lack full knowledge of the state variables, making the Partial Observable Markov Decision Process (POMDP) a suitable framework to handle such situations. In a POMDP, a belief state is formed based on observed data. This belief state, which is essentially a distribution of possible states, is then used as input to the policy function to generate actions. A variant of the Linear Quadratic Regulation (LQR) called the Linear Quadratic Gaussian (LQG) is used for this purpose. LQG follows a three-step process: 1. Computing the belief state using observations via the Kalman Filter algorithm; 2. Using the mean of the belief state as the best approximation for the state; 3. Setting the action based on the mean state, using the regular LQR algorithm. This strategy allows us to continuously update our belief state based on new observations, optimizing our policy even in situations where full state knowledge is unattainable.

Policy Gradient (Reinforce)

So far we've discussed model-based reinforcement learning (RL), where an agent is already aware of the dynamics of its environment, including the likelihood of state transitions and the rewards tied to each state (i.e., reward function). On the other hand, model-free RL (like REINFORCE) operates differently, as the agent starts without any prior knowledge of the environment. The agent must learn independently through actions and interactions, and feedback from these actions.

Through the process of sampling transition probabilities and receiving rewards at each iteration, the agent gradually enhances its understanding of the environment. Unlike certain other RL methods, the agent doesn't need a value function to find the best policy; it can achieve this through its ongoing interactions with the environment.

The REINFORCE algorithm employs gradient ascent to obtain the optimal policy by maximizing the expected total reward. However, since the agent doesn't know the state transition probabilities or reward functions, the gradient is calculated in relation to the variables that directly affect the expected total reward. This technique is similar to the approach used in Variational Auto-Encoders (VAEs), in which multiple trajectories (combinations of states and actions) are generated. The policy is then optimized so that trajectories yielding higher total rewards become more probable.