程式扎記: [ RL & Keras ] Ch1. Reinforcement Learning Basics

Preface
This chapter is a brief introduction to Reinforcement Learning (RL) and includes some key concepts associated with it.
In this chapter, we talk about Reinforcement Learning as a core concept and then define it further. We show a complete flow of how Reinforcement Learning works. We discuss exactly where Reinforcement Learning fits into artificial intelligence (AI). After that we define key terms related to Reinforcement Learning. We start with agents and then touch on environments and then finally talk about the connection between agents and environments.

What Is Reinforcement Learning?
We use Machine Learning to constantly improve the performance of machines or programs over time. The simplified way of implementing a process that improves machine performance with time is using Reinforcement Learning (RL). Reinforcement Learning is an approach through which intelligent programs, known as agents, work in a known or unknown environment to constantly adapt and learn based on giving points. The feedback might be positive, also known as rewards, or negative, also called punishments. Considering the agents and the environment interaction, we then determine which action to take.

In a nutshell, Reinforcement Learning is based on rewards and punishments . Some important points about Reinforcement Learning:

* It differs from normal Machine Learning, as we do not look at training datasets.
* Interaction happens not with data but with environments, through which we depict real-world scenarios.
* As Reinforcement Learning is based on environments, many parameters come in to play. It takes lots of information to learn and act accordingly.
* Environments in Reinforcement Learning are real-world scenarios that might be 2D or 3D simulated worlds or game-based scenarios .
* Reinforcement Learning is broader in a sense because the environments can be large in scale and there might be a lot of factors associated with them.
* The objective of Reinforcement Learning is to reach a goal.
* Rewards in Reinforcement Learning are obtained from the environment.

The Reinforcement Learning cycle is depicted in Figure 1-1 with the help of a robot.

A maze is a good example that can be studied using Reinforcement Learning , in order to determine the exact right moves to complete the maze. In below figure, we are applying Reinforcement Learning and we call it the Reinforcement Learning box because within its vicinity the process of RL works. RL starts with an intelligent program, known as agents, and when they interact with environments, there are rewards and punishments associated. An environment can be either known or unknown to the agents. The agents take actions to move to the next state in order to maximize rewards.

In the maze, the centralized concept is to keep moving. The goal is to clear the maze and reach the end as quickly as possible. The following concepts of Reinforcement Learning and the working scenario are discussed later this chapter.

* The agent is the intelligent program
* The environment is the maze
* The state is the place in the maze where the agent is
* The action is the move we take to move to the next state
* The reward is the points associated with reaching a particular state. It can be positive, negative, or zero

We use the maze example to apply concepts of Reinforcement Learning. We will be describing the following steps :

1. The concept of the maze is given to the agent.
2. There is a task associated with the agent and Reinforcement Learning is applied to it.
3. The agent receives (a-1) reinforcement for every move it makes from one state to other.
4. There is a reward system in place for the agent when it moves from one state to another.

The rewards predictions are made iteratively, where we update the value of each state in a maze based on the value of the best subsequent state and the immediate reward obtained. This is called the update rule. The constant movement of the Reinforcement Learning process is based on decision-making.

Reinforcement Learning works on a trial-and-error basis because it is very difficult to predict which action to take when it is in one state. From the maze problem itself, you can see that in order get the optimal path for the next move, you have to weigh a lot of factors. It is always on the basis of state action and rewards. For the maze, we have to compute and account for probability to take the step. The maze also does not consider the reward of the previous step; it is specifically considering the move to the next state. The concept is the same for all Reinforcement Learning processes.

Here are the steps of this process:

1. We have a problem.
2. We have to apply Reinforcement Learning.
3. We consider applying Reinforcement Learning as a Reinforcement Learning box
4. The Reinforcement Learning box contains all essential components needed for applying the Reinforcement Learning process.
5. The Reinforcement Learning box contains agents, environments, rewards, punishments, and actions.

Reinforcement Learning works well with intelligent program agents that give rewards and punishments when interacting with an environment. This interaction is very important because through these exchanges, the agent adapts to the environments. When a Machine Learning program, robot, or Reinforcement Learning program starts working, the agents are exposed to known or unknown environments and the Reinforcement Learning technique allows the agents to interact and adapt according to the environment’s features.

Accordingly, the agents work and the Reinforcement Learning robot learns. In order to get to a desired position, we assign rewards and punishments.
Now, the program has to work around the optimal path to get maximum rewards if it fails (that is, it takes punishments or receives negative points). In order to reach a new position, which also is known as a state, it must perform what we call an action. To perform an action, we implement a function, also known as a policy. A policy is therefore a function that does some work.

Faces of Reinforcement Learning
As you see from the Venn diagram in Figure 1-5, Reinforcement Learning sits at the intersection of many different fields of science.

Figure 1-5. All the faces of Reinforcement Learning

The intersection points reveal a very strong feature of Reinforcement Learning—it shows the science of decision-making . If we have two paths and have to decide which path to take so that some point is met, a scientific decision-making process can be designed. Reinforcement Learning is the fundamental science of optimal decision-making.

If we focus on the computer science part of the Venn diagram in Figure 1-5, we see that if we want to learn, it falls under the category of Machine Learning, which is specifically mapped to Reinforcement Learning. Reinforcement Learning can be applied to many different fields of science. In engineering, we have devices that focus mostly on optimal control. In neuroscience, we are concerned with how the brain works as a stimulant for making decisions and study the reward system that works on the brain (the dopamine system).

Psychologists can apply Reinforcement Learning to determine how animals make decisions. In mathematics, we have a lot of data applying Reinforcement Learning in operations research.

The Flow of Reinforcement Learning
Figure 1-6 connects agents and environments.

Figure 1-6. RL structure

The interaction happens from one state to another. The exact connection starts between an agent and the environment. Rewards are happening on a regular basis. We take appropriate actions to move from one state to another. The key points of consideration after going through the details are the following:

* The Reinforcement Learning cycle works in an interconnected manner.
* There is distinct communication between the agent and the environment.
* The distinct communication happens with rewards in mind.
* The object or robot moves from one state to another.
* An action is taken to move from one state to another

Figure 1-7 simplifies the interaction process.

Figure 1-7. The entire interaction process

An agent is always learning and finally makes a decision. An agent is a learner, which means there might be different paths. When the agent starts training, it starts to adapt and intelligently learns from its surroundings. The agent is also a decision maker because it tries to take an action that will get it the maximum reward. When the agent starts interacting with the environment, it can choose an action and respond accordingly. From then on, new scenes are created. When the agent changes from one place to another in an environment, every change results in some kind of modification. These changes are depicted as scenes. The transition that happens in each step helps the agent solve the Reinforcement Learning problem more effectively.

Let’s look at another scenario of state transitioning:

Learn to choose actions that maximize the following :

r0 +γr1 +γ2r2 +............... where 0< γ<1 blockquote="">
At each state transition, the reward is a different value, hence we describe reward with varying values in each step, such as r0, r1, r2, etc. Gamma (γ) is called a discount factor and it determines what future reward types we get:

* A gamma value of 0 means the reward is associated with the current state only
* A gamma value of 1 means that the reward is long-term

Different Terms in Reinforcement Learning
Now we cover some common terms associated with Reinforcement Learning. There are two constants that are important in this case—gamma (γ) and lambda (λ). Gamma is common in Reinforcement Learning problems but lambda is used generally in terms of temporal difference problems.
* Gamma

Gamma is used in each state transition and is a constant value at each state change. Gamma allows you to give information about the type of reward you will be getting in every state. Generally, the values determine whether we are looking for reward values in each state only (in which case, it’s 0) or if we are looking for long-term reward values (in which case it’s 1).

* Lambda

Lambda is generally used when we are dealing with temporal difference problems. It is more involved with predictions in successive states.
Increasing values of lambda in each state shows that our algorithm is learning fast. The faster algorithm yields better results when using Reinforcement Learning techniques.

As you’ll learn later, temporal differences can be generalized to what we call TD(Lambda). We discuss it in greater depth later.

Interactions with Reinforcement Learning
Let’s now talk about Reinforcement Learning and its interactions. As shown in Figure 1-11, the interactions between the agent and the environment occur with a reward. We need to take an action to move from one state to another.

Figure 1-11. Reinforcement Learning interactions

Reinforcement Learning is a way of implementing how to map situations to actions so as to maximize and find a way to get the highest rewards. The machine or robot is not told which actions to take, as with other forms of Machine Learning, but instead the machine must discover which actions yield the maximum reward by trying them. In the most interesting and challenging cases, actions affect not only the immediate reward but also the next situation and all subsequent rewards.

RL Characteristics
We talk about characteristics next. The characteristics are generally what the agent does to move to the next state. The agent considers which approach works best to make the next move. The two characteristics are:

* Trial and error search.
* Delayed reward.

As you probably have gathered, Reinforcement Learning works on three things combined :

(S,A,R)

Where S represents state, A represents action, and R represents reward.

If you are in a state S, you perform an action A so that you get a reward R at time frame t+1. Now, the most important part is when you move to the next state. In this case, we do not use the reward we just earned to decide where to move next. Each transition has a unique reward and no reward from any previous state is used to determine the next move. See Figure 1-12.

Figure 1-12. State change with time

The T change (the time frame) is important in terms of Reinforcement Learning. Every occurrence of what we do is always a combination of what we perform in terms of states, actions, and rewards.

How Reward Works
A reward is some motivator we receive when we transition from one state to another. It can be points, as in a video game. The more we train, the more accurate we become, and the greater our reward.

Agents
In terms of Reinforcement Learning, agents are the software programs that make intelligent decisions. Agents should be able to perceive what is happening in the environment. Here are the basic steps of the agents:

1. When the agent can perceive the environment, it can make better decisions.
2. The decision the agents take results in an action.
3. The action that the agents perform must be the best, the optimal, one

Software agents might be autonomous or they might work together with other agents or with people. Figure 1-14 shows how the agent works.

Figure 1-14. The flow of the environment

RL Environments
The environments in the Reinforcement Learning space are comprised of certain factors that determine the impact on the Reinforcement Learning agent. The agent must adapt accordingly to the environment. These environments can be 2D worlds or grids or even a 3D world. Here are some important features of environments:

* Deterministic
* Observable
* Discrete or continuous
* Single or multiagent.

Deterministic
If we can infer and predict what will happen with a certain scenario in the future, we say the scenario is deterministic. It is easier for RL problems to be deterministic because we don’t rely on the decision-making process to change state. It’s an immediate effect that happens with state transitions when we are moving from one state to another. The life of a Reinforcement Learning problem becomes easier.

When we are dealing with RL, the state model we get will be either deterministic or non-deterministic. That means we need to understand the mechanisms behind how DFA and NDFA work.

- DFA (Deterministic Finite Automata)
DFA goes through a finite number of steps. It can only perform one action for a state. See Figure 1-15.

Figure 1-15. Showing DFA

We are showing a state transition from a start state to a final state with the help of a diagram. It is a simple depiction where we can say that, with some input value that is assumed as 1 and 0, the state transition occurs. The self-loop is created when it gets a value and stays in the same state.

- NDFA (Nondeterministic Finite Automaton)
If we are working in a scenario where we don’t know exactly which state a machine will move into, this is a case of NDFA.

Observable
If we can say that the environment around us is fully observable, we have a perfect scenario for implementing Reinforcement Learning.
An example of perfect observability is a chess game. An example of partial observability is a poker game, where some of the cards are unknown to any one player.

Discrete or Continuous
If there is more than one choice for transitioning to the next state, that is a continuous scenario. When there are a limited number of choices, that’s called a discrete scenario.

Single Agent and Multiagent Environments
Solutions in Reinforcement Learning can be of single agent types or multiagent types. Let’s take a look at multiagent Reinforcement Learning first. When we are dealing with complex problems, we use multiagent Reinforcement Learning. Complex problems might have different environments where the agent is doing different jobs to get involved in RL and the agent also wants to interact. This introduces different complications in determining transitions in states.

Multiagent solutions are based on the non-deterministic approach. They are non-deterministic because when the multiagents interact, there might be more than one option to change or move to the next state and we have to make decisions based on that ambiguity. In multiagent solutions, the agent interactions between different environments are enormous. They are enormous because the amount of activity involved in references to environments is very large. This is because the environments might be different types and the multiagents might have different tasks to do in each state transition.

The difference between single-agent and multiagent solutions are as follows:

* Single-agent scenarios involve intelligent software in which the interaction happens in one environment only. If there is another environment simultaneously, it cannot interact with the first environment.
* When there is little bit of convergence in Reinforcement Learning. Convergence is when the agent needs to interact far more often in different environments to make a decision. This scenario is tackled by multiagents, as single agents cannot tackle convergence. Single agents cannot tackle convergence because it connects to other environments when there might be different scenarios involving simultaneous decision-making.
* Multiagents have dynamic environments compared to single agents. Dynamic environments can involve changing environments in the places to interact with.

Conclusion
This chapter touched on the basics of Reinforcement Learning and covered some key concepts. We covered states and environments and how the structure of Reinforcement Learning looks. We also touched on the different kinds of interactions and learned about single-agent and multiagent solutions. The next chapter covers algorithms and discusses the building blocks of Reinforcement Learning.

Supplement
* Reinforcement Learning w/ Keras+OpenAI: The Basics