Notes on Sutton-Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf
https://drive.google.com/drive/folders/0ByidsAQcVcUUOHpFZzFsaThTazg
Sutton, Barto: Reinforcement Learning: An Introduction, Draft, Second Edition, MIT Press, 2016
Richard S. Sutton
degree from Stanford
professor at University of Alberta
Andrew G. Barto
1. The Reinforcement Learning Problem
1.1. Reinforcement Learning
1.2. Examples chess player management of an oil refinery gazelle calf struggles to its feet after birth, 30 minutes later running 20 mph mobile robot: search for more trash or return to recharge Phil getting breakfast
1.3 Elements of RL
- policy - mapping from state to action. It corresponds to what in psychology would be called
a set of stimulus–response rules or associations (provided that stimuli include those that can come from within the animal).
- reward signal - the goal. On each time
step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. human psychology: pleasure/pain
- value function - Whereas rewards determine the immediate, intrinsic
desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. human psychology: refined and farsighted judgment of the current state Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values. The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades.
- model. optional. Model-free: trial and error. Model-based: planning. Start with trial and error, build the model, then use planning.
1.4 Limitations and scope.
multiple reinforcement learning methods
This book focuses on those methods structured around estimating value functions.
other methods use Genetic Algorithms, genetic programming, simulated annealing. No value functions. These methods evaluate the “lifetime” behavior of many nonlearning agents, each using a different policy for interacting with its environment, and select those that are able to obtain the most reward.
policy gradient methods
Learning vs Evolution.
1.5. An Extended Example: Tic Tac Toe
Exercises:
- Self Play
- Symmetries
- Greedy Play
- Learning from Exploration
- Other Improvements
1.6 Summary
1.7 History of RL
The history of reinforcement learning has two main threads, both long and rich, that were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error that started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s. The other thread concerns the problem of optimal control and its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two threads have been largely independent, the exceptions revolve around a third, less distinct thread concerning temporal-difference methods such as the one used in the tic-tac-toe example in this chapter. All three threads came together in the late 1980s to produce the modern field of reinforcement learning as we present it in this book.
1938 B. F. Skinner
1927 Pavlov: Conditioned Reflexes
1948: In a 1948 report, Alan Turing described a design for a “pleasure-pain system” that worked along the lines of the Law of Effect:
Electro-mechanical devices:
1933: Thomas Ross, machine that can find and remember its way through a maze through setting of switches
1954: Marvin Minsky: SNARCs
1954 - 1962: confusion between reinforcement learning and supervised learning. terms Trial and error, rewards and punishments, applied to learning from training examples.
1960-1980 general neglect of trial-and-error learning.
1960 first use of the term reinforcement learning for trial-and-error learning.
Temporal-Difference (TD) 1959 Arthur Samuel: Checkers
1961 Marvin Minsky: Steps to Artificial Intelligence. Talks about prediction, expectation, basic credit-assignment problem How do you distribute credit for success among the many decisions that may have been involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward solving this problem. Minsky’s paper is well worth reading today.
1980 three threads came together as revival of reinforcement learning 1. trial and error that started in the psychology of animal learning. 2. optimal control and its solution using value functions and dynamic programming 3. temporal-difference methods, see tic-tac-toe example in this chapter.
1981 At this time we developed a method for using temporal-difference learning in trial-and-error learning, known as the actor– critic architecture, and applied this method to Michie and Chambers’s pole-balancing problem (Barto, Sutton, and Anderson, 1983).
1.8 Biographical Notes
the theory of reinforcement learning as we treat it in this book is based on maximizing the expected value of the amount of reward an agent can accumulate over its future. This is in consistent with the classic principle of von Neumann and Morgenstern (1944) that rational decisions are those that maximize expected utility.
risk - important in investment, optimal control, outside the scope
utility theory - measuring people's desires, outside the scope
Part 1. Tabular Solution Methods
Chapter 2. Multi-Arm Bandits
2.1. A k-Armed Bandit Problem
Choose one arm. Probability distribution for each arm is unknown.
Explore to build knowledge of probability distribution for each arm.
Exploit to maximize value.
Apply to doctor prescribing one of multiple experimental treatments to multiple patients with disease symptoms.
three fundamental classes of methods for solving finite Markov decision problems:
- dynamic programming,
- Monte Carlo methods, and
- temporal-difference learning.
2.2. Action-Value Methods
Greedy vs. E-greedy
Greedy action selection always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability “, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates. We call methods using this near-greedy action selection rule ”-greedy methods.
2.3. Incremental Implementation
How to calculate the average reward at each timestep.
2.4. Tracking a Non-Stationary Problem
Bandits changing over time. The probability distribution of each bandit is changing over time.
2.5. Optimistic Initial Values
2.6. Upper-Confidence-Bound Action Selection
2.7. Gradient Bandit Algorithms
2.8. Associative Search (Contextual Bandits)
2.9. Summary
3. Finite Markov Decision Processes
3.1. The Agent-Environment Interface
3.2. Goals and Rewards
3.3. Returns
3.4. Unified Notation for Episodic and Continuing Tasks
3.5. The Markov Property
3.6. Markov Decision Processes
3.7. Value Functions
3.8. Optimal Value Functions
3.9. Optimality and Approximation
3.10. Summary
4. Dynamic Programming
4.1. Policy Evaluation
4.2. Policy Improvement
4.3. Policy Iteration
4.4. Value Iteration
4.5. Asynchronous Dynamic Programming
4.6. Generalized Policy Iteration
4.7. Efficiency of Dynamic Programming
4.8. Summary
5. Monte Carlo Methods
5.1. Monte Carlo Prediction
5.2. Monte Carlo Estimation of Action Values
5.3. Monte Carlo Control
5.4. Monte Carlo Control Without Exploring Starts
5.5. Off-policy Prediction via Importance Sampling
5.6. Incremental Implementation
5.7. Off-policy Monte Carlo Control
5.8. Return-Specific Importance Sampling
5.9. Summary
6. Temporal-Difference Learning
6.1. TD Prediction
6.2. Advantages of TD Prediction Methods
6.3. Optimality of TD(0)
6.4. Sarsa: On-Policy TD Control
6.5. Q-Learning: Off-Policy TD Control
6.6. Expected Sarsa
6.7. Maximization Bias and Double Learning
6.8. Games, Afterstates, and Other Special Cases
6.9. Summary
7. Multi-step Bootstrapping
7.1. n-step TD Prediction
7.2. n-step Sarsa
7.3. n-step Off-policy Learning by Importance Sampling
7.4. Off-policy Learning without Importance Sampling: The n-step Tree Backup Algorithm
7.5. A Unifying Algorithm: n-step Q(σ)
7.6. Summary
8. Planning and Learning with Tabular Methods
8.1. Models and Planning
8.2. Dyna: Planning, Acting, and Learning
8.3. When the Model is Wrong
8.4. Prioritized Sweeping
8.5. Planning as Part of Action Selection
8.6. Heuristic Search
8.7. Monte Carlo Tree Search
8.8. Summary
Part II. Approximate Solution Methods
the number of possible camera images, for example, is much larger than the number of atoms in the universe.
9. On-policy Prediction with Approximation
9.1. Value Function Approximation
9.2. The Prediction Objective (MSVE)
9.3. Stochastic-Gradient and Semi-gradient Methods
9.4. Linear Methods
9.5. Feature Construction for Linear Methods
9.5.1. Polynomials
9.5.2. Fourier Basis
9.5.3. Coarse Coding
9.5.4. Tile Coding
9.5.5. Radial Bias Functions
9.6. Non-Linear Function Approximation: Artificial Neural Networks
9.7. Least Squares TD
9.8. Summary
10. On-Policy Control with Approximation
10.1 Episodic Semi-Gradient Control
10.2 n-step Semi-Gradient Sarsa
10.3 Average Reward: A New Problem Setting for Continuing Tasks
10.4 Deprecating the Discounted Setting
10.5 n-step Differential Semi-Gradient Sarsa
10.6 Summary
11. Off-Policy Methods with Approximation
11.1. Semi-Gradient Methods
11.2. Baird's Counterexample
11.3. The Deadly Triad
12. Eligibility Traces
12.1. The λ-return
12.2. TD(λ)
12.3. An On-line Forward View
12.4. True Online TD(λ)
13. Policy Gradient Methods
13.1. Policy Approximation and its Advantages
13.2. The Policy Gradient Theorem
13.3. REINFORCE: Monte Carlo Policy Gradient
13.4. REINFORCE with Baseline
13.5. Actor-Critic Methods
13.6. Policy Gradient for Continuing Problems (Average Reward Rate)
13.7. Policy Parameterization for Continuous Actions
Part III. Looking Deeper
14. Psychology
14.1. Terminology
14.2. Prediction and Control
14.3. Classical Conditioning
14.3.1. The Rescorla-Wagner Model
14.3.2. The TD Model
14.4. Instrumental Conditioning
14.5. Delayed Reinforcement
14.6. Cognitive Maps
14.7. Habitual and Goal-Directed Behavior
14.8. Summary
14.9. Conclusions
15. Neuroscience
15.1. Neuroscience Basics
15.2. Reward Signals, Reinforcement Signals, Values, and Prediction Errors
15.3. The Reward Prediction Error Hypothesis
15.4. Dopamine
15.5. Experimental Support for the Reward Prediction Error Hypothesis
15.6. TD/Error Dopamine Correspondence
15.7. Neural Actor-Critic
15.8. Actor and Critic Learning Rules
15.9. Hedonistic Neurons
15.10. Collective Reinforcement Learning
15.11. Model-Based Methods in the Brain
15.12. Addiction
15.13. Summary
15.14. Conclusion
16. Applications and Case Studies
16.1. TD-Gammon
16.2. Samuel's Checkers Player
16.3. The Acrobot
16.4. Watson's Daily-Double Wagering
16.5. Optimizing Memory Control
16.6. Human-Level Video Game Play
16.7. Mastering the Game of Go
16.8. Personalized Web Services
16.9. Thermal Soaring
17. Frontiers
17.1. The Unified View
chapter incomplete