Mark Crowley | Reinforcement Learning

The papers listed below are a loose superset of the ones I try to cover, or have students present, throughout the term during my .

And Lo, the legends tell us, that before there was even the DQN, there was the incredible VFAs, and before this the Great Age of the Value Tables themselves.

Foundational papers fill the first part of the list. Once we enter the era of Deep Reinforcement Learning, the papers are grouped into early, middle, or later for how they would fall in a graduate course on Advanced RL such as my RL Courses (ECE 457C/657C). Other topics categories relate to general reference, papers mostly about environments to test out RL algorithms, or potential paper for future reading. Further ordering with [n] is sometime listed but this is just a rough guide. In general, if one paper is based on work in an older paper, the older paper should be discussed first, or the same week.

(You can jump to any stage with these links to find a paper)
foundational ~ early ~ middle ~ later ~ reference

foundational

RL Book

[0] Reinforcement Learning: An Introduction

R.S. Sutton, and A.G. Barto.

MIT Press, Cambridge, MA. 2018.
DISCUSSED ON: 2024-09-06 by Prof. Mark Crowley for the first few weeks.

PDF URL Notes

Note: This is the seminal textbook for the core concepts of Reinforcement Learning. The second edition is free online to read and use. The core concepts will be covered in the first few weeks from this book.
[1] Dynamic Programming

R Bellman.

Princeton University Press, New Jersey. 1957.
[2] Modified Policy Iteration Algorithms for Discounted Markov Decision Problems

Martin L Puterman, and Moon Chirl Shin.

Management Science. 24, (11). 1978.
[6] Natural Actor Critic

Jan Peters, Sethu Vijayakumar, and Stefan Schaal.

In European Conference on Machine Learning. Springer Verlag, Berlin, 2005.
[6] Actor-Critic Algorithms

Vijay Konda, and John Tsitsiklis.

In Advances in Neural Information Processing Systems. MIT Press, 1999.

URL
[8] Learning from Delayed Rewards

C J Watkins.

UK. 1989.
[9] Policy Gradient Methods for Reinforcement Learning with Function Approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour.

. 12, MIT Press, 1999.

URL
[9] Natural gradient works efficiently in learning.

S Amari.

Neural Computation. 10, 1998.
[9] Neuro-Dymanic Programming

Dimitri P Bertsekas, and John N Tsitsiklis.

Athena Scientific, Nashua, NH.. 1996.
[9] Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams.

Machine Learning. 7, (2). 1992.
[10] Neurocontrol and Supervised Learning: An Overview and Evaluation

Paul Werbos.

1992.

early

Shallow

[1] State of the Art Control of Atari Games Using Shallow Reinforcement Learning

Yitao Liang, Marlos C. Machado, Erik Talvitie, and Michael Bowling.

In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Richland, South Carolina, USA. 2016.
DISCUSSED ON: 2024-10-04 by Mark Crowley

Abs arXiv Hypoth

Abstract: The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN’s impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN’s network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.
DQN

[2] Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A Riedmiller.

Arxiv Preprint. abs/1312.5, 2013.
DISCUSSED ON: 2024-10-04 by Mark Crowley

arXiv URL Hypoth
HER

[3] Hindsight Experience Replay

Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba.

In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 2017.
PER

[3] Prioritized Experience Replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver.

In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. 2016.
DISCUSSED ON: 2024-10-11 by Mark Crowley?
Rainbow

[4] Rainbow: Combining improvements in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver.

In Proceedings of the AAAI conference on artificial intelligence. 2018.

arXiv Hypoth Notes

Note: This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.
Revisit-ER

[5] Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney.

In International Conference on Machine Learning. 2020.
PPO

[5] Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.

arXiv preprint arXiv:1707.06347. 2017.
DISCUSSED ON: 2024-10-11 by Majid Ghasemi

arXiv PDF Hypoth Notes

Note: A pivotal paper for policy gradient methods, proposing a simplified version of the TRPO approach of explicitly treating RL as a policy optimization problem.
A3C

[6] Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.

In Proceedings of The 33rd International Conference on Machine Learning (ICML). 2016.
DISCUSSED ON: 2024-10-11 by Mark Crowley
[6] Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.

Nature. 518, (7540). 2015.
[7] Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.

In International conference on machine learning. 2018.

middle

Distributional reinforcement learning

Marc G Bellemare, Will Dabney, and Mark Rowland.

MIT Press, 2023.
Attention option-critic

Raviteja Chunduru, and Doina Precup.

arXiv preprint arXiv:2201.02628. 2022.
The starcraft multi-agent challenge

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson.

arXiv preprint arXiv:1902.04043. 2019.
[5] Offline Reinforcement Learning as One Big Sequence Modeling Problem

Michael Janner, Qiyang Li, and Sergey Levine.

In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2021.

PDF URL Hypoth Notes

Note: An alternative approach to RL via transformers that came out at the same time as Decision Transformers.
DecTransfrmr

[8] Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch.

In Advances in Neural Information Processing Systems. Jun, 2021.

Abs arXiv PDF URL Hypoth Notes

Note: This is a really neat and clean idea for how to utilize the temporal, predictive structure of transformers to implement Reinforcement Learning. There are other approaches to this, but this one is very elegant.

Abstract: We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
[10] Reinforcement Learning as a Framework for Ethical Decision Making

David Abel, James MacGlashan, and Michael L. Littman.

In AAAI Workshop: AI, Ethics, and Society. 2016.

URL
[13] Robust Reinforcement Learning for Linear Temporal Logic Specifications with Finite Trajectory Duration

Soroush Mortazavi Moghaddam, Yash Vardhan Pant, and Sebastian Fischmeister.

Proceedings of the Canadian Conference on Artificial Intelligence. Canadian Artificial Intelligence Association (CAIAC), May, 2024.

PDF URL
[19] Deep Hedging with Market Impact

Andrei Neagu, Frédéric Godin, Clarence Simard, and Leila Kosseim.

In . Canadian Artificial Intelligence Association (CAIAC), May, 2024.

Notes

Note: https://caiac.pubpub.org/pub/cuo9vqtk

later

RLHF

[1] Illustrating Reinforcement Learning from Human Feedback (RLHF)

Nathan Lambert, Louis Castricato, Leandro Werra, and Alex Havrilla.

Hugging Face Blog. 2022.

Notes

Note: https://huggingface.co/blog/rlhf
ConstitEthic

[2] Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan.

2022.

URL Hypoth Notes

Note: "Constitutional AI" Anthropic’s answer to "AI Alignment"
MORAL

[3] MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning

Markus Peschl, Arkady Zgonnikov, Frans A. Oliehoek, and Luciano C. Siebert.

In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC. 2022.

Abs PDF

Abstract: Inferring reward functions from demonstrations and pairwise preferences are auspicious approaches for aligning Reinforcement Learning (RL) agents with human intentions. However, state-of-the art methods typically focus on learning a single reward model, thus rendering it difficult to trade off different reward functions from multiple experts. We propose Multi-Objective Reinforced Active Learning (MORAL), a novel method for combining diverse demonstrations of social norms into a Pareto-optimal policy. Through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep RL agent towards a variety of preferences, while eliminating the need for computing multiple policies. We empirically demonstrate the effectiveness of MORAL in two scenarios, which model a delivery and an emergency task that require an agent to act in the presence of normative conflicts. Overall, we consider our research a step towards multi-objective RL with learned rewards, bridging the gap between current reward learning and machine ethics literature.
Moral Grid

[5] Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition

Julia Haas.

Minds and Machines. 30, (2). 2020.

Abs URL

Abstract: I describe a suite of reinforcement learning environments in which artificial agents learn to value and respond to moral content and contexts. I illustrate the core principles of the framework by characterizing one such environment, or “gridworld,”in which an agent learns to trade-off between monetary profit and fair dealing, as applied in a standard behavioral economic paradigm. I then highlight the core technical and philosophical advantages of the learning approach for modeling moral cognition, and for addressing the so-called value alignment problem in AI.
MoralityInterpret

[6] Morality, Machines, and the Interpretation Problem: A Value-based, Wittgensteinian Approach to Building Moral Agents

Cosmin Badea, and Gregory Artus.

In Artificial Intelligence XXXIX. Springer International Publishing, Cham. 2022.

Abs

Abstract: We present what we call the Interpretation Problem, whereby any rule in symbolic form is open to infinite interpretation in ways that we might disapprove of and argue that any attempt to build morality into machines is subject to it. We show how the Interpretation Problem in Artificial Intelligence is an illustration of Wittgenstein’s general claim that no rule can contain the criteria for its own application, and that the risks created by this problem escalates in proportion to the degree to which a machine is causally connected to the world, in what we call the Law of Interpretative Exposure. Using games as an illustration, we attempt to define the structure of normative spaces and argue that any rule-following within a normative space is guided by values that are external to that space and which cannot themselves be represented as rules. In light of this, we categorise the types of mistakes an artificial moral agent could make into Mistakes of Intention and Instrumental Mistakes, and we propose ways of building morality into machines by getting them to interpret the rules we give in accordance with these external values, through explicit moral reasoning, the “Show, not Tell” paradigm, the adjustment of causal power and structure of the agent, and relational values, with the ultimate aim that the machine develop a virtuous character and that the impact of the Interpretation Problem is minimised.
Multi-Obj-Ethic

[10] Multi-Objective Reinforcement Learning for Designing Ethical Environments

Manel Rodriguez-Soto, Maite Lopez-Sanchez, and Juan A. Rodriguez Aguilar.

In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, Aug, 2021.

PDF URL Hypoth
ChemGymRL

[19] ChemGymRL: An Interactive Framework for Reinforcement Learning for Digital Chemistry

Chris Beeler, Sriram Ganapathi Subramanian, Kyle Sprague, Nouha Chatti, Colin Bellinger, Mitchell Shahen, Nicholas Paquin, Mark Baula, Amanuel Dawit, Zihan Yang, and others.

arXiv preprint arXiv:2305.14177. 2023.
SPRING

[20] SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning

Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, and Yuanzhi Li.

2023.

arXiv PDF Hypoth

</div>

reference

RL Book

[0] Reinforcement Learning: An Introduction

R.S. Sutton, and A.G. Barto.

MIT Press, Cambridge, MA. 2018.
DISCUSSED ON: 2024-09-06 by Prof. Mark Crowley for the first few weeks.

PDF URL Notes

Note: This is the seminal textbook for the core concepts of Reinforcement Learning. The second edition is free online to read and use. The core concepts will be covered in the first few weeks from this book.
[1] Artificial intelligence a modern approach

Stuart J Russell.

Pearson Education, Inc., 2010.

Notes

Note: The classic book on general AI concepts and approaches.
[3] Optuna: A Next-generation Hyperparameter Optimization Framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.

In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019.

arXiv PDF Hypoth Notes

Note: Everyone using this in my RL course for tuning hyperparameters of Gymnasium RL algorithms.
[5] The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python

Michael Hu.

Apress Berkeley, CA, 2023.

URL

environment

Minerl: A large-scale dataset of minecraft demonstrations

William H Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov.

arXiv preprint arXiv:1907.13440. 2019.
The starcraft multi-agent challenge

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson.

arXiv preprint arXiv:1902.04043. 2019.
Openai gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.

arXiv preprint arXiv:1606.01540. 2016.
[19] Deep Hedging with Market Impact

Andrei Neagu, Frédéric Godin, Clarence Simard, and Leila Kosseim.

In . Canadian Artificial Intelligence Association (CAIAC), May, 2024.

Notes

Note: https://caiac.pubpub.org/pub/cuo9vqtk
ChemGymRL

[19] ChemGymRL: An Interactive Framework for Reinforcement Learning for Digital Chemistry

Chris Beeler, Sriram Ganapathi Subramanian, Kyle Sprague, Nouha Chatti, Colin Bellinger, Mitchell Shahen, Nicholas Paquin, Mark Baula, Amanuel Dawit, Zihan Yang, and others.

arXiv preprint arXiv:2305.14177. 2023.

potential

Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation

Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, Kyriacos Shiarlis, Dragomir Anguelov, Mark Palatucci, Brandyn White, and Shimon Whiteson.

In International Conference on Robotics and Automation (ICRA). 2022.

Notes

Note: We did not get to this paper in the Spring 2023 reading group.
Graph Convolutional Networks for Chemical Relation Extraction

Darshini Mahendran, Christina Tang, and Bridget T. McInnes.

In Companion Proceedings of the Web Conference 2022. Association for Computing Machinery, New York, NY, USA. Apr, 2022.

Abs Notes

Note: We did not get to this paper in the Spring 2023 reading group.

Abstract: Extracting information regarding novel chemicals and chemical reactions from chemical patents plays a vital role in the chemical and pharmaceutical industry. Due to the increasing volume of chemical patents, there is an urgent need for automated solutions to extract relations between chemical compounds. Several studies have used models that apply attention mechanisms such as Bidirectional Encoder Representations from Transformers (BERT) to capture the contextual information within a text. However, these models do not capture the global information about a specific vocabulary. On the other hand, Graph Convolutional Networks (GCNs) capture global dependencies between terms within a corpus but not the local contextual information. In this work, we propose two novel approaches, GCN-Vanilla and GCN-BERT, for chemical relation extraction. GCN-Vanilla approach builds a single graph for the whole corpus based on word co-occurrence and sentence-word relations. Then, we model the graph with GCN to capture the global information and classify the sentence nodes. GCN-BERT approach combines GCN and BERT to capture both global and local information, and build together a final representation for relation extraction. We evaluate our approaches on the CLEF-2020 dataset. Our results show the combined GCN-BERT approach outperforms standalone BERT and GCN models, and achieves a higher F1 than that reported in our previous studies.
UAV Coverage Path Planning under Varying Power Constraints Using Deep Reinforcement Learning

Mirco Theile, Harald Bayerlein, Richard Nai, David Gesbert, and Marco Caccamo.

In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020.

Notes

Note: We did not get to this paper in the Spring 2023 reading group.
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork.

In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, Virtual Event Ireland. Oct, 2020.

Notes

Note: We did not get to this paper in the Spring 2023 reading group.
Model-ensemble trust-region policy optimization

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel.

arXiv preprint arXiv:1802.10592. 2018.
A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos.

In International conference on machine learning. 2017.
Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel.

In International conference on machine learning. 2017.
Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.

arXiv preprint arXiv:1509.02971. 2015.
Knows What It Knows: A Framework For Self-Aware Learning

Lihong Li, Michael Littman, and Thomas J Walsh.

Proceedings of the 25th International Conference on Machine Learning. 2008.
PAC Model-Free Reinforcement Learning

Alexander L Strehl, Eric Wiewiora, John Langford, and Michael L Littman.

Update. 2006.
[9] Talking About Large Language Models

Murray Shanahan.

Arxiv Preprint. Dec, 2022.

Abs arXiv PDF URL

Abstract: Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.
[99] Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic.

Arxiv Preprint. 2022.

PDF Notes

Note: A promising approach to scientific reasoning with LLMs.

Reinforcement Learning - Reading List