Deep Reinforcement Learning

tomzahavy (at) gmail (dot) com


I am a research scientist at DeepMind in the field of Reinforcement Learning. I come from a small town in 🇮🇱 on the Mediterranean Sea. I am currently living in London 🇬🇧 and I spent some time in the 🇺🇸. My family is coming from 🇩🇪🇮🇩🇱🇺 and by DNA I am 🇮🇩🇭🇺🇮🇷(50/30/20). I am married to Gili​, a singer-songwriter from 🇮🇩🇲🇦🇮🇱. I love spending my free time outdoors in camping, hiking, 4X4 driving, mountaineering, skiing, and scuba diving. When I am at home, my hobbies are running, basketball, and reading science-fiction. I completed my Ph.D. at the Technion where I was advised by Shie Mannor and interned at Microsoft, Walmart, Facebook, and Google.

My high-level research goal is to build an artificial intelligence via Reinforcement Learning. In my PhD I studied aspects of scalability, structure discovery, hierarchy, abstraction, and exploration in DeepRL. Since I joined the Discovery team @DeepMind, I focus on two topics:


Meta RL


Building reinforcement learning algorithms that discover an internal knowledge base (hyper parameters, loss function, options, reward), in order to solve the original problem better.

Read more about meta-gradients in my papers below, in this excellent blog post by Robert Lange, in the MLST Podcast (with Robert, Tim, Yanick and myself), or in this talk by David Silver. 


A Self-Tuning Actor-Critic Algorithm, NeurIPS 2020

Tl;dr: We propose a self-tuning actor-critic algorithm (STACX) that adapts all the differentiable hyper parameters of IMPALA including those of auxiliary tasks and achieves impressive gains in performance in Atari and DM control.

Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh


Bootstrapped Meta Learning, ICLR 2022 (outstanding paper award)

Tl;dr: We propose a novel meta learning algorithm that first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target. When applied to STACX it achieves SOTA results in Atari. 

Sebastian Flennerhag, Yannick Schroecker, Tom Zahavy, Hado van Hasselt, David Silver, Satinder Singh

Screenshot 2022-04-25 at 21.21.11.png

Discovery of Options via Meta-Learned Subgoals, NeurIPS 2021

Tl;dr: we use meta gradients to discover subgoals, in the form of intrinsic rewards, uses these subgoals to learn options, and control these options with an HRL policy. 

Vivek Veeriah, Tom Zahavy, Matteo Hessel, Zhongwen Xu, Junhyuk Oh, Iurii Kemaev, Hado van Hasselt, David Silver, Satinder Singh


Balancing Constraints and Rewards with Meta-Gradient D4PG, ICLR 2021

Tl;dr: We use meta gradients to adapt the learning rates of the RL agent and the Lagrange multiplier in a constrained MDP. 

Screenshot 2022-04-25 at 21.30.07.png

Meta Gradients in Non Stationary Environments, CoLLAs 2022

Tl;dr: We study meta gradients in non stationary RL environments. 

Jelena Luketina, Sebastian Flennerhag, Yannick Schroecker, David Abel, Tom Zahavy Satinder Singh

Non linear RL problems


RL Objectives expressed as nonlinear functions of the state occupancy and algorithms for solving them. 


Reward is enough for convex MDPs, NeurIPS 2021 (spotlight) 

Tl;dr: we study non linear and unsupervised objectives that are defined over the state occupancy of an RL agent in an MDP. These include Apprenticeship Learning, diverse skill discovery, constrained MDPs and pure exploration. We show that maximizing the gradient of such an objective, as an intrinsic reward, solves the problem efficiently. We also propose a meta algorithm and show that many existing algorithms in the literature can be explained as instances of it.

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, Satinder Singh


Discovering a set of policies for the worst case reward, ICLR 2021 (spotlight)

Tl;dr We propose a method for discovering a set of policies that perform well w.r.t the worst case reward when composed together.


Discovering Diverse Nearly Optimal Policies with Successor Features

Tl;dr We propose a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal using a constrained MDP.

Tom Zahavy, Brendan O'Donoghue, Andre Barreto, Volodymyr Mnih, Sebastian Flennerhag, Satinder Singh


Apprenticeship Learning via Frank-Wolfe, AAAI 2020

Tl;dr We show that the well-known Apprenticeship Learning algorithm of Abbeel and Ng (2004) can be understood as a Frank-Wolfe method and propose methods to accelerate it.

Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour


Online Apprenticeship Learning, AAAI 2021

Tl;dr We propose the first Apprenticeship Learning algorithm that does not require to solve an MDP in each iteration and analyse its regret. 

Lior Shani, Tom Zahavy, Shie Mannor

Older Publications from my PhD
minecraft_lifelong copy_edited.png

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

AAAI 2017

Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, Shie Mannor


Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

NeurIPS 2018

Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, Shie Mannor


Graying the black box: Understanding DQNs

ICML 2016

Tom Zahavy, Nir Ben Zrihem, Shie Mannor


Online Limited Memory Neural-Linear Bandits with Likelihood Matching

ICML 2021


Inverse Reinforcement Learning in Contextual MDPs

SPRINGER, Machine Learning Journal 2021, Special Issue On RL for Real Life

Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, Tom Zahavy


Shallow Updates for Deep Reinforcement Learning

NeurIPS 2017

Screen Shot 2018-03-19 at 17.04.18.png

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

ALT 2020

Tom Zahavy, Avinatan Hasidim, Haim Kaplan and Yishay Mansour