Deep Reinforcement Learning

tomzahavy (at) gmail (dot) com


Google Scholar

I am a research scientist at DeepMind in the field of Reinforcement Learning. Previously, I was a Ph.D.  candidate at the Technion and interned at Microsoft, Walmart, Facebook, and Google.


My high-level research goal is to build an artificial intelligence via Reinforcement Learning. My prior work focused on aspects of scalability, structure discovery, hierarchy, abstraction, and exploration.

ently, I focus on Discovery:

  • Metagradients: building reinforcement learning algorithms that discover an internal knowledge base (hyper parameters, loss function, options, reward), in order to solve the original problem better. You can read more about it in my papers [1,2,3] in this blog post by Robert Lange, in this Podcast (with Robert, Tim, Yanick and myself), or in this talk by David Silver.  

  • Diversity: finding diverse policies that are also nearly optimal in terms maximising reward. I am thinking about this problem through the lens of convex optimisation [1,2,3], that is, suggesting unsupervised convex objectives that lead to diversity as well as RL algorithms that find solutions to this objective via the maximisation of reward.  


About me: I come from a small town in 🇮🇱 on the Mediterranean Sea. I am currently living in London 🇬🇧 and I spent some time in the 🇺🇸. My family is coming from 🇩🇪🇮🇩🇱🇺 and by DNA I am 🇮🇩🇭🇺🇮🇷(50/30/20). I am married to Gili​, a singer-songwriter from 🇮🇩🇲🇦🇮🇱. I love spending my free time outdoors in camping, hiking, 4X4 driving, mountaineering, skiing, and scuba diving. When I am at home, my hobbies are running, basketball, and reading science-fiction. 

I was recently featured in the Machine Learning Street Talk podcast, where we talked about my deep RL journey, automatic discovery of structure and our recent meta gradient papers, you can listen to it here.



Selected Publications

Discovering Diverse Nearly Optimal Policies with Successor Features


Tom Zahavy, Brendan O'Donoghue, Andre Barreto, Volodymyr Mnih, Sebastian Flennerhag, Satinder Singh


Reward is enough for convex MDPs 



Discovering a set of policies for the worst case reward

ICLR 2021 (spotlight)

A Self-Tuning Actor-Critic Algorithm


Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

AAAI 2017

Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, Shie Mannor


Graying the black box: Understanding DQNs

ICML 2016

Tom Zahavy, Nir Ben Zrihem, Shie Mannor


Online Limited Memory Neural-Linear Bandits with Likelihood Matching

ICML 2021

Inverse Reinforcement Learning in Contextual MDPs

SPRINGER, Machine Learning Journal 2021, Special Issue On RL for Real Life

Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, Tom Zahavy

Apprenticeship Learning via Frank-Wolfe

AAAI 2020

Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

NeurIPS 2018

Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, Shie Mannor

Balancing Constraints and Rewards with Meta-Gradient D4PG

ICLR 2021

Shallow Updates for Deep Reinforcement Learning

NeurIPS 2017

Planning in Hierarchical Reinforcement Learning: Guarantees for Using Local Policies

ALT 2020

Tom Zahavy, Avinatan Hasidim, Haim Kaplan and Yishay Mansour