Deep Reinforcement Learning

tomzahavy (at) gmail (dot) com

Google Scholar     Curriculum Vitae     LinkedIn

I am a research scientist @ DeepMind.

Prior to that, I was a Ph.D. candidate at the Technion, where Shie Mannor advised me.

My research focus is on Deep Reinforcement Learning (DRL). In particular, I worked on Explainable AI (XAI) in DRL, hierarchical DRL, exploration during representation learning, and the convex optimization foundations of inverse reinforcement learning.

I've worked on the following problems (by this I mean that I developed an API/ designed a machine learning algorithm/collected and processed real-world data):

  • The Arcade Learning Environment

  • Minecraft

  • Text-based-games (Zork)

  • Online treatment regimes (Sepsis, Mimic3)

  • E-commerce (

  • Physics (ultra-short laser pulses)

  • Communication (sub-Nyquist ODFM modem, hardware and software). 


Selected Publications

Tom Zahavy, Nir Ben Zrihem, Shie Mannor. In Proc. International Conference on Machine Learning (ICML), New York, 2016.

In recent years there is a growing interest in using deep representations for reinforcement learning. In this paper, we present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. Moreover, we propose a new model, the Semi Aggregated Markov Decision Process (SAMDP), and an algorithm that learns it automatically. The SAMDP model allows us to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work. Using our tools we reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover, we are able to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize deep neural networks in reinforcement learning. 



Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, Shie Mannor. Conference on Artificial Intelligence (AAAI), 2017.

We propose a lifelong learning system that has the ability to reuse and transfer knowledge from one task to another while efficiently retaining the previously learned knowledge-base. Knowledge is transferred by learning reusable skills to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks, are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture using two techniques: (1) a deep skill array and (2) skill distillation, our novel variation of policy distillation (Rusu et. al. 2015) for learning skills. Skill distillation enables the HDRLN to efficiently retain knowledge and therefore scale in lifelong learning, by accumulating knowledge and encapsulating multiple reusable skills into a single distilled network. The H-DRLN exhibits superior performance and lower learning sample complexity compared to the regular Deep Q Network (Mnih et. al. 2015) in sub-domains of Minecraft.[paper][page][code]

Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz and Shie Mannor. Neural Information Processing Systems (NIPS) 2018.

Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is sometimes easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.

Tom Zahavy, Alon Cohen, Haim Kaplan and Yishay Mansour. Conference on Artificial Intelligence (AAAI), 2020.

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, there is a Markov Decision Process (MDP), but the reward function is not given explicitly. Instead, there is an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope--the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent to the most known AL algorithm, the projection method of Abbeel andNg (2004). This insight allows us to analyze AL with tools from the convex optimization literature and to derive tighter bounds on AL. Specifically, we show that a variation of the FW method that is based on taking" away steps" achieves a linear rate of convergence when applied to AL. We also show experimentally that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer. [paper][code]

Tom Zahavy and Shie Mannor

We study the neural-linear bandit model for solving sequential decision-making problems with high dimensional side information. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is being optimized during learning, information regarding exploration with ”old” features is lost. Here, we propose the first limited memory neural-linear bandit that is resilient to this phenomenon, which we term catastrophic forgetting. We evaluate our method on a variety of real-world data sets, including regression, classification, and sentiment analysis, and observe that our algorithm is resilient to catastrophic forgetting and achieves superior performance.

Tom Zahavy, Avinatan Hasidim, Haim Kaplan and Yishay Mansour, ALT, 020

We consider a settings of hierarchical reinforcement learning, in which the reward is a sum of components. For each component we are given a policy that maximizes it and our goal is to assemble a policy from the individual policies that maximizes the sum of the components. We provide theoretical guarantees for assembling such policies in deterministic MDPs with collectible rewards. Our approach builds on formulating this problem as a traveling salesman problem with discounted reward. We focus on local solutions, i.e., policies that only use information from the current state; thus, they are easy to implement and do not require substantial computational resources. We propose three local stochastic policies and prove that they guarantee better performance than any deterministic local policy in the worst case; experimental results suggest that they also perform better on average.

Please reload

Tom Zahavy , Alex Dikopoltsev, Oren Cohen , Shie Mannor and Mordechai Segev, Optica 5, 666-673 (2018)

Ultra-short laser pulses with femtosecond to attosecond pulse duration are the shortest systematic events humans can create. Characterization (amplitude and phase) of these pulses is a key ingredient in ultrafast science, e.g., exploring chemical reactions and electronic phase transitions. Here, we propose and demonstrate, numerically and experimentally, the first deep neural network technique to reconstruct ultra-short optical pulses. We anticipate that this approach will extend the range of ultrashort laser pulses that can be characterized, e.g., enabling to diagnose very weak attosecond pulses. [Optica] [Cleo18] [NIPS17 Workshop on Deep Learning for Physical Sciences]

Tom Zahavy, Alessandro Magnani, Abhinandan Krishnan and Shie Mannor. The Thirtieth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-18)

Classifying products into categories precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification using text and image inputs. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves the top-1 accuracy % over both networks on a real-world large-scale product classification dataset that we collected from While we focus on image-text fusion that characterizes e-commerce domains, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc. [paper][code]

Please reload


Bachelor's Degree.

Double major in Electrical Engineering and Physics.


M.Sc in

Machine Learning. 


Ph.D. candidate (direct track) in Machine Learning. 


R&D Intern.


Data analysis and computer vision. 


Data Science Research Intern.

Product classification using deep learning. 


Research Intern. 

Transfer learning for Deep Reinforcement Learning. 


Research Intern. 

Inverse Reinforcement Learning.


Research Scientist. 

Lifelong Reinforcement Learning.

© 2023 by Tom Zahavy. 

This site was designed with the
website builder. Create your website today.
Start Now