Research Proposal

Summarization of past research projects

My primary research passion is on learning to make good decisions in complex environments. To this end, I am dedicated to making contributions on reinforcement learning, focusing on two core problems: boosting planning performance with learned or exact environment model, and increasing sample efficiency using better exploration strategies.

To improve planning performance, I worked on two separated but critically related aspects: the effectiveness of planning algorithms and the influence of inaccurate (learned) environment models on it. When a perfect environment model is given, planning can be viewed as a tradeoff between computation cost and performance. Therefore, to achieve the best performance in a fixed amount of time, parallelization has been widely applied, especially in Monte Carlo Tree Search (MCTS), a popular and powerful planning algorithm. However, parallelizing MCTS harms its performance significantly, since it inevitably breaks the sequential dependency between different steps (i.e., each MCTS iteration requires information from all previous iterations to provide effective exploration-exploitation tradeoff). In spite of the difficulties, I proposed a novel parallel MCTS algorithm that achieves linear speed-up with negligible performance loss. The key idea behind the algorithm is to maintain a set of statistics that tracks the number of on-going yet incomplete simulation queries, which is used to retain effective exploration-exploitation tradeoff. This algorithm has been deployed in a production system for months. This work was submitted to ICLR 2020 and received review scores 8, 6, and 8 (tied top: 34/2594).

Aiming at extending the success of the parallel MCTS algorithm to cases where exact environment models are unavailable, I am working on retaining robustness in learned environment models. Essentially, this problem boils down to bounding the multi-step prediction error of the environment model, which is extremely hard and generally impossible. Therefore, focusing exclusively on lowering model error should not be the right solution. Alternatively, quantifying the multi-step prediction error provides us information on when the model’s outcome becomes useless, which allows for remedy measures such as using model-free quantities (e.g., value function) to replace the erroneous part of the rollout trajectory. While Bayesian models such as Bayesian neural networks provide one-step prediction uncertainty, they cannot propagate them in the sequential setting. This motivates the usage of Probabilistic Sentential Decision Diagram (PSDD), a probabilistic graphic model that supports various kinds of efficient inferences. PSDDs are able to propagate uncertainty across time steps, which not only allows to cut-off useless model predictions but also improves the quality of multi-step prediction. Although the results are preliminary, I was fascinated by the fact that even PSDDs are in general less expressive than neural networks, they achieved better performance as environment models.

Orthogonal to planning, learning is another line of crucial yet challenging reinforcement learning approaches. One central problem in learning is to construct effective exploration strategies that visit potentially rewarding states sufficiently often, which allows the algorithm to learn better policies from the experiences (i.e., exploitation). Contrary to most exploration approaches, disentangled learning algorithms avoid the tradeoff between exploration and exploitation by separating the two objectives, each maximizing its own benefit. However, naively disentangling the two can lead to inefficient learning or stability issues. Therefore, on top of the disentangled framework, I proposed to enforce key relations between the decoupled components, which result in significantly better performance in both dense and sparse reward benchmarks. Collaborating with Professor Van den Broeck and his Ph.D. student Yitao Liang, this work has been submitted to AAMAS 2020.

Current research project

One of the promising but under-explored areas in RL is model-based reinforcement learning (MBRL). In addition to learning policy or value function, MBRL uses a learned transition model to improve learning performance. One well-recognized benefit of MBRL is to reduce sample complexity, greatly lower the need to interact with the real environment. Moreover, MBRL can even provide a principled exploration strategy.

However, learning the world model is hard, and it is even harder to get things right in MBRL. In most cases, even if a low-error world model is available, directly planning on it produces bad policy. The following are three main reasons: (i) small model errors accumulate during roll-out and at the end provides “garbage” states; (ii) most world models (e.g. deep neural networks) learn the expected next state, which is not necessarily a valid state; (iii) world models are not aware of their uncertainty.

Among the three problems, I think (i) is generally unavoidable. Even if we are able to cut-off big errors sometimes, we are generally not able to learn a model good enough to have small multi-step error, let alone the affection caused by insufficient exploration (lack of training data) in some regions. Therefore, (ii) and (iii) should be the main focus of MBRL. In the literature, there are many people focusing on solving these two problems mainly using deep neural networks. However, since neural nets sacrifices many things on the inference side (e.g. well-defined and calibrated uncertainty) for expressiveness, it does not solve (ii) and (iii) well.

I was then attracted by tractable probabilistic models (TPMs), which are probabilistic models that enable polynomial time (in most cases, linear time) computation of marginal, MAP, expectation, etc. Indeed, they are not as expressive as neural networks, but solving MBRL purely from increasing model expressiveness seems impractical. One apparent reason is that these models need tremendous amount of data, which opposes the principle of MBRL - increase sample efficiency. I am working on getting one step forward towards solving MBRL using Probabilistic Sentential Decision Diagrams (PSDD), a TPM that enables tractable computation of many advanced queries (e.g. moments, KL divergence). It is exciting that we may be able to gain benefit from inaccurate world models in general tasks.

刘云龙

Research Proposal

Summarization of past research projects

Current research project