# Value iteration networks

Value Iteration Networks Tamar et al., NIPS 2016

‘Value Iteration Networks’ won a best paper award at NIPS 2016. It tackles two of the hot issues in reinforcement learning at the moment: incorporating longer range planning into the learned strategies, and improving transfer learning from one problem to another. It’s two for the price of one, as both of these challenges are addressed by an architecture that learns to plan.

In the grid-world domain shown below, a standard reinforcement learning network, trained on several instances of the world, may still have trouble generalizing to a new unseen domain (right-hand image).

(This setup is very similar to the maze replanning challenge in ‘Strategic attentive writer for learning macro actions‘ from the Google DeepMind team that we looked at earlier this year. Both papers were published at the same time).

… as we show in our experiments, while standard CNN-based networks can be easily trained to solve a set of such maps, they do not generalize well to new tasks outside this set, because they do not understand the goal-directed nature of the behavior. This observation suggests that the computation learned by reactive policies is different from planning, which is required to solve a new task.

Planning is not a new problem – the value iteration algorithm based on Markov decision processes (MDP) has been known since 1957! What Tamar et al. do in this work though, is embed a value iteration (VI) planning component inside the overall neural network architecture. And the breakthrough insight is that the VI algorithm itself can be encoded by a specific type of CNN, which means it is differentiable.

By embedding such a VI network module inside a standard feed-forward classification network, we obtain an NN model that can learn the parameters of a planning computation that yields useful predictions. The VI block is differentiable, and the whole network can be trained using standard backpropagation.

It really is pretty cool – you give the network the machinery that can be used for planning, and it figures out all by itself the best way to use it.

Using the approach, Tamar et al. show that value iteration networks (VINS) generalize better to new grid-world scenarios than either CNNs following the DQN architecture, or fully convolutional networks (FCNs):

(Note there is no comparison to the contemporary STRAW architecture from the DeepMind team that also extends DQNs with planning).

Importantly, note that the prediction loss for the reactive policies is comparable to the VINs, although their success rate is significantly worse. This shows that this is not a standard case of overfitting/underfitting of the reactive policies. Rather, VIN policies, by their VI structure, focus prediction errors on less important parts of the trajectory, while reactive policies do not make this distinction, and learn the easily predictable parts of the trajectory yet fail on the complete task.

They also demonstrated planning success using Mars landscape images for Mars Rover navigation, planning in a physical simulation setting, and planning in the WebNav setting which requires navigating links of a web site towards a goal page.

What I’d love to see is how well the VIN architecture performs on theFrostbite Challenge.

Let’s take a closer look at how it all works, starting with the value iteration algorithm itself, then how to encode that in a NN, before finally putting it all together in a complete architecture.

### Standard value iteration

“A standard model for sequential decision making and planning is the Markov Decision Process (MDP).”

You have a set of states $s \in S$, a set of actions $a \in A$, a reward function $R(s,a)$ that gives the anticipated reward for taking action $a$ in state $s$, and atransition kernel, $P(s'|s,a)$ that encodes the probability of the next state given the current state and action. A policy $\pi(a|s)$ prescribes the action distribution for each state.

(Note the similarity between this structure and the action matrix of STRAW).

The goal in an MDP is to find a policy that obtains high rewards in the long term.

You can consider the value of a state under some policy as the expected discounted sum of rewards when starting from that state and following the policy. A optimal policy will find the maximal long-term return possible from a given state. Value iteration computes the rewards by iterating over the action steps ($\gamma \in (0,1)$ is a discount factor):

### Encoding value iteration in a neural network

Our starting point is the VI algorithm (1). Our main observation is that each iteration of VI may be seen as passing the previous value function Vn and reward function R through a convolution layer and max-pooling layer. In this analogy, each channel in the convolution layer corresponds to the Q-function for a specific action, and convolution kernel weights correspond to the discounted transition probabilities. Thus by recurrently applying a convolution layer K times, K iterations of VI are effectively performed.

This idea leads to the following network structure:

A reward ‘image’ $\bar{R}$ (to follow the more normal CNN formulation of working with images) is fed into convolutional layer $\bar{Q}$ with $\bar{A}$ channels. Each channel corresponds to $\bar{Q}(\bar{s},\bar{a})$ for action $\bar{a}$. The layer is max-pooled along the actions channel to produce the next-iteration value function layer. This is stacked with the reward $\bar{R}$ and fed back in K times, to performK iterations of value iteration.

### The full Value Iteration Network model

The value-iteration module we just described can now be embedded into a full value iteration network as follows:

In many systems, if you’re in a given state, and you take a given action, the set of possible states you end up in is much smaller than the overall universe of states. More precisely, the the states for which $\bar{P}(\bar{s'}|\bar{s},\bar{a}) > 0$ is a small subset of $\bar{S}$.

In NN terminology, this is a form of attention, in the sense that for a given label prediction (action), only a subset of the input features (value function) is relevant. Attention is known to improve learning performance by reducing the effective number of network parameters during learning.

This is the purpose of the attention module added into the feedback loop in the diagram above. With the inclusion of the CNN-based value iteration module, everything in the value iteration network is differentiable:

This allows us to treat the planning module as just another NN, and by back-propagating through it, we can train the whole policy end-to-end.

To implement a VIN, you need to specify the state and action spaces for the planning module ($\bar{S}$ and $\bar{A}$), the reward functions $f_R$ and $f_P$, and the attention function. The authors call this the process of VIN design.

Once a VIN design is chose, implementing the VIN is straightforward, as it is simply a form of CNN. The networks in our experiments all required only several lines of Theano code.

# ［转载］强化学习系列之九:Deep Q Network (DQN)

### 2. Deep Q Network (DQN) 算法

Experience Replay 的动机是：1）深度神经网络作为有监督学习模型，要求数据满足独立同分布，2）但 Q Learning 算法得到的样本前后是有关系的。为了打破数据之间的关联性，Experience Replay 方法通过存储-采样的方法将这个关联性打破了。

DeepMind 在 2015 年初在 Nature 上发布了文章，引入了 Target Q 的概念，进一步打破数据关联性。Target Q 的概念是用旧的深度神经网络 w 去得到目标值，下面是带有 Target Q 的 Q Learning 的优化目标。

J=min(r+γmaxaQ(s,a,w))Q(s,a,w))2

### 3. 后续发展

DQN 是第一个成功地将深度学习和强化学习结合起来的模型，启发了后续一系列的工作。这些后续工作中比较有名的有 Double DQN, Prioritized Replay 和 Dueling Network。

#### 3.1 Double DQN

Thrun 和 Schwartz 在古老的 1993 年观察到 Q-Learning 的过优化 (overoptimism) 现象 [1]，并且指出过优化现象是由于 Q-Learning 算法中的 max 操作造成的。令 Qtarget(s,a) 是目标 Q 值；我们用了价值函数近似，Qapprox 是近似 Q 值；令 Y 为近似值和目标之间的误差，即

Qapprox(s,a)=Qtarget(s,a)+Ys,a

Q-learning 算法更新步骤将所有的 Q 值更新一遍，这个时候近似值和目标值之间的差值

Z==rs,a+γmaxa1Qapprox(s,a1)rs,a+γmaxa2Qtarget(s,a2)γmaxa1Qapprox(s,a1)γmaxa2Qtarget(s,a2)γQapprox(s,a)Qtarget(s,a)=γYs,a(1)

#### 3.2 Prioritized Replay

DQN 用了 Experience Replay 算法，将系统探索环境获得的样本保存起来，然后从中采样出样本以更新模型参数。对于采样，一个常见的改进是改变采样的概率。Prioritized Replay [3] 便是采取了这个策略，采用 TD-err 作为评判标准进行采样。

TDerr=|rs,a+γmaxaQ(s,a)Q(s,a)|(2)

#### 3.3 Dueling Network

Baird 在 1993 年提出将 Q 值分解为价值 (Value) 和优势 (Advantage) [4]。

Q(s,a)=V(s)+A(s,a)

Wang Z 将这个 idea 应用在深度强化学习中，提出了下面的网络结构 [5]。

Dueling Network 是一个深度学习的网络结构。它可以结合之前介绍的 Experience Replay、 Double DQN 和 Prioritized Replay 等方法。 作者在论文中报告 Dueling Network 和 Prioritized Replay 结合的效果最好。

### 4. 总结

[1] S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum.
[2] Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep reinforcement learning with double Q-learning.” CoRR, abs/1509.06461 (2015).
[3] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J]. arXiv preprint arXiv:1511.05952, 2015.
[4] Baird, L.C. Advantage updating. Technical Report WLTR-93-1146,
Wright-Patterson Air Force Base, 1993.
[5] Wang Z, de Freitas N, Lanctot M. Dueling network architectures for deep reinforcement learning[J]. arXiv preprint arXiv:1511.06581, 2015.

# 2016年的AI，一场史无前例的技术营销

2016年12月29日，大概又是一个会被载入史册的日子。名叫SkyNet，哦不，是”Master”的围棋AI，开始了第一次对人类的血洗。

“人工智能”这个buzzword，常常会因为营销或者新闻报道的需求而被赋予不同的含义，其外延有时等同于“机器学习”，有时不等同，所以最外圈的这个等号并不完全准确。不过在2016年被大家普遍讨论的这些“AI”，可以认为基本上就是机器学习。内部的四个小圈则是学术上有确定外延的四个概念，代表了当前最重要的四个问题领域，是需要明确的重点概念。

2012年Andrew NG团队无监督学习生成的猫脸图片。图片来源：New York Times

GAN生成的小动物图片。图片来源：Ian Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks

AI围棋战胜人类，本身是一个伟大的成就。然而在这个伟大浪潮推动下的AI学术大跃进与创业热中，却出现了很多奇怪的现象。

“在八月份美国围棋大会上，我有幸见到了AlphaGo的主要贡献者黄士杰(AjaHuang)和樊麾。我问他们，我们用了大概80到90块GPU来训练模型，我是否可以在演讲时说我们用了AlphaGo百分之一的GPU？那时Aja神秘地笑了笑说：具体数字不能讲。不过，也许小于百分之一吧。”

# 【转载】【David Silver强化学习公开课之一】强化学习入门

【转载请注明出处】chenrudan.github.io

#### 3. 一些变量

history是所有动作、状态、奖赏的序列，Ht=A1,O1,R1,,At,Ot,RtHt=A1,O1,R1,…,At,Ot,Rt

environment state，SetSte，环境当前的状态，它反应了环境发生什么改变。这里需要明白的一点是环境自身的状态和环境反馈给agent的状态并不一定是相同的，例如机器人在走路时，当前的environment状态是一个确定的位置，但是它的camera只能拍到周围的景象，无法告诉agent具体的位置，而拍摄到的照片可以认为是对环境的一个observation，也就是说agent并不是总能知道环境是如何发生改变的，只能看到改变后的一个结果展示。

agent state，SatSta，是agent的现在所处状态的表示，它可以是history的任何函数。

information(Markov) state，它包含了history的所有有用信息。一个状态StSt有马尔可夫性质是指下一个时刻的状态仅由当前状态决定，与过去状态无关。这里定义可以看出environment state是有马尔可夫性质的(这个概念不明白可以暂时不管)。

#### 4. Agent的组成

Policy，它根据当前看到的observation来决定action，是从state到action的映射。有两种表达形式，一种是Deterministic policy即a=π(s)a=π(s)，在某种状态s下，一定会执行某个动作a。一种是Stochastic policy即π(a|s)=p[At=a|St=s]π(a|s)=p[At=a|St=s]，它是在某种状态下执行某个动作的概率。

Value function，它预测了当前状态下未来可能获得的reward的期望。Vπ(s)=Eπ[Rt+1+rRt+2+|St=s]Vπ(s)=Eπ[Rt+1+rRt+2+…|St=s]。用于衡量当前状态的好坏。

Model，预测environment下一步会做出什么样的改变，从而预测agent接收到的状态或者reward是什么。因而有两种类型的model，一种是预测下一个state的transition model即Pass=p[St+1=s|St=s,At=a]Pss′a=p[St+1=s′|St=s,At=a]，一种是预测下一次reward的reward model即Ras=E[Rt+1|St=s,At=a]Rsa=E[Rt+1|St=s,At=a]

# 解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术

Two years ago, a small company in London called DeepMind uploaded their pioneering paper “Playing Atari with Deep Reinforcement Learning” to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!

It has been hailed since then as the first step towards general artificial intelligence – an AI that can survive in a variety of environments, instead of being confined to strict realms such as playing chess. No wonder DeepMind was immediately bought by Google and has been on the forefront of deep learning research ever since. In February 2015 their paper “Human-level control through deep reinforcement learning” was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.

Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn’t had time to delve into reinforcement learning yet.

1. 什么是强化学习的主要挑战？针对这个问题，我们会讨论 credit assignment 问题和 exploration-exploitation 困境。
2. 如何使用数学来形式化强化学习？我们会定义 Markov Decision Process 并用它来对强化学习进行分析推理。
3. 我们如何指定长期的策略？这里，定义了 discounted future reward，这也给出了在下面部分的算法的基础。
4. 如何估计或者近似未来收益？给出了简单的基于表的 Q-learning 算法的定义和分析。
5. 如果状态空间非常巨大该怎么办？这里的 Q-table 就可以使用（深度）神经网络来替代。
6. 怎么样将这个模型真正可行？采用 Experience replay 技术来稳定神经网络的学习。
7. 这已经足够了么？最后会研究一些对 exploration-exploitation 问题的简单解决方案。

# 强化学习

Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.

Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem – for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.
This is the task reinforcement learning tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. 基于这些收益，agent 必须学会在环境中如何行动。

# Discounted Future Reward

To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that?

Given one run of the Markov decision process, we can easily calculate the total reward for one episode:

Given that, the total future reward from time point t onward can be expressed as:

But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward instead:

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.

# Q-learning

In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.

**

**The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s“. It is called Q-function, because it represents the “quality” of a certain action in a given state.

This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “Q(s, a) exists, Q(s, a) exists, …”. Feel it?

If you’re still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

Here π represents the policy, the rule how we choose an action in each state.

OK, how do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’.

This is called the Bellman equation. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.

The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following[1]:

α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.

The maxa’ Q[s’,a’] that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

# Deep Q Network

DeepMind 使用的深度神经网络架构如下：

1. 对当前的状态 s 执行前向传播，获得对所有行动的预测 Q-value
2. 对下一状态 s’ 执行前向传播，计算网络输出最大操作：max_{a’} Q(s’, a’)
3. 设置行动的 Q-value 目标值为 r + γ max_{a’} Q(s’, a’)。使用第二步的 max 值。对所有其他的行动，设置为和第一步返回结果相同的 Q-value 目标值，让这些输出的误差设置为 0
4. 使用反向传播算法更新权重

# Exploration-Exploitation

Q-learning 试着解决 credit assignment 问题——将受益按时间传播，直到导致获得受益的实际的关键决策点为止。但是我们并没有解决 exploration-exploitation 困境……

# Deep Q-learning 算法

DeepMind 其实还使用了很多的技巧来让系统工作得更好——如 target network、error clipping、reward clipping 等等，这里我们不做介绍。

# Final notes

Many improvements to deep Q-learning have been proposed since its first introduction – Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture and extension to continuous action space to name a few. For latest advancements check out the NIPS 2015 deep reinforcement learning workshop and ICLR 2016 (search for “reinforcement” in title). But beware, that deep Q-learning has been patented by Google.

It is often said, that artificial intelligence is something we haven’t figured out yet. Once we know how it works, it doesn’t seem intelligent any more. But deep Q-networks still continue to amaze me. Watching them figure out a new game is like observing an animal in the wild – a rewarding experience by itself.

# Credits

Thanks to Ardi Tampuu, Tanel Pärnamaa, Jaan Aru, Ilya Kuzovkin, Arjun Bansal and Urs Köster for comments and suggestions on the drafts of this post.