# DeepMind Could Bring The Best News Recommendation Engine

Reinforcement Learning, a key Google DeepMind algorithm, could overhaul news recommendation engines and greatly improve users stickiness. After beating a Go Grand Master, the algorithm could become the engine of choice for true personalization.

Demis Hassabis, DeepMind’s founder and CEO, a great communicator, gives captivating lectures, this Oxford University one among the best, delivered on February 24th. The 40-year old PhD in Cognitive Neuroscience, and Computer Science graduate from MIT and Harvard, offers this explanation of his work:

“The core of what we do focuses around what we call Reinforcement Learning. And that’s how we think about intelligence at DeepMind.

[Hassabis then shows the following diagram]

We start with the agent system, the A.I. That agent finds itself in some kind of environment, trying to achieve a goal. In a real-world environment, the agent could be a robot or, in a virtual environment, an avatar.

The agent interacts with the environment in two ways. Firstly against observations through its sensory operators. We currently use vision but we start to think about other modalities.

One of the jobs of the agent system is to build the best possible model of the environment out there, just based on these incomplete and noisy observations that [the agent] is receiving in real time. And it keeps updating its model in the face of new evidences.

Once it built this model, the second job of the agent is to make predictions of what is going to happen next. If you can make predictions, you can start planning about what to do. So if you try to achieve a goal, the agent will have a set of actions available. The decision making problem is to pick which action would be the best to take toward your goal.

Once the agent has decided that based on its model, and its planned trajectories, it executes actions that may or may not make some changes in the environment, and that drives the observations…”

Reinforcement Learning is a highly complex process. First, the observed environment is very noisy, incomplete and largely consists of unstructured pieces of data. When DeepMind decided to tackle basic Atari games like Breakout and Pong, the input was nothing but raw pixels, and the output was predictions — likely target position — and then actions — racket placement. All of the above aimed at maximizing the subsequent reward: survival and score. After a few hundreds games, the machine was able to devise on its own creative strategies that would surprise even its creators (read here, or view this video, time code 10:27).

Over time, the tests will migrate to more complex environment such as 3D games in which it becomes harder to distinguish the pattern of a wall from a useful piece of information.

A rather challenging signa-to-noise environment

DeepMind’s future goals involves dealing with very large and complex sets of data such as genomics, climate, energy or macroeconomics.

Regardless of the nature of the input stream, the principle is roughly the same. The A.I. system relies on a deep neural network to filter raw sensory data and form meaningful patterns to be analyzed. It then builds an optimized statistical model, updates it in real time, and derives the best possible actions from the set of observations available at a given moment. Then the whole system loops back.

How does this connect to improving news production?

Before we get into this, let’s roll back a little bit.

For a news production system, recommending related stories or videos is the most efficient way to increase reader engagement. For media who rely on advertising, sold on CPM or on a per-click basis, raising the number of page views per reader has a direct impact on ad revenue. Paid-for media are less sensitive to page views, but reminding readers of the breadth and depth of an editorial production is a key contributor to a news brand’s status — and a way to underline its economic value.

But there is a problem: today’s news recommendation systems are often terrible.

To display related stories or videos, publishers unwilling to invest in smart, fine-tuned systems have settled for engines based on crude semantic analysis of content. Hence embarrassing situations arise. For example, a piece about a major pedophile cover-up by the French clergy will trigger suggestions about child care. Or another where the state of an intellectual debate will suggest a piece on spelling, or an another one about waste management. The worst are stories automatically proposed by Outbrain or Taboola and picked up all over the web: not only are they the same everywhere, but they tap into the same endless field of click-bait items. The only virtue of these two systems is the direct cash-back for the publisher.

Something needs to be done to improve recommendation systems. A.I. and Reinforcement Learning offer a promising path.

In Demis Hassabis’ demonstration, the important words are : Environment, Observations, Models, Predictions and Actions. Let’s consider these keywords in the context of news production and consumption.

The Environment is dual. The external side is built on the general news cycle. At any given moment, automatically assessing a topic’s weight is reasonably easy, but even though they’re critical to predicting how the news cycle will evolve, detecting low-noise signals is much trickier. As for the internal news environment, it is simply the output of various contents produced (or curated) by the newsroom.

Observations are multiple: they include the vast range of available analytics, again at two levels: how a piece of contents is faring in general (against the background noise or the competition), and how each individual reader behaves. Here, things become interesting.

Models are then fed with a mix of statistical and behavioral data such as: “Stories [x] containing [semantic footprint] perform well against context [of this type of information].” Or: Reader #453.09809 is currently interested by [topics], but she has on her radar this [low-noise topic] that is small but growing.

Predictions detect both contents and topics that have the best lift in the news cycle and pique the reader’s interest, dynamically, in real time.

Actions will then range form putting stories or videos in front of the audience and, more specifically, at the individual level. Personalization will shift from passive (the system serves stories based of the presumed and generally static reader profile) to dynamic, based on current and predicted interest.

Demis Hassabis makes clear that enhanced personalization is on DeepMind’s roadmap:

“Personalization doesn’t work very well. It currently sums up to averaging the crowd as opposed to adapting to the human individual”

It would be unrealistic to see a news outlet developing such A.I.-based recommendation engine on its own, but we could easily foresee companies already working on A.I. selling it as SaaS (Software as a Service.)

A new generation of powerful recommendation engines could greatly benefit the news industry. It would ensure much higher reader loyalty, reinforce brand trust (it recommends stories that are both good and relevant), and help build deeper and more consistent news packages while giving a new life to archives.

Who will jump on the opportunity? Probably those who are the most prone to invest in tech. I would guess Buzzfeed and the Jeff Bezos-owned Washington Post.

— frederic.filloux@mondaynote.com

### （一）FTRL 的算法原理：

FTRL 算法综合考虑了 FOBOS 和 RDA 对于梯度和正则项的优势和不足，其特征权重的更新公式是：

$W^{(t+1)}=argmin_{W}\{G^{(1:t)}\cdot W+\lambda_{1}||W||_{1}+\frac{\lambda_{2}}{2}||W||_{2}^{2}+\frac{1}{2}\sum_{s=1}^{t}\sigma^{(s)}||W-W^{(s)}||_{2}^{2}\}$

$(G^{(1:t)}-\sum_{s=1}^{t}\sigma^{(s)}W^{(s)})\cdot W + \lambda_{1}||W||_{1}+\frac{1}{2}(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})||W||_{2}^{2}$

$W^{(t+1)}=argmin_{W}\{Z^{(t)}\cdot W + \lambda_{1}||W||_{1}+\frac{1}{2}(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})||W||_{2}^{2}\}$

$argmin_{w_{i}}\{z_{i}^{(t)}w_{i}+\lambda_{1}|w_{i}|+\frac{1}{2}(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})w_{i}^{2}\}$

$w_{i}^{(t+1)}=0, \text{ if } |z_{i}^{(t)}|<\lambda_{1}$

$w_{i}^{(t+1)}=-(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})^{-1}\cdot(z_{i}^{(t)}-\lambda_{1}\cdot sgn(z_{i}^{(t)})) \text{ otherwise }$

### （二）学习率

$\eta_{i}^{(t)}=\alpha/(\beta+\sqrt{\sum_{s=1}^{t}(g_{i}^{(s)})^{2}})$

$\sum_{s=1}^{t}\sigma^{(s)}=\eta_{i}^{(t)}=\alpha/(\beta+\sqrt{\sum_{s=1}^{t}(g_{i}^{(s)})^{2}})$

### （三）FTRL 算法

FTRL Algorithm

（1）输入 $\alpha, \beta, \lambda_{1},\lambda_{2}$，初始化 $W\in\mathbb{R}^{N}, Z=0\in\mathbb{R}^{N}, Q=0\in\mathbb{R}^{N}$

（2）for $t=1,2,3,.....$
$G=\nabla_{W}\ell(W,X^{(t)},Y^{(t)})$
for $i=1,2,...,N$
$\sigma_{i}=\alpha^{-1}(\sqrt{q_{i}+g_{i}^{2}}-\sqrt{q_{i}})$
$q_{i}=q_{i}+g_{i}^{2}$ // equals to $(\eta^{(t)})^{-1}-(\eta^{(t-1)})^{-1}$
$z_{i}=z_{i}+g_{i}-\sigma_{i}w_{i}$
$\text{if } |z_{i}^{(t)}|<\lambda_{1}, w_{i}=0$
$\text{otherwise, }$
$w_{i}^{(t+1)}=-(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})^{-1}\cdot(z_{i}^{(t)}-\lambda_{1}\cdot sgn(z_{i}^{(t)}))$
end
end



### （四）Logistic Regression 的 FTRL 形式

$\ell_{t}(w_{t})=-y_{t}log(p_{t})-(1-y_{t})log(1-p_{t})$

FTRL Algorithm (Logistic Regression)

（1）输入 $\alpha, \beta, \lambda_{1},\lambda_{2}$，初始化 $W\in\mathbb{R}^{N}, Z=0\in\mathbb{R}^{N}, Q=0\in\mathbb{R}^{N}$

（2）for $t=1,2,3,.....$
for $i=1,2,...,N$
$g_{i}=(p_{t}-y_{t})x_{i}$ // gradient of loss function
$\sigma_{i}=\alpha^{-1}(\sqrt{q_{i}+g_{i}^{2}}-\sqrt{q_{i}})$
$q_{i}=q_{i}+g_{i}^{2}$ // equals to $(\eta^{(t)})^{-1}-(\eta^{(t-1)})^{-1}$
$z_{i}=z_{i}+g_{i}-\sigma_{i}w_{i}$
$\text{if } |z_{i}^{(t)}|<\lambda_{1}, w_{i}=0$
$\text{otherwise, }$
$w_{i}^{(t+1)}=-(\lambda_{2}+\sum_{s=1}^{t}\sigma^{(s)})^{-1}\cdot(z_{i}^{(t)}-\lambda_{1}\cdot sgn(z_{i}^{(t)}))$
end
end



# 分类模型的正负样本

### 点击率预测的正负样本如何产生：

（1）负样本指的是对用户曝光过的节目，但是用户至始至终都没有播放过，也就是说该节目并不在“历史”和“喜欢”两个分类里面。

（2）负样本指的是在整个抽样的池子里面，但是用户至始至终都没有播放过，也就是说该节目并不在“历史”和“喜欢”这两个分类里面。

# 特征工程简介

### （I）特征工程可以解决什么样的问题？

Actually the sucess of all Machine Learning algorithms depends on how you present the data.

Better feature means flexibility. Better feature means simpler models. Better feature means better results.

### （II）什么才是特征工程？

Feature Engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
—— Jason Brownlee

Feature Engineering is manually designing what the input x’s should be.
—— Tomasz Malisiewicz

### （IV）特征的构造过程

#### ［4］特征的构造：

（4.1）标准化（Standardization）

（4.2）归一化（Normalization）

$f(x)=\log(1+x):[0,+\infty)\rightarrow [0,+\infty),$

$f(x)=x/(x+1):[0,+\infty)\rightarrow [0,1).$
（4.3）特征的离散化（Discretization）

（4.4）特征的二值化（Binarization）

（4.5）特征的交叉

#### [5] 选择特征的常用方法：

（5.1）过滤（Filter）

$\rho_{XY}=cov(X,Y)/(\sigma_{X}\sigma_{Y})\in [-1,1]$

（5.2）包装（Wrapper）

（5.3）嵌入（Embedding）

# 转行数据挖掘和机器学习

Hive用于提取数据，做基本的数据分析。hive的基本函数，比如聚合函数，数学函数，字符串的函数，连接表格函数等。hive的各种语句，比如if else，case等语句。

EXCEL的基本操作需要掌握，可以进行各种数据的处理、统计分析和辅助决策操作，用熟悉了其实挺方便的。

### 3.操作系统

Linux系统，脚本语言Shell。

### 4. 数据挖掘和机器学习的基础知识和算法

Ad Click Prediction a View from the Trenches

## 问题描述（最小化目标函数）

### 等价条件：

（1）约束优化（convex constraint formulation）：

$\hat{w}=argmin_{w}\sum_{i=1}^{n}L(w,z_{i})$ subject to $||w||_{1}\leq s$

（2）无约束优化描述（soft regularization formulation）：

$\hat{w}=argmin_{w}\sum_{i=1}^{n}L(w,z_{i})+\lambda ||w||_{1}$

### 损失函数：

Linear Regression：

$h(W,X_{j})=W^{T}X_{j}, \ell(W,Z)=\sum_{j=1}^{M}(y_{j}-W^{T}X_{j})^{2}$

Logistic Regression：

$h(W,X_{j})=1/(1+\exp(-W^{T}X_{j})), \ell(W,Z)=\sum_{j=1}^{M}\log(1+\exp(-y_{j}W^{T}X_{j}))$

Batch Gradient Descent: Repeat Until Convergence

{ $W^{(t+1)}=W^{(t)}-\eta^{(t)}\nabla\ell_{W}(W^{(t)},Z)$ }

Loop{

for j=1 to M { $W^{(t+1)}=W^{(t)}-\eta^{(t)}\nabla\ell_{W}(W^{(t)},Z_{j})$ }

}

### （1）L1 正则化法

$W_{(t+1)}=W_{(t)}-\eta^{(t)}G^{(t)}-\eta^{(t)}\lambda sgn(W^{t})$

### （2）简单截断法

$W^{(t+1)}=T_{0}(W^{(t)}-\eta^{(t)}G^{(t)},\theta)$

$T_{0}(v_{i},\theta)=0 \text{ if } |v_{i}|\leq \theta,$

$T_{0}(v_{i},\theta)=v_{i} \text{ otherwise}.$

$W^{(t+1)}=T_{1}(W^{(t)}-\eta^{(t)}G^{(t)},\eta^{(t)}\lambda^{(t)},\theta)$

这里的方程 $T_{1}$ 定义为：

$T_{1}(v_{i},\alpha,\theta)=\max(0,v_{i}-\alpha) \text{ if } v_{i}\in [0,\theta]$

$T_{1}(v_{i},\alpha,\theta)=\min(0,v_{i}+\alpha) \text{ else if } v_{i}\in [-\theta,0)$

$T_{1}(v_{i},\alpha,\theta)=v_{i} \text{ otherwise }$

输入 $\theta$，初始化 $W\in\mathbb{R}^{N}$
for t = 1,2,3,... 计算 $G=\nabla_{W}\ell(W,X^{(t)},Y^{(t)})$

（i）当 t/k 不是整数时，采用标准的 SGD (Stochastic Gradient Descent) 进行迭代。$\lambda^{(t)}=0$，并且 $w_{i}^{(t)}=w_{i}^{(t)}-\eta^{(t)}g_{i}$ for all $i \in \mathbb{R}^{N}$.

（ii）当 t/k 是整数时，采取截断技术。

$w_{i} = \max(0,w_{i}-\eta^{(t)}g_{i}-\eta^{(t)}\lambda^{(t)})$, if $(w_{i}-\eta^{(t)}g_{i})\in[0,\theta]$

$w_{i} = \min(0,w_{i}-\eta^{(t)}g_{i}+\eta^{(t)}\lambda^{(t)})$, else if $(w_{i}-\eta^{(t)}g_{i})\in[-\theta,0]$

$w_{i}=w_{i}-\eta^{(t)}g_{i}$, otherwise
return W.

# Forward-Backward Splitting (FOBOS) 算法简介

## FOBOS 算法简介

### （1）FOBOS（Forward Backward Splitting）算法原理

$W^{(t+0.5)}=W^{(t)}-\eta^{(t)}G^{(t)}$

$W^{(t+1)}=argmin_{W}\{\frac{1}{2}||W-W^{(t+0.5)}||_{2}^{2}+\eta^{(t+0.5)}\Psi(W)\}$

$W^{(t+1)}=argmin_{W}\{\frac{1}{2}||W-W^{(t)}+\eta^{(t)}G^{(t)}||_{2}^{2}+\eta^{(t+0.5)}\Psi(W)\}$

$0 \in \partial F(W)=W-W^{(t)}+\eta^{(t)}G^{(t)}+\eta^{(t+0.5)}\partial\Psi(W)$.

$W^{(t+1)}=W^{(t)}-\eta^{(t)}G^{(t)}-\eta^{(t+0.5)}\partial\Psi(W^{(t+1)})$

### （2）FOBOS（Forward Backward Splitting）的 L1 正则化

$W^{(t+1)}=argmin_{W}\{\frac{1}{2}||W-W^{(t+0.5)}||_{2}^{2}+\eta^{(t+0.5)}\Psi(W)\}$

$=argmin_{W}\sum_{i=1}^{N}(\frac{1}{2}(w_{i}-v_{i})^{2}+\tilde{\lambda}|w_{i}|).$

$w^{(t+1)}_{i}=argmin_{w_{i}}(\frac{1}{2}(w_{i}-v_{i})^{2}+\tilde{\lambda}|w_{i}|)$ for all $latex 1\leq i \leq N$.

Claim. 如果 $w_{i}^{*}$$\min_{w_{i}}(\frac{1}{2}(w_{i}-v_{i})^{2}+\tilde{\lambda}|w_{i}|)$ 的最优解，那么 $w_{i}^{*}v_{i}\geq 0$.

$w_{i}^{(t+1)}=sgn(v_{i})\max(0,|v_{i}|-\tilde{\lambda})$

$= sgn(w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)})\max\{0,|w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)}|-\eta^{(t+0.5)}\lambda\}$ for all $1\leq i \leq N,$

$w_{i}^{(t+1)}=(w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)})-\eta^{(t+0.5)}\lambda \cdot sgn(w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)})$

#### L1-FOBOS 的算法：

（1）输入 $\lambda$，初始化 $W\in\mathbb{R}^{N}$
（2）for $t=1,2,3...$, 计算 $G=\nabla_{W}\ell(W,X^{(t)},Y^{(t)})$
按照公式更新 $W$，$w_{i} = sgn(w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)})\max\{0,|w_{i}^{(t)}-\eta^{(t)}g_{i}^{(t)}|-\eta^{(t+0.5)}\lambda\}$
for all $1\leq i \leq N$
（3）Return W

### （3）L1-FOBOS 与 Truncated Gradient 的关系

$\theta=+\infty, k=1, \lambda_{TG}^{(t)}=\eta^{(t+0.5)}\lambda$，可以得到 L1-FOBOS 与 Truncated Gradient 完全一致，换句话说 L1-FOBOS 是 Truncated Gradient 在一些特定条件下的形式。