Category Archives: 数据挖掘与机器学习

强化学习的入门方法

September 4, 2022 zr9558 2 Comments

强化学习是人工智能领域一个非常重要的分支，在游戏 AI 领域的应用是无处不在的，包括大家耳熟能详的 AlphaGo 和王者荣耀，其背后都使用了大量的强化学习方法。随着人工智能技术在互联网公司的日渐普及，人们对人工智能的期待也越来越高，也期待着人工智能未来能够做更多的事情。除了各种门禁系统，人脸识别系统在各个公司的应用之外，在游戏领域，强化学习也扮演着举足轻重的作用。

当前的强化学习技术除了使用在游戏 AI 领域之外，也在机器人控制，推荐系统等领域有着很多应用，也取得了不少商业价值。但强化学习的学习还是存在一定的门槛，除了一些在线的 tutorial 和零零散散的资料之外，其实还是需要一本相对完整和优秀的教材来辅助高校的老师和学生，只有这样，强化学习才能够成为一门课进入高校课堂。

上海交大的强化学习团队教师张伟楠，沈键，俞勇合力撰写了一本关于强化学习的，理论与实践相结合的书籍《动手学强化学习》。本书的每个章节除了必要的理论介绍，公式推导之外，还包含了相关的 Python Notebook。Notebook 中包含了算法过程和可运行代码，方便读者阅读相关内容并立刻进行实践。

《动手学强化学习》是一本相对完整的书籍，读者并不需要知道太多的预备知识就可以直接上手这门课。本书包括三个部分：

第一部分：强化学习基础；
第二部分：强化学习进阶；
第三部分：强化学习前沿。

在第一部分，作者们介绍了强化学习的基本原理，多臂老虎机，马尔可夫决策过程，动态规划算法等内容；在第二部分，作者介绍了 DQN 算法及其改进算法，策略梯度算法，Actor-Critic 算法等诸多算法，让作者能够在第一部分的基础上更进一步的理解强化学习；在第三部分，作者们撰写了前沿的强化学习算法，包括离线强化学习，多智能体强化学习等内容，让读者在之前的基础上迅速进入到强化学习的前沿领域。

本书的另外一个亮点就是 Python Notebook 的代码丰富完善，读者可以一边阅读强化学习的基础知识，一边进行代码的开发工作。基于理论和实战相结合的办法，读者可以尽快掌握强化学习的实战方法。学习这本书需要读者有一定的 Python 开发基础，并且这本书有相应的强化学习视频，方便读者尽快掌握这本书。

数据挖掘与机器学习

主动学习（active Learning）

September 13, 2020 zr9558 Leave a comment

主动学习背景介绍

机器学习的研究领域包括有监督学习（Supervised Learning），无监督学习（Unsupervised Learning），半监督学习（Semi-supervised Learning）和强化学习（Reinforcement Learning）等诸多内容。针对有监督学习和半监督学习，都需要一定数量的标注数据，也就是说在训练模型的时候，全部或者部分数据需要带上相应的标签才能进行模型的训练。但是在实际的业务场景或者生产环境中，工作人员获得样本的成本其实是不低的，甚至在某些时候是相对较高的，那么如何通过较少成本来获得较大价值的标注数据，进一步地提升算法的效果就是值得思考的问题了。

在工业界的图像标注领域，虽然有 ImageNet 这个学术界和工业界都在使用的图像数据库，但是在很多特殊的业务场景上，从业人员依旧需要想尽办法去获取业务标注数据。在安全风控领域，黑产用户相对于正常用户是偏少的，因此，如何通过极少的黑产用户来建立模型则是值得思考的问题之一。在业务运维领域，服务器，app 的故障时间相对于正常运行的时间也是偏少的，必然会出现样本不均衡的情况。因此，在这些业务领域，要想获得样本和构建模型，就必须要通过人力的参与。那么如何通过一些机器学习算法来降低人工标注的成本就是从业者需要关注的问题了。毕竟需要标注 100 个样本和需要标注成千上万的样本所需要的人力物力是截然不同的。

在学术界，同样有学者在关注这方面的问题，学者们通过一些技术手段或者数学方法来降低人们标注的成本，学者们把这个方向称之为主动学习（Active Learning）。在整个机器学习建模的过程中有人工参与的部分和环节，并且通过机器学习方法筛选出合适的候选集给人工标注的过程。主动学习（Active Learning）的大致思路就是：通过机器学习的方法获取到那些比较“难”分类的样本数据，让人工再次确认和审核，然后将人工标注得到的数据再次使用有监督学习模型或者半监督学习模型进行训练，逐步提升模型的效果，将人工经验融入机器学习的模型中。

在没有使用主动学习（Active Learning）的时候，通常来说系统会从样本中随机选择或者使用一些人工规则的方法来提供待标记的样本供人工进行标记。这样虽然也能够带来一定的效果提升，但是其标注成本总是相对大的。

用一个例子来比喻，一个高中生通过做高考的模拟试题以希望提升自己的考试成绩，那么在做题的过程中就有几种选择。一种是随机地从历年高考和模拟试卷中随机选择一批题目来做，以此来提升考试成绩。但是这样做的话所需要的时间也比较长，针对性也不够强；另一种方法是每个学生建立自己的错题本，用来记录自己容易做错的习题，反复地巩固自己做错的题目，通过多次复习自己做错的题目来巩固自己的易错知识点，逐步提升自己的考试成绩。其主动学习的思路就是选择一批容易被错分的样本数据，让人工进行标注，再让机器学习模型训练的过程。

那么主动学习（Active Learning）的整体思路究竟是怎样的呢？在机器学习的建模过程中，通常包括样本选择，模型训练，模型预测，模型更新这几个步骤。在主动学习这个领域则需要把标注候选集提取和人工标注这两个步骤加入整体流程，也就是：

机器学习模型：包括机器学习模型的训练和预测两部分；
待标注的数据候选集提取：依赖主动学习中的查询函数（Query Function）；
人工标注：专家经验或者业务经验的提炼；
获得候选集的标注数据：获得更有价值的样本数据；
机器学习模型的更新：通过增量学习或者重新学习的方式更新模型，从而将人工标注的数据融入机器学习模型中，提升模型效果。

通过这种循环往复的方法，就可以达到人工调优模型的结果。其应用的领域包括：

个性化的垃圾邮件，短信，内容分类：包括营销短信，订阅邮件，垃圾短信和邮件等等；
异常检测：包括但不限于安全数据异常检测，黑产账户识别，时间序列异常检测等等。

主动学习的模型分类包括两种，第一种是流式的主动学习（Sequential Active Learning），第二种是离线批量的主动学习（Pool-based Active Learning）。在不同的场景下，业务人员可以选择不同的方案来执行。

而查询策略（Query Strategy Frameworks）就是主动学习的核心之处，通常可以选择以下几种查询策略：

不确定性采样的查询（Uncertainty Sampling）；
基于委员会的查询（Query-By-Committee）；
基于模型变化期望的查询（Expected Model Change）；
基于误差减少的查询（Expected Error Reduction）；
基于方差减少的查询（Variance Reduction）；
基于密度权重的查询（Density-Weighted Methods）。

不确定性采样（Uncertainty Sampling）

顾名思义，不确定性采样的查询方法就是将模型中难以区分的样本数据提取出来，提供给业务专家或者标注人员进行标注，从而达到以较快速度提升算法效果的能力。而不确定性采样方法的关键就是如何描述样本或者数据的不确定性，通常有以下几种思路：

置信度最低（Least Confident）；
边缘采样（Margin Sampling）；
熵方法（Entropy）；

Least Confident

对于二分类或者多分类的模型，通常它们都能够对每一个数据进行打分，判断它究竟更像哪一类。例如，在二分类的场景下，有两个数据分别被某一个分类器预测，其对两个类别的预测概率分别是：(0.9,0.1) 和 (0.51, 0.49)。在此情况下，第一个数据被判定为第一类的概率是 0.9，第二个数据被判定为第一类的概率是 0.51，于是第二个数据明显更“难”被区分，因此更有被继续标注的价值。所谓 Least Confident 方法就是选择那些最大概率最小的样本进行标注，用数学公式描述就是：

$x_{LC}^{*}=argmax_{x}(1-P_{\theta}(\hat{y}|x))=argmin_{x}P_{\theta}(\hat{y}|x)$ ,

其中 $\hat{y}=argmax_{y}P_{\theta}(y|x)$ ，这里的 $\theta$ 表示一个已经训练好的机器学习模型参数集合。 $\hat{y}$ 对于 $x$ 而言是模型预测概率最大的类别。Least Confident 方法考虑那些模型预测概率最大但是可信度较低的样本数据。

Margin Sampling

边缘采样（margin sampling）指的是选择那些极容易被判定成两类的样本数据，或者说这些数据被判定成两类的概率相差不大。边缘采样就是选择模型预测最大和第二大的概率差值最小的样本，用数学公式来描述就是：

$x_{M}^{*}=argmin_{x}(P_{\theta}(\hat{y}_{1}|x)-P_{\theta}(\hat{y}_{2}|x))$ ,

其中 $\hat{y}_{1}$ 和 $\hat{y}_{2}$ 分别表示对于 $x$ 而言，模型预测为最大可能类和第二大可能类。

特别地，如果针对二分类问题，least confident 和 margin sampling 其实是等价的。

Entropy

在数学中，可以使用熵（Entropy）来衡量一个系统的不确定性，熵越大表示系统的不确定性越大，熵越小表示系统的不确定性越小。因此，在二分类或者多分类的场景下，可以选择那些熵比较大的样本数据作为待定标注数据。用数学公式表示就是：

$x_{H}^{*}=argmax_{x}-\sum_{i}P_{\theta}(y_{i}|x)\cdot \ln P_{\theta}(y_{i}|x)$ ,

相较于 least confident 和 margin sample 而言，entropy 的方法考虑了该模型对某个 $x$ 的所有类别判定结果。而 least confident 只考虑了最大的概率，margin sample 考虑了最大的和次大的两个概率。

基于委员会的查询（Query-By-Committee）

除了考虑单个模型的不确定性采样方法之外，还可以考虑多个模型的场景，这就是类似集成学习的方法。通过多个模型投票的模式，来选择出那些较“难”区分的样本数据。在 QBC（Query-By-Committee）的技术方案中，可以假设有 $C$ 个模型，其参数分别是 $\{\theta^{(1)},\cdots,\theta^{(C)}\}$ ，并且这些模型都是通过数据集 $\mathcal{L}$ 的训练得到的。

如果不需要考虑每个模型的检测效果，其实可以考虑类似不确定性采样中的 least confident 和 margin sampling 方法。可以选择某一个分类器难以区分的样本数据，也可以选择其中两三个分类器难以区分的数据。但是如果要考虑所有模型的分类效果的时候，则还是需要熵（Entropy）或者 KL 散度等指标。因此，QBC 通常也包括两种方法：

投票熵（Vote Entropy）：选择这些模型都无法区分的样本数据；
平均 KL 散度（Average Kullback-Leibler Divergence）：选择 KL 散度较大的样本数据。

投票熵（Vote Entropy）

对于这种多模型 $\{\theta^{(1)},\cdots,\theta^{(C)}\}$ 的场景而言，可以用熵来衡量样本数据被这些分类器区分的难易程度，如果这些分类器都把样本数据划分到某一类，则容易区分；如果分类器把样本数据划分到多类，则表示难以区分，需要重点关注。用数学公式表达就是：

$x_{VE}^{*}=argmax_{x}-\sum_{i}\frac{V(y_{i})}{C}\cdot\ln\frac{V(y_{i})}{C}$ ,

其中 $y_{i}$ 表示第 $i$ 类，求和符号表示将所有的类别 $i$ 相加， $V(y_{i})$ 表示投票给 $y_{i}$ 的分类器个数， $C$ 表示分类器的总数，并且 $\sum_{i}V(y_{i})=C$ 。

平均 KL 散度（Average KL Divergence）

KL 散度可以衡量两个概率之间的“距离”，因此可以用 KL 散度计算出那些偏差较大的数据样本。用数学公式来描述就是：

$x_{KL}^{*}=argmax_{x}\frac{1}{C}\sum_{c=1}^{C}D(P_{\theta^{(c)}}||P_{\mathcal{C}}),$

其中 $P_{\mathcal{C}}(y_{i}|x)=\frac{1}{C}\sum_{c=1}^{C}P_{\theta^{(c)}}(y_{i}|x)$ 也是概率分布， $D(P_{\theta^{(c)}}||P_{\mathcal{C}})$ 表示两个概率的 KL 散度。

期望模型变化（Expected Model Change）

模型变化最大其实可以选择那些使得梯度变化最大的样本数据。

期望误差减少（Expected Error Reduction）

可以选择那些通过增加一个样本就使得 loss 函数减少最多的样本数据。

方差减少（Variance Reduction）

选择那些方差减少最多的样本数据。

基于密度权重的选择方法（Density-Weighted Methods）

有的时候，某个数据点可能是异常点或者与大多数数据偏差较大，不太适合做样本选择或者区分，某些时候考虑那些稠密的，难以区分的数据反而价值更大。于是，可以在使用不确定性采样或者 QBC 方法的时候，将样本数据的稠密性考虑进去。用数学公式表示就是：

$x_{ID}^{*}=argmax_{x}\phi_{A}(x)\cdot\bigg(\frac{1}{U}\sum_{u=1}^{U}sim(x,x^{(u)})\bigg)^{\beta}$ ,

在这里， $\phi_{A}$ 表示某个不确定性采样方法或者 QBC 方法， $\beta$ 表示指数参数， $x^{(u)}$ 表示第 $u$ 类的代表元， $U$ 表示类别的个数。加上权重表示会选择那些与代表元相似度较高的元素作为标注候选集。

总结

在主动学习（Active Learning）领域，其关键在于如何选择出合适的标注候选集给人工进行标注，而选择的方法就是所谓的查询策略（Query Stategy）。查询策略基本上可以基于单个机器学习模型，也可以基于多个机器学习模型，在实际使用的时候可以根据情况来决定。整体来看，主动学习都是为了降低标注成本，迅速提升模型效果而存在的。主动学习的应用场景广泛，包括图像识别，自然语言处理，安全风控，时间序列异常检测等诸多领域。后续笔者将会持续关注这一领域的发展并撰写相关文档。

参考资料

Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
Aggarwal, Charu C., et al. “Active learning: A survey.” Data Classification: Algorithms and Applications. CRC Press, 2014. 571-605.

数据挖掘与机器学习

符号计算中的深度学习方法

January 11, 2020 zr9558 Leave a comment

符号计算

符号计算一直是计算数学的重要领域之一。在开源领域，Python 的 SymPy 就可以支持符号计算。在商业化领域，Maple，Matlab，Mathematica 都能够进行符号计算。它们不仅能够做简单的实数和复数加减乘除，还能够支持数学分析，线性代数，甚至各种各样的大学数学课程。

随着人工智能的进一步发展，深度学习不仅在图像识别，自然语言处理方向上发挥着自身的价值，还在各种各样的领域展示着自己的实用性。在 2019 年底，facebook 两位研究员在 arxiv 上挂出了一篇文章《Deep Learning for Symbolic Mathematics》，在符号计算方向上引入了深度学习的工具。

要想了解符号运算，就要先知道在计算机中，是怎么对数学公式进行表示的。较为常见的表达式形如：

$2 + 3 * (5 + 2)$
$3x^{2}+\cos(2x)+1$
$\frac{\partial^{2}\psi}{\partial x^{2}} - \frac{1}{v^{2}}\frac{\partial^{2}\psi}{\partial t^{2}}$

在这里，数学表达式通常都会被表示成树的结构，其中树的内部节点是由算子（operator），函数（function）组成的，叶子节点由数字，变量，函数等组成。例如：

图 1 的三幅图分别对应着上面的三个数学表达式。

在 Python 的 SymPy 工具中，同样可以对数学公式进行展示。其表示方法就是用 sympy.srepr

>>> import sympy
>>> x, y = sympy.symbols("x y")
>>> expr = sympy.sin(x+y) + x**2 + 1/y - 10
>>> sympy.srepr(expr)
"Add(Pow(Symbol('x'), Integer(2)), sin(Add(Symbol('x'), Symbol('y'))), Integer(-10), Pow(Symbol('y'), Integer(-1)))"
>>> expr = sympy.sin(x*y)/2 - x**2 + 1/y
>>> sympy.srepr(expr)
"Add(Mul(Integer(-1), Pow(Symbol('x'), Integer(2))), Mul(Rational(1, 2), sin(Mul(Symbol('x'), Symbol('y')))), Pow(Symbol('y'), Integer(-1)))"

SymPy 的 srepr 函数的输出用树状结构来表示就是形如图 2 这种格式。叶子节点要么是 x，y 这种变量，要么是 -1 和 2 这种整数。对于一元函数而言，例如 sin 函数，就是对应唯一的一个叶子。对于二元函数而言，例如 pow，mul，add，则是对应两个叶子节点。

论文方案

在 Deep Learning for Symbolic Mathematics 这篇论文中，作者们的大致思路是分成以下几步的：

生成数据；
训练模型；
预测结果；

第一步生成数据是为了让深度学习模型是大量的已知样本来做训练；第二步训练模型是用第一步的数据把端到端的深度学习模型进行训练；第三步预测结果是给一个函数或者一个微分方程，使用已经训练好的模型来预测结果，对预测出来的多个结果进行排序，选择最有可能的那个结果作为符号计算的值。

众所周知，深度学习的训练是依赖大量的样本数据的，那么要想用深度学习来解决符号计算的问题，就要解决样本少的问题。在这篇论文中，作者们把精力投入了三个领域，分别是：

函数的积分；
一阶常微分方程；
二阶常微分方程。

在生成数据之前，作者们对数据的范围进行了必要的限制：

数学表达式最多拥有 15 个内部节点；
$L = 11$ 表示叶子节点的值只有 11 个，分别是变量 $x$ 和 $\{-5,-4,-3,-2,-1,1,2,3,4,5\}$ ；
$p_{1} = 15$ 表示一元计算只有 15 个，分别是 $\exp, \log, \sqrt, \sin, \cos, \tan, \arcsin, \arccos, \arctan$ , $sinh, cosh, tanh, arcsinh, arccosh, arctanh$ 。
$p_{2} = 4$ 表示二元计算只有四个，分别是 +, -, *, /；

意思就是在这个有限的范围内去生成深度学习所需要的数据集。

积分数据的生成

在微积分里面，积分指的是求导的逆运算，那么形如 $(f',f)$ 的表达式就可以作为深度学习的积分训练数据。生成积分的话其实有多种方法：

第一种方法：前向生成（Forward Generation，简写为 FWD）。主要思路就是在以上的数据范围内随机生成各种各样的方程 $f$ ，然后使用 SymPy 或者 Mathematica 等工具来计算函数 $f$ 的积分 $F$ ，那么 $(f,F)$ 就可以作为一个训练集。当然，有的时候函数 $f$ 的积分是无法计算出来的，那么这种计算表达式就需要进行放弃，就不能放入训练集。

第二种方法：反向生成（Backward Generation，简写为 BWD）。由于积分是求导的逆运算，可以在以上的数据范围内随机生成各种各样的方程 $f$ ，然后计算它们的导数 $f'$ ，于是 $(f',f)$ 就可以放入积分数据的训练集。

第三种方法：分部积分（Backward generation with integration by parts，简写为 IBP）。根据分部积分的公式，如果 $F'=f, G'=g$ ，那么 $\int F(x)g(x)dx=F(x)G(x) - \int f(x)G(x)dx$ 。对于两个随机生成的函数 $F, G$ ，可以计算出它们的导数 $f, g$ 。如果 $fG$ 在训练集合里面，那么就把 $Fg$ 的积分计算出来放入训练集合；反之，如果 $Fg$ 在训练集合里面，那么就把 $fG$ 的积分计算出来放入训练集合。如果 $fG$ 和 $Fg$ 都没有在训练集合，并且都无法积分出来，那么就放弃该条数据。

三种方法的比较可以参见表 1，从表 1 可以看出通过分部积分（Integration by parts）所获得的样本表达式相对于前向生成和反向生成都比较长。

一阶常微分方程的生成

一阶常微分方程只有该函数的一阶导数，因此在构造函数的时候，作者们用了一个技巧。在随机生成表达式的时候，叶子节点的元素只从 $\{x, -5,-4,-3,-2,-1,1,2,3,4,5\}$ 选择一个，于是随机把其中的一个整数换成变量 $c$ 。例如：在 $x\log(2/x)$ 中就把 2 换成 c，于是得到了一个二元函数 $f(x,c) = x\log(c/x)$ 。那么就执行以下步骤：

生成二元函数 $f(x,c) = x\log(c/x)$ ；
求解 $c$ ，得到 $c = xe^{f(x)/x}$ ；
对 $x$ 进行求导得到 $0 = e^{f(x)/x}(1+f'(x)-f(x)/x) = 0$ ；
简化后得到 $x+xf'(x) - f(x) =0$ ，也就是 $x+xy'-y=0$ 。

此时，一阶常微分方程的训练数据就是 $(x+xy'-y=0, x\log(c/x))$ 。

二阶常微分方程不仅可能由该函数的一阶导数，还必须有二阶导数，那么此时就要求导两次：

生成三元函数 $f(x,c_{1},c_{2}) = c_{1}e^{x}+c_{2}e^{-x}$ ；
求解 $c_{2}$ 得到 $c_{2} = f(x,c_{1},c_{2}) e^{x}- c_{1}e^{2x}$ ；
对 $x$ 进行求导得到 $0 = e^{x}(\partial f(x,c_{1},c_{2})/\partial x + f(x,c_{1},c_{2})) - 2c_{1}e^{2x} = 0$ ；
求解 $c_{1}$ 得到 $c_{1} = e^{-x}(\partial f(x,c_{1},c_{2})/\partial x+f(x))/2$ ；
对 $x$ 进行求导得到 $0 = e^{-x}(\partial^{2} f(x,c_{1},c_{2})/\partial x^{2} - f(x,c_{1},c_{2}))=0$ ；
简化后得到 $\partial^{2} f(x,c_{1},c_{2})/\partial x^{2} - f(x,c_{1},c_{2})=0$ ，也就是 $y''-y=0$ 。

那么此时的二阶微分方程的训练数据就是 $(y''-y=0, c_{1}e^{x}+c_{2}e^{-x})$ 。

需要注意的事情就是，在生成数据的时候，一旦无法求出 $c_{1}, c_{2}$ 的表达式，该数据就需要放弃，重新生成新的数据。

数据处理

数学表达式的简化（expression simplification）：例如 $x+1+1$ 可以简化成 $x+2$ ， $\sin^{2}(x)+\cos^{2}(x)$ 可以简化成 1。
参数的简化（coefficient simplification）：例如 $\log(x^{2}) + c\log(x)$ 可以简化成 $c\log(x)$ 。
无效表达式的过滤（invalid expression filter）：例如 $\sqrt{2}, \log(0)$ 等。

树状结构的表达式，是使用前缀表达式来写成一个句子的。例如 2+3 就写成 + 2 3，2 + x 就写成 + 2 x。

模型训练

在这里，作者们用了 Transformer 模型，8 attention heads，6 layers，512 个维度，并且发现用更复杂的模型并没有提升其效果。在预测的时候，使用不同的 Beam Size，其准确率是不一样的，在 Beam Size = 50 的时候，效果比较好。参见表 2。

在与其他数学软件的比较中，作者限制了 Mathematica 的运行时间为 30s。

并且举出了几个现有模型能够计算出来，Mathematica 无法计算的例子。

在求解的等价性方面，可以根据 Score 逆序排列，然后这个例子的 Top10 都是等价的。

在使用 SymPy 的过程中，可以获得各种各样的积分表达式如下：

结论

符号计算在 1960 年代末就已经在研究了，有诸多的符号计算软件，例如 Matlab，Mathematica，Maple，PARI，SAGE 等。在这篇论文中，作者们使用标准的 seq2seq 模型来对生成的数据进行训练，然后获得积分，一阶常微分方程，二阶常微分方程的解。在传统符号计算的基础上，开拓了一种新的思路。

数据挖掘与机器学习

计算机视觉中的注意力机制

January 23, 2019 zr9558 2 Comments

引言

在机器翻译（Machine Translation）或者自然语言处理（Natural Language Processing）领域，以前都是使用数理统计的方法来进行分析和处理。近些年来，随着 AlphaGo 的兴起，除了在游戏AI领域，深度学习在计算机视觉领域，机器翻译和自然语言处理领域也有着巨大的用武之地。在 2016 年，随着深度学习的进一步发展，seq2seq 的训练模式和翻译模式已经开始进入人们的视野。除此之外，在端到端的训练方法中，除了需要海量的业务数据之外，在网络结构中加入一些重要的模块也是非常必要的。在此情形下，基于循环神经网咯（Recurrent Neural Network）的注意力机制（Attention Mechanism）进入了人们的视野。除了之前提到的机器翻译和自然语言处理领域之外，计算机视觉中的注意力机制也是十分有趣的，本文将会简要介绍一下计算机视觉领域中的注意力方法。在此事先声明一下，笔者并不是从事这几个领域的，可能在撰写文章的过程中会有些理解不到位的地方，请各位读者指出其中的不足。

LSTM_1

注意力机制

顾名思义，注意力机制是本质上是为了模仿人类观察物品的方式。通常来说，人们在看一张图片的时候，除了从整体把握一幅图片之外，也会更加关注图片的某个局部信息，例如局部桌子的位置，商品的种类等等。在翻译领域，每次人们翻译一段话的时候，通常都是从句子入手，但是在阅读整个句子的时候，肯定就需要关注词语本身的信息，以及词语前后关系的信息和上下文的信息。在自然语言处理方向，如果要进行情感分类的话，在某个句子里面，肯定会涉及到表达情感的词语，包括但不限于“高兴”，“沮丧”，“开心”等关键词。而这些句子里面的其他词语，则是上下文的关系，并不是它们没有用，而是它们所起的作用没有那些表达情感的关键词大。

在以上描述下，注意力机制其实包含两个部分：

注意力机制需要决定整段输入的哪个部分需要更加关注；
从关键的部分进行特征提取，得到重要的信息。

通常来说，在机器翻译或者自然语言处理领域，人们阅读和理解一句话或者一段话其实是有着一定的先后顺序的，并且按照语言学的语法规则来进行阅读理解。在图片分类领域，人们看一幅图也是按照先整体再局部，或者先局部再整体来看的。再看局部的时候，尤其是手写的手机号，门牌号等信息，都是有先后顺序的。为了模拟人脑的思维方式和理解模式，循环神经网络（RNN）在处理这种具有明显先后顺序的问题上有着独特的优势，因此，Attention 机制通常都会应用在循环神经网络上面。

虽然，按照上面的描述，机器翻译，自然语言处理，计算机视觉领域的注意力机制差不多，但是其实仔细推敲起来，这三者的注意力机制是有明显区别的。

在机器翻译领域，翻译人员需要把已有的一句话翻译成另外一种语言的一句话。例如把一句话从英文翻译到中文，把中文翻译到法语。在这种情况下，输入语言和输出语言的词语之间的先后顺序其实是相对固定的，是具有一定的语法规则的；
在视频分类或者情感识别领域，视频的先后顺序是由时间戳和相应的片段组成的，输入的就是一段视频里面的关键片段，也就是一系列具有先后顺序的图片的组合。NLP 中的情感识别问题也是一样的，语言本身就具有先后顺序的特点；
图像识别，物体检测领域与前面两个有本质的不同。因为物体检测其实是在一幅图里面挖掘出必要的物体结构或者位置信息，在这种情况下，它的输入就是一幅图片，并没有非常明显的先后顺序，而且从人脑的角度来看，由于个体的差异性，很难找到一个通用的观察图片的方法。由于每个人都有着自己观察的先后顺序，因此很难统一成一个整体。

在这种情况下，机器翻译和自然语言处理领域使用基于 RNN 的 Attention 机制就变得相对自然，而计算机视觉领域领域则需要必要的改造才能够使用 Attention 机制。

LSTM_3

基于 RNN 的注意力机制

通常来说，RNN 等深度神经网络可以进行端到端的训练和预测，在机器翻译领域和或者文本识别领域有着独特的优势。对于端到端的 RNN 来说，有一个更简洁的名字叫做 sequence to sequence，简写就是 seq2seq。顾名思义，输入层是一句话，输出层是另外一句话，中间层包括编码和解码两个步骤。

而基于 RNN 的注意力机制指的是，对于 seq2seq 的诸多问题，在输入层和输出层之间，也就是词语（Items）与词语之间，存在着某种隐含的联系。例如：“中国” -> “China”，“Excellent” -> “优秀的”。在这种情况下，每次进行机器翻译的时候，模型需要了解当前更加关注某个词语或者某几个词语，只有这样才能够在整句话中进行必要的提炼。在这些初步的思考下，基于 RNN 的 Attention 机制就是：

建立一个编码（Encoder）和解码（Decoder）的非线性模型，神经网络的参数足够多，能够存储足够的信息；
除了关注句子的整体信息之外，每次翻译下一个词语的时候，需要对不同的词语赋予不同的权重，在这种情况下，再解码的时候，就可以同时考虑到整体的信息和局部的信息。

LSTM_4

注意力机制的种类

从初步的调研情况来看，注意力机制有两种方法，一种是基于强化学习（Reinforcement Learning）来做的，另外一种是基于梯度下降（Gradient Decent）来做的。强化学习的机制是通过收益函数（Reward）来激励，让模型更加关注到某个局部的细节。梯度下降法是通过目标函数以及相应的优化函数来做的。无论是 NLP 还是 CV 领域，都可以考虑这些方法来添加注意力机制。

LSTM_5

计算机视觉领域的 Attention 部分论文整理

下面将会简单的介绍几篇近期阅读的计算机视觉领域的关于注意力机制的文章。

Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition

在图像识别领域，通常都会遇到给图片中的鸟类进行分类，包括种类的识别，属性的识别等内容。为了区分不同的鸟，除了从整体来对图片把握之外，更加关注的是一个局部的信息，也就是鸟的样子，包括头部，身体，脚，颜色等内容。至于周边信息，例如花花草草之类的，则显得没有那么重要，它们只能作为一些参照物。因为不同的鸟类会停留在树木上，草地上，关注树木和草地的信息对鸟类的识别并不能够起到至关重要的作用。所以，在图像识别领域引入注意力机制就是一个非常关键的技术，让深度学习模型更加关注某个局部的信息。

RA_CNN_1

在这篇文章里面，作者们提出了一个基于 CNN 的注意力机制，叫做 recurrent attention convolutional neural network（RA-CNN），该模型递归地分析局部信息，从局部的信息中提取必要的特征。同时，在 RA-CNN 中的子网络（sub-network）中存在分类结构，也就是说从不同区域的图片里面，都能够得到一个对鸟类种类划分的概率。除此之外，还引入了 attention 机制，让整个网络结构不仅关注整体信息，还关注局部信息，也就是所谓的 Attention Proposal Sub-Network（APN）。这个 APN 结构是从整个图片（full-image）出发，迭代式地生成子区域，并且对这些子区域进行必要的预测，并将子区域所得到的预测结果进行必要的整合，从而得到整张图片的分类预测概率。

RA_CNN_2

RA-CNN 的特点是进行一个端到端的优化，并不需要提前标注 box，区域等信息就能够进行鸟类的识别和图像种类的划分。在数据集上面，该论文不仅在鸟类数据集（CUB Birds）上面进行了实验，也在狗类识别（Stanford Dogs）和车辆识别（Stanford Cars）上进行了实验，并且都取得了不错的效果。

RA_CNN_4

从深度学习的网络结构来看，RA-CNN 的输入时是整幅图片（Full Image），输出的时候就是分类的概率。而提取图片特征的方法通常来说都是使用卷积神经网络（CNN）的结构，然后把 Attention 机制加入到整个网络结构中。从下图来看，一开始，整幅图片从上方输入，然后判断出一个分类概率；然后中间层输出一个坐标值和尺寸大小，其中坐标值表示的是子图的中心点，尺寸大小表示子图的尺寸。在这种基础上，下一幅子图就是从坐标值和尺寸大小得到的图片，第二个网络就是在这种基础上构建的；再迭代持续放大图片，从而不停地聚焦在图片中的某些关键位置。不同尺寸的图片都能够输出不同的分类概率，再将其分类概率进行必要的融合，最终的到对整幅图片的鸟类识别概率。

因此，在整篇论文中，有几个关键点需要注意：

分类概率的计算，也就是最终的 loss 函数的设计；
从上一幅图片到下一幅图片的坐标值和尺寸大小。

只要获得了这些指标，就可以把整个 RA-CNN 网络搭建起来。

大体来说，第一步就是给定了一幅输入图片 $X$ ，需要提取它的特征，可以记录为 $W_{c}*X$ ，这里的 $*$ 指的是卷积等各种各样的操作。所以得到的概率分布情况其实就是 $p(X) = f(W_{c}*X)$ ， $f$ 指的是从 CNN 的特征层到全连接层的函数，外层使用了 Softmax 激活函数来计算鸟类的概率。

第二步就是计算下一个 box 的坐标 $(t_{x}, t_{y})$ 和尺寸大小 $t_{\ell}$ ，其中 $t_{x}, t_{y}$ 分别指的是横纵坐标，正方形的边长其实是 $2*t_{\ell}$ 。用数学公式来记录这个流程就是 $[t_{x}, t_{y}, t_{\ell}] = g(W_{c}*X)$ 。在坐标值的基础上，我们可以得到以下四个值，分别表示 $x, y$ 两个坐标轴的上下界：

$t_{x(t\ell)} = t_{x} - t_{\ell}, t_{x(br)} = t_{x} + t_{\ell},$

$t_{y(t\ell)} = t_{y} - t_{\ell}, t_{y(br)} = t_{y} + t_{\ell}.$

局部注意力和放大策略（Attention Localization and Amplification）指的是：从上面的方法中拿到坐标值和尺寸，然后把图像进行必要的放大。为了提炼局部的信息，其实就需要在整张图片 $X$ 的基础上加上一个面具（Mask）。所谓面具，指的是在原始图片的基础上进行点乘 0 或者 1 的操作，把一些数据丢失掉，把一些数据留下。在图片领域，就是把周边的信息丢掉，把鸟的信息留下。但是，有的时候，如果直接进行 0 或者 1 的硬编码，会显得网络结构不够连续或者光滑，因此就有其他的替代函数。

在激活函数里面，逻辑回归函数（Logistic Regression）是很常见的。其实通过逻辑回归函数，我们可以构造出近似的阶梯函数或者面具函数。

sigmoid_1

对于逻辑回归函数 $\sigma(x) = 1/(1+e^{-kx})$ 而言，当 $k$ 足够大的时候， $\sigma(x) \approx 1$ 当 $x \geq 0$ ； $\sigma(x) \approx 0$ 当 $x<0$ 。此时的逻辑回归函数近似于一个阶梯函数。如果假设 $x_{0}<x_{1}$ ，那么 $\sigma(x-x_{0}) - \sigma(x-x_{1})$ 就是光滑一点的阶梯函数， $\sigma(x-x_{0}) - \sigma(x-x_{1}) \approx 0$ 当 $x < x_{0} \text{ or } x > x_{1}$ ； $\sigma(x-x_{0}) - \sigma(x-x_{1}) \approx 1$ 当 $x_{0}\leq x\leq x_{1}$ 。

因此，基于以上的分析和假设，我们可以构造如下的函数： $X^{attr} = X \odot M(t_{x}, t_{y}, t_{\ell}),$ 其中， $X^{attr}$ 表示图片需要关注的区域， $M(\cdot)$ 函数就是 $M(t_{x}, t_{y}, t_{\ell}) = [\sigma(x-t_{x(t\ell)}) - \sigma(x-t_{x(br)})]\cdot[\sigma(y-t_{y(t\ell)}) - \sigma(y-t_{y(br)})],$ 这里的 $\sigma$ 函数对应了一个足够大的 $k$ 值。

当然，从一张完整的图片到小图片，在实际操作的时候，需要把小图片继续放大，在放大的过程中，可以考虑使用双线性插值算法来扩大。也就是说：

$X_{(i,j)}^{amp} = \sum_{\alpha,\beta=0}^{1}|1-\alpha-\{i/\lambda\}|\cdot|1-\beta-\{j/\lambda\}|\cdot X_{(m,n)}^{att},$

其中 $m = [i/\lambda] + \alpha, n = [j/\lambda] + \beta$ ， $\lambda$ 表示上采样因子， $[\cdot], \{\cdot\}$ 分别表示一个实数的正数部分和小数部分。

在分类（Classification）和排序（Ranking）部分，RA-CNN 也有着自己的方法论。在损失函数（Loss Function）里面有两个重要的部分，第一个部分就是三幅图片的 LOSS 函数相加，也就是所谓的 classification loss， $Y^{(s)}$ 表示预测类别的概率， $Y$ 表示真实的类别。除此之外，另外一个部分就是排序的部分， $L_{rank}(p_{t}^{(s)}, p_{t}^{(s+1)}) = \max\{0,p_{t}^{(s)}-p_{t+1}^{(s+1)}+margin\},$ 其中 $p^{(s)}$ 表示在第 $s$ 个尺寸下所得到的类别 $t$ 的预测概率，并且最大值函数强制了该深度学习模型在训练中可以保证 $p_{t}^{(s+1)} > p_{t}^{(s)} + margin$ ，也就是说，局部预测的概率值应该高于整体的概率值。

$L(X) = \sum_{s=1}^{3}\{L_{cls}(Y^{(s)},Y^{*})\} + \sum_{s=1}^{2}\{L_{rank}(p_{t}^{(s)},p_{t}^{(s+1)})\}$ .

RA_CNN_3

在这种 Attention 机制下，可以使用训练好的 conv5_4 或者 VGG-19 来进行特征的提取。在图像领域，location 的位置是需要通过训练而得到的，因为每张图片的鸟的位置都有所不同。进一步通过数学计算可以得到， $t_{\ell}$ 会随着网络而变得越来越小，也就是一个层次递进的关系，越来越关注到局部信息的提取。简单来看，

$\frac{\partial L_{rank}}{\partial t_{x}} \propto D_{top} \odot \frac{\partial M(t_{x},t_{y},t_{\ell})}{\partial t_{x}},$

这里的 $\odot$ 表示元素的点乘， $D_{top}$ 表示之前的网络所得到的导数。

当 $x\rightarrow t_{x(t\ell)}$ ， $\frac{\partial M}{\partial t_{x}}<0;$

当 $x \rightarrow t_{x(br)}$ ， $\frac{\partial M}{\partial t_{x}}>0;$

其余情况， $\frac{\partial M}{\partial t_{x}}=0.$

当 $y\rightarrow t_{y(t\ell)}$ ， $\frac{\partial M}{\partial t_{y}}<0;$

当 $y \rightarrow t_{y(br)}$ ， $\frac{\partial M}{\partial t_{y}}>0;$

其余情况， $\frac{\partial M}{\partial t_{y}}=0.$

当 $x \rightarrow t_{x(t\ell)}\text{ or } x \rightarrow t_{x(br)}\text{ or } y \rightarrow t_{y(t\ell)}\text{ or } y \rightarrow t_{y(br)},$ $\frac{\partial M}{\partial t_{\ell}}>0;$

其余情况， $\frac{\partial M}{\partial t_{\ell}}<0.$

因此， $t_{\ell}$ 在迭代的过程中会越来越小，也就是说关注的区域会越来越集中。

RA-CNN 的实验效果如下：

Multiple Granularity Descriptors for Fine-grained Categorization

这篇文中同样做了鸟类的分类工作，与 RA-CNN 不同之处在于它使用了层次的结构，因为鸟类的区分是按照一定的层次关系来进行的，粗糙来看，有科 -> 属 -> 种三个层次结构。

MC_CNN_1

因此，在设计网络结构的过程中，需要有并行的网络结构，分别对应科，属，种三个层次。从前往后的顺序是检测网络（Detection Network），区域发现（Region Discovery），描述网络（Description Network）。并行的结构是 Family-grained CNN + Family-grained Descriptor，Genus-grained CNN + Genus-grained Descriptor，Species-grained CNN + Species-grained Descriptor。而在区域发现的地方，作者使用了 energy 的思想，让神经网络分别聚焦在图片中的不同部分，最终的到鸟类的预测结果。

MC_CNN_2 MC_CNN_3

Recurrent Models of Visual Attention

在计算机视觉中引入注意力机制，DeepMind 的这篇文章 recurrent models of visual attention 发表于 2014 年。在这篇文章中，作者使用了基于强化学习方法的注意力机制，并且使用收益函数来进行模型的训练。从网络结构来看，不仅从整体来观察图片，也从局部来提取必要的信息。

DeepMind_1

DeepMind_2 DeepMind_3

整体来看，其网络结构是 RNN，上一个阶段得到的信息和坐标会被传递到下一个阶段。这个网络只在最后一步进行分类的概率判断，这是与 RA-CNN 不同之处。这是为了模拟人类看物品的方式，人类并非会一直把注意力放在整张图片上，而是按照某种潜在的顺序对图像进行扫描。Recurrent Models of Visual Attention 本质上是把图片按照某种时间序列的形式进行输入，一次处理原始图片的一部分信息，并且在处理信息的过程中，需要根据过去的信息和任务选择下一个合适的位置进行处理。这样就可以不需要进行事先的位置标记和物品定位了。

DeepMind_4

正如上图所示，enc 指的是对图片进行编码， $r_{i}^{(1)}$ 表示解码的过程， $x_{i}$ 表示图片的一个子区域。而 $y_{s}$ 表示对图片的预测概率或者预测标签。

Multiple Object Recognition with Visual Attention

这篇文章同样是 DeepMind 的论文，与 Recurrent Models of Visual Attention 不同之处在于，它是一个两层的 RNN 结构，并且在最上层把原始图片进行输入。其中 enc 是编码网络， $r^{(1)}_{i}$ 是解码网络， $r_{i}^{(2)}$ 是注意力网络，输出概率在解码网络的最后一个单元输出。

deep_recurrent_attention_model_1

在门牌识别里面，该网络是按照从左到右的顺序来进行图片扫描的，这与人类识别物品的方式极其相似。除了门牌识别之外，该论文也对手写字体进行了识别，同样取得了不错的效果。

deep_recurrent_attention_model_3

实验效果如下：

总结

本篇 Blog 初步介绍了计算机视觉中的 Attention 机制，除了这些方法之外，应该还有一些更巧妙的方法，希望各位读者多多指教。

参考文献

Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition，CVPR，2017.
Recurrent Models of Visual Attention，NIPS，2014
GitHub 代码：Recurrent-Attention-CNN，https://github.com/Jianlong-Fu/Recurrent-Attention-CNN
Multiple Granularity Descriptors for Fine-grained Categorization，ICCV，2015
Multiple Object Recognition with Visual Attention，ICRL，2015
Understanding LSTM Networks，Colah’s Blog，2015，http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Survey on the attention based RNN model and its applications in computer vision，2016

数据挖掘与机器学习

启发式优化算法

April 9, 2018 zr9558 Leave a comment

启发式优化算法的一些调研

启发式优化算法的背景

1. 首先，我们需要有一个目标函数 $f(\bold{x}),$ 其中的 $\bold{x} \in \mathbb{R}^{n}$ ， $n$ 是一个正整数。然后，目标函数 $f(\bold{x})$ 不一定需要是显式的（所谓显式指的是能够精确写出 $f(\bold{x})$ 的表达式，例如逻辑回归的目标函数，神经网络的目标函数，XGBoost 的目标函数等），隐式的目标函数使用启发式优化算法也是可以研究的（即无法写出 $f(\bold{x})$ 表达式）。

2. 其次，需要计算目标函数 $f(\bold{x})$ 的最大值 $\max f(\bold{x})$ 或者最小值 $\min f(\bold{x})$ 。

3. 再次，如果存在多个函数 $f_{i}(\bold{x}), 1\leq i\leq m,$ 并且每一个 $f_{i}(\bold{x})$ 的值域都在 $[0,1]$ 内。如果目标函数是

$\min_{x}(\max_{i}f_{i}(x)-\min_{i}f_{i}(x)),$

希望求该目标函数的最小值，那么意思就是使得每一个 $1\leq i\leq m,$ $f_{i}(\bold{x})$ 在 $\bold{x}$ 的定义域内都相差得不多，也就是这 $m$ 条曲线几乎重合。

4. 拥有目标函数的目的是寻找最优的 $\bold{x}$ 使得它满足 $argmin_{\bold{x}}f(x)$ 或者 $argmax_{\bold{x}}f(x)$ 。

5. 在这里的 $\bold{x}\in \mathbb{R}^{n}$ 一般都是连续的特征，而不是那种类别特征。如果是类别的特征，并不是所有的启发式优化算法都适用，但是遗传算法之类的算法在这种情况下有着一定的用武之地。

启发式优化算法的实验方法

1. 论文中的方法：找到一些形状怪异的函数作为目标函数，需要寻找这些函数的全局最大值或者全局最小值。除了全局的最大最小值点之外，这些函数也有很多局部极值点。例如 Rastrigin 函数（Chapter 6，Example 6.1，Search and Optimization by Metaheuristics）

$\min_{\bold{x}}f(\bold{x}) = 10\cdot n +\sum_{i=1}^{n}(x_{i}^{2}-10\cos(2\pi x_{i})),$

其中， $\bold{x}\in[-5.12,5.12]^{n}.$ 或者 Easom 函数（Chapter 2，Example 2.1，Search and Optimization by Metaheuristics）

$\min_{\bold{x}}f(\bold{x}) = - \cos(x_{1})\cos(x_{2})\exp(-(x_{1}-\pi)^{2}-(x_{2}-\pi)^{2}),$

其中， $\bold{x} \in [-100,100]^{2}.$

2. 最朴素的算法：随机搜索法（Random Search）。也就是把 $\bold{x}$ 的定义域切分成 $m^{n}$ 份，其中 $\bold{x}\in\mathbb{R}^{n}.$ 计算每一个交点的函数取值，然后统计出其最大值或者最小值就可以了。实际中不适用，因为当 $n\geq 100$ 的时候，这个是指数级别的计算复杂度。

3. 启发式优化算法：针对某一个目标函数，使用某种启发式优化算法，然后与随机搜索法做对比，或者与其余的启发式优化算法做对比。

4. 在线调优：启发式优化算法大部分都可以做成实时调优的算法，调优的速度除了算法本身的收敛速度之外，还有给定一个 $\bold{x}$ 之后，获取到 $f(\vec{x})$ 的时间。

注：可以尝试使用启发式优化算法在线调优推荐系统中的 AUC 或者 CTR。不过 CTR 一般具有波动性，需要几个小时甚至一天的数据才能够统计出一个相对靠谱的取值。同时，如果用户量不够多的话，如果切流量的话，可能会造成一些流量的 CTR 偏高，另外一些流量的 CTR 偏低。并且流量的条数会有上限，一般来说不会超过 100 左右，甚至只有 10 个不同的流量。因此，在其余场景下使用这类方法的时候，一定要注意获取目标函数 value 的时间长度。时间长度越短，能够进行的尝试（探索次数）就会越多，收敛的希望就会越大；反之，时间长度越长，能够进行的尝试（探索次数）就会受到限制，收敛的时间长度就会变长。

启发式优化算法的一些常见算法

以下的内容暂时只涉及最基本的算法，有很多算法的变种暂时不做过多的描述。

粒子群算法（Particle Swarm Optimization）

PSO 算法是有一个粒子群，根据群体粒子和单个粒子的最优效果，来调整每一个粒子的下一步行动方向。假设粒子的个数是 $N_{p}$ ，每一个粒子 $\bold{x}_{i}\in \mathbb{R}^{n}$ 都是 $n$ 维欧几里德空间里面的点，同时需要假设粒子的速度 $\bold{v}_{i}\in\mathbb{R}^{n}$ 。在每一轮迭代中，需要更新两个最值，分别是每一个粒子在历史上的最优值和所有粒子在历史上的最优值，分别记为 $\bold{x}_{i}^{*}$ （ $1\leq i \leq N_{p}$ ）和 $\bold{x}^{g}$ 。在第 $t+1$ 次迭代的时候，

$\bold{v}_{i}(t+1) = \bold{v}_{i}(t) + c r_{1}[\bold{x}_{i}^{*}(t) - \bold{x}_{i}(t)] + c r_{2}[\bold{x}^{g}(t) - \bold{x}_{i}(t)],$

$\bold{x}_{i}(t+1) = \bold{x}_{i}(t)+\bold{v}_{i}(t+1), \text{ } 1\leq i\leq N_{p}.$

在这里， $c>0$ ，并且 $r_{1}, r_{2}$ 是 $[0,1]$ 中间的随机数。

注：需要解决的问题是如何设置这 $N_{p}$ 个粒子，也就是构造粒子群的问题。在每次只能够设置调整一次 $\bold{x}$ 的时候，可以把时间窗口按照连续的 $N_{p}$ 段来进行切分，在一个时间段内的这 $N_{p}$ 个时间点可以当成 $N_{p}$ 个粒子，然后进行下一轮的迭代即可。

模拟退火（Simulated Annealing）

其核心思想是构造了温度 $T$ 这个维度，当温度 $T$ 高的时候，分子运动速度快，能够探索更多的区域；温度 $T$ 低的时候，分子运动速度慢，能够探索的区域有限。随着时间的迁移，温度 $T$ 会从一个较大的值逐渐降低到 0。

模拟退火的大体思路是这样的：先设置一个较大的温度值 $T$ ，随机初始化 $\bold{x}(0)$ 。假设目标函数是 $E(\bold{x})$ ，需要寻找 $argmin_{\bold{x}}E(\bold{x})$ ，然后执行以下的程序：

Repeat：

a. Repeat:

i. 进行一个随机扰动 $\bold{x} = \bold{x} + \Delta \bold{x}$ ；

ii. 计算 $\Delta E(\bold{x}) = E(\bold{x}+\Delta\bold{x}) - E(\bold{x})$ ，

如果 $\Delta E(\bold{x}) <0$ ，也就是 $E(\bold{x} + \Delta\bold{x})<E(\bold{x})$ ，选择 $\bold{x}+\Delta\bold{x}$ ；

否则，按照 $P = e^{-\Delta E/T}$ 的概率来接受 $\bold{x} +\Delta\bold{x}$ 。

b. 令 $T = T-\Delta T$ 。

直到 $T$ 足够小。

遗传算法的个人理解：

如何选择合适的 $\Delta \bold{x}$ ，可以选择 Gaussian 正态分布，并且可以同时修改 $\bold{x}$ 的 $n$ 个维度，也可以每次只修改一个维度或者部分维度的值；
一开始在温度 $T$ 很大的时候，即使 $E(\bold{x}+\Delta\bold{x})\geq E(\bold{x})$ 的时候，都会以极大概率接受 $E(\bold{x}+\Delta\bold{x})$ ，因为此时 $\lim_{T\rightarrow\infty}e^{-\Delta E/T}=1$ 。然后一开始都在四处游荡；
当温度 $T$ 逐渐降低的时候，当时 $\bold{x}$ 不一定在一个合适的位置，因此，需要记录历史上探索过的满足 $E(\bold{x})$ 最小的值，从历史上的值中选择出一个合适的值继续迭代。因此，在温度下降的时候，如何选择一个合适的值继续迭代可能是一个关键问题。
从遗传算法和 PSO 的对比效果来看，因为 PSO 算法会随时记录当前粒子和所有粒子的最优状态，然后在整个种群中，总会有少数粒子处于当前的最优状态，所以 PSO 的效果从实验效果上来看貌似比不记录状态的遗传算法会好一些。

进化策略（Evolutionary Strategies）

ES 和 GA 有一些方法论上面的不同，这个在书上有论述，等撰写了 GA 在一起看这一块的内容。

进化策略（Evolutionary Strategies）的大体流程与遗传算法（Genetic Algorithm）相似，不同的是 ES 的步骤是交叉/变异（Crossover/Mutation）在前，选择（Selection）在后。

对于经典的进化策略而言，交叉算子（Crossover Operator）可以按照如下定义：两个父代（parents） $\bold{x}_{1}$ 和 $\bold{x}_{2}$ ，那么子代（children）的交叉可以定义为：

$\bold{x}'=(\bold{x}_{1}+\bold{x}_{2})/2,$

这里的相加就是向量之间的相加。

对于经典的进化策略而言，变异算子（Mutation Operator）可以按照如下定义：对于一个向量 $\bold{x} = (x_{1},\cdots,x_{n})$ ，使用 Gaussian 正态分布可以构造出一个后代如下：

$x_{i}'=x_{i}+N(0,\sigma_{i}),\text{ } i=1,\cdots,n,$

这里的 $\sigma_{i}, \text{ } i=1,\cdots,n$ 是基于具体问题的，很难给出一个通用的值。

就经验而谈，ES 算法一般来说比 GA 更加 Robust。在交叉（Crossover）或者变异（Mutation）之后，ES 算法就需要进行选择，通常来说，ES 算法有两种方式，分别是 $(\lambda+\mu)$ 和 $(\lambda,\mu)$ 两种策略。这里的 $\mu$ 表示人口数量（population size）， $\lambda$ 表示从人口中产生的后代的数量。

（1）在 $(\lambda+\mu)$ 策略中，从 $\mu$ 人口中通过变异或者交叉得到 $\lambda$ 个人，然后从这 $(\lambda+\mu)$ 个人中选择出 $\mu$ 个最优的人。 $(\lambda+\mu)$ 是精英策略，每一次迭代都比上一次要好。

（2）在 $(\lambda, \mu)$ 策略中， $\mu$ 个最优的人是从 $\lambda(\lambda\geq\mu)$ 中选择出来的。

传统的梯度搜索法，例如 Newton-Raphson 方法，需要能够计算目标函数 $f(x)$ 的导数，也就是说目标函数必须可导。但是在 Evolutionary Gradient Search 中，不需要 $f(x)$ 是可导的，而是使用导数的近似值来计算。

如果是 $(1,\lambda)-ES$ 的方案，它整体来说其实只有一个人口。从当前的点 $\bold{x}$ 开始，通过 Gaussian Mutation 的方法可以生成 $\lambda$ 个后代，记为 $\bold{t}_{1},\cdots,\bold{t}_{\lambda}$ 。计算它们的函数取值 $f(\bold{t}_{1}),\cdots,f(\bold{t}_{\lambda})$ 。那么估算的梯度就可以计算出来：

$\bold{g}=\sum_{i=1}^{\lambda}(f(\bold{t}_{i})-f(\bold{x}))(\bold{t}_{i}-\bold{x}),$

归一化之后得到 $\bold{e}=\bold{g}/||\bold{g}||.$

Evolutionary gradient search 生成了两个点，分别是：

$\bold{x}_{1}=\bold{x}+(\sigma \psi)\bold{e}, \bold{x}_{2}=\bold{x}+(\sigma/\psi)\bold{e},$

这里的 $\psi>1.$ 新的个体定义为: $\bold{x}'=\bold{x}+\sigma'\bold{e}.$ 如果 $f(\bold{x}_{1})>f(\bold{x}_{2}),$ 那么 $\sigma' = \sigma\psi;$ 如果 $f(\bold{x}_{1})\leq f(\bold{x}_{2}),$ 那么 $\sigma'=\sigma/\psi.$

进化策略（ES）包括 gradient search 和 gradient evolution，并且 covariance matrix adaptation（CMA）ES 在一定的限制条件下加速了搜索的效率。

参考资料：

Search and Optimization by Metaheuristics，Kelin DU, M.N.S. SWAMY，2016年

数据挖掘与机器学习

聚类算法（二）

May 16, 2017 zr9558 Leave a comment

聚类是一种无监督学习（无监督学习是指事先并不知道要寻找的内容，没有固定的目标变量）的算法，它将相似的一批对象划归到一个簇，簇里面的对象越相似，聚类的效果越好。聚类的目标是在保持簇数目不变的情况下提高簇的质量。给定一个 m 个对象的集合。把这 m 个对象划分到 K 个区域，每个对象只属于一个区域。所遵守的一般准则是：同一个簇的对象尽可能的相互接近或者相关，而不同簇的对象尽可能的区分开。之前已经介绍过 KMeans 算法与 bisecting KMeans 算法，在这里介绍一个流式的聚类算法。

流式聚类算法（One Pass Cluster Algorithm）

既然是流式的聚类算法，那么每个数据只能够进行一次聚类的操作。有一个简单的思路就是对于每一个未知的新样本，如果它与某个类足够相似，那么就把这个未知的新样本加入这个类；否则就自成为一个新的类。如下图所示：

SinglePass 聚类原理

基于以上思路，我们可以设计一个流式的聚类算法。假设 dataMat 是一个 m 行 n 列的一个矩阵，每一行代表一个向量，n 代表向量的维度。i.e. 在 n 维欧几里德空间里面有 m 个点需要进行聚类。

流式聚类的算法细节：

参数的设定：

$K$ 表示在聚类的过程中允许形成的簇的最大个数；

$D$ 表示距离的阀值，在这里两个点之间的距离可以使用 $L^{1}, L^{2}, L^{\infty}$ 范数；

聚簇的质心定义为该类中所有点的平均值。i.e. 如果该类中的点是 $A[0],A[1],\cdot\cdot\cdot,A[n-1]$ ，那么质心就是 $\sum_{i=0}^{n-1}A[i]/n$ ；

第 j 个聚簇中元素个数用 $num[j]$ 表示。

方法：

Step 1：对于 $dataMat[0]$ 而言，自成一类。i.e. 质心就是它本身 $C[0]=dataMat[0]$ ，该聚簇的元素个数就是 $num[0]=1$ ，当前所有簇的个数是 $K^{'}=1$ 。

Step 2: 对于每一个 $1\leq i\leq m-1$ ，进行如下的循环操作：

假设当前有 $K^{'}$ 个簇，第 j 个簇的质心是 $C[j]$ ，第 j 个簇的元素个数是 $num[j]$ ，其中 $0\leq j\leq K^{'}-1$ 。

计算 $d= \min_{0\leq j\leq K^{'}-1}Distance(dataMat[i],C[j])$ ，其中的 Distance 可以是欧几里德空间的 $L^{1},L^{2},L^{\infty}$ 范数，对应的下标是 $j^{'}$ 。i.e. $j'=argmin_{0\leq j\leq K'-1}Distance(dataMat[i],C[j])$ 。

（i）如果当前 $K'=K$ 或者 $d\leq D$ ，则把 $dataMat[i]$ 加入到第 $j^{'}$ 个聚簇。也就是说：

质心更新为 $C[j'] \leftarrow (C[j']*num[j]$ + $dataMat[i])/(num[j]$ + $1)$ ，

元素的个数更新为 $num[j'] \leftarrow num[j']$ + $1$ 。

（ii）否则， $dataMat[i]$ 需要自成一类。i.e. $K^{'} \leftarrow K^{'}$ + $1$ ， $num[K'-1]=1$ ， $C[K'-1]=dataMat[i]$ 。

注：

（1）数据的先后顺序对聚类的效果有一定的影响。

（2）该算法的时间复杂度与簇的个数是相关的，通常来说，如果形成的簇越多，对于一个新的样本与质心的比较次数就会越多。

（3）如果设置簇的最大个数是一个比较小的数字的时候，那么对于一个新的样本，它需要比较的次数就会大量减少。如果最小距离小于 D 或者目前已经达到最大簇的个数的话，那么把新的样本放入某个类。此时，可能会出现某个类由于这个新的样本导致质心偏移的情况。如果设置 $K$ 很大，那么计算的时间复杂度就会增加。因此，在保证计算实时性的时候，需要考虑到 $K$ 的设置问题。

效果展示：

使用一批二维数据集合，进行流式聚类算法可以得到下图：’+’ 表示的是质心的位置，其余就是数据点的位置。通过选择不同的距离阈值 $D$ ，就可以得到不同的聚类情况。其中 ‘+’ 表示该簇质心的位置，蓝色的点表示数据点。如下图所示：

流式聚类算法1 流式聚类算法2

流式聚类算法3

数据挖掘与机器学习

三十分钟理解博弈论“纳什均衡” — Nash Equilibrium

January 19, 2017 zr9558 Leave a comment

欢迎转载，转载请注明：本文出自Bin的专栏blog.csdn.net/xbinworld。
技术交流QQ群：433250724，欢迎对算法、技术感兴趣的同学加入。

纳什均衡（或者纳什平衡），Nash equilibrium ,又称为非合作博弈均衡，是博弈论的一个重要策略组合，以约翰·纳什命名。

约翰·纳什，生于1928年6月13日。著名经济学家、博弈论创始人、《美丽心灵》男主角原型。前麻省理工学院助教，后任普林斯顿大学数学系教授，主要研究博弈论、微分几何学和偏微分方程。由于他与另外两位数学家（经济学家，约翰·C·海萨尼和莱因哈德·泽尔腾）在非合作博弈的均衡分析理论方面做出了开创性的贡献，对博弈论和经济学产生了重大影响，而获得1994年诺贝尔经济学奖。

纳什的人生非常曲折，一度学术成果不被认可，甚至换上严重的精神分裂症，在爱的力量下在很多年后奇迹般地恢复，并最终获得诺内尔经济学奖。影片《美丽心灵》（A Beautiful Mind）是一部改编自同名传记而获得奥斯卡金像奖的电影，影片以约翰·纳什与他的妻子艾莉西亚（曾离婚，但2001年复婚）以及普林斯顿的朋友、同事的真实感人故事为题材，艺术地重现了这个爱心呵护天才的传奇故事。

这里写图片描述
年轻时的Nash，很帅噢

纳什均衡定义

经济学定义[3]
所谓纳什均衡，指的是参与人的这样一种策略组合，在该策略组合上，任何参与人单独改变策略都不会得到好处。换句话说，如果在一个策略组合上，当所有其他人都不改变策略时，没有人会改变自己的策略，则该策略组合就是一个纳什均衡。

数学定义
纳什均衡的定义：在博弈G=﹛S1,…,Sn：u1,…，un﹜中，如果由各个博弈方的各一个策略组成的某个策略组合（s1*,…，sn*）中，任一博弈方i的策略si*，都是对其余博弈方策略的组合（s1*,…s*i-1,s*i+1,…，sn*）的最佳对策，也即ui（s1*,…s*i-1,si*,s*i+1,…，sn*）≥ui（s1*,…s*i-1,sij*,s*i+1,…，sn*）对任意sij∈Si都成立，则称（s1*,…，sn*）为G的一个纳什均衡。

注：经济学定义从字面上还是相对比较好理解的；这里稍微解释一下数学定义，博弈论也称Game Theory，一场博弈用G表示，Si表示博弈方i的策略，ui表示收益。因此，纳什均衡的意思是：任何一方采取的策略都是对其余所有方采取策略组合下的最佳对策；当所有其他人都不改变策略时，为了让自己的收益最大，任何一方都不会（或者无法）改变自己的策略，这个时候的策略组合就是一个纳什均衡。

纳什证明了在每个参与者都只有有限种策略选择、并允许混合策略的前提下，纳什均衡一定存在。以两家公司的价格大战为例，纳什均衡意味着两败俱伤的可能：在对方不改变价格的条件下，既不能提价，否则会进一步丧失市场；也不能降价，因为会出现赔本甩卖。于是两家公司可以改变原先的利益格局，通过谈判寻求新的利益评估分摊方案，也就是Nash均衡。类似的推理当然也可以用到选举，群体之间的利益冲突，潜在战争爆发前的僵局，议会中的法案争执等。

纳什均衡案例

以下介绍几个经典的纳什均衡案例[2][4]，因为本文主要是以科普为主，所以案例不会涉及到复杂深奥的经济学问题（事实上，我也不懂，哈~）。

（1）囚徒困境

假设有两个小偷A和B联合犯事、私入民宅被警察抓住。警方将两人分别置于不同的两个房间内进行审讯，对每一个犯罪嫌疑人，警方给出的政策是：如果一个犯罪嫌疑人坦白了罪行，交出了赃物，于是证据确凿，两人都被判有罪。如果另一个犯罪嫌疑人也作了坦白，则两人各被判刑8年；如果另一个犯罪嫌人没有坦白而是抵赖，则以妨碍公务罪（因已有证据表明其有罪）再加刑2年，而坦白者有功被减刑8年，立即释放。如果两人都抵赖，则警方因证据不足不能判两人的偷窃罪，但可以私入民宅的罪名将两人各判入狱1年。

此时产生了两个嫌疑人之间的一场博弈：

这里写图片描述

表中的数字表示A，B各自的判刑结果。博弈论分析中一般都用这样的表来表示。

该案例，显然最好的策略是双方都抵赖，结果是大家都只被判1年。但是由于两人处于隔离的情况，首先应该是从心理学的角度来看，当事双方都会怀疑对方会出卖自己以求自保、其次才是亚当·斯密的理论，假设每个人都是“理性的经济人”，都会从利己的目的出发进行选择。这两个人都会有这样一个盘算过程：假如他坦白，如果我抵赖，得坐10年监狱，如果我坦白最多才8年；假如他要是抵赖，如果我也抵赖，我就会被判一年，如果我坦白就可以被释放，而他会坐10年牢。综合以上几种情况考虑，不管他坦白与否，对我而言都是坦白了划算。两个人都会动这样的脑筋，最终，两个人都选择了坦白，结果都被判8年刑期。

注：亚当·斯密的理论（“看不见的手”原理），在市场经济中，每一个人都从利己的目的出发，而最终全社会达到利他的效果。但是我们可以从“纳什均衡”中引出“看不见的手”原理的一个悖论：从利己目的出发，结果损人不利己，既不利己也不利他。

（2）智猪博弈

猪圈里有两头猪，一头大猪，一头小猪。猪圈的一边有个踏板，每踩一下踏板，在远离踏板的猪圈的另一边的投食口就会落下少量的食物。如果有一只猪去踩踏板，另一只猪就有机会抢先吃到另一边落下的食物。当小猪踩动踏板时，大猪会在小猪跑到食槽之前刚好吃光所有的食物；若是大猪踩动了踏板，则还有机会在小猪吃完落下的食物之前跑到食槽，争吃到另一半残羹。

那么，两只猪各会采取什么策略？答案是：小猪将选择“搭便车”策略，也就是舒舒服服地等在食槽边；而大猪则为一点残羹不知疲倦地奔忙于踏板和食槽之间。

原因何在？因为，小猪踩踏板将一无所获，不踩踏板反而能吃上食物。对小猪而言，无论大猪是否踩动踏板，不踩踏板总是好的选择。反观大猪，已明知小猪是不会去踩动踏板的，自己亲自去踩踏板总比不踩强吧，所以只好亲力亲为了。

（3）普通范式博弈

GOO公司和SAM公司是某手机产品生态的两大重量级参与者，双方在产业链的不同位置上各司其职且关系暧昧，有时也往往因商业利益和产品影响力的争夺而各怀异心。二者的收益也随着博弈的变化而不断更替。

这里写图片描述

上图表格模拟了两家公司的博弈现状，双方各有两个可选策略“合作”与“背叛”，格中的四组数据表示四个博弈结局的分数（收益），每组数据的第一个数字表示GOO公司的收益，后一个数字表示SAM公司的收益。

博弈是同时进行的，一方参与者必须站在对方的角度上来思考我方的策略选择，以追求收益最大化。这在博弈论里称作Putting yourselves into other people’s shoes。

现在我们以GOO公司为第一人称视角来思考应对SAM公司的博弈策略。假如SAM公司选择合作，那么我方也选择合作带来的收益是3，而我方选择背叛带来的收益是5，基于理性的收益最大化考虑，我方应该选择背叛，这叫严格优势策略；假如SAM公司选择背叛，那么我方选择合作带来的收益是-3，而选择背叛带来的收益为-1，为使损失降到最低，我方应该选择背叛。最后，GOO公司的分析结果是，无论SAM公司选择合作还是背叛策略，我方都必须选择背叛策略才能获得最大化的收益。

同理，当SAM公司也以严格优势策略来应对GOO公司的策略选择时，我们重复上述分析过程，就能得出结论：无论GOO公司选择合作还是背叛策略，SAM公司都必须选择背叛策略才能获得最大化收益。

最后我们发现，本次博弈的双方都采取了背叛策略，各自的收益都为-1，这是一个比较糟糕的结局，尽管对任何一方来说都不是最糟糕的那种。这种局面就是著名的“囚徒困境”。

但是，博弈的次数往往不止一次，就像COO与SAM公司双方的商业往来也许会有很多机会。当二者经历了多次背叛策略的博弈之后，发现公式上还有一个（3，3）收益的双赢局面，这比（-1，-1）的收益结果显然要好很多，因此二者在之后的博弈过程中必然会尝试互建信任，从而驱使双方都选择合作策略。

这里有一个理想化假设，那就是假设双方都知道博弈次数是无限的话，也就是说双方的商业往来是无止尽的，那么二者的策略都将持续选择合作，最终的博弈收益将定格在（3，3），这就是一个纳什均衡。既然博弈次数是无限的，那么任何一方都没有理由选择背叛策略去冒险追求5点短暂收益，而招致对方在下一轮博弈中的报复（这种报复在博弈论里称作“以牙还牙”策略）。

还有另一种假设情况是，假使双方都知道博弈次数是有限的，也许下一次博弈就是最后一次，那么为了避免对方在最后一轮博弈中选择背叛策略而使我方遭受-3的收益损失，于是双方都重新采取了背叛的策略选择，最后的博弈结果又回到了（-1，-1），这就形成了第二个纳什均衡。

由此可见，随着次数（博弈性质）的变化，纳什均衡点也并非唯一。

（4）饿狮博弈

假设有A、B、C、D、E、F六只狮子（强弱从左到右依次排序）和一只绵羊。假设狮子A吃掉绵羊后就会打盹午睡，这时比A稍弱的狮子B就会趁机吃掉狮子A，接着B也会午睡，然后狮子C就会吃掉狮子B，以此类推。那么问题来了，狮子A敢不敢吃绵羊？

为简化说明，我们先给出此题的解法。该题须采用逆向分析法，也就是从最弱的狮子F开始分析，依次前推。假设狮子E睡着了，狮子F敢不敢吃掉狮子E？答案是肯定的，因为在狮子F的后面已没有其它狮子，所以狮子F可以放心地吃掉午睡中的狮子E。

继续前推，既然狮子E睡着会被狮子F吃掉，那么狮子E必然不敢吃在他前面睡着的狮子D。

再往前推，既然狮子E不敢吃掉狮子D，那么D则可以放心去吃午睡中的狮子C。依次前推，得出C不吃，B吃，A不吃。所以答案是狮子A不敢吃掉绵羊。

推理结果如下图：
这里写图片描述

但是，如果我们在狮子F的后面增加了一只狮子G，总数变成7只，用逆向分析法按照上题步骤再推一次，很容易得出结论：狮子G吃，狮子F不吃，E吃，D不吃，C吃，B不吃，A吃。这次的答案变成了狮子A敢吃掉绵羊。

这里写图片描述

对比两次博弈我们发现，狮子A敢不敢吃绵羊取决于狮子总数的奇偶性，总数为奇数时，A敢吃掉绵羊；总数为偶数时，A则不敢吃。因此，总数为奇数和总数为偶数的狮群博弈结果形成了两个稳定的纳什均衡点。

（5）硬币正反

你正在图书馆枯坐，一位陌生美女主动过来和你搭讪，并要求和你一起玩个数学游戏。美女提议：“让我们各自亮出硬币的一面，或正或反。如果我们都是正面，那么我给你3元，如果我们都是反面，我给你1元，剩下的情况你给我2元就可以了。”那么该不该和这位姑娘玩这个游戏呢？

每一种游戏依具其规则的不同会存在两种纳什均衡，一种是纯策略纳什均衡，也就是说玩家都能够采取固定的策略(比如一直出正面或者一直出反面)，使得每人都赚得最多或亏得最少；或者是混合策略纳什均衡，而在这个游戏中，便应该采用混合策略纳什均衡。

这里写图片描述

假设我们出正面的概率是x，反面的概率是1-x，美女出正面的概率是y，反面的概率是1-y。为了使利益最大化，应该在对手出正面或反面的时候我们的收益都相等，由此列出方程就是

3x + (-2)(1-x)=(-2) * x + 1*( 1-x )——解方程得x=3/8；同样，美女的收益，列方程-3y + 2( 1-y)= 2y+ (-1) * ( 1-y)——解得y也等于3/8。

于是，我们就可以算美女每次的期望收益是： （1-y）(2x-(1-x)) + y(-3x+2(1-x)) = 1/8元，也就是说，双方都采取最优策略的情况下，平均每次美女赢1/8元。

其实只要美女采取了(3/8,5/8)这个方案，不论你再采用什么方案，都是不能改变局面的。如果全部出正面，每次的期望收益是 (3+3+3-2-2-2-2-2)/8=-1/8元；如果全部出反面，每次的期望收益也是(-2-2-2+1+1+1+1+1)/8=-1/8元。而任何策略无非只是上面两种策略的线性组合，所以期望还是-1/8元。但是当你也采用最佳策略时，至少可以保证自己输得最少。否则，你肯定就会被美女采用的策略针对，从而赔掉更多。

纳什均衡分类

最后讲一讲纳什均衡的分类。纳什均衡可以分成两类：“纯战略纳什均衡”和“混合战略纳什均衡”。

要说明纯战略纳什均衡和混合战略纳什均衡，要先说明纯战略和混合战略。所谓纯战略是提供给玩家要如何进行赛局的一个完整的定义。特别地是，纯战略决定在任何一种情况下要做的移动。战略集合是由玩家能够施行的纯战略所组成的集合。而混合战略是对每个纯战略分配一个机率而形成的战略。混合战略允许玩家随机选择一个纯战略。混合战略博弈均衡中要用概率计算，因为每一种策略都是随机的，达到某一概率时，可以实现支付最优。因为机率是连续的，所以即使战略集合是有限的，也会有无限多个混合战略。

当然，严格来说，每个纯战略都是一个“退化”的混合战略，某一特定纯战略的机率为 1，其他的则为 0。
故“纯战略纳什均衡”，即参与之中的所有玩家都玩纯战略；而相应的“混合战略纳什均衡”，之中至少有一位玩家玩混合战略。并不是每个赛局都会有纯战略纳什均衡，例如“钱币问题”就只有混合战略纳什均衡，而没有纯战略纳什均衡。不过，还是有许多赛局有纯战略纳什均衡（如协调赛局，囚徒困境和猎鹿赛局）。甚至，有些赛局能同时有纯战略和混合战略均衡。

数据挖掘与机器学习

［转载］【重磅】无监督学习生成式对抗网络突破，OpenAI 5大项目落地

January 3, 2017 zr9558 Leave a comment

http://www.cnblogs.com/wangxiaocvpr/p/5966574.html

【重磅】无监督学习生成式对抗网络突破，OpenAI 5大项目落地

【新智元导读】“生成对抗网络是切片面包发明以来最令人激动的事情！”LeCun前不久在Quroa答问时毫不加掩饰对生成对抗网络的喜爱，他认为这是深度学习近期最值得期待、也最有可能取得突破的领域。生成对抗学习是无监督学习的一种，该理论由 Ian Goodfellow 提出，此人现在 OpenAI 工作。作为业内公认进行前沿基础理论研究的机构，OpenAI 不久前在博客中总结了他们的5大项目成果，结合丰富实例介绍了生成对抗网络，并对OpenAI 五大落地项目进行梳理，包括完善对抗生成网络（GAN）、完善变分推断（VAE）、提出GAN的扩展 InfoGAN，以及提出生成对抗模仿学习及代码。

OpenAI 旨在开发能够让计算机理解世界的算法和技术。

我们常会忽略自己对周遭世界的理解：你知道世界由三维环境构成，物体可以移动、碰撞、相互作用；人能行走、说话、思考；动物会吃草、飞翔、奔跑或者鸣叫；屏幕会显示经过编码的信息，内容涉及天气、篮球赛的结果或者 1970 年的事情。

这些海量信息就在那里，大都触手可及——其存在形式要么是现实世界中的原子，要么是数字世界里的比特。唯一的问题是如何设计模型和算法，分析和理解这些宝贵的数据。

生成模型是实现这一目标最值得期待的方法。训练生成模型，首先要大量收集某种数据（比如成千上万的图像、句子或声音），然后训练一个模型，让这个模型可以生成这样的数据。

其原理是费曼的名言：“做不出来就没有真正明白。”（What I cannot create, I do not understand.）

用于生成模型的神经网络，很多参数都远远小于用于训练的数据的量，因此模型能够发现并有效内化数据的本质，从而可以生成这些数据。

生成式模型有很多短期应用。但从长远角度看，生成模型有望自动学会数据集的类型、维度等特征。

生成图像

举个例子，假设有某个海量图像数据集，比如含有 120 万幅图像的 ImageNet 数据集。如果将每幅图的宽高设为 256，这个数据集就是 1200000*256*256*3（约 200 GB）的像素块。其中的一些样例如下：

这些图像是人类肉眼所见的样子，我们将它们称为“真实数据分布中的样本”。现在我们搭建生成模型，训练该模型生成类似上图的图像，在这里，这个生成模型就是一个输出为图像的大型神经网络，这些输出的图像称为“模型样本”。

DCGAN

Radford 等人提出的 DCGAN 网络（见下图）就是这样一个例子。DCGAN 网络以 100 个从一个高斯分布中采样的随机数作为输入（即代码，或者叫“隐含变量”，靠左红色部分），输出图像（在这里就是 64*64*3 的图像，靠右绿色部分）。当代码增量式变化时，生成的图像也随之变化——说明模型已经学会了如何描述世界，而不是仅仅是记住了一些样本。

网络（黄色部分）由标准的卷积神经网络（CNN）构成：

DCGAN 使用随机权重初始化，所以随机代码输入会生成一个完全随机的图像。但是，这个网络有好几百万的参数可以调整，而我们的目的是设置参数，让根据随机代码输入产生的样本与训练数据看上去相似。换句话说，我们想要模型分布与图像空间中真实数据的分布相匹配。

训练生成模型

假设我们使用最新初始化的网络生成 200 幅图，每次都从不同的随机代码开始。问题是：我们该如何调整网络的参数，让每次输出的新图像都更接近理想？需要注意的是，这里并非监督学习场景，我们也没有对 200 幅输出图像设定明确的预期；我们只是希望这些图像看起来跟真实的一样。

一个巧妙的处理方式是依照生成对抗网络（Generative Adversarial Network，GAN）方法。这里，我们引入另一个判别器网络（discriminator network，通常是一个标准的卷积神经网络），判断输入的图像是真实的还是生成的。我们可以将 200 幅生成的图像和 200 幅真实图像用作训练数据，将这个判别器训练成一个标准的分类器，其功能就是区分这两种不同的图像。

此外，我们还可以经由判别器和生成器反向传播（backpropagate），找出应该如何改变生成器的参数，使其生成的 200 幅样本对判别器而言混淆度更大。这两个网络就形成了一种对抗：判别器试着从伪造图像中区分出真实图像，而生成器则努力产生可以骗过判别器的图像。最后，生成器网络输出的结果就是在判别器看来无法区分的图像。

下图展示了两种从生成模型采样的过程。两种情况下，输入都是有噪声和混乱的，经过一段时间收敛，可以得到较为可信的图像统计：

VAE 学会产生图像（log time）

GAN 学会产生图像（linear time）

这令人兴奋——这些神经网络正在逐渐学会世界看起来是什么样子的！这些模型通常只有 10 亿参数，所以一个在 ImageNet 上训练的网络（粗略地）将 200GB 的像素数据压缩到 100MB 的权重。这让模型得以发现数据最主要的特征：例如，模型很可能学会位置邻近的像素更有可能拥有同样的颜色，或者世界是由水平或竖直的边构成。

最终，模型可能会发现很多更复杂的规律：例如，图像中有特定类型的背景、物体、纹理，它们会以某种可能的排列方式出现，或者在视频中随时间按某种方式变化等等。

更泛化的表现形式

数学上看，我们考虑数据集 x1，…，xn 是从真实数据分布 p(x) 中的一段。下图中，蓝色区域展示了一部分图像空间，这部分空间以高概率（超过某阈值）包含真实图像，而黑色点表示数据点（每个都是数据集中一副图像）。现在，我们的模型同样刻画了一个分布 p^θ(x) (绿色)，将从一个单位 Gaussian 分布 (红色) 获得的点，通过一个（判别器）神经网络映射，得到了生成模型 (黄色)。

我们的网络是参数为 θ 的函数，调整这些参数就能改变生成出的图像分布。目标是找到参数 θ 可以产生一个较好匹配真实数据分布的分布。因此，你可以想象绿色分布从随机开始，然后训练过程迭代式改变参数 θ 拉长和压缩自己使得更匹配蓝色分布。

生成模型三种搭建方法

大多数生成模型有一个基础的设置，只是在细节上有所不同。下面是生成模型的三个常用方法：

Generative Adversarial Network（GAN）这个我们在上面讨论过了，将训练过程作为两个不同网络的对抗：一个生成器网络和一个判别器网络，判别器网络试图区分样来自于真实分布 p(x) 和模型分布 p^(x) 的样本。每当判别器发现两个分布之间有差异时，生成器网络便微整参数，使判别器不能从中找到差异。
Variational Autoencoders（VAE）让我们可以在概率图模型框架下形式化这个问题，我们会最大化数据的对数似然（log likelihood）的下界
PixelRNN 等自回归模型。训练一个建模了给定前面像素下每个独立像素条件分布的网络（从左到右，从上到下）. 这类似于将图像的像素输入到一个 char-rnn，但是 RNN 水平和垂直遍历图像，而不是 1D 的字符序列

所有这些方法有各自的优缺点。例如，变分自编码器可以执行学习和在复杂的包含隐含变量的概率图模型上进行高效贝叶斯推断（如 DRAW 或者 Attend Infer Repeat 近期相对复杂的模型）。但是，生成的样本会有些模糊不清。GAN 目前生成了清楚的图像，但是因为不稳定的训练动态性很难优化。PixelRNN 有一个非常简单和稳定的训练过程（softmax loss），并且当前给出了最优的对数似然（产生出数据的可信程度）。然而，PixelRNN 在采样时相对低效，而且没有给图像以简单的低维代码。

OpenAI 5 大落地

我们对 OpenAI 做出的生成式模型非常兴奋，刚刚发布了四个对近期工作项目改进工作. 对每个贡献，我们同样发布了技术报告和源代码.

1. 完善对抗生成网络（GAN）

GAN 是非常值得期待的生成模型，不像其他方法，GAN 产生的图像干净、清晰，并且能够学会与纹理有关的代码。然而，GAN 被构建成两个网络之间的对抗，保持平衡十分重要（而且相当考验技巧）：两个网络可能在解析度之间震荡，生成器容易崩溃。

Tim Salimans, Ian Goodfellow, Wojciech Zaremba 等人引入了一些新技巧，让 GAN 训练更加稳定。这些技巧让我们能够放大 GAN ，获得清晰的 128*128 ImageNet 样本：

［左］真实图像（ImageNet）；［右］生成的图像

我们 CIFAR-10 的样本看起来也是非常清晰的——Amazon 为图像打标签的工人（Amazon Mechanical Turk workers）在区分这些图像和真实图像时，错误率为 21.3% （50% 的错误率代表随机猜测）。

［左］真实图像（CIFAR-10）；［右］生成的图像

除了生成更好的图像，我们还引入了一种结合 GAN 和半监督学习的方法。这使我们在不需要大量带标签样本的前提下，在 MNIST、SVHN 和 CIFAR-10 获得当前最佳的结果。在 MNIST，我们对每个类仅有 10 个带标签的样本，使用了一个全连接的神经网络，达到了 99.14% 的准确率——这个结果接近已知最好的监督学习，而后者使用了 6 万个带标签的样本。由于为样本打标签非常麻烦，所以上述方法是很值得期待的。

生成对抗网络是两年多前才提出来的，我们期望在未来出现更多提升其训练稳定性的研究。

2. 完善变分推断（VAE）

在这项工作中，Durk Kingma 和 Tim Salimans 引入了一个灵活、可扩展的计算方法，提升变分推断的准确率。目前，大多数 VAE 训练采用暴力近似后验分布（approximate posterior），每个隐含变量都是独立的。最近的扩展工作虽然解决了这个问题，但由于引入的序列依赖，在计算上仍然称不上高效。

这项工作的主要贡献是“逆自递归流”（Inverse Autoregressive Flow，IAF），这种方法使 rich approximate posterior 能够并行计算，从而高度灵活，可以达到近乎随机的任意性。

我们在下面的图中展示了一些 32*32 的图像样本。前一幅是来自 DRAW 模型的早期样本（初级 VAE 样本看起来更差和模糊）。DRAW 模型一年前才发表的，由此也可以感受到训练生成模型的发展迅速。

［左］用 IAF 训练 VAE 生成的图像；［右］DRAW 模型生成的图像

3. InfoGAN

Peter Chen 等人提出了 InfoGAN ——GAN 的扩展。普通的 GAN 通过在模型里重新生成数据分布，让模型学会图像的disentangled 和 interpretable 表征。但是，其 layout 和 organization 是 underspecified。

InfoGAN 引入了额外的结构，得到了相当出色的结果。在三维人脸图像中，改变代码的一个连续维度，保持其他维度不变，很明显从每张图片给出的 5 行例子中，代码的 resulting dimension 是可解释的，模型在事先不知道摄像头角度、面部变化等特征存在的前提下，可能已经理解这些特征是存在的，并且十分重要：

（a）pose （b）Elevation

（c）Lighting （d）Wide or Narrow

同时，值得一提的是，上述方法是非监督学习的方法。因此，相比通过监督学习的方法实现了同样结果的思路，这种方法体现出了更高的水平。

3. 生成模型的深度强化学习（两项）

下面是两个强化学习场景下（另一个 OpenAI 聚焦的领域），生成式模型的完善：Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks。

在高维度连续空间中进行高效的探索是当前强化学习尚未解决的一个难题。没有有效的探索方法，智能体只能到处乱闯直到碰巧遇到奖励。若要对高维行动空间进行探索（比如机器人），这些算法是完全不够的。这篇论文中，Rein Houthooft 等人提出了 VIME，一个使用生成模型对不确定性进行探索的实用方法。

VIME 让智能体本身驱动；它主动地寻求意外的状态-行动。作者展示了 VIME 可以提高一系列策略搜索方法，并在更多稀疏奖励的实际任务（比如智能体需要在无指导的情形下学习原始行动的场景）取得了显著改进。

用 VIME 训练的策略

没有受训的策略

4. 生成对抗模仿学习

Jonathan Ho 等人提出了一个新的模仿学习（imitation learning）方法。Jonathan Ho 在斯坦福大学完成了这项工作的主要内容，他作为暑期实习生加入 OpenAI 后，结合 GAN 以及 RL 等相关工作的启发，最终完成了生成对抗模仿学习及代码。

标准的强化学习场景通常需要设计一个规定智能体预期行为的奖励函数。但实际情况是，有样做有时候会为了实现细节上的正确，而引入代价过高的试错过程。相较而言，模仿学习中，智能体从样本展示中学习，就免去了对奖励函数的依赖。

常用模仿方法包含两个阶段的流程：首先学习奖励函数，然后依照奖励函数执行深度强化学习。这样的过程非常缓慢，也由于这种间接性方法，很难保证结果策略的质量。这项工作展示了如何通过 GAN 直接从数据中抽取策略。这个方法在 OpenAI Gym 环境中可以根据专家表现进行策略学习。

展望未来

生成模型是快速发展的研究领域。在完善这些模型，扩展训练和数据集的同时，我们完全可以认为最终将会产生能够以假乱真的图像或视频样本。这可以用很多应用，图像降噪（image denoising）、inpainting、高清分辨率（super-resolution）、结构化预测（structured prediction）、强化学习探索（exploration in reinforcement learning），以及神经网络预处理这些为数据打标签的很复杂、造价很高的领域，有很多潜力。

这项工作更深的启示是，在训练生成模型的过程中，我们最终会增进计算机对世界及其构成的理解。

数据挖掘与机器学习

一文学会用 Tensorflow 搭建神经网络

December 12, 2016 zr9558 Leave a comment

http://www.jianshu.com/p/e112012a4b2d

cs224d-Day 6: 快速入门 Tensorflow

本文是学习这个视频课程系列的笔记，课程链接是 youtube 上的，
讲的很好，浅显易懂，入门首选，而且在github有代码，
想看视频的也可以去他的优酷里的频道找。

Tensorflow 官网

神经网络是一种数学模型，是存在于计算机的神经系统，由大量的神经元相连接并进行计算，在外界信息的基础上，改变内部的结构，常用来对输入和输出间复杂的关系进行建模。

神经网络由大量的节点和之间的联系构成，负责传递信息和加工信息，神经元也可以通过训练而被强化。

这个图就是一个神经网络系统，它由很多层构成。输入层就是负责接收信息，比如说一只猫的图片。输出层就是计算机对这个输入信息的认知，它是不是猫。隐藏层就是对输入信息的加工处理。

神经网络是如何被训练的，首先它需要很多数据。比如他要判断一张图片是不是猫。就要输入上千万张的带有标签的猫猫狗狗的图片，然后再训练上千万次。

神经网络训练的结果有对的也有错的，如果是错误的结果，将被当做非常宝贵的经验，那么是如何从经验中学习的呢？就是对比正确答案和错误答案之间的区别，然后把这个区别反向的传递回去，对每个相应的神经元进行一点点的改变。那么下一次在训练的时候就可以用已经改进一点点的神经元去得到稍微准确一点的结果。

神经网络是如何训练的呢？每个神经元都有属于它的激活函数，用这些函数给计算机一个刺激行为。

在第一次给计算机看猫的图片的时候，只有部分的神经元被激活，被激活的神经元所传递的信息是对输出结果最有价值的信息。如果输出的结果被判定为是狗，也就是说是错误的了，那么就会修改神经元，一些容易被激活的神经元会变得迟钝，另外一些神经元会变得敏感。这样一次次的训练下去，所有神经元的参数都在被改变，它们变得对真正重要的信息更为敏感。

Tensorflow 是谷歌开发的深度学习系统，用它可以很快速地入门神经网络。

它可以做分类，也可以做拟合问题，就是要把这个模式给模拟出来。

这是一个基本的神经网络的结构，有输入层，隐藏层，和输出层。
每一层点开都有它相应的内容，函数和功能。

那我们要做的就是要建立一个这样的结构，然后把数据喂进去。
把数据放进去后它就可以自己运行，TensorFlow 翻译过来就是向量在里面飞。

这个动图的解释就是，在输入层输入数据，然后数据飞到隐藏层飞到输出层，用梯度下降处理，梯度下降会对几个参数进行更新和完善，更新后的参数再次跑到隐藏层去学习，这样一直循环直到结果收敛。

tensors_flowing.gif

今天一口气把整个系列都学完了，先来一段完整的代码，然后解释重要的知识点！

1. 搭建神经网络基本流程

定义添加神经层的函数

1.训练的数据
2.定义节点准备接收数据
3.定义神经层：隐藏层和预测层
4.定义 loss 表达式
5.选择 optimizer 使 loss 达到最小

然后对所有变量进行初始化，通过 sess.run optimizer，迭代 1000 次进行学习：

import tensorflow as tf
import numpy as np

# 添加层
def add_layer(inputs, in_size, out_size, activation_function=None):
    # add one more layer and return the output of this layer
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
    Wx_plus_b = tf.matmul(inputs, Weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

# 1.训练的数据
# Make up some real data 
x_data = np.linspace(-1,1,300)[:, np.newaxis]
noise = np.random.normal(0, 0.05, x_data.shape)
y_data = np.square(x_data) - 0.5 + noise

# 2.定义节点准备接收数据
# define placeholder for inputs to network  
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

# 3.定义神经层：隐藏层和预测层
# add hidden layer 输入值是 xs，在隐藏层有 10 个神经元   
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer 输入值是隐藏层 l1，在预测层输出 1 个结果
prediction = add_layer(l1, 10, 1, activation_function=None)

# 4.定义 loss 表达式
# the error between prediciton and real data    
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
                     reduction_indices=[1]))

# 5.选择 optimizer 使 loss 达到最小                   
# 这一行定义了用什么方式去减少 loss，学习率是 0.1       
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)


# important step 对所有变量进行初始化
init = tf.initialize_all_variables()
sess = tf.Session()
# 上面定义的都没有运算，直到 sess.run 才会开始运算
sess.run(init)

# 迭代 1000 次学习，sess.run optimizer
for i in range(1000):
    # training train_step 和 loss 都是由 placeholder 定义的运算，所以这里要用 feed 传入参数
    sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
    if i % 50 == 0:
        # to see the step improvement
        print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

2. 主要步骤的解释：

之前写过一篇文章 TensorFlow 入门讲了 tensorflow 的安装，这里使用时直接导入：

import tensorflow as tf
import numpy as np

导入或者随机定义训练的数据 x 和 y：

x_data = np.random.rand(100).astype(np.float32)
y_data = x_data*0.1 + 0.3

先定义出参数 Weights，biases，拟合公式 y，误差公式 loss：

Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
biases = tf.Variable(tf.zeros([1]))
y = Weights*x_data + biases
loss = tf.reduce_mean(tf.square(y-y_data))

选择 Gradient Descent 这个最基本的 Optimizer：

optimizer = tf.train.GradientDescentOptimizer(0.5)

神经网络的 key idea，就是让 loss 达到最小：

train = optimizer.minimize(loss)

前面是定义，在运行模型前先要初始化所有变量：

init = tf.initialize_all_variables()

接下来把结构激活，sesseion像一个指针指向要处理的地方：

sess = tf.Session()

init 就被激活了，不要忘记激活：

sess.run(init)

训练201步：

for step in range(201):

要训练 train，也就是 optimizer：

sess.run(train)

每 20 步打印一下结果，sess.run 指向 Weights，biases 并被输出：

if step % 20 == 0:
print(step, sess.run(Weights), sess.run(biases))

所以关键的就是 y，loss，optimizer 是如何定义的。

3. TensorFlow 基本概念及代码：

在 TensorFlow 入门也提到了几个基本概念，这里是几个常见的用法。

Session

矩阵乘法：tf.matmul

product = tf.matmul(matrix1, matrix2) # matrix multiply np.dot(m1, m2)

定义 Session，它是个对象，注意大写：

sess = tf.Session()

result 要去 sess.run 那里取结果：

result = sess.run(product)

Variable

用 tf.Variable 定义变量，与python不同的是，必须先定义它是一个变量，它才是一个变量，初始值为0，还可以给它一个名字 counter：

state = tf.Variable(0, name='counter')

将 new_value 加载到 state 上，counter就被更新：

update = tf.assign(state, new_value)

如果有变量就一定要做初始化：

init = tf.initialize_all_variables() # must have if define variable

placeholder：

要给节点输入数据时用 placeholder，在 TensorFlow 中用placeholder 来描述等待输入的节点，只需要指定类型即可，然后在执行节点的时候用一个字典来“喂”这些节点。相当于先把变量 hold 住，然后每次从外部传入data，注意 placeholder 和 feed_dict 是绑定用的。

这里简单提一下 feed 机制，给 feed 提供数据，作为 run()
调用的参数， feed 只在调用它的方法内有效, 方法结束, feed 就会消失。

import tensorflow as tf

input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
ouput = tf.mul(input1, input2)

with tf.Session() as sess:
  print(sess.run(ouput, feed_dict={input1: [7.], input2: [2.]}))

4. 神经网络基本概念

激励函数：

例如一个神经元对猫的眼睛敏感，那当它看到猫的眼睛的时候，就被激励了，相应的参数就会被调优，它的贡献就会越大。

下面是几种常见的激活函数：
x轴表示传递过来的值，y轴表示它传递出去的值：

激励函数在预测层，判断哪些值要被送到预测结果那里：

TensorFlow 常用的 activation function

添加神经层：

输入参数有 inputs, in_size, out_size, 和 activation_function

import tensorflow as tf

def add_layer(inputs, in_size, out_size,  activation_function=None):

  Weights = tf.Variable(tf.random_normal([in_size, out_size]))
  biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
  Wx_plus_b = tf.matmul(inputs, Weights) + biases

  if activation_function is None:
    outputs = Wx_plus_b
  else:
    outputs = activation_function(Wx_plus_b)

return outputs

分类问题的 loss 函数 cross_entropy ：

# the error between prediction and real data
# loss 函数用 cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                                              reduction_indices=[1]))       # loss
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

overfitting：

下面第三个图就是 overfitting，就是过度准确地拟合了历史数据，而对新数据预测时就会有很大误差：

Tensorflow 有一个很好的工具, 叫做dropout, 只需要给予它一个不被 drop 掉的百分比，就能很好地降低 overfitting。

dropout 是指在深度学习网络的训练过程中，按照一定的概率将一部分神经网络单元暂时从网络中丢弃，相当于从原始的网络中找到一个更瘦的网络，这篇博客中讲的非常详细

代码实现就是在 add layer 函数里加上 dropout, keep_prob 就是保持多少不被 drop，在迭代时在 sess.run 中被 feed:

def add_layer(inputs, in_size, out_size, layer_name, activation_function=None, ):
    # add one more layer and return the output of this layer
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1, )
    Wx_plus_b = tf.matmul(inputs, Weights) + biases

    # here to dropout
    # 在 Wx_plus_b 上drop掉一定比例
    # keep_prob 保持多少不被drop，在迭代时在 sess.run 中 feed
    Wx_plus_b = tf.nn.dropout(Wx_plus_b, keep_prob)

    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b, )
    tf.histogram_summary(layer_name + '/outputs', outputs)  
    return outputs

5. 可视化 Tensorboard

Tensorflow 自带 tensorboard ，可以自动显示我们所建造的神经网络流程图：

就是用 with tf.name_scope 定义各个框架，注意看代码注释中的区别：

import tensorflow as tf


def add_layer(inputs, in_size, out_size, activation_function=None):
    # add one more layer and return the output of this layer
    # 区别：大框架，定义层 layer，里面有 小部件
    with tf.name_scope('layer'):
        # 区别：小部件
        with tf.name_scope('weights'):
            Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W')
        with tf.name_scope('biases'):
            biases = tf.Variable(tf.zeros([1, out_size]) + 0.1, name='b')
        with tf.name_scope('Wx_plus_b'):
            Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
        if activation_function is None:
            outputs = Wx_plus_b
        else:
            outputs = activation_function(Wx_plus_b, )
        return outputs


# define placeholder for inputs to network
# 区别：大框架，里面有 inputs x，y
with tf.name_scope('inputs'):
    xs = tf.placeholder(tf.float32, [None, 1], name='x_input')
    ys = tf.placeholder(tf.float32, [None, 1], name='y_input')

# add hidden layer
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer
prediction = add_layer(l1, 10, 1, activation_function=None)

# the error between prediciton and real data
# 区别：定义框架 loss
with tf.name_scope('loss'):
    loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
                                        reduction_indices=[1]))

# 区别：定义框架 train
with tf.name_scope('train'):
    train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

sess = tf.Session()

# 区别：sess.graph 把所有框架加载到一个文件中放到文件夹"logs/"里 
# 接着打开terminal，进入你存放的文件夹地址上一层，运行命令 tensorboard --logdir='logs/'
# 会返回一个地址，然后用浏览器打开这个地址，在 graph 标签栏下打开
writer = tf.train.SummaryWriter("logs/", sess.graph)
# important step
sess.run(tf.initialize_all_variables())

运行完上面代码后，打开 terminal，进入你存放的文件夹地址上一层，运行命令 tensorboard –logdir=’logs/’ 后会返回一个地址，然后用浏览器打开这个地址，点击 graph 标签栏下就可以看到流程图了：

6. 保存和加载

训练好了一个神经网络后，可以保存起来下次使用时再次加载：

import tensorflow as tf
import numpy as np

## Save to file
# remember to define the same dtype and shape when restore
W = tf.Variable([[1,2,3],[3,4,5]], dtype=tf.float32, name='weights')
b = tf.Variable([[1,2,3]], dtype=tf.float32, name='biases')

init= tf.initialize_all_variables()

saver = tf.train.Saver()

# 用 saver 将所有的 variable 保存到定义的路径
with tf.Session() as sess:
   sess.run(init)
   save_path = saver.save(sess, "my_net/save_net.ckpt")
   print("Save to path: ", save_path)


################################################

# restore variables
# redefine the same shape and same type for your variables
W = tf.Variable(np.arange(6).reshape((2, 3)), dtype=tf.float32, name="weights")
b = tf.Variable(np.arange(3).reshape((1, 3)), dtype=tf.float32, name="biases")

# not need init step

saver = tf.train.Saver()
# 用 saver 从路径中将 save_net.ckpt 保存的 W 和 b restore 进来
with tf.Session() as sess:
    saver.restore(sess, "my_net/save_net.ckpt")
    print("weights:", sess.run(W))
    print("biases:", sess.run(b))

tensorflow 现在只能保存 variables，还不能保存整个神经网络的框架，所以再使用的时候，需要重新定义框架，然后把 variables 放进去学习。

文／不会停的蜗牛（简书作者）
原文链接：http://www.jianshu.com/p/e112012a4b2d
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。

数据挖掘与机器学习

TensorFlow 入门

December 12, 2016 zr9558 Leave a comment

http://www.jianshu.com/p/6766fbcd43b9

CS224d－Day 2:

在 Day 1 里，先了解了一下 NLP 和 DP 的主要概念，对它们有了一个大体的印象，用向量去表示研究对象，用神经网络去学习，用 TensorFlow 去训练模型，基本的模型和算法包括 word2vec，softmax，RNN，LSTM，GRU，CNN，大型数据的 seq2seq，还有未来比较火热的研究方向 DMN，还有模型的调优。

今天先不直接进入理论学习，而是先学习一下 TensorFlow，在原课程里，这部分在第7讲，但是我觉得最高效地学习算法的方式，就是一边学理论，一边写代码，实践中才能理解更深刻。

Day 2 先认识 TensorFlow，了解一下基本用法，下一次就写代码来训练模型算法，以问题为导向，以项目为驱动。

本文结构：

1. TensorFlow 是什么
2. 为什么需要 TensorFlow
3. TensorFlow 的优点
4. TensorFlow 的工作原理
5. 安装
6. TensorFlow 基本用法
- 要点
- 例子
- 概念
  - 图
  - 张量
  - 会话

1. TensorFlow 是什么

是一个深度学习库，由 Google 开源，可以对定义在 Tensor(张量)上的函数自动求导。

Tensor(张量)意味着 N 维数组，Flow(流)意味着基于数据流图的计算，TensorFlow即为张量从图的一端流动到另一端。

它的一大亮点是支持异构设备分布式计算，它能够在各个平台上自动运行模型，从电话、单个CPU / GPU到成百上千GPU卡组成的分布式系统。

支持CNN、RNN和LSTM算法，是目前在 Image，NLP 最流行的深度神经网络模型。

2. 为什么需要 TensorFlow 等库

深度学习通常意味着建立具有很多层的大规模的神经网络。

除了输入X，函数还使用一系列参数，其中包括标量值、向量以及最昂贵的矩阵和高阶张量。

在训练网络之前，需要定义一个代价函数，常见的代价函数包括回归问题的方差以及分类时候的交叉熵。

训练时，需要连续的将多批新输入投入网络，对所有的参数求导后，代入代价函数，从而更新整个网络模型。

这个过程中有两个主要的问题：1. 较大的数字或者张量在一起相乘百万次的处理，使得整个模型代价非常大。2. 手动求导耗时非常久。

所以 TensorFlow 的对函数自动求导以及分布式计算，可以帮我们节省很多时间来训练模型。

3. TensorFlow 的优点

第一，基于Python，写的很快并且具有可读性。

第二，在多GPU系统上的运行更为顺畅。

第三，代码编译效率较高。

第四，社区发展的非常迅速并且活跃。

第五，能够生成显示网络拓扑结构和性能的可视化图。

4. TensorFlow 的工作原理

TensorFlow是用数据流图(data flow graphs)技术来进行数值计算的。

数据流图是描述有向图中的数值计算过程。

有向图中，节点通常代表数学运算，边表示节点之间的某种联系，它负责传输多维数据(Tensors)。

节点可以被分配到多个计算设备上，可以异步和并行地执行操作。因为是有向图，所以只有等到之前的入度节点们的计算状态完成后，当前节点才能执行操作。

5. 安装

极客学院有官方文档翻译版，讲的很清楚，有各种安装方式的讲解。

我选择基于 Anaconda 的安装，因为这个很方便。

Anaconda 是一个集成许多第三方科学计算库的 Python 科学计算环境，用 conda 作为自己的包管理工具，同时具有自己的计算环境，类似 Virtualenv。

安装 Anaconda
我之前已经安装过 Anaconda 了，直接从下面进行：
建立一个 conda 计算环境

# 计算环境名字叫 tensorflow:
# Python 2.7
$ conda create -n tensorflow python=2.7

激活环境，使用 conda 安装 TensorFlow

$ source activate tensorflow
(tensorflow)$  # Your prompt should change

# Mac OS X, CPU only:
(tensorflow)$ pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.8.0rc0-py2-none-any.whl

安装成功后，每次使用 TensorFlow 的时候需要激活 conda 环境
conda 环境激活后，你可以测试是否成功，在终端进入 python，输入下面代码，没有提示错误，说明安装 TensorFlow 成功：

$ python
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print(sess.run(a + b))
42
>>>

当你不用 TensorFlow 的时候，关闭环境:

(tensorflow)$ source deactivate

$  # Your prompt should change back

再次使用的时候再激活:

$ source activate tensorflow
(tensorflow)$  # Run Python programs that use TensorFlow.
...

(tensorflow)$ source deactivate

在 Jupyter notebook 里用 TensorFlow
我在 (tensorflow)$ 直接输入 jupyter notebook 后，输入 import tensorflow as tf 是有错误的，可以参考这里。

6. TensorFlow 基本用法

接下来按照官方文档中的具体代码，来看一下基本用法。

你需要理解在TensorFlow中，是如何：

将计算流程表示成图；
通过Sessions来执行图计算；
将数据表示为tensors；
使用Variables来保持状态信息；
分别使用feeds和fetches来填充数据和抓取任意的操作结果；

先看个栗子：
例1，生成三维数据，然后用一个平面拟合它：

# (tensorflow)$ python   用 Python API 写 TensorFlow 示例代码

import tensorflow as tf
import numpy as np

# 用 NumPy 随机生成 100 个数据
x_data = np.float32(np.random.rand(2, 100)) 
y_data = np.dot([0.100, 0.200], x_data) + 0.300

# 构造一个线性模型
b = tf.Variable(tf.zeros([1]))
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))
y = tf.matmul(W, x_data) + b

# 最小化方差
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# 初始化变量
init = tf.initialize_all_variables()

# 启动图 (graph)
sess = tf.Session()
sess.run(init)

# 拟合平面
for step in xrange(0, 201):
    sess.run(train)
    if step % 20 == 0:
        print step, sess.run(W), sess.run(b)

# 输出结果为：
0 [[-0.14751725  0.75113136]] [ 0.2857058]
20 [[ 0.06342752  0.32736415]] [ 0.24482927]
40 [[ 0.10146417  0.23744738]] [ 0.27712563]
60 [[ 0.10354312  0.21220125]] [ 0.290878]
80 [[ 0.10193551  0.20427427]] [ 0.2964265]
100 [[ 0.10085492  0.201565  ]] [ 0.298612]
120 [[ 0.10035028  0.20058727]] [ 0.29946309]
140 [[ 0.10013894  0.20022322]] [ 0.29979277]
160 [[ 0.1000543   0.20008542]] [ 0.29992008]
180 [[ 0.10002106  0.20003279]] [ 0.29996923]
200 [[ 0.10000814  0.20001261]] [ 0.29998815]

注意这几条代码：

W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))

y = tf.matmul(W, x_data) + b

init = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init)

sess.run(train) 
print step, sess.run(W), sess.run(b)

接下来看具体概念：

TensorFlow 用图来表示计算任务，图中的节点被称之为operation，缩写成op。
一个节点获得 0 个或者多个张量 tensor，执行计算，产生0个或多个张量。
图必须在会话(Session)里被启动，会话(Session)将图的op分发到CPU或GPU之类的设备上，同时提供执行op的方法，这些方法执行后，将产生的张量(tensor)返回。

1. 构建图
例2，计算矩阵相乘：

import tensorflow as tf

# 创建一个 常量 op, 返回值 'matrix1' 代表这个 1x2 矩阵.
matrix1 = tf.constant([[3., 3.]])

# 创建另外一个 常量 op, 返回值 'matrix2' 代表这个 2x1 矩阵.
matrix2 = tf.constant([[2.],[2.]])

# 创建一个矩阵乘法 matmul op , 把 'matrix1' 和 'matrix2' 作为输入.
# 返回值 'product' 代表矩阵乘法的结果.
product = tf.matmul(matrix1, matrix2)

默认图有三个节点, 两个 constant() op, 和一个 matmul() op. 为了真正进行矩阵相乘运算, 并得到矩阵乘法的结果, 你必须在会话里启动这个图.

2. 张量 Tensor
从向量空间到实数域的多重线性映射(multilinear maps)（v是向量空间，v*是对偶空间）
例如代码中的 [[3., 3.]]，Tensor 可以看作是一个 n 维的数组或列表。在 TensorFlow 中用 tensor 数据结构来代表所有的数据, 计算图中, 操作间传递的数据都是 tensor。

3. 在一个会话中启动图
创建一个 Session 对象, 如果无任何创建参数, 会话构造器将启动默认图。
会话负责传递 op 所需的全部输入，op 通常是并发执行的。

# 启动默认图.
sess = tf.Session()

# 调用 sess 的 'run()' 方法, 传入 'product' 作为该方法的参数，
# 触发了图中三个 op (两个常量 op 和一个矩阵乘法 op)，
# 向方法表明, 我们希望取回矩阵乘法 op 的输出.
result = sess.run(product)

# 返回值 'result' 是一个 numpy `ndarray` 对象.
print result
# ==> [[ 12.]]

# 任务完成, 需要关闭会话以释放资源。
sess.close()

交互式使用
在 Python API 中，使用一个会话 Session 来启动图, 并调用 Session.run() 方法执行操作.

为了便于在 IPython 等交互环境使用 TensorFlow，需要用 InteractiveSession 代替 Session 类, 使用 Tensor.eval() 和 Operation.run() 方法代替 Session.run()。

例3，计算 ‘x’ 减去 ‘a’：

# 进入一个交互式 TensorFlow 会话.
import tensorflow as tf
sess = tf.InteractiveSession()

x = tf.Variable([1.0, 2.0])
a = tf.constant([3.0, 3.0])

# 使用初始化器 initializer op 的 run() 方法初始化 'x' 
x.initializer.run()

# 增加一个减法 sub op, 从 'x' 减去 'a'. 运行减法 op, 输出结果 
sub = tf.sub(x, a)
print sub.eval()
# ==> [-2. -1.]

变量 Variable

上面用到的张量是常值张量(constant)。

变量 Variable，是维护图执行过程中的状态信息的. 需要它来保持和更新参数值，是需要动态调整的。

下面代码中有 tf.initialize_all_variables，是预先对变量初始化，
Tensorflow 的变量必须先初始化，然后才有值！而常值张量是不需要的。

下面的 assign() 操作和 add() 操作，在调用 run() 之前, 它并不会真正执行赋值和加和操作。

例4，使用变量实现一个简单的计数器：

# －创建一个变量, 初始化为标量 0.  初始化定义初值
state = tf.Variable(0, name="counter")

# 创建一个 op, 其作用是使 state 增加 1
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

# 启动图后, 变量必须先经过`初始化` (init) op 初始化,
# 才真正通过Tensorflow的initialize_all_variables对这些变量赋初值
init_op = tf.initialize_all_variables()

# 启动默认图, 运行 op
with tf.Session() as sess:

  # 运行 'init' op
  sess.run(init_op)

  # 打印 'state' 的初始值
  # 取回操作的输出内容, 可以在使用 Session 对象的 run() 调用 执行图时, 
  # 传入一些 tensor, 这些 tensor 会帮助你取回结果. 
  # 此处只取回了单个节点 state，
  # 也可以在运行一次 op 时一起取回多个 tensor: 
  # result = sess.run([mul, intermed])
  print sess.run(state)

  # 运行 op, 更新 'state', 并打印 'state'
  for _ in range(3):
    sess.run(update)
    print sess.run(state)

# 输出:

# 0
# 1
# 2
# 3

上面的代码定义了一个如下的计算图：

Ok，总结一下，来一个清晰的代码：
过程就是：建图->启动图->运行取值

计算矩阵相乘：

import tensorflow as tf

# 建图
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])

product = tf.matmul(matrix1, matrix2)

# 启动图
sess = tf.Session()

# 取值
result = sess.run(product)
print result

sess.close()

上面的几个代码介绍了基本用法，通过观察，有没有觉得 tf 和 numpy 有点像呢。

TensorFlow和普通的Numpy的对比
在cs224d的课件中有下面这个代码，来看一下二者之间的区别：

eval()

在 Python 中定义完 a 后，直接打印就可以看到 a。

In [37]: a = np.zeros((2,2))

In [39]: print(a)
[[ 0.  0.]
 [ 0.  0.]]

但是在 Tensorflow 中需要显式地输出(evaluation，也就是说借助eval()函数)！

In [38]: ta = tf.zeros((2,2))

In [40]: print(ta)
Tensor("zeros_1:0", shape=(2, 2), dtype=float32)

In [41]: print(ta.eval())
[[ 0.  0.]
[ 0. 0.]]

通过几个例子了解了基本的用法，feed 在上面的例子中还没有写到，下一次就能用到了，其他的可以查询这里。

Day 1 宏观了解了 NLP，Day 2 搞定了工具，下次要直接先进入实战，训练模型，先从 Logistic 和 NN 开始，一边看模型一边写代码一边思考模型原理，这样理解才会更深刻！

文／不会停的蜗牛（简书作者）
原文链接：http://www.jianshu.com/p/6766fbcd43b9
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。

数据挖掘与机器学习

循环神经网络－－－Recurrent Neural Networks

November 21, 2016 zr9558 Leave a comment

循环神经网络（Recurrent Neural Networks）是目前非常流行的神经网络模型，在自然语言处理的很多任务中已经展示出卓越的效果。但是在介绍 RNN 的诸多文章中，通常都是介绍 RNN 的使用方法和实战效果，很少有文章会介绍关于该神经网络的训练过程。

循环神经网络是一个在时间上传递的神经网络，网络的深度就是时间的长度。该神经网络是专门用来处理时间序列问题的，能够提取时间序列的信息。如果是前向神经网络，每一层的神经元信号只能够向下一层传播，样本的处理在时刻上是独立的。对于循环神经网络而言，神经元在这个时刻的输出可以直接影响下一个时间点的输入，因此该神经网络能够处理时间序列方面的问题。

本文将会从数学的角度展开关于循环神经网络的使用方法和训练过程，在本文中，会假定读者已经掌握数学分析中的导数，偏导数，链式法则，梯度下降法等基础内容。本文将会使用传统的后向传播算法（Back Propagation）来训练 RNN 模型。

微信公众号文章：循环神经网络

This slideshow requires JavaScript.

数据挖掘与机器学习

线性回归（Linear Regression）

October 28, 2016 zr9558 Leave a comment

回归方法是为了对连续型的数据做出预测，其中最简单的回归方法当然就是线性回归。顾名思义，线性回归就是使用线性方程来对已知的数据集合进行拟合，达到预测未来的目的。线性回归的优点就是结果十分容易理解，计算公式简单；缺点则是对非线性的数据拟合程度不够好。例如，用一个线性函数 $y = kx + b$ 去拟合二次函数 $f(x) = x^{2}$ ，结果总是不尽人意。为了解决这类问题，有人提出了局部加权线性回归（locally weighted linear regression），岭回归（ridge regression），LASSO 和 前向逐步线性回归（forward stagewise linear regression）。本文中将会一一介绍这些回归算法。

（一）线性回归（Linear Regression）

假设矩阵 X 的每一行表示一个样本，每一列表示相应的特征，列向量 Y 表示矩阵 X 所对应的取值，那么我们需要找到一个列向量 $\Theta$ 使得 $Y=X\Theta$ 。当然，这样的 $\Theta$ 在现实的数据集中几乎不可能存在。不过，我们可以寻找一个 $\Theta$ 使得列向量 $Y-X\Theta$ 的 Eulidean 范数足够小。换言之，我们需要找到一个向量 $\Theta$ 使得

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}(Y-X\Theta)$

的取值足够小，其中 m 是矩阵 X 的行数， $x_{i}$ 表示矩阵 X 的第 i 个行向量。通过数学计算可以得到：

$(Y-X\Theta)^{T}(Y-X\Theta)=Y^{T}Y-2Y^{T}X\Theta + \Theta^{T}X^{T}X\Theta$

对 $\Theta$ 求导之后得到： $-2X^{T}Y + 2X^{T}X\Theta=0$ ，求解 $\Theta$ 之后得到 $\Theta = (X^{T}X)^{-1}X^{T}Y$ 。因此，对于矩阵 $X$ 和列向量 $Y$ 而言，最佳的线性回归系数是

$\Theta = (X^{T}X)^{-1}X^{T}Y.$

举例说明：蓝色的是数据集，使用线性回归计算的话会得到一条直线。

qq%e5%9b%be%e7%89%8720161028185606

（二）局部加权线性回归（Locally Weighted Linear Regression）

线性回归的一个问题就是会出现欠拟合的情况，因为线性方程确实很难精确地描述现实生活的大量数据集。因此有人提出了局部加权线性回归（Locally Weighted Linear Regression），在该算法中，给每一个点都赋予一定的权重，也就是

$\sum_{i=1}^{m}w_{i}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}W(Y-X\Theta),$

其中 $W$ 表示以 $\{w_{1},...,w_{m}\}$ 为对角线的对角矩阵，其中 m 是矩阵 X 的行数， $x_{i}$ 表示矩阵 X 的第 i 个行向量。通过计算可以得到：

$(Y-X\Theta)^{T}W(Y-X\Theta)=Y^{T}WY-2Y^{T}WX\Theta+\Theta^{T}X^{T}WX\Theta,$

对 $\Theta$ 求导之后得到：

$-2(Y^{T}WX)^{T} + 2 X^{T}WX\Theta = -2X^{T}WY+2X^{T}WX\Theta.$

令导数等于零之后得到： $\Theta = (X^{T}WX)^{-1}X^{T}WY$ 。因此，如果使用局部加权线性回归的话，最佳的系数就是

$\Theta = (X^{T}WX)^{-1}X^{T}WY.$

局部加权线性回归需要确定权重矩阵 W 的值，那么就需要定义对角线的取值，通常情况下我们会使用高斯核。

$w_{i} = \exp\{-\frac{(x_{i}-x)^{2}}{2k^{2}}\}.$

其中 k 是参数。从高斯核的定义可以看出，如果 $x$ 与 $x_{i}$ 隔得很近，那么 $w_{i}$ 就会较大；如果隔得较远，那么 $w_{i}$ 就会趋向于零。意思就是说：在局部形成了线性回归的算法，在整体并不一定是线性回归。在局部线性回归中， $k$ 就是唯一的参数值。

如果选择了合适的 $k$ ，可以得到一条看上去还不错的曲线；如果选择了不合适的 $k$ ，就有可能出现过拟合的情况。

qq%e5%9b%be%e7%89%8720161028185611

（三）岭回归（Ridge Regression）和 LASSO

如果在某种特殊的情况下，特征的个数 n 大于样本的个数 m，i.e. 矩阵 X 的列数多于行数，那么 X 不是一个满秩矩阵，因此在计算 $(X^{T}X)^{-1}$ 的时候会出现问题。为了解决这个问题，有人引入了岭回归（ridge regression）的概念。也就是说在计算矩阵的逆的时候，增加了一个对角矩阵，目的是使得可以对矩阵进行求逆。用数学语言来描述就是矩阵 $X^{T}X$ 加上 $\lambda I$ ，这里的 I 是一个 $n\times n$ 的对角矩阵，使得矩阵 $X^{T}X+\lambda I$ 是一个可逆矩阵。在这种情况下，回归系数的计算公式变成了

$\Theta = (X^{T}X+\lambda I)^{-1}X^{T}Y.$

岭回归最初只是为了解决特征数目大于样本数目的情况，现在也可以用于在估计中加入偏差，从而得到更好的估计。

从另一个角度来讲，当样本的特征很多，而样本的数量相对少的时候， $\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}$ 很容易过拟合。为了缓解过拟合的问题，可以引入正则化项。如果使用 $L^{2}$ 正则化，那么目标函数则是

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda||\Theta||_{2}^{2}=(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta,$

其中 $\lambda>0$ 。通过数学推导可以得到：

$(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta = Y^{T}Y - 2\Theta^{T}X^{T}Y+\Theta^{T}X^{T}X\Theta+\lambda\Theta^{T}I\Theta.$

对 $\Theta$ 求导之后得到：

$-2X^{T}Y+2(X^{T}X+\lambda I)\Theta,$

令导数等于零可以得到： $\Theta = (X^{T}X + \lambda I)^{-1}X^{T}Y.$ 因此，从另一个角度来说，岭回归（Ridge Regression）是在线性规划的基础上添加了一个 $L^{2}$ 范数的正则化，可以用来降低过拟合的风险。

需要注意的是：在进行岭回归的时候，需要在一开始就对特征进行标准化处理，使得每一维度的特征具有相同的重要性。具体来说就是 (特征-特征的均值)/特征的方差，让每一维度的特征都满足零均值和单位方差。

另外，如果把岭回归中的 $L^{2}$ 范数正则化替换成 $L^{1}$ 范数，那么目标函数就变成了

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda ||\Theta||_{1}$

其中的参数 $\lambda>0$ 。 $L^{1}$ 和 $L^{2}$ 范数都有助于降低过拟合的风险，使用 $L^{1}$ 范数的方法被称为 LASSO（Least Absolute Shrinkage and Selection Operation）。使用 $L^{1}$ 范数比使用 $L^{2}$ 范数更加容易获得稀疏解（sparse solution），即它求得的参数 $\Theta$ 会有更少的非零分量。 $\Theta$ 获得稀疏解意味着初始的 n 个特征中仅有对应着 $\Theta$ 的非零分量的特征才会出现在最终的模型中。于是，求解 $L^{1}$ 范数正则化的结果是得到了仅采用一部分原始特征的模型；从另一个角度来说，基于 $L^{1}$ 正则化的学习方法就是一种嵌入式的特征选择方法，其特征选择的过程和训练的过程融为一体，同时完成。

（四）前向逐步线性回归（Forward Stagewise Linear Regression）

前向逐步线性回归算法是一种贪心算法，目的是在每一步都尽可能的减少误差。初始化的时候，所有的权重都设置为1，然后每一步所做的据测就是对某个权重增加或者减少一个很小的值 $\epsilon$ 。

该算法的伪代码如下所示：

数据标准化，使其分布满足零均值和单位方差
在每一轮的迭代中：
  设置当前最小误差为正无穷
  对每个特征：
    增大或者缩小：
      改变一个系数得到一个新的权重W
      计算新W下的误差
      如果误差Error小于当前误差：设置Wbest等于当前的W
    将W设置为新的Wbest

（五）总结

与分类一样，回归也是预测目标值的过程。但是分类预测的是离散型变量，回归预测的是连续型变量。但是在大多数情况下，数据之间会很复杂，这种情况下使用线性模型确实不是特别合适，需要采用其余的方法，例如非线性模型等。

数据挖掘与机器学习

如何在 Kaggle 首战中进入前 10%

October 25, 2016 zr9558 Leave a comment

Introduction

Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，在 2125 个参赛队伍中排名第 98 位（~ 5%）。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

本文的英文版见这里。

Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

首先简单介绍一些关于 Kaggle 比赛的知识：

不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。
比赛结束前一周是一个 Deadline，在这之后不能再组队，也不能再新加入比赛。所以想要参加比赛请务必在这一 Deadline 之前有过至少一次有效的提交。
一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。
反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
LB 指的就是在 Leaderboard 得到的分数，由上，有 Public LB 和 Private LB 之分。
自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
新手可以从比赛的 Forum 和 Scripts 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。

那么就开始吧！

P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!

General Approach

在这一节中我会讲述一次 Kaggle 比赛的大致流程。

Data Exploration

在这一步要做的基本就是 EDA (Exploratory Data Analysis)，也就是对数据进行探索性的分析，从而为之后的处理和建模提供必要的结论。

通常我们会用 pandas 来载入数据，并做一些简单的可视化来理解数据。

Visualization

通常来说 matplotlib 和 seaborn 提供的绘图功能就可以满足需求了。

比较常用的图表有：

查看目标变量的分布。当分布不平衡时，根据评分标准和具体模型的使用不同，可能会严重影响性能。
对 Numerical Variable，可以用 Box Plot 来直观地查看它的分布。
对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。
绘制变量之间两两的分布和相关度图表。

这里有一个在著名的 Iris 数据集上做了一系列可视化的例子，非常有启发性。

Statistical Tests

我们可以对数据进行一些统计上的测试来验证一些假设的显著性。虽然大部分情况下靠可视化就能得到比较明确的结论，但有一些定量结果总是更理想的。不过，在实际数据中经常会遇到非 i.i.d. 的分布。所以要注意测试类型的的选择和对显著性的解释。

在某些比赛中，由于数据分布比较奇葩或是噪声过强，Public LB 的分数可能会跟Local CV 的结果相去甚远。可以根据一些统计测试的结果来粗略地建立一个阈值，用来衡量一次分数的提高究竟是实质的提高还是由于数据的随机性导致的。

Data Preprocessing

大部分情况下，在构造 Feature 之前，我们需要对比赛提供的数据集进行一些处理。通常的步骤有：

有时数据会分散在几个不同的文件中，需要 Join 起来。
处理 Missing Data。
处理 Outlier。
必要时转换某些 Categorical Variable 的表示方式。
有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

这一部分的处理策略多半依赖于在前一步中探索数据集所得到的结论以及创建的可视化图表。在实践中，我建议使用 iPython Notebook 进行对数据的操作，并熟练掌握常用的 pandas 函数。这样做的好处是可以随时得到结果的反馈和进行修改，也方便跟其他人进行交流（在 Data Science 中 Reproducible Results 是很重要的)。

下面给两个例子。

Outlier

这是经过 Scaling 的坐标数据。可以发现右上角存在一些离群点，去除以后分布比较正常。

Dummy Variables

对于 Categorical Variable，常用的做法就是 One-hot encoding。即对这一变量创建一组新的伪变量，对应其所有可能的取值。这些变量中只有这条数据对应的取值为 1，其他都为 0。

如下，将原本有 7 种可能取值的 Weekdays 变量转换成 7 个 Dummy Variables。

要注意，当变量可能取值的范围很大（比如一共有成百上千类）时，这种简单的方法就不太适用了。这时没有有一个普适的方法，但我会在下一小节描述其中一种。

Feature Engineering

有人总结 Kaggle 比赛是 “Feature 为主，调参和 Ensemble 为辅”，我觉得很有道理。Feature Engineering 能做到什么程度，取决于对数据领域的了解程度。比如在数据包含大量文本的比赛中，常用的 NLP 特征就是必须的。怎么构造有用的 Feature，是一个不断学习和提高的过程。

一般来说，当一个变量从直觉上来说对所要完成的目标有帮助，就可以将其作为 Feature。至于它是否有效，最简单的方式就是通过图表来直观感受。比如：

Feature Selection

总的来说，我们应该生成尽量多的 Feature，相信 Model 能够挑出最有用的 Feature。但有时先做一遍 Feature Selection 也能带来一些好处：

Feature 越少，训练越快。
有些 Feature 之间可能存在线性关系，影响 Model 的性能。
通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。

Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的Feature Importance 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如这个。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。

看 Feature Importance 对于某些数据经过脱敏处理的比赛尤其重要。这可以免得你浪费大把时间在琢磨一个不重要的变量的意义上。

Feature Encoding

这里用一个例子来说明在一些情况下 Raw Feature 可能需要经过一些转换才能起到比较好的效果。

假设有一个 Categorical Variable 一共有几万个取值可能，那么创建 Dummy Variables 的方法就不可行了。这时一个比较好的方法是根据 Feature Importance 或是这些取值本身在数据中的出现频率，为最重要（比如说前 95% 的 Importance）那些取值（有很大可能只有几个或是十几个）创建 Dummy Variables，而所有其他取值都归到一个“其他”类里面。

Model Selection

准备好 Feature 以后，就可以开始选用一些常见的模型进行训练了。Kaggle 上最常用的模型基本都是基于树的模型：

Gradient Boosting
Random Forest
Extra Randomized Trees

以下模型往往在性能上稍逊一筹，但是很适合作为 Ensemble 的 Base Model。这一点之后再详细解释。（当然，在跟图像有关的比赛中神经网络的重要性还是不能小觑的。）

SVM
Linear Regression
Logistic Regression
Neural Networks

以上这些模型基本都可以通过 sklearn 来使用。

当然，这里不能不提一下 Xgboost。Gradient Boosting 本身优秀的性能加上Xgboost 高效的实现，使得它在 Kaggle 上广为使用。几乎每场比赛的获奖者都会用 Xgboost 作为最终 Model 的重要组成部分。在实战中，我们往往会以 Xgboost 为主来建立我们的模型并且验证 Feature 的有效性。顺带一提，在 Windows 上安装 Xgboost 很容易遇到问题，目前已知最简单、成功率最高的方案可以参考我在这篇帖子中的描述。

Model Training

在训练时，我们主要希望通过调整参数来得到一个性能不错的模型。一个模型往往有很多参数，但其中比较重要的一般不会太多。比如对 sklearn 的 RandomForestClassifier 来说，比较重要的就是随机森林中树的数量 n_estimators 以及在训练每棵树时最多选择的特征数量 max_features。所以我们需要对自己使用的模型有足够的了解，知道每个参数对性能的影响是怎样的。

通常我们会通过一个叫做 Grid Search 的过程来确定一组最佳的参数。其实这个过程说白了就是根据给定的参数候选对所有的组合进行暴力搜索。

1
2
3

param_grid = {‘n_estimators’: [300, 500], ‘max_features’: [10, 12, 14]}
model = grid_search.GridSearchCV(estimator=rfr, param_grid=param_grid, n_jobs=1, cv=10, verbose=20, scoring=RMSE)
model.fit(X_train, y_train)

顺带一提，Random Forest 一般在 max_features 设为 Feature 数量的平方根附近得到最佳结果。

这里要重点讲一下 Xgboost 的调参。通常认为对它性能影响较大的参数有：

eta：每次迭代完成后更新权重时的步长。越小训练越慢。
num_round：总共迭代的次数。
subsample：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。
colsample_bytree：训练每棵树时用来训练的特征的比例，类似 RandomForestClassifier 的 max_features。
max_depth：每棵树的最大深度限制。与 Random Forest 不同，Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的。
early_stopping_rounds：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。

一般的调参步骤是：

将训练数据的一部分划出来作为验证集。
先将 eta 设得比较高（比如 0.1），num_round 设为 300 ~ 500。
用 Grid Search 对其他参数进行搜索
逐步将 eta 降低，找到最佳值。
以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 early_stopping_rounds。

X_dtrain, X_deval, y_dtrain, y_deval = cross_validation.train_test_split(X_train, y_train, random_state=1026, test_size=0.3)
dtrain = xgb.DMatrix(X_dtrain, y_dtrain)
deval = xgb.DMatrix(X_deval, y_deval)
watchlist = [(deval, ‘eval’)]
params = {
‘booster’: ‘gbtree’,
‘objective’: ‘reg:linear’,
‘subsample’: 0.8,
‘colsample_bytree’: 0.85,
‘eta’: 0.05,
‘max_depth’: 7,
‘seed’: 2016,
‘silent’: 0,
‘eval_metric’: ‘rmse’
}
clf = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=50)
pred = clf.predict(xgb.DMatrix(df_test))

最后要提一点，所有具有随机性的 Model 一般都会有一个 seed 或是 random_state 参数用于控制随机种子。得到一个好的 Model 后，在记录参数时务必也记录下这个值，从而能够在之后重现 Model。

Cross Validation

Cross Validation 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 Public LB 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 CV 结果，提交之后得到的 LB 结果却变差了，一般认为这时应该相信 CV 的结果。当然，最理想的情况是多种不同的 CV 方法得到的结果和 LB 同时提高，但这样的比赛并不是太多。

在数据的分布比较随机均衡的情况下，5-Fold CV 一般就足够了。如果不放心，可以提到 10-Fold。但是 Fold 越多训练也就会越慢，需要根据实际情况进行取舍。

很多时候简单的 CV 得到的分数会不大靠谱，Kaggle 上也有很多关于如何做 CV 的讨论。比如这个。但总的来说，靠谱的 CV 方法是 Case By Case 的，需要在实际比赛中进行尝试和学习，这里就不再（也不能）叙述了。

Ensemble Generation

Ensemble Learning 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以同时降低最终模型的 Bias 和 Variance（证明可以参考这篇论文，我最近在研究类似的理论，可能之后会写新文章详述)，从而在提高分数的同时又降低 Overfitting 的风险。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。

常见的 Ensemble 方法有这么几种：

Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。
Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。
Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。
Stacking：接下来会详细介绍。

从理论上讲，Ensemble 要成功，有两个要素：

Base Model 之间的相关性要尽可能的小。这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。
Base Model 之间的性能表现不能差距太大。这其实是一个 Trade-off，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。

Stacking

相比 Blending，Stacking 能更好地利用训练数据。以 5-Fold Stacking 为例，它的基本原理如图所示：

整个过程很像 Cross Validation。首先将训练数据分为 5 份，接下来一共 5 个迭代，每次迭代时，将 4 份数据作为 Training Set 对每个 Base Model 进行训练，然后在剩下一份 Hold-out Set 上进行预测。同时也要将其在测试数据上的预测保存下来。这样，每个 Base Model 在每次迭代时会对训练数据的其中 1 份做出预测，对测试数据的全部做出预测。5 个迭代都完成以后我们就获得了一个 #训练数据行数 x #Base Model 数量 的矩阵，这个矩阵接下来就作为第二层的 Model 的训练数据。当第二层的 Model 训练完以后，将之前保存的 Base Model 对测试数据的预测（因为每个 Base Model 被训练了 5 次，对测试数据的全体做了 5 次预测，所以对这 5 次求一个平均值，从而得到一个形状与第二层训练数据相同的矩阵）拿出来让它进行预测，就得到最后的输出。

这里给出我的实现代码：

class Ensemble(object):
def __init__(self, n_folds, stacker, base_models):
self.n_folds = n_folds
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], len(folds)))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
# y_holdout = y[test_idx]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_holdout)[:]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict(T)[:]
S_test[:, i] = S_test_i.mean(1)
self.stacker.fit(S_train, y)
y_pred = self.stacker.predict(S_test)[:]
return y_pred

获奖选手往往会使用比这复杂得多的 Ensemble，会出现三层、四层甚至五层，不同的层数之间有各种交互，还有将经过不同的 Preprocessing 和不同的 Feature Engineering 的数据用 Ensemble 组合起来的做法。但对于新手来说，稳稳当当地实现一个正确的 5-Fold Stacking 已经足够了。

*Pipeline

可以看出 Kaggle 比赛的 Workflow 还是比较复杂的。尤其是 Model Selection 和 Ensemble。理想情况下，我们需要搭建一个高自动化的 Pipeline，它可以做到：

模块化 Feature Transform，只需写很少的代码就能将新的 Feature 更新到训练集中。
自动化 Grid Search，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。
自动化 Ensemble Generation，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

对新手来说，第一点可能意义还不是太大，因为 Feature 的数量总是人脑管理的过来的；第三点问题也不大，因为往往就是在最后做几次 Ensemble。但是第二点还是很有意义的，手工记录每个 Model 的表现不仅浪费时间而且容易产生混乱。

Crowdflower Search Results Relevance 的第一名获得者 Chenglong Chen 将他在比赛中使用的 Pipeline 公开了，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。

Home Depot Search Relevance

在这一节中我会具体分享我在 Home Depot Search Relevance 比赛中是怎么做的，以及比赛结束后从排名靠前的队伍那边学到的做法。

首先简单介绍这个比赛。Task 是判断用户搜索的关键词和网站返回的结果之间的相关度有多高。相关度是由 3 个人类打分取平均得到的，每个人可能打 1 ~ 3 分，所以这是一个回归问题。数据中包含用户的搜索词，返回的产品的标题和介绍，以及产品相关的一些属性比如品牌、尺寸、颜色等。使用的评分基准是 RMSE。

这个比赛非常像 Crowdflower Search Results Relevance 那场比赛。不过那边用的评分基准是 Quadratic Weighted Kappa，把 1 误判成 4 的惩罚会比把 1 判成 2 的惩罚大得多，所以在最后 Decode Prediction 的时候会更麻烦一点。除此以外那次比赛没有提供产品的属性。

EDA

由于加入比赛比较晚，当时已经有相当不错的 EDA 了。尤其是这个。从中我得到的启发有：

同一个搜索词/产品都出现了多次，数据分布显然不 i.i.d.。
文本之间的相似度很有用。
产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。
产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？

Preprocessing

这次比赛中我的 Preprocessing 和 Feature Engineering 的具体做法都可以在这里看到。我只简单总结一下和指出重要的点。

利用 Forum 上的 Typo Dictionary 修正搜索词中的错误。
统计属性的出现次数，将其中出现次数多又容易利用的记录下来。
将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。
对所有文本能做 Stemming 和 Tokenizing，同时手工做了一部分格式统一化（比如涉及到数字和单位的）和同义词替换。

Feature

*Attribute Features
- 是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）
- 这个特定的属性是否匹配
Meta Features
- 各个文本域的长度
- 是否包含属性域
- 品牌（将所有的品牌做数值离散化）
- 产品 ID
简单匹配
- 搜索词是否在产品标题、产品介绍或是产品属性中出现
- 搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例
- *搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现
搜索词和产品标题、产品介绍以及产品属性之间的文本相似度
- BOW Cosine Similairty
- TF-IDF Cosine Similarity
- Jaccard Similarity
- *Edit Distance
- Word2Vec Distance（由于效果不好，最后没有使用，但似乎是因为用的不对）
Latent Semantic Indexing：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 SVD 分解，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。

值得一提的是，上面打了 * 的 Feature 都是我在最后一批加上去的。问题是，使用这批 Feature 训练得到的 Model 反而比之前的要差，而且还差不少。我一开始是以为因为 Feature 的数量变多了所以一些参数需要重新调优，但在浪费了很多时间做 Grid Search 以后却发现还是没法超过之前的分数。这可能就是之前提到的 Feature 之间的相互作用导致的问题。当时我设想过一个看到过好几次的解决方案，就是将使用不同版本 Feature 的 Model 通过 Ensemble 组合起来。但最终因为时间关系没有实现。事实上排名靠前的队伍分享的解法里面基本都提到了将不同的 Preprocessing 和 Feature Engineering 做 Ensemble 是获胜的关键。

Model

我一开始用的是 RandomForestRegressor，后来在 Windows 上折腾 Xgboost 成功了就开始用 XGBRegressor。XGB 的优势非常明显，同样的数据它只需要不到一半的时间就能跑完，节约了很多时间。

比赛中后期我基本上就是一边台式机上跑 Grid Search，一边在笔记本上继续研究 Feature。

这次比赛数据分布很不独立，所以期间多次遇到改进的 Feature 或是 Grid Search新得到的参数训练出来的模型反而 LB 分数下降了。由于被很多前辈教导过要相信自己的 CV，我的决定是将 5-Fold 提到 10-Fold，然后以 CV 为标准继续前进。

Ensemble

最终我的 Ensemble 的 Base Model 有以下四个：

RandomForestRegressor
ExtraTreesRegressor
GradientBoostingRegressor
XGBRegressor

第二层的 Model 还是用的 XGB。

因为 Base Model 之间的相关都都太高了（最低的一对也有 0.9），我原本还想引入使用 gblinear 的 XGBRegressor 以及 SVR，但前者的 RMSE 比其他几个 Model 高了 0.02（这在 LB 上有几百名的差距），而后者的训练实在太慢了。最后还是只用了这四个。

值得一提的是，在开始做 Stacking 以后，我的 CV 和 LB 成绩的提高就是完全同步的了。

在比赛最后两天，因为身心疲惫加上想不到还能有什么显著的改进，我做了一件事情：用 20 个不同的随机种子来生成 Ensemble，最后取 Weighted Average。这个其实算是一种变相的 Bagging。其意义在于按我实现 Stacking 的方式，我在训练 Base Model 时只用了 80% 的训练数据，而训练第二层的 Model 时用了 100% 的数据，这在一定程度上增大了 Overfitting 的风险。而每次更改随机种子可以确保每次用的是不同的 80%，这样在多次训练取平均以后就相当于逼近了使用 100% 数据的效果。这给我带来了大约 0.0004 的提高，也很难受说是真的有效还是随机性了。

比赛结束后我发现我最好的单个 Model 在 Private LB 上的得分是 0.46378，而最终 Stacking 的得分是 0.45849。这是 174 名和 98 名的差距。也就是说，我单靠 Feature 和调参进到了前 10%，而 Stacking 使我进入了前 5%。

Lessons Learned

比赛结束后一些队伍分享了他们的解法，从中我学到了一些我没有做或是做的不够好的地方：

产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用With/Without XXX 的格式放在标题最后。
使用外部数据，比如 WordNet，Reddit 评论数据集等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。
基于字母而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现越长的单词约具体，所以越容易被用户认为相关度高。此外他们还使用了逐字符的序列比较（difflib.SequenceMatcher），因为这个相似度能够衡量视觉上的相似度。像这样的 Feature 的确不是每个人都能想到的。
标注单词的词性，找出中心词，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。
将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。
一些新颖的距离尺度，比如 Word Movers Distance
除了 SVD 以外还可以用上 NMF。
最重要的 Feature 之间的 Pairwise Polynomial Interaction。
针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布。

至于 Ensemble 的方法，我暂时还没有办法学到什么，因为自己只有最简单的 Stacking 经验。

Summary

Takeaways

比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。
很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。
Feature 为王。我花在 Feature 上的时间还是太少。
可能的话，多花点时间去手动查看原始数据中的 Pattern。

Issues Raised

我认为在这次比赛中遇到的一些问题是很有研究价值的：

在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 CV。
如何量化 Ensemble 中 Diversity vs. Accuracy 的 Trade-off。
如何处理 Feature 之间互相影响导致性能反而下降。

Beginner Tips

给新手的一些建议：

选择一个感兴趣的比赛。如果你对相关领域原本就有一些洞见那就更理想了。
根据我描述的方法开始探索、理解数据并进行建模。
通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。
如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。
在得到一个比较不错的 LB 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。
如果觉得自己有希望拿到奖金，开始找人组队吧！
到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。
比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备。
好好休息！

Reference

Introduction

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th (~ 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal.

This post is also available in Chinese.

Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.

First let’s go through some facts about Kaggle competitions in case you are not very familiar with them.

Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited amount of times a day. Usually it’s 5 times a day.
There will be a deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore be sure to have at least one valid submission before that.
You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.
The score you get is calculated on a subset of testing set, which is commonly referred to as a Public LB score. Whereas the final result will use the remaining data in the testing set, which is referred to as Private LB score.
The score you get by local cross validation is commonly referred to as a CVscore. Generally speaking, CV scores are more reliable than LB scores.
Beginners can learn a lot from Forum and Scripts. Do not hesitate to ask, Kagglers are very kind and helpful.

I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!

General Approach

In this section, I will walk you through the whole process of a Kaggle competition.

Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
For numerical variables, use box plot to inspect their distributions.
For coordinates-like data, use scatter plot to inspect the distribution and check for outliers.
For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
Make pairwise distribution plots and examine correlations between pairs of variables.

Be sure to read this very inspiring tutorial of exploratory visualization before you go on.

Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about the choice of tests and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.id. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is an genuine one or due to randomness.

Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

Sometimes several files are provided and we need to join them.
Deal with missing data.
Deal with outliers.
Encode categorical variables if necessary.
Deal with noise. For example you may have some floats derived from unknown integers. The loss of precision during floating-point operations can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.

How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using iPython Notebook for data manipulating and mastering usages of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun operations. Also this makes it very convenient to share your approaches with others. After all reproducible results are very important in data science.

Let’s have some examples.

Outlier

The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.

Dummy Variables

For categorical variables, a common practice is One-hot Encoding. For a categorical variable with n possible values, we create a group of n dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Like this, we transform DayOfWeek into 7 dummy variables.

Note that when the categorical variable can takes many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.

Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, common NLP features are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to check by plotting it against the target variable like this:

Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

Less features mean faster training
Some features are linearly related to others. This might put a strain on the model.
By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.

The simplest way to inspect feature importance is by fitting a random forest model. There exist more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

Feature Encoding

Sometimes raw features have to be converted to some other formats for them to be work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

Model Selection

With the features set, we can start training models. Kaggle competitions usually favor tree-based models:

Gradient Boosted Trees
Random Forest
Extra Randomized Trees

These models are slightly worse in terms of performance, but are suitable as base models in ensemble learning (will be discussed later):

SVM
Linear Regression
Logistic Regression
Neural Networks

Of course, neural networks are very important in image-related competitions.

All these models can be accessed using Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

BTW, installing Xgboost on Windows could be a painstaking process. You can refer to this post by me if you run into problems.

Model Training

We can obtain a good model by tuning its parameters. A model usually have many parameters, but only a few of them are important to its performance. For example, the most important parameters for random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each of the parameters have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

param_grid = {‘n_estimators’: [300, 500], ‘max_features’: [10, 12, 14]}
model = grid_search.GridSearchCV(
estimator=rfr, param_grid=param_grid, n_jobs=1, cv=10, verbose=20, scoring=RMSE
)
model.fit(X_train, y_train)

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

eta: Step size used in updating weights. Lower eta means slower training.
num_round: Total round of iterations.
subsample: The ratio of training data used in each iteration. This is to combat overfitting.
colsample_bytree: The ratio of features used in each iteration. This is like max_features of RandomForestClassifier.
max_depth: The maximum depth of each tree. Unlike random forest,gradient boosting would eventually overfit if we do not limit its depth.
early_stopping_rounds: Controls how many iterations that do not show a increase of score on validation set are needed for the algorithm to stop early. This is to combat overfitting, too.

Usual tuning steps:

Reserve a portion of training set as the validation set.
Set eta to a relatively high value (e.g. 0.1), num_round to 300 ~ 500.
Use grid search to find best combination of other parameters.
Gradually lower eta to find the optimum.
Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

X_dtrain, X_deval, y_dtrain, y_deval = \
cross_validation.train_test_split(X_train, y_train, random_state=1026, test_size=0.3)
dtrain = xgb.DMatrix(X_dtrain, y_dtrain)
deval = xgb.DMatrix(X_deval, y_deval)
watchlist = [(deval, ‘eval’)]
params = {
‘booster’: ‘gbtree’,
‘objective’: ‘reg:linear’,
‘subsample’: 0.8,
‘colsample_bytree’: 0.85,
‘eta’: 0.05,
‘max_depth’: 7,
‘seed’: 2016,
‘silent’: 0,
‘eval_metric’: ‘rmse’
}
clf = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=50)
pred = clf.predict(xgb.DMatrix(df_test))

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

Cross Validation

Cross validation is an essential step. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions where it’s not easy to do reliable CV.

Ensemble Generation

Ensemble Learning refers to ways of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thusincreasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

Bagging: Use different random subsets of training data to train each base model. Then base models vote to generate the final predictions. This is how random forest works.
Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. It performs better than bagging but is more prone to overfitting.
Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
Stacking: To be discussed next.

In theory, for the ensemble to perform well, two elements matter:

Base models should be as unrelated as possibly. This is why we tend to include non-tree-base models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase performance even under this circumstance.

Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(rows in training data) X #(base models). This matrix is then fed to the stacker in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

class Ensemble(object):
def __init__(self, n_folds, stacker, base_models):
self.n_folds = n_folds
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], len(folds)))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
# y_holdout = y[test_idx]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_holdout)[:]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict(T)[:]
S_test[:, i] = S_test_i.mean(1)
self.stacker.fit(S_train, y)
y_pred = self.stacker.predict(S_test)[:]
return y_pred

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

*Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

Modularized feature transform. We only need to write a few lines of codes and the new feature is added to the training set.
Automated grid search. We only need to set up models and parameter grid, the search will be run and best parameters are recorded.
Automated ensemble generation. Use best K models for ensemble as soon as last generation is done.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because manually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s still very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.

Home Depot Search Relevance

In this section I will share my solution in Home Depot Search Relevance and what I learned from top teams after the competition.

The task in this competitions is to predict how relevant a result is for a search term on Home Depot website. The relevance score is an average of three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is RMSE.

This is much like Crowdflower Search Results Relevance. The difference is thatQuadratic Weighted Kappa is used in that competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in that competition.

EDA

There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:

Many search terms / products appeared several times.
Text similarities are great features.
Many products don’t have attributes features. Would this be a problem?
Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?

Preprocessing

You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:

Use typo dictionary posted in forum to correct typos in search terms.
Count attributes. Find those frequent and easily exploited ones.
Join the training set with the testing set. This is important because otherwise you’ll have to do feature transform twice.
Do stemming and tokenizing for all the text fields. Some normalization(with digits and units) and synonym substitutions are performed manually.

Feature

*Attribute Features
- Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
- Whether a certain attribute matches with search term

Meta Features
- Length of each text field
- Whether the product contains attribute fields
- Brand (encoded as integers)
- Product ID
Matching
- Whether search term appears in product title / description / attributes
- Count and ratio of search term’s appearance in product title / description / attributes
- *Whether the i-th word of search term appears in product title / description / attributes
Text similarities between search term and product title/description/attributes
- BOW Cosine Similairty
- TF-IDF Cosine Similarity
- Jaccard Similarity
- *Edit Distance
- Word2Vec Distance (I didn’t include this because of poor performance. Yet it seems that I was using it wrong.)
Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get the latent descriptions of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).

Note that features listed above with * are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlation mentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

Model

At first I was using RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my PC while working on features on my laptop.

Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.

Ensemble

My final model is an ensemble consisting of 4 base models:

RandomForestRegressor
ExtraTreesRegressor
GradientBoostingRegressor
XGBRegressor

The stacker (L2 model) is also a XGBRegressor.

The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.

The good news is that, despite base models being highly correlated, stacking really bumps up my score. What’s more, my CV score and LB score are in complete sync after I started stacking.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of 0.0004, which might be just due to randomness.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.

Lessons Learned

There’s much to learn from the solutions shared by top teams:

There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by With/Without XXXat the end of the title.
Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
POS-tag words and find anchor words. Use anchor words for computing various distances.
Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
Some novel distance metrics like Word Movers Distance
Apart from SVD, some used NMF.
Generate pairwise polynomial interactions between top-ranking features.
For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.

Summary

Takeaways

It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
Features matter the most! I didn’t spend enough time on features in this competition.
If possible, spend some time to manually inspect raw data for patterns.

Issues Raised

Several issues I encountered in this competitions are of high research values.

How to do reliable CV with dependent data.
How to quantify the trade-off between diversity and accuracy in ensemble learning.
How to deal with feature interaction which harms the model’s performance. And how to determine whether new features are effective in such situation.

Beginner Tips

Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
Following my approach or somebody else’s, start exploring, understanding and modeling data.
Learn from forum and scripts. See how other interpret data and construct features.
Find winner interviews / blog post of previous competitions. They’re very helpful.
Start doing ensemble after you have reached a pretty good score (e.g. ~ 10%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
If you think you may have a chance to win the prize, try teaming up!
Don’t give up until the end of the competition. At least try something new every day.
Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
Get some rest!

Reference

数据挖掘与机器学习

最大似然估计（Maximal Likelihood Estimation）

October 18, 2016 zr9558 Leave a comment

（一）基本思想

给定一个概率分布 $D$ ，假设其概率密度函数是 $f_{D}$ ，它与一个未知参数 $\theta$ 相关。我们可以从这个分布中抽取 $n$ 样本 $x_{1},x_{2},...,x_{n}$ ，我们就可以得到这个概率是

$P(x_{1},...,x_{n}) = f_{D}(x_{1},...,x_{n}|\theta)$ .

但是，在这里我们并不知道参数 $\theta$ 的值。如何估计参数 $\theta$ 的取值就成为了关键之处。一个简单的想法就是从这个分布中随机抽取样本 $x_{1},...,x_{n}$ ，然后利用这些数据来估算 $\theta$ 的值。

最大似然估计 (maximal likelihood estimator) 算法会计算参数 $\theta$ 的最可能的值，也就是说参数的选择会使得这个采样的概率最大化。

用数学的语言来说，首先我们需要定义似然函数：

$L(\theta) = f_{D}(x_{1},...,x_{n}|\theta)$ ,

并且在 $\theta$ 的所有取值上，使得这个函数的取值最大化。换言之，也就是函数 $L(\theta)$ 的一阶导数等于零。这个使得 $L(\theta)$ 最大化的参数 $\hat{\theta}$ 称为 $\theta$ 的最大似然估计。

Remark. 最大似然函数不一定是唯一的，甚至不一定是存在的。

（二）基本算法

求解最大似然函数估计值的一般步骤：

（1）定义似然函数；

（2）对似然函数求导数，或者说对似然函数的对数求导数，目的都是为了更加方便地计算一阶导数；

（3）令一阶导数等于零，得到关于参数 $\theta$ 的似然方程；

（4）求解似然方程，得到的参数就是最大似然估计。在求解的过程中，如果不能够直接求解方程的话，可以采取牛顿法来近似求解。

（三）例子

（i）Bernoulli 分布（Bernoulli Distribution）

假设我们有 $n$ 个随机样本 $x_{1},...,x_{n}$ . 如果第 $i$ 个学生没有自行车，那么 $x_{i}=0$ ; 否则 $x_{i}=1$ . 并且假设 $x_{i}$ 是满足未知参数 $p$ 的 Bernoulli 分布的。我们此时的目标是计算最大似然估计 $p$ ，也就是全体学生中拥有自行车的比例。

如果 $\{x_{i}:1\leq i\leq n\}$ 是相互独立的 Bernoulli 随机变量，那么对每一个 $x_{i}$ 而言，它的概率函数则是：

$f(x_{i};p)=p^{x_{i}}(1-p)^{1-x_{i}}, \text{ for } x_{i}=0 \text{ or } 1 \text{ and } 0<p<1$ .

因此，似然函数 $L(p)$ 可以定义为：

$L(p)=\prod_{i=1}^{n}f(x_{i};p)=p^{\sum_{i=1}^{n}x_{i}}(1-p)^{n-\sum_{i=1}^{n}x_{i}}$ .

为了计算参数 $p$ 的值，可以对 $ln(L(p))$ 求导：

$\ln(L(p))=(\sum_{i=1}^{n}x_{i})\ln(p) + (n-\sum_{i=1}^{n}x_{i})\ln(1-p)$

$\frac{\partial\ln(L(p))}{\partial p} = \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}$

令 $\frac{\partial\ln(L(p))}{\partial p} = 0$ ，可以得到

$p=\sum_{i=1}^{n}x_{i}/n$ .

也就是说最大似然估计是

$\hat{p}=\sum_{i=1}^{n}x_{i}/n$ .

（ii）Gaussian Distribution

假设 $x_{1},...,x_{n}$ 满足正态分布，并且该正态分布的参数 $\mu$ 和 $\sigma^{2}$ 都是未知的。目标是寻找均值 $\mu$ 和方差 $\sigma^{2}$ 的最大似然估计。

如果 $\{x_{1},...,x_{n}\}$ 是满足正态分布的，那么对于每一个变量 $x_{i}$ 的概率密度函数就是：

$f(x_{i};\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x_{i}-\theta)^{2}}{2\sigma^{2}})$ .

似然函数就是：

$L(\mu,\sigma) = \prod_{i=1}^{n} f(x_{i};\mu,\sigma^{2}) = (2\pi)^{-n/2}\sigma^{-n}\exp(-\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}})$

$\ln(L(\mu,\sigma)) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} = \frac{\sum_{i=1}^{n}(x_{i}-\mu)}{\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{\sigma^{3}}$

令 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} =0$ 和 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma}=0$ ，可以求解方程组得到：

$\hat{\mu}= \sum_{i=1}^{n}x_{i}/n$ ,

$\hat{\sigma}^{2} = \sum_{i=1}^{n}(x_{i}-\mu)^{2}/n$ .

(iii) Weibull 分布（Weibull Distribution）

首先，我们回顾一下 Weibull 分布的定义。Weibull 分布（Weibull Distribution）是连续型的概率分布，其概率密度函数是：

$f(x;\lambda,k) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1}\exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0, f(x;\lambda,k)=0 \text{ for } x<0.$

其中， $x$ 是随机变量， $\lambda>0$ 是 scale parameter， $k>0$ 是 shape parameter。特别地，当 $k=1$ 时，Weibull 分布就是指数分布；当 $k=2$ 时，Weibull 分布就是 Rayleigh 分布。

Weibull 分布的累积分布函数是

$F(x;k,\lambda) = 1- \exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0$ ,

$F(x;k,\lambda) = 0 \text{ for } x<0$ .

Weibull 分布的分位函数（quantile function, inverse cumulative distribution）是

$Q(p;k,\lambda) = \lambda(-\ln(1-p))^{1/k} \text{ for } 0\leq p <1$ .

其次，我们来计算最大似然估计。

假设 $\{x_{1},...,x_{n}\}$ 满足 Weibull 分布，其未知参数是 $k,\lambda.$ 那么对于每一个 $x_{i}$ 而言，概率密度函数是：

$p(x_{i};k,\lambda) = \frac{k}{\lambda}(\frac{x_{i}}{\lambda})^{k-1}\exp(-(\frac{x_{i}}{\lambda})^{k})$ .

定义似然函数为：

$L(k,\lambda) = \prod_{i=1}^{n}p(x_{i};k,\lambda)$

取对数之后得到：

$\ln(L(k,\lambda)) = n\ln(k) - nk\ln(\lambda) + (k-1)\sum_{i=1}^{n}\ln(x_{i}) - \sum_{i=1}^{n}x_{i}^{k}/\lambda^{k}.$

计算一阶偏导数得到：

$\frac{\partial \ln(L(k,\lambda))}{\partial \lambda} = - \frac{nk}{\lambda} + \frac{k\sum_{i=1}^{n}x_{i}^{k}}{\lambda^{k+1}},$

$\frac{\partial \ln(L(k,\lambda))}{\partial k} = \frac{n}{k} - n\ln(\lambda) + \sum_{i=1}^{n}\ln(x_{i}) -\sum_{i=1}^{n}(\frac{x_{i}}{\lambda})^{k}\ln(\frac{x_{i}}{\lambda}).$

可以计算得出：

$\lambda^{k}=\frac{\sum_{i=1}^{n}x_{i}^{k}}{n},$

$\frac{1}{k} = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} -\frac{\sum_{i=1}^{n}\ln(x_{i})}{n}.$

其中第一个式子可以计算出 $\lambda$ 的最大似然估计。第二个式子是关于 $k$ 的隐函数，不能够直接求解，需要使用 Newton’s method 来计算。

令

$f(k) = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} - \frac{\sum_{i=1}^{n}\ln(x_{i})}{n} - \frac{1}{k}.$

求导得到：

$f^{'}(k)= \frac{\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2}}{\sum_{i=1}^{n}x_{i}^{k}}-(\frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}})^{2} + \frac{1}{k^{2}}.$

根据 Cauchy’s Inequality, 可以得到：

$(\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2})\cdot(\sum_{i=1}^{n}x_{i}^{k})\geq (\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i}))^{2}.$

所以， $f^{'}(k)>0 \text{ for all } k>0$ ，换言之， $f(k)$ 是关于 $k$ 的递增函数，并且

$\lim_{k\rightarrow 0^{+}}f(k) = -\infty,$

$\lim_{k\rightarrow +\infty}f(k) > 0 \text{ if } \forall x_{i}>1.$

那么对于递增函数 $f(k)$ 而言，就必定有一个零点。因此使用 Newton’s Iteration 的时候，初始点可以从靠近零的整数开始，比如 $k_{0}=0.0001$ 。如果从一个比较大的数开始的时候，可能使用 Newton 法的时候，会与负轴相交。但是如果从一个较小的数开始，就必定只与正数轴相交。其中 Newton 法的公式是：

$k_{0}= 0.0001,$

$k_{n+1} = k_{n}- \frac{f(k_{n})}{f^{'}(k_{n})} \text{ for all } n\geq 0.$

当 $n$ 的次数足够大的时候， $k_{n}$ 就可以被当作最大似然估计。

数据挖掘与机器学习

How machine learning can help the security industry

October 11, 2016 zr9558 Leave a comment

Machine learning (ML) is such a hot area in security right now.

At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.

Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.

Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.

Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.

Where does ML work?

One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.

We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.

Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.

One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.

There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.

Threat models and anomaly detection

In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.

There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.

数据挖掘与机器学习

异常点检测算法综述

September 28, 2016 zr9558 Leave a comment

异常点检测（又称为离群点检测）是找出其行为很不同于预期对象的一个检测过程。这些对象被称为异常点或者离群点。异常点检测在很多实际的生产生活中都有着具体的应用，比如信用卡欺诈，工业损毁检测，图像检测等。

本文主要介绍一些常见的异常点检测算法，包括基于统计的模型，基于距离的模型，线性变换的模型，非线性变换的模型等。

异常点检测和聚类分析是两项高度相关的人物。聚类分析发现数据集中的各种模式，而异常点检测则是试图捕捉那些显著偏离多数模式的异常情况。异常点检测和聚类模型服务于不同的目的。

This slideshow requires JavaScript.

数据挖掘与机器学习

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

September 27, 2016 zr9558 Leave a comment

Introduction

If things don’t go your way in predictive modeling, use XGboost. XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?

This article is best suited to people who are new to XGBoost. In this article, we’ll learn the art of parameter tuning along with some useful information about XGBoost. Also, we’ll practice this algorithm using a data set in Python.

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

What should you know ?

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python, I highly recommend going through that before reading further. It will help you bolster your understanding of boosting in general and parameter tuning for GBM.

Special Thanks: Personally, I would like to acknowledge the timeless support provided by Mr. Sudalai Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. He is helping us guide thousands of data scientists. A big thanks to SRK!

The XGBoost Advantage
Understanding XGBoost Parameters
Tuning Parameters (with Example)

1. The XGBoost Advantage

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:

Regularization:
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as ‘regularized boosting‘ technique.
Parallel Processing:
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
- XGBoost also supports implementation on Hadoop.
High Flexibility
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
Handling Missing Values
- XGBoost has an in-built routine to handle missing values.
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
Tree Pruning:
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruningthe tree backwards and remove splits beyond which there is no positive gain.
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
Built-in Cross-Validation
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.

I hope now you understand the sheer power XGBoost algorithm. Note that these are the points which I could muster. You know a few more? Feel free to drop a comment below and I will update the list.

Did I whet your appetite ? Good. You can refer to following web-pages for a deeper understanding:

2. XGBoost Parameters

The overall parameters have been divided into 3 categories by XGBoost authors:

General Parameters: Guide the overall functioning
Booster Parameters: Guide the individual booster (tree/regression) at each step
Learning Task Parameters: Guide the optimization performed

I will give analogies to GBM here and highly recommend to read this article to learn from the very basics.

General Parameters

These define the overall functionality of XGBoost.

booster [default=gbtree]
- Select the type of model to run at each iteration. It has 2 options:
  - gbtree: tree-based models
  - gblinear: linear models
silent [default=0]:
- Silent mode is activated is set to 1, i.e. no running messages will be printed.
- It’s generally good to keep it 0 as the messages might help in understanding the model.
nthread [default to maximum number of threads available if not set]
- This is used for parallel processing and number of cores in the system should be entered
- If you wish to run on all cores, value should not be entered and algorithm will detect automatically

There are 2 more parameters which are set automatically by XGBoost and you need not worry about them. Lets move on to Booster parameters.

Booster Parameters

Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

eta [default=0.3]
- Analogous to learning rate in GBM
- Makes the model more robust by shrinking the weights on each step
- Typical final values to be used: 0.01-0.2
min_child_weight [default=1]
- Defines the minimum sum of weights of all observations required in a child.
- This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.
max_depth [default=6]
- The maximum depth of a tree, same as GBM.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
- Should be tuned using CV.
- Typical values: 3-10
max_leaf_nodes
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
gamma [default=0]
- A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
- Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
max_delta_step [default=0]
- In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
- Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
- This is generally not used but you can explore further if you wish.
subsample [default=1]
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
- Typical values: 0.5-1
colsample_bytree [default=1]
- Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
- Typical values: 0.5-1
colsample_bylevel [default=1]
- Denotes the subsample ratio of columns for each split, in each level.
- I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
lambda [default=1]
- L2 regularization term on weights (analogous to Ridge regression)
- This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
alpha [default=0]
- L1 regularization term on weight (analogous to Lasso regression)
- Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
scale_pos_weight [default=1]
- A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

Learning Task Parameters

These parameters are used to define the optimization objective the metric to be calculated at each step.

objective [default=reg:linear]
- This defines the loss function to be minimized. Mostly used values are:
  - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
  - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
    - you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
  - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
eval_metric [ default according to objective ]
- The metric to be used for validation data.
- The default values are rmse for regression and error for classification.
- Typical values are:
  - rmse – root mean square error
  - mae – mean absolute error
  - logloss – negative log-likelihood
  - error – Binary classification error rate (0.5 threshold)
  - merror – Multiclass classification error rate
  - mlogloss – Multiclass logloss
  - auc: Area under the curve
seed [default=0]
- The random number seed.
- Can be used for generating reproducible results and also for parameter tuning.

If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:

eta –> learning_rate
lambda –> reg_lambda
alpha –> reg_alpha

You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier. However, it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.

I recommend you to go through the following parts of xgboost guide to better understand the parameters and codes:

3. Parameter Tuning with Example

We will take the data set from Data Hackathon 3.x AV hackathon, same as that taken in the GBM article. The details of the problem can be found on the competition page. You can download the data set from here. I have performed the following steps:

City variable dropped because of too many categories
DOB converted to Age | DOB dropped
EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped
EmployerName dropped because of too many categories
Existing_EMI imputed with 0 (median) since only 111 values were missing
Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped
Lead_Creation_Date dropped because made little intuitive impact on outcome
Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values
Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped
Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped
LoggedIn, Salary_Account dropped
Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped
Source – top 2 kept as is and all others combined into different category
Numerical and One-Hot-Coding performed

For those who have the original data from competition, you can check out these steps from the data_preparation iPython notebook in the repository.

Lets start by importing the required libraries and loading the data:

#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'

Note that I have imported 2 forms of XGBoost:

xgb – this is the direct xgboost library. I will use a specific function “cv” from this library
XGBClassifier – this is an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM

Before proceeding further, lets define a function which will help us create XGBoost models and perform cross-validation. The best part is that you can take this function as it is and use it later for your own models.

def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
                    
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

This code is slightly different from what I used for GBM. The focus of this article is to cover the concepts and not coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it. Note that xgboost’s sklearn wrapper doesn’t have a “feature_importances” metric but a get_fscore() function which does the same job.

General Approach for Parameter Tuning

We will use an approach similar to that of GBM here. The various steps to be performed are:

Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
Lower the learning rate and decide the optimal parameters .

Let us look at a more detailed step by step approach.

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:

max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.
min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.
gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.
subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.
scale_pos_weight = 1: Because of high class imbalance.

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.

#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, train, predictors)

As you can see that here we got 140 as the optimal estimators for 0.1 learning rate. Note that this value might be too high for you depending on the power of your system. In that case you can increase the learning rate and re-run the command to get the reduced number of estimators.

Note: You will see the test AUC as “AUC Score (Test)” in the outputs here. But this would not appear if you try to run the command on your system as the data is not made public. It’s provided here just for reference. The part of the code which generates this output has been removed here.

Step 2: Tune max_depth and min_child_weight

We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.

Important Note: I’ll be doing some heavy-duty grid searched in this section which can take 15-30 mins or even more time to run depending on your system. You can vary the number of values you are testing based on what your system can handle.

param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

Here, we have run 12 combinations with wider intervals between values. The ideal values are 5 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.

param_test2 = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

Here, we get the optimum values as 4 for max_depth and 6 for min_child_weight. Also, we can see the CV score increasing slightly. Note that as the model performance increases, it becomes exponentially difficult to achieve even marginal gains in performance. You would have noticed that here we got 6 as optimum value for min_child_weight but we haven’t tried values more than 6. We can do that as follow:.

param_test2b = {
 'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])

modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_

We see 6 as the optimal value.

Step 3: Tune gamma

Now lets tune gamma value using the parameters already tuned above. Gamma can take various values but I’ll check for 5 values here. You can go into more precise values as.

param_test3 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

This shows that our original value of gamma, i.e. 0 is the optimum one. Before proceeding, a good idea would be to re-calibrate the number of boosting rounds for the updated parameters.

xgb2 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb2, train, predictors)

Here, we can see the improvement in score. So the final parameters are:

max_depth: 4
min_child_weight: 6
gamma: 0

Step 4: Tune subsample and colsample_bytree

The next step would be try different subsample and colsample_bytree values. Lets do this in 2 stages as well and take values 0.6,0.7,0.8,0.9 for both to start with.

param_test4 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

Here, we found 0.8 as the optimum value for both subsample and colsample_bytree. Now we should try values in 0.05 interval around these.

param_test5 = {
 'subsample':[i/100.0 for i in range(75,90,5)],
 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])

Again we got the same values as before. Thus the optimum values are:

subsample: 0.8
colsample_bytree: 0.8

Step 5: Tuning Regularization Parameters

Next step is to apply regularization to reduce overfitting. Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity. But we should always try it. I’ll tune ‘reg_alpha’ value here and leave it upto you to try different values of ‘reg_lambda’.

param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

We can see that the CV score is less than the previous case. But the values tried are very widespread, we should try values closer to the optimum here (0.01) to see if we get something better.

param_test7 = {
 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

You can see that we got a better CV. Now we can apply this regularization in the model and look at the impact:

xgb3 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb3, train, predictors)

Again we can see slight improvement in the score.

Step 6: Reducing Learning Rate

Lastly, we should lower the learning rate and add more trees. Lets use the cv function of XGBoost to do the job again.

xgb4 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb4, train, predictors)

Now we can see a significant boost in performance and the effect of parameter tuning is clearer.

As we come to the end, I would like to share 2 key thoughts:

It is difficult to get a very big leap in performance by just using parameter tuning or slightly better models. The max score for GBM was 0.8487 while XGBoost gave 0.8494. This is a decent improvement but not something very substantial.
A significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc

You can also download the iPython notebook with all these model codes from my GitHub account. For codes in R, you can refer to this article.

End Notes

This article was based on developing a XGBoost model end-to-end. We started with discussing why XGBoost has superior performance over GBM which was followed by detailed discussion on thevarious parameters involved. We also defined a generic function which you can re-use for making models.

Finally, we discussed the general approach towards tackling a problem with XGBoost and also worked out the AV Data Hackathon 3.x problem through that approach.

I hope you found this useful and now you feel more confident to apply XGBoost in solving a data science problem. You can try this out in out upcoming hackathons.

Did you like this article? Would you like to share some other hacks which you implement while making XGBoost models? Please feel free to drop a note in the comments below and I’ll be glad to discuss.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

数据挖掘与机器学习

AI^2: Training a big data machine to defend

September 22, 2016 zr9558 Leave a comment

AI²: Training a big data machine to defend Veeramachaneni et al.IEEE International conference on Big Data Security, 2016

Will machines take over? The lesson of today’s paper is that we’re better off together. Combining AI with HI (human intelligence, I felt like we deserved an acronym of our own ) yields much better results than a system that uses only unsupervised learning. The context is information security, scanning millions of log entries per day to detect suspicious activity and prevent attacks. Examples of attacks include account takeovers, new account fraud (opening a new account using stolen credit card information), and terms of service abuse (e.g. abusing promotional codes, or manipulating cookies for advantage).

A typical attack has a behavioral signature, which comprises the series of steps involved in commiting it. The information necessary to quantify these signatures is buried deep in the raw data, and is often delivered as logs.

The usual problem with such outlier/anomaly detection systems is that they trigger lots of false positive alarms, that take substantial time and effort to investigate. After the system has ‘cried wolf’ enough times they can become distrusted and of limited use. AI² combines the experience and intuition of analysts with machine learning techniques. An ensemble of unsupervised learning models generates a set of k events to be analysed per day (where the daily budget k of events that can be analysed is a configurable parameter). The human judgements on these k events are used to train a supervised model, the results of which are combined with the unsupervised ensemble results to refine the k events to be presented to the analyst on the next day. And so it goes on.

The end result looks a bit like this:

With a daily investigation budget (k) of 200 events, AI² detects 86.8% of attacks with a false positive rate of 4.4%. Using only unsupervised learning, on 7.9% of attacks are detected. If the investigation budget is upped to 1000 events/day, unsupervised learning can detect 73.7% of attacks with a false positive rate of 22%. At this level, the unsupervised system is generating 5x the false positives of AI², and still not detecting as many attacks.

Detecting attacks is a true ‘needle-in-a-haystack’ problem as the following table shows:

Entities in the above refers to the number of unique IP addresses, users, sessions etc. anaysed on a daily basis. The very small relative number of true attacks results in extreme class imbalance when trying to learn a supervised model.

AI² tracks activity based on ingested log records and aggregates activities over intervals of time (for example,counters, indicators – did this happen in the window at all? – elapsed time between events, number of unique values and so on). These features are passed into an ensemble of three unsupervised outlier detection models:

A Principle Component Analysis (PCA) based model. The basic idea is to use PCA to determine the most significant features (those that explain most of the variance in the data). Given an input take its PCA projection, and then from the projection, reconstruct the original variables. The reconstruction error will be small for the majority of examples, but will remain high for outliers.
A Replicator Neural Network (not to be confused with a RecurrentNeural Network – both get abbreviated to RNN). This works on a very similar principal. The input and output layers have the same number of nodes, and intermediate layers have fewer nodes. The goal is to train the network to recreate the input at the output layer – which means it must learn an efficient compressed representation in the lower-dimensional hidden layers. Once the RNN has been trained, the reconstruction error can be used as the outlier score.

The third unsupervised model uses copula functions to build a joint probability function that can be used to detect rare events.

A copula framework provides a means of inference after modeling a multivariate joint probability distribution from training data. Because copula frameworks are less well known than other forms of estimation, we will now briefly review copula theory…

(If you’re interested in that review, and how copula functions are used to form a multivariate density function, see section 6.3 in the paper).

The scores from each of the models are translated into probabilities using a Weibull distribution, “which is flexible and can model a wide variety of shapes.” This translation means that we can compare like-with-like when combining the results from the three models. Here’s an example of the combination process using one-day’s worth of data:

The whole AI² system cycles through training, deployment, and feedback collection/model updating phases on a daily basis. The system trains unsupervised and supervised models based on all the available data, applies those models to the incoming data, identifies k entities as extreme events or attacks, and brings these to the analyst’s attention. The analysts deductions are used to build a new predictive model for the next day.

This combined approach makes effective use of the limited available analyst bandwidth, can overcome some of the weaknesses of pure unsupervised learning, and actively adapts and synthesizes new models.

This setup captures the cascading effect of the human-machine interaction: the more attacks the predictive system detects, the more feeback it will receive from the analysts; this feedback, in turn, will improve the accuracy of future predictions.

数据挖掘与机器学习

Glossary of AI Terms for Cyber Security

September 22, 2016 zr9558 Leave a comment

We often encounter confusion and hype surrounding the terminology of Artificial Intelligence. In this post, it is hoped that the security practitioner can have a quick reference guide for some of the more important and common terms.

Note that this is a limited set. We discovered that, once you start defining these terms, the terms themselves introduce new terms that require definition. We had to draw the line somewhere…

Artificial Intelligence – “The goal of work in Artificial Intelligence is to build machines that perform tasks that normally require human intelligence.” This quote from Nils Nilsson is an excellent definition, but it is not the only one. There are many definitions of Artificial Intelligence here and here.
Algorithms – are a self-contained step-by-step sets of operations to be performed. Algorithms perform calculation, data processing, and/or automated reasoning tasks. Among other things, Algorithms can be used to train Machine Learning models.
Machine Learning – A discipline or subfield of Artificial Intelligence. Paraphrasing the definition by Tom Mitchell, ML is the study of computer algorithms that learn from experience to perform a set of predefined tasks.
Machine Learning Models – The output of a machine learning algorithm. There are two types of machine learning models, those generated by Supervised algorithms and Unsupervised algorithms. See below.
The difference between “algorithms” and “models”: this is a common question and still quite difficult to answer. In the context of Artificial Intelligence, we can say that learning algorithms generate models. The learning algorithms are either Supervised or Unsupervised.N.B. People often use “models” and “algorithms” interchangeably which is a common source of confusion. To the layman, think that algorithms are like programs and the models are the output of the program.
Unsupervised Learning (algorithm) – a family of machine learning algorithms that learn without labels (labels defined below). The output of Unsupervised Learning algorithms are models that capture the structure of the data, can identify groups, or find statistical outliers. For example, Unsupervised Learning models can show you behaviors that are unlike other behaviors in a corpus of data.
Supervised Learning (algorithm) – a family of machine learning algorithms that learn from labeled data. The output of Supervised Learning algorithms are predictive models that can classify or assign a score to a data pattern. For example, trained Supervised Learning models can classify behaviors patterns into different attack tactics, or can assign a risk score to a behavior. In cyber-security, Supervised Learning models predict what a human would label a given behavior pattern.
Labeling – is the act of classification or describing something. For PatternEx, the act of labeling is something that a human analyst does every day. He or she marks something as a malicious or benign behavior. The more labels are provided, the more accurate the system becomes.
Active Learning – Active learning is a machine learning process in which a learning algorithm interactively requests inputs from an external source to improve a model. It is most commonly applied when only unlabeled data is available, the goal is to train Supervised Learning models, and the external source is a human expert that provides labels, and the labeling process is expensive and/or slow. Active learning strategies are also useful when, as in the case of InfoSec, the data changes fast.
Behavior Vectors – a quantified description of the activity of the modeled entities.
Entities – a thing with distinct, independent existence against which the behaviors relate. In cyber-security, examples would be users, IP’s, domains, and so on.
Human-Assisted Artificial Intelligence – the synergy between human intuition and artificial intelligence. Note that the humans assist the AI by providing feedback (e.g. labels) and the trained AI assists the humans by automating and scaling the tasks requiring human intelligence.
Predictions – are the activity of the system anticipating how an event would be classified by the security analyst.
Rare Events – very similar to “anomalies” and “outliers,” Rare Events are events that activities seen in log data that are unusual or out of the ordinary but not yet determined to be either malicious or benign.
Transfer Learning – means you can port knowledge acquired at one environment to another to improve model accuracy. For example, a model trained at company X can be transferred to companies A, B and C, increasing the detection capabilities of the entire group.
Virtual Analyst – the term that describes the effect of a fully trained AI system. Because a trained AI system greatly scales the analytic capability of the human analyst team, we say it is like expanding your team with “virtual analysts.”

PatternEx Unique Approach

PatternEx comes with many algorithms out-of-the-box that allow it to create predictive models that select what the analyst should review. Humans will always be needed to identify in context what is malicious in a constantly changing sea of events. In this way, Human-Assisted Artificial intelligence systems learn from analysts to identify what events are malicious. This results in greater detection accuracy at scale and reduced mean time to identify an attack.

If you’d like to learn more about how PatternEx is bringing Artificial Intelligence into the domain of Cyber Security click here: LEARN MORE

数据挖掘与机器学习

异常点检测算法（二）

June 23, 2016 zr9558 3 Comments

前面一篇文章《异常点检测算法（一）》简要的介绍了如何使用概率统计的方法来计算异常点，本文将会介绍一种基于矩阵分解的异常点检测方法。在介绍这种方法之前，先回顾一下主成分分析（Principle Component Analysis）这一基本的降维方法。

（一）主成分分析（Principle Component Analysis）

对高维数据集合的简化有各种各样的原因，例如：

（1）使得数据集合更容易使用；

（2）降低很多算法的计算开销；

（3）去除噪声；

（4）更加容易的描述结果。

在主成分分析（PCA）这种降维方法中，数据从原来的坐标系转换到新的坐标系，新坐标系的选择是由数据集本身所决定的。第一个新坐标轴的方向选择的是原始数据集中方差最大的方向，第二个新坐标轴的选择是和第一个坐标轴正交并且具有最大方差的方向。该过程一直重复，重复的次数就是原始数据中特征的数目。如此操作下去，将会发现，大部分方差都包含在最前面的几个新坐标轴之中。因此，我们可以忽略余下的坐标轴，也就是对数据进行了降维的处理。

为了提取到第一个主成分（数据差异性最大）的方向，进而提取到第二个主成分（数据差异性次大）的方向，并且该方向需要和第一个主成分方向正交，那么我们就需要对数据集的协方差矩阵进行特征值的分析，从而获得这些主成分的方向。一旦我们计算出了协方差矩阵的特征向量，我们就可以保留最大的 N 个值。正是这 N 个值反映了 N 个最重要特征的真实信息，可以把原始数据集合映射到 N 维的低维空间。

提取 N 个主成分的伪代码如下：

去除平均值

计算协方差矩阵

计算协方差矩阵的特征值和特征向量

将特征值从大到小排序

保留最大的N个特征值以及它们的特征向量

将数据映射到上述N个特征向量构造的新空间中

通过 Python 的 numpy 库和 matplotlib 库可以计算出某个二维数据集合的第一主成分如下：原始数据集使用蓝色的三角形表示，第一主成分使用黄色的圆点表示。

PCA

Principle Component Analysis 的基本性质：

Principle component analysis provides a set of eigenvectors satisfying the following properties:

（1）If the top-k eigenvectors are picked (by largest eigenvalue), then the k-dimensional hyperplane defined by these eigenvectors, and passing through the mean of the data, is a plane for which the mean square distance of all data points to it is as small as possible among all hyperplanes of dimensionality k.

（2）If the data is transformed to the axis-system corresponding to the orthogonal eigenvectors, the variance of the transformed data along each eigenvector dimension is equal to the corresponding eigenvalue. The covariances of the transformed data in this new representation are 0.

（3）Since the variances of the transformed data along the eigenvectors with small eigenvalues are low, significant deviations of the transformed data from the mean values along these directions may represent outliers.

（二）基于矩阵分解的异常点检测方法

基于矩阵分解的异常点检测方法的关键思想是利用主成分分析去寻找那些违背了数据之间相关性的异常点。为了发现这些异常点，基于主成分分析（PCA）的算法会把原始数据从原始的空间投影到主成分空间，然后再把投影拉回到原始的空间。如果只使用第一主成分来进行投影和重构，对于大多数的数据而言，重构之后的误差是小的；但是对于异常点而言，重构之后的误差依然相对大。这是因为第一主成分反映了正常值的方差，最后一个主成分反映了异常点的方差。

假设 dataMat 是一个 p 维的数据集合，有 N 个样本，它的协方差矩阵是 X。那么协方差矩阵就通过奇异值分解写成：

$X=PDP^{T},$

其中 P 是一个 (p,p) 维的正交矩阵，它的每一列都是 X 的特征向量。D 是一个 (p,p) 维的对角矩阵，包含了特征值 $\lambda_{1},...,\lambda_{p}$ 。从图像上看，一个特征向量可以看成 2 维平面上面的一条线，或者高维空间里面的一个超平面。特征向量所对应的特征值反映了这批数据在这个方向上的拉伸程度。通常情况下，可以把对角矩阵 D 中的特征值进行从大到小的排序，矩阵 P 的每一列也进行相应的调整，保证 P 的第 i 列对应的是 D 的第 i 个对角值。

这个数据集 dataMat 在主成分空间的投影可以写成

$Y=dataMat\times P.$

需要注意的是做投影可以只在部分的维度上进行，如果使用 top-j 的主成分的话，那么投影之后的数据集是

$Y^{j}=dataMat \times P^{j},$

其中 $P^{j}$ 是矩阵 P 的前 j 列，也就是说 $P^{j}$ 是一个 (p,j) 维的矩阵， $Y^{j}$ 是一个 (N,j) 维的矩阵。如果考虑拉回映射的话（也就是从主成分空间映射到原始空间），重构之后的数据集合是

$R^{j}=(P^{j}\times (Y^{j})^{T})^{T}=Y^{j}\times (P^{j})^{T},$

其中 $R^{j}$ 是使用 top-j 的主成分进行重构之后形成的数据集，是一个 (N,p) 维的矩阵。

下面可以定义数据 $dataMat_{i}=(dataMat_{i,1},...,dataMat_{i,p})$ 的异常值分数（outlier score）如下：

$score(dataMat_{i})=\sum_{j=1}^{p}(|dataMat_{i}-R_{i}^{j}|)\times ev(j)$

$ev(j)=\sum_{k=1}^{j}\lambda_{k}/\sum_{k=1}^{p}\lambda_{k}$

注意到 $|dataMat_{i}-R_{i}^{j}|$ 指的是 Euclidean 范数， ev(j) 表示的是 top-j 的主成分在所有主成分中所占的比例，并且特征值是按照从大到小的顺序排列的。因此，ev(j) 是递增的序列，这就表示 j 越高，越多的方差就会被考虑在 ev(j) 中，因为是从 1 到 j 的求和。在这个定义下，偏差最大的第一个主成分获得最小的权重，偏差最小的最后一个主成分获得了最大的权重 1。根据 PCA 的性质，异常点在最后一个主成分上有着较大的偏差，因此可以获得更高的分数。

整个算法的结构如图所示：

PCC

（三）效果展示

下面两幅图使用了同一批数据集，分别采用了基于矩阵分解的异常点检测算法和基于高斯分布的概率模型的异常点算法。

PCC2

基于矩阵分解的异常点检测

Gauss

基于高斯分布的概率模型的异常点检测

根据图像可以看出，如果使用基于矩阵分解的异常点检测算法的话，偏离第一主成分较多的点都被标记为异常点，其中包括部分左下角的点。需要注意的是如果使用基于高斯分布的概率模型的话，是不太可能标记出左下角的点的，两者形成鲜明对比。