主动学习（active Learning）

主动学习背景介绍

1. 机器学习模型：包括机器学习模型的训练和预测两部分；
2. 待标注的数据候选集提取：依赖主动学习中的查询函数（Query Function）；
3. 人工标注：专家经验或者业务经验的提炼；
4. 获得候选集的标注数据：获得更有价值的样本数据；
5. 机器学习模型的更新：通过增量学习或者重新学习的方式更新模型，从而将人工标注的数据融入机器学习模型中，提升模型效果。

1. 个性化的垃圾邮件，短信，内容分类：包括营销短信，订阅邮件，垃圾短信和邮件等等；
2. 异常检测：包括但不限于安全数据异常检测，黑产账户识别，时间序列异常检测等等。

1. 不确定性采样的查询（Uncertainty Sampling）；
2. 基于委员会的查询（Query-By-Committee）；
3. 基于模型变化期望的查询（Expected Model Change）；
4. 基于误差减少的查询（Expected Error Reduction）；
5. 基于方差减少的查询（Variance Reduction）；
6. 基于密度权重的查询（Density-Weighted Methods）。

不确定性采样（Uncertainty Sampling）

1. 置信度最低（Least Confident）；
2. 边缘采样（Margin Sampling）；
3. 熵方法（Entropy）；

Least Confident

$x_{LC}^{*}=argmax_{x}(1-P_{\theta}(\hat{y}|x))=argmin_{x}P_{\theta}(\hat{y}|x)$,

Margin Sampling

$x_{M}^{*}=argmin_{x}(P_{\theta}(\hat{y}_{1}|x)-P_{\theta}(\hat{y}_{2}|x))$,

Entropy

$x_{H}^{*}=argmax_{x}-\sum_{i}P_{\theta}(y_{i}|x)\cdot \ln P_{\theta}(y_{i}|x)$,

基于委员会的查询（Query-By-Committee）

1. 投票熵（Vote Entropy）：选择这些模型都无法区分的样本数据；
2. 平均 KL 散度（Average Kullback-Leibler Divergence）：选择 KL 散度较大的样本数据。

投票熵（Vote Entropy）

$x_{VE}^{*}=argmax_{x}-\sum_{i}\frac{V(y_{i})}{C}\cdot\ln\frac{V(y_{i})}{C}$,

平均 KL 散度（Average KL Divergence）

KL 散度可以衡量两个概率之间的“距离”，因此可以用 KL 散度计算出那些偏差较大的数据样本。用数学公式来描述就是：

$x_{KL}^{*}=argmax_{x}\frac{1}{C}\sum_{c=1}^{C}D(P_{\theta^{(c)}}||P_{\mathcal{C}}),$

基于密度权重的选择方法（Density-Weighted Methods）

$x_{ID}^{*}=argmax_{x}\phi_{A}(x)\cdot\bigg(\frac{1}{U}\sum_{u=1}^{U}sim(x,x^{(u)})\bigg)^{\beta}$,

参考资料

1. Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
2. Aggarwal, Charu C., et al. “Active learning: A survey.” Data Classification: Algorithms and Applications. CRC Press, 2014. 571-605.

符号计算中的深度学习方法

符号计算

• $2 + 3 * (5 + 2)$
• $3x^{2}+\cos(2x)+1$
• $\frac{\partial^{2}\psi}{\partial x^{2}} - \frac{1}{v^{2}}\frac{\partial^{2}\psi}{\partial t^{2}}$

>>> import sympy
>>> x, y = sympy.symbols("x y")
>>> expr = sympy.sin(x+y) + x**2 + 1/y - 10
>>> sympy.srepr(expr)
>>> expr = sympy.sin(x*y)/2 - x**2 + 1/y
>>> sympy.srepr(expr)
"Add(Mul(Integer(-1), Pow(Symbol('x'), Integer(2))), Mul(Rational(1, 2), sin(Mul(Symbol('x'), Symbol('y')))), Pow(Symbol('y'), Integer(-1)))"

SymPy 的 srepr 函数的输出用树状结构来表示就是形如图 2 这种格式。叶子节点要么是 x，y 这种变量，要么是 -1 和 2 这种整数。对于一元函数而言，例如 sin 函数，就是对应唯一的一个叶子。对于二元函数而言，例如 pow，mul，add，则是对应两个叶子节点。

论文方案

1. 生成数据；
2. 训练模型；
3. 预测结果；

1. 函数的积分；
2. 一阶常微分方程；
3. 二阶常微分方程。

1. 数学表达式最多拥有 15 个内部节点；
2. $L = 11$ 表示叶子节点的值只有 11 个，分别是变量 $x$$\{-5,-4,-3,-2,-1,1,2,3,4,5\}$
3. $p_{1} = 15$ 表示一元计算只有 15 个，分别是 $\exp, \log, \sqrt, \sin, \cos, \tan, \arcsin, \arccos, \arctan$, $sinh, cosh, tanh, arcsinh, arccosh, arctanh$
4. $p_{2} = 4$ 表示二元计算只有四个，分别是 +, -, *, /；

一阶常微分方程的生成

1. 生成二元函数 $f(x,c) = x\log(c/x)$
2. 求解 $c$，得到 $c = xe^{f(x)/x}$
3. $x$ 进行求导得到 $0 = e^{f(x)/x}(1+f'(x)-f(x)/x) = 0$
4. 简化后得到 $x+xf'(x) - f(x) =0$，也就是 $x+xy'-y=0$

1. 生成三元函数 $f(x,c_{1},c_{2}) = c_{1}e^{x}+c_{2}e^{-x}$
2. 求解 $c_{2}$ 得到 $c_{2} = f(x,c_{1},c_{2}) e^{x}- c_{1}e^{2x}$
3. $x$ 进行求导得到 $0 = e^{x}(\partial f(x,c_{1},c_{2})/\partial x + f(x,c_{1},c_{2})) - 2c_{1}e^{2x} = 0$
4. 求解 $c_{1}$ 得到 $c_{1} = e^{-x}(\partial f(x,c_{1},c_{2})/\partial x+f(x))/2$
5. $x$ 进行求导得到 $0 = e^{-x}(\partial^{2} f(x,c_{1},c_{2})/\partial x^{2} - f(x,c_{1},c_{2}))=0$
6. 简化后得到 $\partial^{2} f(x,c_{1},c_{2})/\partial x^{2} - f(x,c_{1},c_{2})=0$，也就是 $y''-y=0$

数据处理

1. 数学表达式的简化（expression simplification）：例如 $x+1+1$ 可以简化成 $x+2$$\sin^{2}(x)+\cos^{2}(x)$ 可以简化成 1。
2. 参数的简化（coefficient simplification）：例如 $\log(x^{2}) + c\log(x)$ 可以简化成 $c\log(x)$
3. 无效表达式的过滤（invalid expression filter）：例如 $\sqrt{2}, \log(0)$ 等。

计算机视觉中的注意力机制

注意力机制

1. 注意力机制需要决定整段输入的哪个部分需要更加关注；
2. 从关键的部分进行特征提取，得到重要的信息。

1. 在机器翻译领域，翻译人员需要把已有的一句话翻译成另外一种语言的一句话。例如把一句话从英文翻译到中文，把中文翻译到法语。在这种情况下，输入语言和输出语言的词语之间的先后顺序其实是相对固定的，是具有一定的语法规则的；
2. 在视频分类或者情感识别领域，视频的先后顺序是由时间戳和相应的片段组成的，输入的就是一段视频里面的关键片段，也就是一系列具有先后顺序的图片的组合。NLP 中的情感识别问题也是一样的，语言本身就具有先后顺序的特点；
3. 图像识别，物体检测领域与前面两个有本质的不同。因为物体检测其实是在一幅图里面挖掘出必要的物体结构或者位置信息，在这种情况下，它的输入就是一幅图片，并没有非常明显的先后顺序，而且从人脑的角度来看，由于个体的差异性，很难找到一个通用的观察图片的方法。由于每个人都有着自己观察的先后顺序，因此很难统一成一个整体。

基于 RNN 的注意力机制

1. 建立一个编码（Encoder）和解码（Decoder）的非线性模型，神经网络的参数足够多，能够存储足够的信息；
2. 除了关注句子的整体信息之外，每次翻译下一个词语的时候，需要对不同的词语赋予不同的权重，在这种情况下，再解码的时候，就可以同时考虑到整体的信息和局部的信息。

计算机视觉领域的 Attention 部分论文整理

Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition

RA-CNN 的特点是进行一个端到端的优化，并不需要提前标注 box，区域等信息就能够进行鸟类的识别和图像种类的划分。在数据集上面，该论文不仅在鸟类数据集（CUB Birds）上面进行了实验，也在狗类识别（Stanford Dogs）和车辆识别（Stanford Cars）上进行了实验，并且都取得了不错的效果。

1. 分类概率的计算，也就是最终的 loss 函数的设计；
2. 从上一幅图片到下一幅图片的坐标值和尺寸大小。

$t_{x(t\ell)} = t_{x} - t_{\ell}, t_{x(br)} = t_{x} + t_{\ell},$

$t_{y(t\ell)} = t_{y} - t_{\ell}, t_{y(br)} = t_{y} + t_{\ell}.$

$X_{(i,j)}^{amp} = \sum_{\alpha,\beta=0}^{1}|1-\alpha-\{i/\lambda\}|\cdot|1-\beta-\{j/\lambda\}|\cdot X_{(m,n)}^{att},$

$L(X) = \sum_{s=1}^{3}\{L_{cls}(Y^{(s)},Y^{*})\} + \sum_{s=1}^{2}\{L_{rank}(p_{t}^{(s)},p_{t}^{(s+1)})\}$.

$\frac{\partial L_{rank}}{\partial t_{x}} \propto D_{top} \odot \frac{\partial M(t_{x},t_{y},t_{\ell})}{\partial t_{x}},$

$x\rightarrow t_{x(t\ell)}$$\frac{\partial M}{\partial t_{x}}<0;$

$x \rightarrow t_{x(br)}$$\frac{\partial M}{\partial t_{x}}>0;$

$y\rightarrow t_{y(t\ell)}$$\frac{\partial M}{\partial t_{y}}<0;$

$y \rightarrow t_{y(br)}$$\frac{\partial M}{\partial t_{y}}>0;$

$x \rightarrow t_{x(t\ell)}\text{ or } x \rightarrow t_{x(br)}\text{ or } y \rightarrow t_{y(t\ell)}\text{ or } y \rightarrow t_{y(br)},$ $\frac{\partial M}{\partial t_{\ell}}>0;$

RA-CNN 的实验效果如下：

参考文献

1. Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition，CVPR，2017.
2. Recurrent Models of Visual Attention，NIPS，2014
3. GitHub 代码：Recurrent-Attention-CNN，https://github.com/Jianlong-Fu/Recurrent-Attention-CNN
4. Multiple Granularity Descriptors for Fine-grained Categorization，ICCV，2015
5. Multiple Object Recognition with Visual Attention，ICRL，2015
6. Understanding LSTM Networks，Colah’s Blog，2015，http://colah.github.io/posts/2015-08-Understanding-LSTMs/
7. Survey on the attention based RNN model and its applications in computer vision，2016

启发式优化算法

启发式优化算法的一些调研

启发式优化算法的背景

1. 首先，我们需要有一个目标函数 $f(\bold{x}),$ 其中的 $\bold{x} \in \mathbb{R}^{n}$$n$ 是一个正整数。然后，目标函数 $f(\bold{x})$ 不一定需要是显式的（所谓显式指的是能够精确写出 $f(\bold{x})$ 的表达式，例如逻辑回归的目标函数，神经网络的目标函数，XGBoost 的目标函数等），隐式的目标函数使用启发式优化算法也是可以研究的（即无法写出 $f(\bold{x})$ 表达式）。

2. 其次，需要计算目标函数 $f(\bold{x})$ 的最大值 $\max f(\bold{x})$ 或者最小值 $\min f(\bold{x})$

3. 再次，如果存在多个函数 $f_{i}(\bold{x}), 1\leq i\leq m,$ 并且每一个 $f_{i}(\bold{x})$ 的值域都在 $[0,1]$ 内。如果目标函数是

$\min_{x}(\max_{i}f_{i}(x)-\min_{i}f_{i}(x)),$

4. 拥有目标函数的目的是寻找最优的 $\bold{x}$ 使得它满足 $argmin_{\bold{x}}f(x)$ 或者 $argmax_{\bold{x}}f(x)$

5. 在这里的 $\bold{x}\in \mathbb{R}^{n}$ 一般都是连续的特征，而不是那种类别特征。如果是类别的特征，并不是所有的启发式优化算法都适用，但是遗传算法之类的算法在这种情况下有着一定的用武之地。

启发式优化算法的实验方法

1. 论文中的方法：找到一些形状怪异的函数作为目标函数，需要寻找这些函数的全局最大值或者全局最小值。除了全局的最大最小值点之外，这些函数也有很多局部极值点。例如 Rastrigin 函数（Chapter 6，Example 6.1，Search and Optimization by Metaheuristics）

$\min_{\bold{x}}f(\bold{x}) = 10\cdot n +\sum_{i=1}^{n}(x_{i}^{2}-10\cos(2\pi x_{i})),$

$\min_{\bold{x}}f(\bold{x}) = - \cos(x_{1})\cos(x_{2})\exp(-(x_{1}-\pi)^{2}-(x_{2}-\pi)^{2}),$

2. 最朴素的算法：随机搜索法（Random Search）。也就是把 $\bold{x}$ 的定义域切分成 $m^{n}$ 份，其中 $\bold{x}\in\mathbb{R}^{n}.$ 计算每一个交点的函数取值，然后统计出其最大值或者最小值就可以了。实际中不适用，因为当 $n\geq 100$ 的时候，这个是指数级别的计算复杂度。

3. 启发式优化算法：针对某一个目标函数，使用某种启发式优化算法，然后与随机搜索法做对比，或者与其余的启发式优化算法做对比。

4. 在线调优：启发式优化算法大部分都可以做成实时调优的算法，调优的速度除了算法本身的收敛速度之外，还有给定一个 $\bold{x}$ 之后，获取到 $f(\vec{x})$ 的时间。

启发式优化算法的一些常见算法

粒子群算法（Particle Swarm Optimization）

PSO 算法是有一个粒子群，根据群体粒子和单个粒子的最优效果，来调整每一个粒子的下一步行动方向。假设粒子的个数是 $N_{p}$，每一个粒子 $\bold{x}_{i}\in \mathbb{R}^{n}$ 都是 $n$ 维欧几里德空间里面的点，同时需要假设粒子的速度 $\bold{v}_{i}\in\mathbb{R}^{n}$。在每一轮迭代中，需要更新两个最值，分别是每一个粒子在历史上的最优值和所有粒子在历史上的最优值，分别记为 $\bold{x}_{i}^{*}$$1\leq i \leq N_{p}$）和 $\bold{x}^{g}$。在第 $t+1$ 次迭代的时候，

$\bold{v}_{i}(t+1) = \bold{v}_{i}(t) + c r_{1}[\bold{x}_{i}^{*}(t) - \bold{x}_{i}(t)] + c r_{2}[\bold{x}^{g}(t) - \bold{x}_{i}(t)],$

$\bold{x}_{i}(t+1) = \bold{x}_{i}(t)+\bold{v}_{i}(t+1), \text{ } 1\leq i\leq N_{p}.$

模拟退火（Simulated Annealing）

Repeat：

a. Repeat:

i. 进行一个随机扰动 $\bold{x} = \bold{x} + \Delta \bold{x}$

ii. 计算 $\Delta E(\bold{x}) = E(\bold{x}+\Delta\bold{x}) - E(\bold{x})$

b. 令 $T = T-\Delta T$

1. 如何选择合适的 $\Delta \bold{x}$，可以选择 Gaussian 正态分布，并且可以同时修改 $\bold{x}$$n$ 个维度，也可以每次只修改一个维度或者部分维度的值；
2. 一开始在温度 $T$ 很大的时候，即使 $E(\bold{x}+\Delta\bold{x})\geq E(\bold{x})$ 的时候，都会以极大概率接受 $E(\bold{x}+\Delta\bold{x})$，因为此时 $\lim_{T\rightarrow\infty}e^{-\Delta E/T}=1$。然后一开始都在四处游荡；
3. 当温度 $T$ 逐渐降低的时候，当时 $\bold{x}$ 不一定在一个合适的位置，因此，需要记录历史上探索过的满足 $E(\bold{x})$ 最小的值，从历史上的值中选择出一个合适的值继续迭代。因此，在温度下降的时候，如何选择一个合适的值继续迭代可能是一个关键问题。
4. 从遗传算法和 PSO 的对比效果来看，因为 PSO 算法会随时记录当前粒子和所有粒子的最优状态，然后在整个种群中，总会有少数粒子处于当前的最优状态，所以 PSO 的效果从实验效果上来看貌似比不记录状态的遗传算法会好一些。

进化策略（Evolutionary Strategies）

ES 和 GA 有一些方法论上面的不同，这个在书上有论述，等撰写了 GA 在一起看这一块的内容。

$\bold{x}'=(\bold{x}_{1}+\bold{x}_{2})/2,$

$x_{i}'=x_{i}+N(0,\sigma_{i}),\text{ } i=1,\cdots,n,$

（1）在 $(\lambda+\mu)$ 策略中，从 $\mu$ 人口中通过变异或者交叉得到 $\lambda$ 个人，然后从这 $(\lambda+\mu)$ 个人中选择出 $\mu$ 个最优的人。$(\lambda+\mu)$ 是精英策略，每一次迭代都比上一次要好。

（2）在 $(\lambda, \mu)$ 策略中，$\mu$ 个最优的人是从 $\lambda(\lambda\geq\mu)$ 中选择出来的。

$\bold{g}=\sum_{i=1}^{\lambda}(f(\bold{t}_{i})-f(\bold{x}))(\bold{t}_{i}-\bold{x}),$

$\bold{x}_{1}=\bold{x}+(\sigma \psi)\bold{e}, \bold{x}_{2}=\bold{x}+(\sigma/\psi)\bold{e},$

参考资料：

1. Search and Optimization by Metaheuristics，Kelin DU, M.N.S. SWAMY，2016年

聚类算法（二）

流式聚类的算法细节：

$K$ 表示在聚类的过程中允许形成的簇的最大个数；

$D$ 表示距离的阀值，在这里两个点之间的距离可以使用 $L^{1}, L^{2}, L^{\infty}$ 范数；

Step 1：对于 $dataMat[0]$ 而言，自成一类。i.e. 质心就是它本身 $C[0]=dataMat[0]$，该聚簇的元素个数就是 $num[0]=1$，当前所有簇的个数是 $K^{'}=1$

Step 2:  对于每一个 $1\leq i\leq m-1$，进行如下的循环操作：

（i）如果当前 $K'=K$ 或者 $d\leq D$，则把 $dataMat[i]$ 加入到第 $j^{'}$ 个聚簇。也就是说：

（ii）否则，$dataMat[i]$ 需要自成一类。i.e. $K^{'} \leftarrow K^{'}$ + $1$$num[K'-1]=1$$C[K'-1]=dataMat[i]$

注：

（1）数据的先后顺序对聚类的效果有一定的影响。

（2）该算法的时间复杂度与簇的个数是相关的，通常来说，如果形成的簇越多，对于一个新的样本与质心的比较次数就会越多。

（3）如果设置簇的最大个数是一个比较小的数字的时候，那么对于一个新的样本，它需要比较的次数就会大量减少。如果最小距离小于 D 或者目前已经达到最大簇的个数的话，那么把新的样本放入某个类。此时，可能会出现某个类由于这个新的样本导致质心偏移的情况。如果设置 $K$ 很大，那么计算的时间复杂度就会增加。因此，在保证计算实时性的时候，需要考虑到 $K$ 的设置问题。

三十分钟理解博弈论“纳什均衡” — Nash Equilibrium

纳什均衡定义

：经济学定义从字面上还是相对比较好理解的；这里稍微解释一下数学定义，博弈论也称Game Theory，一场博弈用G表示，Si表示博弈方i的策略，ui表示收益。因此，纳什均衡的意思是：任何一方采取的策略都是对其余所有方采取策略组合下的最佳对策；当所有其他人都不改变策略时，为了让自己的收益最大，任何一方都不会（或者无法）改变自己的策略，这个时候的策略组合就是一个纳什均衡。

纳什均衡案例

（1）囚徒困境

（2）智猪博弈

（3）普通范式博弈

GOO公司和SAM公司是某手机产品生态的两大重量级参与者，双方在产业链的不同位置上各司其职且关系暧昧，有时也往往因商业利益和产品影响力的争夺而各怀异心。二者的收益也随着博弈的变化而不断更替。

（4）饿狮博弈

（5）硬币正反

3x + (-2)(1-x)=(-2) * x + 1*( 1-x )——解方程得x=3/8；同样，美女的收益，列方程-3y + 2( 1-y)= 2y+ (-1) * ( 1-y)——解得y也等于3/8。

［转载］【重磅】无监督学习生成式对抗网络突破，OpenAI 5大项目落地

【重磅】无监督学习生成式对抗网络突破，OpenAI 5大项目落地

【新智元导读】“生成对抗网络是切片面包发明以来最令人激动的事情！”LeCun前不久在Quroa答问时毫不加掩饰对生成对抗网络的喜爱，他认为这是深度学习近期最值得期待、也最有可能取得突破的领域。生成对抗学习是无监督学习的一种，该理论由 Ian Goodfellow 提出，此人现在 OpenAI 工作。作为业内公认进行前沿基础理论研究的机构，OpenAI 不久前在博客中总结了他们的5大项目成果，结合丰富实例介绍了生成对抗网络，并对OpenAI 五大落地项目进行梳理，包括完善对抗生成网络（GAN）、完善变分推断（VAE）、提出GAN的扩展 InfoGAN，以及提出生成对抗模仿学习及代码。

OpenAI 旨在开发能够让计算机理解世界的算法和技术。

DCGAN

Radford 等人提出的 DCGAN 网络（见下图）就是这样一个例子。DCGAN 网络以 100 个从一个高斯分布中采样的随机数作为输入（即代码，或者叫“隐含变量”，靠左红色部分），输出图像（在这里就是 64*64*3 的图像，靠右绿色部分）。当代码增量式变化时，生成的图像也随之变化——说明模型已经学会了如何描述世界，而不是仅仅是记住了一些样本。

DCGAN 使用随机权重初始化，所以随机代码输入会生成一个完全随机的图像。但是，这个网络有好几百万的参数可以调整，而我们的目的是设置参数，让根据随机代码输入产生的样本与训练数据看上去相似。换句话说，我们想要模型分布与图像空间中真实数据的分布相匹配。

VAE 学会产生图像（log time）

GAN 学会产生图像（linear time）

• Generative Adversarial Network（GAN）这个我们在上面讨论过了，将训练过程作为两个不同网络的对抗：一个生成器网络和一个判别器网络，判别器网络试图区分样来自于真实分布 p(x) 和模型分布 p^(x) 的样本。每当判别器发现两个分布之间有差异时，生成器网络便微整参数，使判别器不能从中找到差异。

• Variational Autoencoders（VAE）让我们可以在概率图模型框架下形式化这个问题，我们会最大化数据的对数似然（log likelihood）的下界

• PixelRNN 等自回归模型。训练一个建模了给定前面像素下每个独立像素条件分布的网络（从左到右，从上到下）. 这类似于将图像的像素输入到一个 char-rnn，但是 RNN 水平和垂直遍历图像，而不是 1D 的字符序列

OpenAI 5 大落地

1. 完善对抗生成网络（GAN）

GAN 是非常值得期待的生成模型，不像其他方法，GAN 产生的图像干净、清晰，并且能够学会与纹理有关的代码。然而，GAN 被构建成两个网络之间的对抗，保持平衡十分重要（而且相当考验技巧）：两个网络可能在解析度之间震荡，生成器容易崩溃。

Tim Salimans, Ian Goodfellow, Wojciech Zaremba 等人引入了一些新技巧，让 GAN 训练更加稳定。这些技巧让我们能够放大 GAN ，获得清晰的 128*128 ImageNet 样本：

［左］真实图像（ImageNet）；［右］生成的图像

［左］真实图像（CIFAR-10）；［右］生成的图像

2. 完善变分推断（VAE）

［左］用 IAF 训练 VAE 生成的图像；［右］DRAW 模型生成的图像

3. InfoGAN

Peter Chen 等人提出了 InfoGAN ——GAN 的扩展。普通的 GAN 通过在模型里重新生成数据分布，让模型学会图像的disentangled 和 interpretable 表征。但是，其 layout 和 organization 是 underspecified。

InfoGAN 引入了额外的结构，得到了相当出色的结果。在三维人脸图像中，改变代码的一个连续维度，保持其他维度不变，很明显从每张图片给出的 5 行例子中，代码的 resulting dimension 是可解释的，模型在事先不知道摄像头角度、面部变化等特征存在的前提下，可能已经理解这些特征是存在的，并且十分重要：

（a）pose                    （b）Elevation

（c）Lighting             （d）Wide or Narrow

3. 生成模型的深度强化学习（两项）

VIME 让智能体本身驱动；它主动地寻求意外的状态-行动。作者展示了 VIME 可以提高一系列策略搜索方法，并在更多稀疏奖励的实际任务（比如智能体需要在无指导的情形下学习原始行动的场景）取得了显著改进。

4. 生成对抗模仿学习

Jonathan Ho 等人提出了一个新的模仿学习（imitation learning）方法。Jonathan Ho 在斯坦福大学完成了这项工作的主要内容，他作为暑期实习生加入 OpenAI 后，结合 GAN 以及 RL 等相关工作的启发，最终完成了生成对抗模仿学习及代码。

一文学会用 Tensorflow 搭建神经网络

http://www.jianshu.com/p/e112012a4b2d

cs224d-Day 6: 快速入门 Tensorflow

Tensorflow 官网

Tensorflow 是谷歌开发的深度学习系统，用它可以很快速地入门神经网络。

tensors_flowing.gif

1. 搭建神经网络基本流程

1.训练的数据
2.定义节点准备接收数据
3.定义神经层：隐藏层和预测层
4.定义 loss 表达式
5.选择 optimizer 使 loss 达到最小

import tensorflow as tf
import numpy as np

# 添加层
# add one more layer and return the output of this layer
Weights = tf.Variable(tf.random_normal([in_size, out_size]))
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
Wx_plus_b = tf.matmul(inputs, Weights) + biases
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)
return outputs

# 1.训练的数据
# Make up some real data
x_data = np.linspace(-1,1,300)[:, np.newaxis]
noise = np.random.normal(0, 0.05, x_data.shape)
y_data = np.square(x_data) - 0.5 + noise

# 2.定义节点准备接收数据
# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

# 3.定义神经层：隐藏层和预测层
# add hidden layer 输入值是 xs，在隐藏层有 10 个神经元
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer 输入值是隐藏层 l1，在预测层输出 1 个结果
prediction = add_layer(l1, 10, 1, activation_function=None)

# 4.定义 loss 表达式
# the error between prediciton and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
reduction_indices=[1]))

# 5.选择 optimizer 使 loss 达到最小
# 这一行定义了用什么方式去减少 loss，学习率是 0.1

# important step 对所有变量进行初始化
init = tf.initialize_all_variables()
sess = tf.Session()
# 上面定义的都没有运算，直到 sess.run 才会开始运算
sess.run(init)

# 迭代 1000 次学习，sess.run optimizer
for i in range(1000):
# training train_step 和 loss 都是由 placeholder 定义的运算，所以这里要用 feed 传入参数
sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
if i % 50 == 0:
# to see the step improvement
print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

2. 主要步骤的解释：

• 之前写过一篇文章 TensorFlow 入门 讲了 tensorflow 的安装，这里使用时直接导入：
import tensorflow as tf
import numpy as np
• 导入或者随机定义训练的数据 x 和 y：
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data*0.1 + 0.3
• 先定义出参数 Weights，biases，拟合公式 y，误差公式 loss：
Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
biases = tf.Variable(tf.zeros([1]))
y = Weights*x_data + biases
loss = tf.reduce_mean(tf.square(y-y_data))
• 选择 Gradient Descent 这个最基本的 Optimizer：
optimizer = tf.train.GradientDescentOptimizer(0.5)
• 神经网络的 key idea，就是让 loss 达到最小：
train = optimizer.minimize(loss)
• 前面是定义，在运行模型前先要初始化所有变量：
init = tf.initialize_all_variables()
• 接下来把结构激活，sesseion像一个指针指向要处理的地方：
sess = tf.Session()
• init 就被激活了，不要忘记激活：
sess.run(init)
• 训练201步：
for step in range(201):
• 要训练 train，也就是 optimizer：
sess.run(train)
• 每 20 步打印一下结果，sess.run 指向 Weights，biases 并被输出：
if step % 20 == 0:
print(step, sess.run(Weights), sess.run(biases))

3. TensorFlow 基本概念及代码：

TensorFlow 入门 也提到了几个基本概念，这里是几个常见的用法。

• Session

product = tf.matmul(matrix1, matrix2) # matrix multiply np.dot(m1, m2)

sess = tf.Session()

result 要去 sess.run 那里取结果：

result = sess.run(product)
• Variable

state = tf.Variable(0, name='counter')

update = tf.assign(state, new_value)

init = tf.initialize_all_variables() # must have if define variable
• placeholder：

import tensorflow as tf

input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
ouput = tf.mul(input1, input2)

with tf.Session() as sess:
print(sess.run(ouput, feed_dict={input1: [7.], input2: [2.]}))

4. 神经网络基本概念

• 激励函数：

x轴表示传递过来的值，y轴表示它传递出去的值：

TensorFlow 常用的 activation function

• 添加神经层：

import tensorflow as tf

Weights = tf.Variable(tf.random_normal([in_size, out_size]))
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
Wx_plus_b = tf.matmul(inputs, Weights) + biases

if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)

return outputs
• 分类问题的 loss 函数 cross_entropy ：
# the error between prediction and real data
# loss 函数用 cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
reduction_indices=[1]))       # loss
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
• overfitting：

Tensorflow 有一个很好的工具, 叫做dropout, 只需要给予它一个不被 drop 掉的百分比，就能很好地降低 overfitting。

dropout 是指在深度学习网络的训练过程中，按照一定的概率将一部分神经网络单元暂时从网络中丢弃，相当于从原始的网络中找到一个更瘦的网络，这篇博客中讲的非常详细

def add_layer(inputs, in_size, out_size, layer_name, activation_function=None, ):
# add one more layer and return the output of this layer
Weights = tf.Variable(tf.random_normal([in_size, out_size]))
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1, )
Wx_plus_b = tf.matmul(inputs, Weights) + biases

# here to dropout
# 在 Wx_plus_b 上drop掉一定比例
# keep_prob 保持多少不被drop，在迭代时在 sess.run 中 feed
Wx_plus_b = tf.nn.dropout(Wx_plus_b, keep_prob)

if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b, )
tf.histogram_summary(layer_name + '/outputs', outputs)
return outputs

5. 可视化 Tensorboard

Tensorflow 自带 tensorboard ，可以自动显示我们所建造的神经网络流程图：

import tensorflow as tf

# add one more layer and return the output of this layer
# 区别：大框架，定义层 layer，里面有 小部件
with tf.name_scope('layer'):
# 区别：小部件
with tf.name_scope('weights'):
Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W')
with tf.name_scope('biases'):
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1, name='b')
with tf.name_scope('Wx_plus_b'):
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b, )
return outputs

# define placeholder for inputs to network
# 区别：大框架，里面有 inputs x，y
with tf.name_scope('inputs'):
xs = tf.placeholder(tf.float32, [None, 1], name='x_input')
ys = tf.placeholder(tf.float32, [None, 1], name='y_input')

l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
prediction = add_layer(l1, 10, 1, activation_function=None)

# the error between prediciton and real data
# 区别：定义框架 loss
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
reduction_indices=[1]))

# 区别：定义框架 train
with tf.name_scope('train'):

sess = tf.Session()

# 区别：sess.graph 把所有框架加载到一个文件中放到文件夹"logs/"里
# 接着打开terminal，进入你存放的文件夹地址上一层，运行命令 tensorboard --logdir='logs/'
# 会返回一个地址，然后用浏览器打开这个地址，在 graph 标签栏下打开
writer = tf.train.SummaryWriter("logs/", sess.graph)
# important step
sess.run(tf.initialize_all_variables())

6. 保存和加载

import tensorflow as tf
import numpy as np

## Save to file
# remember to define the same dtype and shape when restore
W = tf.Variable([[1,2,3],[3,4,5]], dtype=tf.float32, name='weights')
b = tf.Variable([[1,2,3]], dtype=tf.float32, name='biases')

init= tf.initialize_all_variables()

saver = tf.train.Saver()

# 用 saver 将所有的 variable 保存到定义的路径
with tf.Session() as sess:
sess.run(init)
save_path = saver.save(sess, "my_net/save_net.ckpt")
print("Save to path: ", save_path)

################################################

# restore variables
# redefine the same shape and same type for your variables
W = tf.Variable(np.arange(6).reshape((2, 3)), dtype=tf.float32, name="weights")
b = tf.Variable(np.arange(3).reshape((1, 3)), dtype=tf.float32, name="biases")

# not need init step

saver = tf.train.Saver()
# 用 saver 从路径中将 save_net.ckpt 保存的 W 和 b restore 进来
with tf.Session() as sess:
saver.restore(sess, "my_net/save_net.ckpt")
print("weights:", sess.run(W))
print("biases:", sess.run(b))

tensorflow 现在只能保存 variables，还不能保存整个神经网络的框架，所以再使用的时候，需要重新定义框架，然后把 variables 放进去学习。

TensorFlow 入门

CS224d－Day 2:

Day 2 先认识 TensorFlow，了解一下基本用法，下一次就写代码来训练模型算法，以问题为导向，以项目为驱动。

• 1. TensorFlow 是什么
• 2. 为什么需要 TensorFlow
• 3. TensorFlow 的优点
• 4. TensorFlow 的工作原理
• 5. 安装
• 6. TensorFlow 基本用法
• 要点
• 例子
• 概念
• 张量
• 会话

1. TensorFlow 是什么

Tensor(张量)意味着 N 维数组，Flow(流)意味着基于数据流图的计算，TensorFlow即为张量从图的一端流动到另一端。

4. TensorFlow 的工作原理

TensorFlow是用数据流图(data flow graphs)技术来进行数值计算的。

5. 安装

Anaconda 是一个集成许多第三方科学计算库的 Python 科学计算环境，用 conda 作为自己的包管理工具，同时具有自己的计算环境，类似 Virtualenv。

• 安装 Anaconda
我之前已经安装过 Anaconda 了，直接从下面进行：
• 建立一个 conda 计算环境
# 计算环境名字叫 tensorflow:
# Python 2.7
$conda create -n tensorflow python=2.7 • 激活环境，使用 conda 安装 TensorFlow $ source activate tensorflow
(tensorflow)$# Your prompt should change # Mac OS X, CPU only: (tensorflow)$ pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.8.0rc0-py2-none-any.whl
• 安装成功后，每次使用 TensorFlow 的时候需要激活 conda 环境
• conda 环境激活后，你可以测试是否成功，在终端进入 python，输入下面代码，没有提示错误，说明安装 TensorFlow 成功：
$python ... >>> import tensorflow as tf >>> hello = tf.constant('Hello, TensorFlow!') >>> sess = tf.Session() >>> print(sess.run(hello)) Hello, TensorFlow! >>> a = tf.constant(10) >>> b = tf.constant(32) >>> print(sess.run(a + b)) 42 >>> • 当你不用 TensorFlow 的时候，关闭环境: (tensorflow)$ source deactivate

$# Your prompt should change back • 再次使用的时候再激活: $ source activate tensorflow
(tensorflow)$# Run Python programs that use TensorFlow. ... (tensorflow)$ source deactivate

import tensorflow as tf
import numpy as np

# 用 NumPy 随机生成 100 个数据
x_data = np.float32(np.random.rand(2, 100))
y_data = np.dot([0.100, 0.200], x_data) + 0.300

# 构造一个线性模型
b = tf.Variable(tf.zeros([1]))
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))
y = tf.matmul(W, x_data) + b

# 最小化方差
loss = tf.reduce_mean(tf.square(y - y_data))
train = optimizer.minimize(loss)

# 初始化变量
init = tf.initialize_all_variables()

# 启动图 (graph)
sess = tf.Session()
sess.run(init)

# 拟合平面
for step in xrange(0, 201):
sess.run(train)
if step % 20 == 0:
print step, sess.run(W), sess.run(b)

# 输出结果为：
0 [[-0.14751725  0.75113136]] [ 0.2857058]
20 [[ 0.06342752  0.32736415]] [ 0.24482927]
40 [[ 0.10146417  0.23744738]] [ 0.27712563]
60 [[ 0.10354312  0.21220125]] [ 0.290878]
80 [[ 0.10193551  0.20427427]] [ 0.2964265]
100 [[ 0.10085492  0.201565  ]] [ 0.298612]
120 [[ 0.10035028  0.20058727]] [ 0.29946309]
140 [[ 0.10013894  0.20022322]] [ 0.29979277]
160 [[ 0.1000543   0.20008542]] [ 0.29992008]
180 [[ 0.10002106  0.20003279]] [ 0.29996923]
200 [[ 0.10000814  0.20001261]] [ 0.29998815]

W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))

y = tf.matmul(W, x_data) + b

init = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init)

sess.run(train)
print step, sess.run(W), sess.run(b)

• TensorFlow 用图来表示计算任务，图中的节点被称之为operation，缩写成op。
• 一个节点获得 0 个或者多个张量 tensor，执行计算，产生0个或多个张量。
• 图必须在会话(Session)里被启动，会话(Session)将图的op分发到CPU或GPU之类的设备上，同时提供执行op的方法，这些方法执行后，将产生的张量(tensor)返回。

1. 构建图

import tensorflow as tf

# 创建一个 常量 op, 返回值 'matrix1' 代表这个 1x2 矩阵.
matrix1 = tf.constant([[3., 3.]])

# 创建另外一个 常量 op, 返回值 'matrix2' 代表这个 2x1 矩阵.
matrix2 = tf.constant([[2.],[2.]])

# 创建一个矩阵乘法 matmul op , 把 'matrix1' 和 'matrix2' 作为输入.
# 返回值 'product' 代表矩阵乘法的结果.
product = tf.matmul(matrix1, matrix2)

2. 张量 Tensor

3. 在一个会话中启动图

# 启动默认图.
sess = tf.Session()

# 调用 sess 的 'run()' 方法, 传入 'product' 作为该方法的参数，
# 触发了图中三个 op (两个常量 op 和一个矩阵乘法 op)，
# 向方法表明, 我们希望取回矩阵乘法 op 的输出.
result = sess.run(product)

# 返回值 'result' 是一个 numpy ndarray 对象.
print result
# ==> [[ 12.]]

# 任务完成, 需要关闭会话以释放资源。
sess.close()

# 进入一个交互式 TensorFlow 会话.
import tensorflow as tf
sess = tf.InteractiveSession()

x = tf.Variable([1.0, 2.0])
a = tf.constant([3.0, 3.0])

# 使用初始化器 initializer op 的 run() 方法初始化 'x'
x.initializer.run()

# 增加一个减法 sub op, 从 'x' 减去 'a'. 运行减法 op, 输出结果
sub = tf.sub(x, a)
print sub.eval()
# ==> [-2. -1.]

Tensorflow 的变量必须先初始化，然后才有值！而常值张量是不需要的。

# －创建一个变量, 初始化为标量 0.  初始化定义初值
state = tf.Variable(0, name="counter")

# 创建一个 op, 其作用是使 state 增加 1
one = tf.constant(1)
update = tf.assign(state, new_value)

# 启动图后, 变量必须先经过初始化 (init) op 初始化,
# 才真正通过Tensorflow的initialize_all_variables对这些变量赋初值
init_op = tf.initialize_all_variables()

# 启动默认图, 运行 op
with tf.Session() as sess:

# 运行 'init' op
sess.run(init_op)

# 打印 'state' 的初始值
# 取回操作的输出内容, 可以在使用 Session 对象的 run() 调用 执行图时,
# 传入一些 tensor, 这些 tensor 会帮助你取回结果.
# 此处只取回了单个节点 state，
# 也可以在运行一次 op 时一起取回多个 tensor:
# result = sess.run([mul, intermed])
print sess.run(state)

# 运行 op, 更新 'state', 并打印 'state'
for _ in range(3):
sess.run(update)
print sess.run(state)

# 输出:

# 0
# 1
# 2
# 3

Ok，总结一下，来一个清晰的代码：

import tensorflow as tf

# 建图
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])

product = tf.matmul(matrix1, matrix2)

# 启动图
sess = tf.Session()

# 取值
result = sess.run(product)
print result

sess.close()

TensorFlow和普通的Numpy的对比
cs224d的课件中有下面这个代码，来看一下二者之间的区别：

eval()

In [37]: a = np.zeros((2,2))

In [39]: print(a)
[[ 0.  0.]
[ 0.  0.]]

In [38]: ta = tf.zeros((2,2))

In [40]: print(ta)
Tensor("zeros_1:0", shape=(2, 2), dtype=float32)

In [41]: print(ta.eval())
[[ 0.  0.]
[ 0. 0.]]

Day 1 宏观了解了 NLP，Day 2 搞定了工具，下次要直接先进入实战，训练模型，先从 Logistic 和 NN 开始，一边看模型一边写代码一边思考模型原理，这样理解才会更深刻！

循环神经网络－－－Recurrent Neural Networks

This slideshow requires JavaScript.

线性回归（Linear Regression）

（一）线性回归（Linear Regression）

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}(Y-X\Theta)$

$(Y-X\Theta)^{T}(Y-X\Theta)=Y^{T}Y-2Y^{T}X\Theta + \Theta^{T}X^{T}X\Theta$

$\Theta$ 求导之后得到：$-2X^{T}Y + 2X^{T}X\Theta=0$，求解 $\Theta$ 之后得到 $\Theta = (X^{T}X)^{-1}X^{T}Y$。因此，对于矩阵 $X$ 和列向量 $Y$ 而言，最佳的线性回归系数是

$\Theta = (X^{T}X)^{-1}X^{T}Y.$

（二）局部加权线性回归（Locally Weighted Linear Regression）

$\sum_{i=1}^{m}w_{i}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}W(Y-X\Theta),$

$(Y-X\Theta)^{T}W(Y-X\Theta)=Y^{T}WY-2Y^{T}WX\Theta+\Theta^{T}X^{T}WX\Theta,$

$\Theta$ 求导之后得到：

$-2(Y^{T}WX)^{T} + 2 X^{T}WX\Theta = -2X^{T}WY+2X^{T}WX\Theta.$

$\Theta = (X^{T}WX)^{-1}X^{T}WY.$

$w_{i} = \exp\{-\frac{(x_{i}-x)^{2}}{2k^{2}}\}.$

（三）岭回归（Ridge Regression）和 LASSO

$\Theta = (X^{T}X+\lambda I)^{-1}X^{T}Y.$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda||\Theta||_{2}^{2}=(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta,$

$(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta = Y^{T}Y - 2\Theta^{T}X^{T}Y+\Theta^{T}X^{T}X\Theta+\lambda\Theta^{T}I\Theta.$

$\Theta$ 求导之后得到：

$-2X^{T}Y+2(X^{T}X+\lambda I)\Theta,$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda ||\Theta||_{1}$

（四）前向逐步线性回归（Forward Stagewise Linear Regression）

数据标准化，使其分布满足零均值和单位方差

设置当前最小误差为正无穷
对每个特征：
增大或者缩小：
改变一个系数得到一个新的权重W
计算新W下的误差
如果误差Error小于当前误差：设置Wbest等于当前的W
将W设置为新的Wbest

如何在 Kaggle 首战中进入前 10%

Introduction

Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，在 2125 个参赛队伍中排名第 98 位（~ 5%）。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

• 不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
• 比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。
• 一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。
• 反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
• LB 指的就是在 Leaderboard 得到的分数，由上，有 Public LBPrivate LB 之分。
• 自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
• 新手可以从比赛的 ForumScripts 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。

P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!

General Approach

Data Exploration

Visualization

• 查看目标变量的分布。当分布不平衡时，根据评分标准和具体模型的使用不同，可能会严重影响性能。
• Numerical Variable，可以用 Box Plot 来直观地查看它的分布。
• 对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
• 对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。
• 绘制变量之间两两的分布和相关度图表。

Data Preprocessing

• 有时数据会分散在几个不同的文件中，需要 Join 起来。
• 处理 Missing Data
• 处理 Outlier
• 必要时转换某些 Categorical Variable 的表示方式。
• 有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

Feature Engineering

Feature Selection

• Feature 越少，训练越快。
• 有些 Feature 之间可能存在线性关系，影响 Model 的性能。
• 通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。

Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的Feature Importance 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如这个。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。

Model Selection

• Random Forest
• Extra Randomized Trees

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

Model Training

• eta：每次迭代完成后更新权重时的步长。越小训练越慢。
• num_round：总共迭代的次数。
• subsample：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。
• colsample_bytree：训练每棵树时用来训练的特征的比例，类似 RandomForestClassifiermax_features
• max_depth：每棵树的最大深度限制。与 Random Forest 不同，Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的
• early_stopping_rounds：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。

1. 将训练数据的一部分划出来作为验证集。
2. 先将 eta 设得比较高（比如 0.1），num_round 设为 300 ~ 500。
3. 用 Grid Search 对其他参数进行搜索
4. 逐步将 eta 降低，找到最佳值。
5. 以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 early_stopping_rounds

Cross Validation

Cross Validation 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 Public LB 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 CV 结果，提交之后得到的 LB 结果却变差了，一般认为这时应该相信 CV 的结果。当然，最理想的情况是多种不同的 CV 方法得到的结果和 LB 同时提高，但这样的比赛并不是太多。

Ensemble Generation

Ensemble Learning 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以同时降低最终模型的 Bias 和 Variance（证明可以参考这篇论文，我最近在研究类似的理论，可能之后会写新文章详述)，从而在提高分数的同时又降低 Overfitting 的风险。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。

• Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。
• Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。
• Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。
• Stacking：接下来会详细介绍。

• Base Model 之间的相关性要尽可能的小。这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。
• Base Model 之间的性能表现不能差距太大。这其实是一个 Trade-off，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。

*Pipeline

• 模块化 Feature Transform，只需写很少的代码就能将新的 Feature 更新到训练集中。
• 自动化 Grid Search，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。
• 自动化 Ensemble Generation，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

Crowdflower Search Results Relevance 的第一名获得者 Chenglong Chen 将他在比赛中使用的 Pipeline 公开了，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。

Home Depot Search Relevance

EDA

• 同一个搜索词/产品都出现了多次，数据分布显然不 i.i.d.
• 文本之间的相似度很有用。
• 产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。
• 产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？

Preprocessing

1. 利用 Forum 上的 Typo Dictionary 修正搜索词中的错误。
2. 统计属性的出现次数，将其中出现次数多又容易利用的记录下来。
3. 将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。
4. 对所有文本能做 StemmingTokenizing，同时手工做了一部分格式统一化（比如涉及到数字和单位的）同义词替换

Feature

• *Attribute Features
• 是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）
• 这个特定的属性是否匹配
• Meta Features
• 各个文本域的长度
• 是否包含属性域
• 品牌（将所有的品牌做数值离散化）
• 产品 ID
• 简单匹配
• 搜索词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例
• *搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词和产品标题、产品介绍以及产品属性之间的文本相似度
• Latent Semantic Indexing：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 SVD 分解，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。

Ensemble

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

Lessons Learned

• 产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用With/Without XXX 的格式放在标题最后。
• 使用外部数据，比如 WordNetReddit 评论数据集等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。
• 基于字母而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现越长的单词约具体，所以越容易被用户认为相关度高。此外他们还使用了逐字符的序列比较（difflib.SequenceMatcher），因为这个相似度能够衡量视觉上的相似度。像这样的 Feature 的确不是每个人都能想到的。
• 标注单词的词性，找出中心词，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。
• 将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。
• 一些新颖的距离尺度，比如 Word Movers Distance
• 除了 SVD 以外还可以用上 NMF
• 最重要的 Feature 之间的 Pairwise Polynomial Interaction
• 针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布

Summary

Takeaways

1. 比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。
2. 很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。
3. Feature 为王。我花在 Feature 上的时间还是太少。
4. 可能的话，多花点时间去手动查看原始数据中的 Pattern。

Issues Raised

1. 在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 CV
2. 如何量化 Ensemble 中 Diversity vs. Accuracy 的 Trade-off。
3. 如何处理 Feature 之间互相影响导致性能反而下降。

Beginner Tips

1. 选择一个感兴趣的比赛。如果你对相关领域原本就有一些洞见那就更理想了。
2. 根据我描述的方法开始探索、理解数据并进行建模。
3. 通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。
4. 如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。
5. 在得到一个比较不错的 LB 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。
6. 如果觉得自己有希望拿到奖金，开始找人组队吧！
7. 到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。
8. 比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备
9. 好好休息！

Introduction

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th (~ 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal.

This post is also available in Chinese.

Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.

First let’s go through some facts about Kaggle competitions in case you are not very familiar with them.

• Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
• A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited amount of times a day. Usually it’s 5 times a day.
• There will be a deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore be sure to have at least one valid submission before that.
• You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.
• The score you get is calculated on a subset of testing set, which is commonly referred to as a Public LB score. Whereas the final result will use the remaining data in the testing set, which is referred to as Private LB score.
• The score you get by local cross validation is commonly referred to as a CVscore. Generally speaking, CV scores are more reliable than LB scores.
• Beginners can learn a lot from Forum and Scripts. Do not hesitate to ask, Kagglers are very kind and helpful.

I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!

General Approach

In this section, I will walk you through the whole process of a Kaggle competition.

Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

• Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
• For numerical variables, use box plot to inspect their distributions.
• For coordinates-like data, use scatter plot to inspect the distribution and check for outliers.
• For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
• Make pairwise distribution plots and examine correlations between pairs of variables.

Be sure to read this very inspiring tutorial of exploratory visualization before you go on.

Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about the choice of tests and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.id. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is an genuine one or due to randomness.

Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

• Sometimes several files are provided and we need to join them.
• Deal with missing data.
• Deal with outliers.
• Encode categorical variables if necessary.
• Deal with noise. For example you may have some floats derived from unknown integers. The loss of precision during floating-point operations can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.

How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using iPython Notebook for data manipulating and mastering usages of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun operations. Also this makes it very convenient to share your approaches with others. After all reproducible results are very important in data science.

Let’s have some examples.

Outlier

The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.

Dummy Variables

For categorical variables, a common practice is One-hot Encoding. For a categorical variable with n possible values, we create a group of n dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Like this, we transform DayOfWeek into 7 dummy variables.

Note that when the categorical variable can takes many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.

Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, common NLP features are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to check by plotting it against the target variable like this:

Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

• Less features mean faster training
• Some features are linearly related to others. This might put a strain on the model.
• By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.

The simplest way to inspect feature importance is by fitting a random forest model. There exist more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

Feature Encoding

Sometimes raw features have to be converted to some other formats for them to be work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

Model Selection

With the features set, we can start training models. Kaggle competitions usually favor tree-based models:

• Random Forest
• Extra Randomized Trees

These models are slightly worse in terms of performance, but are suitable as base models in ensemble learning (will be discussed later):

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

Of course, neural networks are very important in image-related competitions.

All these models can be accessed using Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

BTW, installing Xgboost on Windows could be a painstaking process. You can refer to this post by me if you run into problems.

Model Training

We can obtain a good model by tuning its parameters. A model usually have many parameters, but only a few of them are important to its performance. For example, the most important parameters for random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each of the parameters have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

• eta: Step size used in updating weights. Lower eta means slower training.
• num_round: Total round of iterations.
• subsample: The ratio of training data used in each iteration. This is to combat overfitting.
• colsample_bytree: The ratio of features used in each iteration. This is like max_features of RandomForestClassifier.
• max_depth: The maximum depth of each tree. Unlike random forest,gradient boosting would eventually overfit if we do not limit its depth.
• early_stopping_rounds: Controls how many iterations that do not show a increase of score on validation set are needed for the algorithm to stop early. This is to combat overfitting, too.

Usual tuning steps:

1. Reserve a portion of training set as the validation set.
2. Set eta to a relatively high value (e.g. 0.1), num_round to 300 ~ 500.
3. Use grid search to find best combination of other parameters.
4. Gradually lower eta to find the optimum.
5. Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

Cross Validation

Cross validation is an essential step. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions where it’s not easy to do reliable CV.

Ensemble Generation

Ensemble Learning refers to ways of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thusincreasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

• Bagging: Use different random subsets of training data to train each base model. Then base models vote to generate the final predictions. This is how random forest works.
• Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. It performs better than bagging but is more prone to overfitting.
• Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
• Stacking: To be discussed next.

In theory, for the ensemble to perform well, two elements matter:

• Base models should be as unrelated as possibly. This is why we tend to include non-tree-base models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
• Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase performance even under this circumstance.

Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(rows in training data) X #(base models). This matrix is then fed to the stacker in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

*Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

• Modularized feature transform. We only need to write a few lines of codes and the new feature is added to the training set.
• Automated grid search. We only need to set up models and parameter grid, the search will be run and best parameters are recorded.
• Automated ensemble generation. Use best K models for ensemble as soon as last generation is done.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because manually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s still very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.

Home Depot Search Relevance

In this section I will share my solution in Home Depot Search Relevance and what I learned from top teams after the competition.

The task in this competitions is to predict how relevant a result is for a search term on Home Depot website. The relevance score is an average of three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is RMSE.

This is much like Crowdflower Search Results Relevance. The difference is thatQuadratic Weighted Kappa is used in that competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in that competition.

EDA

There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:

• Many search terms / products appeared several times.
• Text similarities are great features.
• Many products don’t have attributes features. Would this be a problem?
• Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?

Preprocessing

You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:

1. Use typo dictionary posted in forum to correct typos in search terms.
2. Count attributes. Find those frequent and easily exploited ones.
3. Join the training set with the testing set. This is important because otherwise you’ll have to do feature transform twice.
4. Do stemming and tokenizing for all the text fields. Some normalization(with digits and units) and synonym substitutions are performed manually.

Feature

• *Attribute Features
• Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
• Whether a certain attribute matches with search term
• Meta Features
• Length of each text field
• Whether the product contains attribute fields
• Brand (encoded as integers)
• Product ID
• Matching
• Whether search term appears in product title / description / attributes
• Count and ratio of search term’s appearance in product title / description / attributes
• *Whether the i-th word of search term appears in product title / description / attributes
• Text similarities between search term and product title/description/attributes
• Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get the latent descriptions of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).

Note that features listed above with * are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlation mentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

Model

At first I was using RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my PC while working on features on my laptop.

Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.

Ensemble

My final model is an ensemble consisting of 4 base models:

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

The stacker (L2 model) is also a XGBRegressor.

The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.

The good news is that, despite base models being highly correlated, stacking really bumps up my score. What’s more, my CV score and LB score are in complete sync after I started stacking.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of 0.0004, which might be just due to randomness.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.

Lessons Learned

There’s much to learn from the solutions shared by top teams:

• There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by With/Without XXXat the end of the title.
• Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
• Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
• POS-tag words and find anchor words. Use anchor words for computing various distances.
• Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
• Some novel distance metrics like Word Movers Distance
• Apart from SVD, some used NMF.
• Generate pairwise polynomial interactions between top-ranking features.
• For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.

Summary

Takeaways

1. It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
2. It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
3. Features matter the most! I didn’t spend enough time on features in this competition.
4. If possible, spend some time to manually inspect raw data for patterns.

Issues Raised

Several issues I encountered in this competitions are of high research values.

1. How to do reliable CV with dependent data.
2. How to quantify the trade-off between diversity and accuracy in ensemble learning.
3. How to deal with feature interaction which harms the model’s performance. And how to determine whether new features are effective in such situation.

Beginner Tips

1. Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
2. Following my approach or somebody else’s, start exploring, understanding and modeling data.
3. Learn from forum and scripts. See how other interpret data and construct features.
4. Find winner interviews / blog post of previous competitions. They’re very helpful.
5. Start doing ensemble after you have reached a pretty good score (e.g. ~ 10%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
6. If you think you may have a chance to win the prize, try teaming up!
7. Don’t give up until the end of the competition. At least try something new every day.
8. Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
9. Get some rest!

最大似然估计（Maximal Likelihood Estimation）

（一）基本思想

$P(x_{1},...,x_{n}) = f_{D}(x_{1},...,x_{n}|\theta)$.

$L(\theta) = f_{D}(x_{1},...,x_{n}|\theta)$,

Remark. 最大似然函数不一定是唯一的，甚至不一定是存在的。

（二）基本算法

（1）定义似然函数；

（2）对似然函数求导数，或者说对似然函数的对数求导数，目的都是为了更加方便地计算一阶导数；

（3）令一阶导数等于零，得到关于参数 $\theta$ 的似然方程；

（4）求解似然方程，得到的参数就是最大似然估计。在求解的过程中，如果不能够直接求解方程的话，可以采取牛顿法来近似求解。

（三）例子

（i）Bernoulli 分布（Bernoulli Distribution）

$f(x_{i};p)=p^{x_{i}}(1-p)^{1-x_{i}}, \text{ for } x_{i}=0 \text{ or } 1 \text{ and } 0.

$L(p)=\prod_{i=1}^{n}f(x_{i};p)=p^{\sum_{i=1}^{n}x_{i}}(1-p)^{n-\sum_{i=1}^{n}x_{i}}$.

$\ln(L(p))=(\sum_{i=1}^{n}x_{i})\ln(p) + (n-\sum_{i=1}^{n}x_{i})\ln(1-p)$

$\frac{\partial\ln(L(p))}{\partial p} = \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}$

$\frac{\partial\ln(L(p))}{\partial p} = 0$，可以得到

$p=\sum_{i=1}^{n}x_{i}/n$.

$\hat{p}=\sum_{i=1}^{n}x_{i}/n$.

（ii）Gaussian Distribution

$f(x_{i};\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x_{i}-\theta)^{2}}{2\sigma^{2}})$.

$L(\mu,\sigma) = \prod_{i=1}^{n} f(x_{i};\mu,\sigma^{2}) = (2\pi)^{-n/2}\sigma^{-n}\exp(-\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}})$

$\ln(L(\mu,\sigma)) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} = \frac{\sum_{i=1}^{n}(x_{i}-\mu)}{\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{\sigma^{3}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} =0$ 和  $\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma}=0$，可以求解方程组得到：

$\hat{\mu}= \sum_{i=1}^{n}x_{i}/n$,

$\hat{\sigma}^{2} = \sum_{i=1}^{n}(x_{i}-\mu)^{2}/n$.

(iii) Weibull 分布（Weibull Distribution）

$f(x;\lambda,k) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1}\exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0, f(x;\lambda,k)=0 \text{ for } x<0.$

Weibull 分布的累积分布函数

$F(x;k,\lambda) = 1- \exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0$,

$F(x;k,\lambda) = 0 \text{ for } x<0$.

Weibull 分布的分位函数（quantile function, inverse cumulative distribution）是

$Q(p;k,\lambda) = \lambda(-\ln(1-p))^{1/k} \text{ for } 0\leq p <1$.

$p(x_{i};k,\lambda) = \frac{k}{\lambda}(\frac{x_{i}}{\lambda})^{k-1}\exp(-(\frac{x_{i}}{\lambda})^{k})$.

$L(k,\lambda) = \prod_{i=1}^{n}p(x_{i};k,\lambda)$

$\ln(L(k,\lambda)) = n\ln(k) - nk\ln(\lambda) + (k-1)\sum_{i=1}^{n}\ln(x_{i}) - \sum_{i=1}^{n}x_{i}^{k}/\lambda^{k}.$

$\frac{\partial \ln(L(k,\lambda))}{\partial \lambda} = - \frac{nk}{\lambda} + \frac{k\sum_{i=1}^{n}x_{i}^{k}}{\lambda^{k+1}},$

$\frac{\partial \ln(L(k,\lambda))}{\partial k} = \frac{n}{k} - n\ln(\lambda) + \sum_{i=1}^{n}\ln(x_{i}) -\sum_{i=1}^{n}(\frac{x_{i}}{\lambda})^{k}\ln(\frac{x_{i}}{\lambda}).$

$\lambda^{k}=\frac{\sum_{i=1}^{n}x_{i}^{k}}{n},$

$\frac{1}{k} = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} -\frac{\sum_{i=1}^{n}\ln(x_{i})}{n}.$

$f(k) = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} - \frac{\sum_{i=1}^{n}\ln(x_{i})}{n} - \frac{1}{k}.$

$f^{'}(k)= \frac{\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2}}{\sum_{i=1}^{n}x_{i}^{k}}-(\frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}})^{2} + \frac{1}{k^{2}}.$

$(\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2})\cdot(\sum_{i=1}^{n}x_{i}^{k})\geq (\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i}))^{2}.$

$\lim_{k\rightarrow 0^{+}}f(k) = -\infty,$

$\lim_{k\rightarrow +\infty}f(k) > 0 \text{ if } \forall x_{i}>1.$

$k_{0}= 0.0001,$

$k_{n+1} = k_{n}- \frac{f(k_{n})}{f^{'}(k_{n})} \text{ for all } n\geq 0.$

$n$ 的次数足够大的时候，$k_{n}$ 就可以被当作最大似然估计。

How machine learning can help the security industry

Machine learning (ML) is such a hot area in security right now.

At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.

Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.

Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.

Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.

Where does ML work?

One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.

We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.

Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.

One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.

There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.

Threat models and anomaly detection

In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.

There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.

异常点检测算法综述

This slideshow requires JavaScript.

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

Introduction

If things don’t go your way in predictive modeling, use XGboost.  XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?

What should you know ?

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python, I highly recommend going through that before reading further. It will help you bolster your understanding of boosting in general and parameter tuning for GBM.

Special Thanks: Personally, I would like to acknowledge the timeless support provided by Mr. Sudalai Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. He is helping us guide thousands of data scientists. A big thanks to SRK!

2. Understanding XGBoost Parameters
3. Tuning Parameters (with Example)

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:

1. Regularization:
• Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
• In fact, XGBoost is also known as ‘regularized boosting‘ technique.
2. Parallel Processing:
• XGBoost implements parallel processing and is blazingly faster as compared to GBM.
• But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
• XGBoost also supports implementation on Hadoop.
3. High Flexibility
• XGBoost allow users to define custom optimization objectives and evaluation criteria.
• This adds a whole new dimension to the model and there is no limit to what we can do.
4. Handling Missing Values
• XGBoost has an in-built routine to handle missing values.
• User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
5. Tree Pruning:
• A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then start pruningthe tree backwards and remove splits beyond which there is no positive gain.
• Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
6. Built-in Cross-Validation
• XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
• This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
7. Continue on Existing Model
• User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
• GBM implementation of sklearn also has this feature so they are even on this point.

I hope now you understand the sheer power XGBoost algorithm. Note that these are the points which I could muster. You know a few more? Feel free to drop a comment below and I will update the list.

Did I whet your appetite ? Good. You can refer to following web-pages for a deeper understanding:

2. XGBoost Parameters

The overall parameters have been divided into 3 categories by XGBoost authors:

1. General Parameters: Guide the overall functioning
2. Booster Parameters: Guide the individual booster (tree/regression) at each step
3. Learning Task Parameters: Guide the optimization performed

I will give analogies to GBM here and highly recommend to read this article to learn from the very basics.

General Parameters

These define the overall functionality of XGBoost.

1. booster [default=gbtree]
• Select the type of model to run at each iteration. It has 2 options:
• gbtree: tree-based models
• gblinear: linear models
2. silent [default=0]:
• Silent mode is activated is set to 1, i.e. no running messages will be printed.
• It’s generally good to keep it 0 as the messages might help in understanding the model.
3. nthread [default to maximum number of threads available if not set]
• This is used for parallel processing and number of cores in the system should be entered
• If you wish to run on all cores, value should not be entered and algorithm will detect automatically

There are 2 more parameters which are set automatically by XGBoost and you need not worry about them. Lets move on to Booster parameters.

Booster Parameters

Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

1. eta [default=0.3]
• Analogous to learning rate in GBM
• Makes the model more robust by shrinking the weights on each step
• Typical final values to be used: 0.01-0.2
2. min_child_weight [default=1]
• Defines the minimum sum of weights of all observations required in a child.
• This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
• Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
• Too high values can lead to under-fitting hence, it should be tuned using CV.
3. max_depth [default=6]
• The maximum depth of a tree, same as GBM.
• Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
• Should be tuned using CV.
• Typical values: 3-10
4. max_leaf_nodes
• The maximum number of terminal nodes or leaves in a tree.
• Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
• If this is defined, GBM will ignore max_depth.
5. gamma [default=0]
• A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
• Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
6. max_delta_step [default=0]
• In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
• Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
• This is generally not used but you can explore further if you wish.
7. subsample [default=1]
• Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
• Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
• Typical values: 0.5-1
8. colsample_bytree [default=1]
• Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
• Typical values: 0.5-1
9. colsample_bylevel [default=1]
• Denotes the subsample ratio of columns for each split, in each level.
• I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
10. lambda [default=1]
• L2 regularization term on weights (analogous to Ridge regression)
• This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
11. alpha [default=0]
• L1 regularization term on weight (analogous to Lasso regression)
• Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
12. scale_pos_weight [default=1]
• A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]
• This defines the loss function to be minimized. Mostly used values are:
• binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
• multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
• you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
• multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
2. eval_metric [ default according to objective ]
• The metric to be used for validation data.
• The default values are rmse for regression and error for classification.
• Typical values are:
• rmse – root mean square error
• mae – mean absolute error
• logloss – negative log-likelihood
• error – Binary classification error rate (0.5 threshold)
• merror – Multiclass classification error rate
• mlogloss – Multiclass logloss
• auc: Area under the curve
3. seed [default=0]
• The random number seed.
• Can be used for generating reproducible results and also for parameter tuning.

If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:

1. eta –> learning_rate
2. lambda –> reg_lambda
3. alpha –> reg_alpha

You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier. However, it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.

I recommend you to go through the following parts of xgboost guide to better understand the parameters and codes:

3. Parameter Tuning with Example

We will take the data set from Data Hackathon 3.x AV hackathon, same as that taken in the GBM article. The details of the problem can be found on the competition page. You can download the data set from here. I have performed the following steps:

1. City variable dropped because of too many categories
2. DOB converted to Age | DOB dropped
3. EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped
4. EmployerName dropped because of too many categories
5. Existing_EMI imputed with 0 (median) since only 111 values were missing
6. Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped
8. Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values
9. Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped
10. Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped
11. LoggedIn, Salary_Account dropped
12. Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped
13. Source – top 2 kept as is and all others combined into different category
14. Numerical and One-Hot-Coding performed

For those who have the original data from competition, you can check out these steps from the data_preparation iPython notebook in the repository.

#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

target = 'Disbursed'
IDcol = 'ID'

Note that I have imported 2 forms of XGBoost:

1. xgb – this is the direct xgboost library. I will use a specific function “cv” from this library
2. XGBClassifier – this is an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM

Before proceeding further, lets define a function which will help us create XGBoost models and perform cross-validation. The best part is that you can take this function as it is and use it later for your own models.

def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
alg.set_params(n_estimators=cvresult.shape[0])

#Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')

#Predict training set:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

#Print model report:
print "\nModel Report"
print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)

feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')

This code is slightly different from what I used for GBM. The focus of this article is to cover the concepts and not coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it. Note that xgboost’s sklearn wrapper doesn’t have a “feature_importances” metric but a get_fscore() function which does the same job.

General Approach for Parameter Tuning

We will use an approach similar to that of GBM here. The various steps to be performed are:

1. Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
2. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
4. Lower the learning rate and decide the optimal parameters .

Let us look at a more detailed step by step approach.

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:

1. max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.
2. min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.
3. gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.
4. subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.
5. scale_pos_weight = 1: Because of high class imbalance.

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.

#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1,
seed=27)
modelfit(xgb1, train, predictors)

As you can see that here we got 140 as the optimal estimators for 0.1 learning rate. Note that this value might be too high for you depending on the power of your system. In that case you can increase the learning rate and re-run the command to get the reduced number of estimators.

Note: You will see the test AUC as “AUC Score (Test)” in the outputs here. But this would not appear if you try to run the command on your system as the data is not made public. It’s provided here just for reference. The part of the code which generates this output has been removed here.

Step 2: Tune max_depth and min_child_weight

We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.

Important Note: I’ll be doing some heavy-duty grid searched in this section which can take 15-30 mins or even more time to run depending on your system. You can vary the number of values you are testing based on what your system can handle.

param_test1 = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

Here, we have run 12 combinations with wider intervals between values. The ideal values are 5 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.

param_test2 = {
'max_depth':[4,5,6],
'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

Here, we get the optimum values as 4 for max_depth and 6 for min_child_weight. Also, we can see the CV score increasing slightly. Note that as the model performance increases, it becomes exponentially difficult to achieve even marginal gains in performance. You would have noticed that here we got 6 as optimum value for min_child_weight but we haven’t tried values more than 6. We can do that as follow:.

param_test2b = {
'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])
modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_

We see 6 as the optimal value.

Step 3: Tune gamma

Now lets tune gamma value using the parameters already tuned above. Gamma can take various values but I’ll check for 5 values here. You can go into more precise values as.

param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

This shows that our original value of gamma, i.e. 0 is the optimum one. Before proceeding, a good idea would be to re-calibrate the number of boosting rounds for the updated parameters.

xgb2 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1,
seed=27)
modelfit(xgb2, train, predictors)

Here, we can see the improvement in score. So the final parameters are:

• max_depth: 4
• min_child_weight: 6
• gamma: 0

Step 4: Tune subsample and colsample_bytree

The next step would be try different subsample and colsample_bytree values. Lets do this in 2 stages as well and take values 0.6,0.7,0.8,0.9 for both to start with.

param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

Here, we found 0.8 as the optimum value for both subsample and colsample_bytree. Now we should try values in 0.05 interval around these.

param_test5 = {
'subsample':[i/100.0 for i in range(75,90,5)],
'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])

Again we got the same values as before. Thus the optimum values are:

• subsample: 0.8
• colsample_bytree: 0.8

Step 5: Tuning Regularization Parameters

Next step is to apply regularization to reduce overfitting. Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity. But we should always try it. I’ll tune ‘reg_alpha’ value here and leave it upto you to try different values of ‘reg_lambda’.

param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

We can see that the CV score is less than the previous case. But the values tried are very widespread, we should try values closer to the optimum here (0.01) to see if we get something better.

param_test7 = {
'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

You can see that we got a better CV. Now we can apply this regularization in the model and look at the impact:

xgb3 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
scale_pos_weight=1,
seed=27)
modelfit(xgb3, train, predictors)

Again we can see slight improvement in the score.

Step 6: Reducing Learning Rate

Lastly, we should lower the learning rate and add more trees. Lets use the cv function of XGBoost to do the job again.

xgb4 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
scale_pos_weight=1,
seed=27)
modelfit(xgb4, train, predictors)

Now we can see a significant boost in performance and the effect of parameter tuning is clearer.

As we come to the end, I would like to share 2 key thoughts:

1. It is difficult to get a very big leap in performance by just using parameter tuning or slightly better models. The max score for GBM was 0.8487 while XGBoost gave 0.8494. This is a decent improvement but not something very substantial.
2. A significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc

You can also download the iPython notebook with all these model codes from my GitHub account. For codes in R, you can refer to this article.

End Notes

This article was based on developing a XGBoost model end-to-end. We started with discussing why XGBoost has superior performance over GBM which was followed by detailed discussion on thevarious parameters involved. We also defined a generic function which you can re-use for making models.

Finally, we discussed the general approach towards tackling a problem with XGBoost and also worked out the AV Data Hackathon 3.x problem through that approach.

I hope you found this useful and now you feel more confident to apply XGBoost in solving a data science problem. You can try this out in out upcoming hackathons.

Did you like this article? Would you like to share some other hacks which you implement while making XGBoost models? Please feel free to drop a note in the comments below and I’ll be glad to discuss.

AI^2: Training a big data machine to defend

AI2: Training a big data machine to defend Veeramachaneni et al.IEEE International conference on Big Data Security, 2016

Will machines take over? The lesson of today’s paper is that we’re better off together. Combining AI with HI (human intelligence, I felt like we deserved an acronym of our own ) yields much better results than a system that uses only unsupervised learning. The context is information security, scanning millions of log entries per day to detect suspicious activity and prevent attacks. Examples of attacks include account takeovers, new account fraud (opening a new account using stolen credit card information), and terms of service abuse (e.g. abusing promotional codes, or manipulating cookies for advantage).

A typical attack has a behavioral signature, which comprises the series of steps involved in commiting it. The information necessary to quantify these signatures is buried deep in the raw data, and is often delivered as logs.

The usual problem with such outlier/anomaly detection systems is that they trigger lots of false positive alarms, that take substantial time and effort to investigate. After the system has ‘cried wolf’ enough times they can become distrusted and of limited use. AI2 combines the experience and intuition of analysts with machine learning techniques. An ensemble of unsupervised learning models generates a set of k events to be analysed per day (where the daily budget k of events that can be analysed is a configurable parameter). The human judgements on these k events are used to train a supervised model, the results of which are combined with the unsupervised ensemble results to refine the k events to be presented to the analyst on the next day. And so it goes on.

The end result looks a bit like this:

With a daily investigation budget (k) of 200 events, AI2 detects 86.8% of attacks with a false positive rate of 4.4%. Using only unsupervised learning, on 7.9% of attacks are detected. If the investigation budget is upped to 1000 events/day, unsupervised learning can detect 73.7% of attacks with a false positive rate of 22%. At this level, the unsupervised system is generating 5x the false positives of AI2, and still not detecting as many attacks.

Detecting attacks is a true ‘needle-in-a-haystack’ problem as the following table shows:

Entities in the above refers to the number of unique IP addresses, users, sessions etc. anaysed on a daily basis. The very small relative number of true attacks results in extreme class imbalance when trying to learn a supervised model.

AI2 tracks activity based on ingested log records and aggregates activities over intervals of time (for example,counters, indicators – did this happen in the window at all? – elapsed time between events, number of unique values and so on). These features are passed into an ensemble of three unsupervised outlier detection models:

• A Principle Component Analysis (PCA) based model. The basic idea is to use PCA to determine the most significant features (those that explain most of the variance in the data). Given an input take its PCA projection, and then from the projection, reconstruct the original variables. The reconstruction error will be small for the majority of examples, but will remain high for outliers.
• A Replicator Neural Network (not to be confused with a RecurrentNeural Network – both get abbreviated to RNN). This works on a very similar principal. The input and output layers have the same number of nodes, and intermediate layers have fewer nodes. The goal is to train the network to recreate the input at the output layer – which means it must learn an efficient compressed representation in the lower-dimensional hidden layers. Once the RNN has been trained, the reconstruction error can be used as the outlier score.
• The third unsupervised model uses copula functions to build a joint probability function that can be used to detect rare events.

A copula framework provides a means of inference after modeling a multivariate joint probability distribution from training data. Because copula frameworks are less well known than other forms of estimation, we will now briefly review copula theory…

(If you’re interested in that review, and how copula functions are used to form a multivariate density function, see section 6.3 in the paper).

The scores from each of the models are translated into probabilities using a Weibull distribution, “which is flexible and can model a wide variety of shapes.” This translation means that we can compare like-with-like when combining the results from the three models. Here’s an example of the combination process using one-day’s worth of data:

The whole AI2 system cycles through training, deployment, and feedback collection/model updating phases on a daily basis. The system trains unsupervised and supervised models based on all the available data, applies those models to the incoming data, identifies k entities as extreme events or attacks, and brings these to the analyst’s attention. The analysts deductions are used to build a new predictive model for the next day.

This combined approach makes effective use of the limited available analyst bandwidth, can overcome some of the weaknesses of pure unsupervised learning, and actively adapts and synthesizes new models.

This setup captures the cascading effect of the human-machine interaction: the more attacks the predictive system detects, the more feeback it will receive from the analysts; this feedback, in turn, will improve the accuracy of future predictions.

Glossary of AI Terms for Cyber Security

We often encounter confusion and hype surrounding the terminology of Artificial Intelligence. In this post, it is hoped that the security practitioner can have a quick reference guide for some of the more important and common terms.

Note that this is a limited set. We discovered that, once you start defining these terms, the terms themselves introduce new terms that require definition. We had to draw the line somewhere…

• Artificial Intelligence – “The goal of work in Artificial Intelligence is to build machines that perform tasks that normally require human intelligence.” This quote from Nils Nilsson is an excellent definition, but it is not the only one. There are many definitions of Artificial Intelligence here and here.
• Algorithms – are a self-contained step-by-step sets of operations to be performed. Algorithms perform calculation, data processing, and/or automated reasoning tasks. Among other things, Algorithms can be used to train Machine Learning models.
• Machine Learning – A discipline or subfield of Artificial Intelligence. Paraphrasing the definition by Tom Mitchell, ML is the study of computer algorithms that learn from experience to perform a set of predefined tasks.
• Machine Learning Models – The output of a machine learning algorithm. There are two types of machine learning models, those generated by Supervised algorithms and Unsupervised algorithms. See below.
• The difference between “algorithms” and “models”: this is a common question and still quite difficult to answer. In the context of Artificial Intelligence, we can say that learning algorithms generate models. The learning algorithms are either Supervised or Unsupervised.N.B. People often use “models” and “algorithms” interchangeably which is a common source of confusion. To the layman, think that algorithms are like programs and the models are the output of the program.
• Unsupervised Learning (algorithm)a family of machine learning algorithms that learn without labels (labels defined below). The output of Unsupervised Learning algorithms are models that capture the structure of the data, can identify groups, or find statistical outliers. For example, Unsupervised Learning models can show you behaviors that are unlike other behaviors in a corpus of data.
• Supervised Learning (algorithm) – a family of machine learning algorithms that learn from labeled data. The output of Supervised Learning algorithms are predictive models that can classify or assign a score to a data pattern. For example, trained Supervised Learning models can classify behaviors patterns into different attack tactics, or can assign a risk score to a behavior. In cyber-security, Supervised Learning models predict what a human would label a given behavior pattern.
• Labeling – is the act of classification or describing something. For PatternEx, the act of labeling is something that a human analyst does every day. He or she marks something as a malicious or benign behavior. The more labels are provided, the more accurate the system becomes.
• Active Learning – Active learning is a machine learning process in which a learning algorithm interactively requests inputs from an external source to improve a model. It is most commonly applied when only unlabeled data is available, the goal is to train Supervised Learning models, and the external source is a human expert that provides labels, and the labeling process is expensive and/or slow. Active learning strategies are also useful when, as in the case of InfoSec, the data changes fast.
• Behavior Vectors – a quantified description of the activity of the modeled entities.
• Entities – a thing with distinct, independent existence against which the behaviors relate. In cyber-security, examples would be users, IP’s, domains, and so on.
• Human-Assisted Artificial Intelligence – the synergy between human intuition and artificial intelligence. Note that the humans assist the AI by providing feedback (e.g. labels) and the trained AI assists the humans by automating and scaling the tasks requiring human intelligence.
• Predictions – are the activity of the system anticipating how an event would be classified by the security analyst.
• Rare Events – very similar to “anomalies” and “outliers,” Rare Events are events that activities seen in log data that are unusual or out of the ordinary but not yet determined to be either malicious or benign.
• Transfer Learning – means you can port knowledge acquired at one environment to another to improve model accuracy. For example, a model trained at company X can be transferred to companies A, B and C, increasing the detection capabilities of the entire group.
• Virtual Analyst – the term that describes the effect of a fully trained AI system. Because a trained AI system greatly scales the analytic capability of the human analyst team, we say it is like expanding your team with “virtual analysts.”

PatternEx Unique Approach

PatternEx comes with many algorithms out-of-the-box that allow it to create predictive models that select what the analyst should review. Humans will always be needed to identify in context what is malicious in a constantly changing sea of events. In this way, Human-Assisted Artificial intelligence systems learn from analysts to identify what events are malicious. This results in greater detection accuracy at scale and reduced mean time to identify an attack.

异常点检测算法（二）

（一）主成分分析（Principle Component Analysis）

（1）使得数据集合更容易使用；

（2）降低很多算法的计算开销；

（3）去除噪声；

（4）更加容易的描述结果。

去除平均值

Principle Component Analysis 的基本性质：

Principle component analysis provides a set of eigenvectors satisfying the following properties:

（1）If the top-k eigenvectors are picked (by largest eigenvalue), then the k-dimensional hyperplane defined by these eigenvectors, and passing through the mean of the data, is a plane for which the mean square distance of all data points to it is as small as possible among all hyperplanes of dimensionality k.

（2）If the data is transformed to the axis-system corresponding to the orthogonal eigenvectors, the variance of the transformed data along each eigenvector dimension is equal to the corresponding eigenvalue. The covariances of the transformed data in this new representation are 0.

（3）Since the variances of the transformed data along the eigenvectors with small eigenvalues are low, significant deviations of the transformed data from the mean values along these directions may represent outliers.

（二）基于矩阵分解的异常点检测方法

$X=PDP^{T},$

$Y=dataMat\times P.$

$Y^{j}=dataMat \times P^{j},$

$R^{j}=(P^{j}\times (Y^{j})^{T})^{T}=Y^{j}\times (P^{j})^{T},$

$score(dataMat_{i})=\sum_{j=1}^{p}(|dataMat_{i}-R_{i}^{j}|)\times ev(j)$

$ev(j)=\sum_{k=1}^{j}\lambda_{k}/\sum_{k=1}^{p}\lambda_{k}$

异常点检测算法（一）

（一）基于正态分布的一元离群点检测方法

$\mu=\sum_{i=1}^{n}x_{i}/n,$

$\sigma^{2}=\sum_{i=1}^{n}(x_{i}-\mu)^{2}/n.$

（二）多元离群点的检测方法

（1）基于一元正态分布的离群点检测方法

$\mu_{j}=\sum_{i=1}^{m}x_{i,j}/m$

$\sigma_{j}^{2}=\sum_{i=1}^{m}(x_{i,j}-\mu_{j})^{2}/m$

$p(\vec{x})=\prod_{j=1}^{n} p(x_{j};\mu_{j},\sigma_{j}^{2})=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}\sigma_{j}}\exp(-\frac{(x_{j}-\mu_{j})^{2}}{2\sigma_{j}^{2}})$

（2）多元高斯分布的异常点检测

$\vec{\mu}=(E(x_{1}),...,E(x_{n}))$

$n\times n$ 的协方差矩阵：

$\Sigma=[Cov(x_{i},x_{j})], i,j \in \{1,...,n\}$

$p(\vec{x})=\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \exp(-\frac{1}{2}(\vec{x}-\vec{\mu})^{T}\Sigma^{-1}(\vec{x}-\vec{\mu}))$

（3）使用 Mahalanobis 距离检测多元离群点

$MDist(a,\overline{a})=\sqrt{(a-\overline{a})^{T}S^{-1}(a-\overline{a})},$

（4）使用 $\chi^{2}$ 统计量检测多元离群点

$\chi^{2}=\sum_{i=1}^{n}(a_{i}-E_{i})^{2}/E_{i}.$