线性回归（Linear Regression）

October 28, 2016 zr9558 Leave a comment

回归方法是为了对连续型的数据做出预测，其中最简单的回归方法当然就是线性回归。顾名思义，线性回归就是使用线性方程来对已知的数据集合进行拟合，达到预测未来的目的。线性回归的优点就是结果十分容易理解，计算公式简单；缺点则是对非线性的数据拟合程度不够好。例如，用一个线性函数 $y = kx + b$ 去拟合二次函数 $f(x) = x^{2}$ ，结果总是不尽人意。为了解决这类问题，有人提出了局部加权线性回归（locally weighted linear regression），岭回归（ridge regression），LASSO 和 前向逐步线性回归（forward stagewise linear regression）。本文中将会一一介绍这些回归算法。

（一）线性回归（Linear Regression）

假设矩阵 X 的每一行表示一个样本，每一列表示相应的特征，列向量 Y 表示矩阵 X 所对应的取值，那么我们需要找到一个列向量 $\Theta$ 使得 $Y=X\Theta$ 。当然，这样的 $\Theta$ 在现实的数据集中几乎不可能存在。不过，我们可以寻找一个 $\Theta$ 使得列向量 $Y-X\Theta$ 的 Eulidean 范数足够小。换言之，我们需要找到一个向量 $\Theta$ 使得

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}(Y-X\Theta)$

的取值足够小，其中 m 是矩阵 X 的行数， $x_{i}$ 表示矩阵 X 的第 i 个行向量。通过数学计算可以得到：

$(Y-X\Theta)^{T}(Y-X\Theta)=Y^{T}Y-2Y^{T}X\Theta + \Theta^{T}X^{T}X\Theta$

对 $\Theta$ 求导之后得到： $-2X^{T}Y + 2X^{T}X\Theta=0$ ，求解 $\Theta$ 之后得到 $\Theta = (X^{T}X)^{-1}X^{T}Y$ 。因此，对于矩阵 $X$ 和列向量 $Y$ 而言，最佳的线性回归系数是

$\Theta = (X^{T}X)^{-1}X^{T}Y.$

举例说明：蓝色的是数据集，使用线性回归计算的话会得到一条直线。

qq%e5%9b%be%e7%89%8720161028185606

（二）局部加权线性回归（Locally Weighted Linear Regression）

线性回归的一个问题就是会出现欠拟合的情况，因为线性方程确实很难精确地描述现实生活的大量数据集。因此有人提出了局部加权线性回归（Locally Weighted Linear Regression），在该算法中，给每一个点都赋予一定的权重，也就是

$\sum_{i=1}^{m}w_{i}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}W(Y-X\Theta),$

其中 $W$ 表示以 $\{w_{1},...,w_{m}\}$ 为对角线的对角矩阵，其中 m 是矩阵 X 的行数， $x_{i}$ 表示矩阵 X 的第 i 个行向量。通过计算可以得到：

$(Y-X\Theta)^{T}W(Y-X\Theta)=Y^{T}WY-2Y^{T}WX\Theta+\Theta^{T}X^{T}WX\Theta,$

对 $\Theta$ 求导之后得到：

$-2(Y^{T}WX)^{T} + 2 X^{T}WX\Theta = -2X^{T}WY+2X^{T}WX\Theta.$

令导数等于零之后得到： $\Theta = (X^{T}WX)^{-1}X^{T}WY$ 。因此，如果使用局部加权线性回归的话，最佳的系数就是

$\Theta = (X^{T}WX)^{-1}X^{T}WY.$

局部加权线性回归需要确定权重矩阵 W 的值，那么就需要定义对角线的取值，通常情况下我们会使用高斯核。

$w_{i} = \exp\{-\frac{(x_{i}-x)^{2}}{2k^{2}}\}.$

其中 k 是参数。从高斯核的定义可以看出，如果 $x$ 与 $x_{i}$ 隔得很近，那么 $w_{i}$ 就会较大；如果隔得较远，那么 $w_{i}$ 就会趋向于零。意思就是说：在局部形成了线性回归的算法，在整体并不一定是线性回归。在局部线性回归中， $k$ 就是唯一的参数值。

如果选择了合适的 $k$ ，可以得到一条看上去还不错的曲线；如果选择了不合适的 $k$ ，就有可能出现过拟合的情况。

qq%e5%9b%be%e7%89%8720161028185611

（三）岭回归（Ridge Regression）和 LASSO

如果在某种特殊的情况下，特征的个数 n 大于样本的个数 m，i.e. 矩阵 X 的列数多于行数，那么 X 不是一个满秩矩阵，因此在计算 $(X^{T}X)^{-1}$ 的时候会出现问题。为了解决这个问题，有人引入了岭回归（ridge regression）的概念。也就是说在计算矩阵的逆的时候，增加了一个对角矩阵，目的是使得可以对矩阵进行求逆。用数学语言来描述就是矩阵 $X^{T}X$ 加上 $\lambda I$ ，这里的 I 是一个 $n\times n$ 的对角矩阵，使得矩阵 $X^{T}X+\lambda I$ 是一个可逆矩阵。在这种情况下，回归系数的计算公式变成了

$\Theta = (X^{T}X+\lambda I)^{-1}X^{T}Y.$

岭回归最初只是为了解决特征数目大于样本数目的情况，现在也可以用于在估计中加入偏差，从而得到更好的估计。

从另一个角度来讲，当样本的特征很多，而样本的数量相对少的时候， $\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}$ 很容易过拟合。为了缓解过拟合的问题，可以引入正则化项。如果使用 $L^{2}$ 正则化，那么目标函数则是

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda||\Theta||_{2}^{2}=(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta,$

其中 $\lambda>0$ 。通过数学推导可以得到：

$(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta = Y^{T}Y - 2\Theta^{T}X^{T}Y+\Theta^{T}X^{T}X\Theta+\lambda\Theta^{T}I\Theta.$

对 $\Theta$ 求导之后得到：

$-2X^{T}Y+2(X^{T}X+\lambda I)\Theta,$

令导数等于零可以得到： $\Theta = (X^{T}X + \lambda I)^{-1}X^{T}Y.$ 因此，从另一个角度来说，岭回归（Ridge Regression）是在线性规划的基础上添加了一个 $L^{2}$ 范数的正则化，可以用来降低过拟合的风险。

需要注意的是：在进行岭回归的时候，需要在一开始就对特征进行标准化处理，使得每一维度的特征具有相同的重要性。具体来说就是 (特征-特征的均值)/特征的方差，让每一维度的特征都满足零均值和单位方差。

另外，如果把岭回归中的 $L^{2}$ 范数正则化替换成 $L^{1}$ 范数，那么目标函数就变成了

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda ||\Theta||_{1}$

其中的参数 $\lambda>0$ 。 $L^{1}$ 和 $L^{2}$ 范数都有助于降低过拟合的风险，使用 $L^{1}$ 范数的方法被称为 LASSO（Least Absolute Shrinkage and Selection Operation）。使用 $L^{1}$ 范数比使用 $L^{2}$ 范数更加容易获得稀疏解（sparse solution），即它求得的参数 $\Theta$ 会有更少的非零分量。 $\Theta$ 获得稀疏解意味着初始的 n 个特征中仅有对应着 $\Theta$ 的非零分量的特征才会出现在最终的模型中。于是，求解 $L^{1}$ 范数正则化的结果是得到了仅采用一部分原始特征的模型；从另一个角度来说，基于 $L^{1}$ 正则化的学习方法就是一种嵌入式的特征选择方法，其特征选择的过程和训练的过程融为一体，同时完成。

（四）前向逐步线性回归（Forward Stagewise Linear Regression）

前向逐步线性回归算法是一种贪心算法，目的是在每一步都尽可能的减少误差。初始化的时候，所有的权重都设置为1，然后每一步所做的据测就是对某个权重增加或者减少一个很小的值 $\epsilon$ 。

该算法的伪代码如下所示：

数据标准化，使其分布满足零均值和单位方差
在每一轮的迭代中：
  设置当前最小误差为正无穷
  对每个特征：
    增大或者缩小：
      改变一个系数得到一个新的权重W
      计算新W下的误差
      如果误差Error小于当前误差：设置Wbest等于当前的W
    将W设置为新的Wbest

（五）总结

与分类一样，回归也是预测目标值的过程。但是分类预测的是离散型变量，回归预测的是连续型变量。但是在大多数情况下，数据之间会很复杂，这种情况下使用线性模型确实不是特别合适，需要采用其余的方法，例如非线性模型等。

安全业务领域

该如何做大中型 UGC 平台（如新浪微博）的反垃圾（anti-spam）工作？

October 27, 2016 zr9558 Leave a comment

来自知乎

帅帅产品经理

423 人赞同

Anti-spam

@周源的邀请，我现在才回应，见谅。做反 Spam 工作的人，要禁得住诱惑耐得住寂寞扛得住压力受得了委屈，本想路过算了。但看看互联网上这块内容都比较少，看到有人说自己会说些干货，结果找到很少，做 Anti-spam 的人不多，也时常不受重视，其实交流又非常重要，基于此，我就从产品的角度谈谈这块自己的一点积累，抛砖引玉。

Anti-spam 是数据分析工作的一个方向，非常考验一个产品人员对数据整体和局部的把握，如果对产品无爱，对数据，特别是数据的细节刨根问底不着迷，这事儿做不好。

做 Anti-spam工作，只掌握了数据分析的方法是不够的，还要加入足够的产品市场人员的思维——对用户需求的分析，对用户需求的理解，对人性的理解，多换位思考。有了这些才能真正的把基于XX产品的反 Spam 工作做好。这也是一般做了几年反 Spam 工作后，能力提升瓶颈的关键点。

开始正文，先分4部分：具体工作怎么做，如何进阶，反spam产品经理还需要具备哪些能力，我个人的经验。

1. 具体工作怎么做
1.1 做数据分析
第一次接触这个工作的人，一般压力很大，都是人肉通过后台工具解决spam问题解决不了或者这个问题已经严重的威胁产品安全了，希望你能解决，如果你幸运的解决了一两个问题，更希望你能成为黯淡无光黑夜里的救星。
在很多人指手画脚，投诉各种问题的时候，自己不要乱，一定先只做一件事件——数据分析，抽XX产品10万个数据分析分析。
目的：了解目前整体的情况，对问题严重性，多样性，有足够的认识。
产出：分析报告，列出当前所有问题的分类情况，比例情况，严重性情况，每类呈现出什么特点，给出问题解决的优先级排序。
做完这个事情，整体情况你应该最了解，老板再问你，你就能从全局介绍情况，然后再分类给出优先级。一般老板都关注最关键，最重要，影响最大等关键问题。

1.2 给出XX产品spam的定义
数据分析报告中列出所有问题，而非仅仅是spam问题，因为几乎没有人能在不看大量数据的情况下，就能给出这个产品spam准确的定义，如果有给出的，基本也是拍各种器官拍出来的。
给出XX产品spam的定义很重要，重要的意义有：
1.2.1 明确自己的工作范围
做反spam工作一般开始压力大，万事开头难，千万不要一上来眉毛胡子一把抓，贪多，定位太高，当前具体问题解决不好，赢得不了信任，以后工作很难开展。
跟反spam工作，相关的有很多，黄反监控、账号安全、防攻击防抓站，这每一个都是难度大不好做的工作，反spam没有做好前，不要牵扯精力。
1.2.2 明确自己的工作目标
有了工作范围和工作任务定义，自己的工作目标就容易定出来了，也就是你的KPI，这个很重要，spam问题只要不是瞎子都看得到，不管懂不懂都可以上来说一通自己的策略，如果没有KPI，你就无法证明自己的工作是否有效，无法证明虽然现在问题比较多，但整体情况是在前进，变好的。
1.2.3 指导今后判定问题的标准
今后的工作中，会遇到很多灰色地带和问题，这个定义就是你划分是否属于你工作范围的明灯，也是你在数据分析中，判断具体问题是否是spam的标准。

1.3 发现问题

1.3.1 以spam问题为导向
没啥好说的，初期就是哪里有问题，哪里就有你的分析，研究。
1.3.2 全面掌握spam情况，找出主要问题
面对一个产品的spam问题，首先应从基础数据入手，全面的掌握spam问题的类型、比例情况，最有效的办法是：大量的基础数据标注。这个办法好处非常多，除了能掌握主要问题和全面的情况外，还能对spam的贴子有亲生的体会，了解spamer在想什么，惯用的伎手段有哪些，找出很多典型的例子。
1.3.3 保持发现问题的敏感性，把握spam流行趋势
大型的数据调研有一定的周期性，获得的结论一般关注的是主要问题，由于spam问题有着很强的时效性，反spam系统一旦出现漏洞，某一类之前可能比例较小的spam问题也很容易泛滥起来，因此保持敏感性，把握流行趋势很重要。方法是：
① 关注spam收益高的spam案例；
这个因产品而已，但是每个产品总能找到。
② 注意用户反馈；
任何监控和机制，总免不了有疏漏，我们也要非常注意用户关于spam问题的投诉、反馈，用户深恶痛绝的spam问题，往往也是危害大，容易流行起来的问题。

1.4 分析问题
一个产品中出现的spam行为，也可以看成是一种用户需求，当然这些用户需求从产品官方角度看是不正常的，都是以伤害绝大多数用户体验为代价，满足小部分人赚钱的需求。
反spam中，分析问题最主要的目的，就是把这些一小撮害群之马的行为从绝大多数正常行为中，抽象化、规律化、用机器能执行的语言分离出来，最终变成反spam策略解决掉。主要方法是：
 找碴，找不容易变的碴
反spam就是找出spam行为与正常用户行为之间的不同规律，把这些不同区分出来，区分的办法价值的高低，主要是两点来衡量：spamer的规律是否易变和我们区分的成本是否很低。机器最容易区别的，spamer变化成本高的不同点，就是我们要的点。
常见的4个方向
① 内容；spam行为都是以获利为目的的，在产品里spam，最终spamer都是要把用户、流量导入到目标网站，一般都会在内容中留下spam特征即利益的出口。
② 行为；凡是spam能获利的地方，spamer都希望更快更多的获利，这就注定了spam行为一定会走发的多、发的快的路线，一定会跟正常用户有区别。
③ 社区属性数据，包括：发贴作者注册时间、作者等级（新用户、平民、会员、认证人员），spam贴子发布的连续性，spam用户发贴在贴子页面停留时间等等
④ 用户之间的交互数据，这个不一一列举。

总之，一种类型的数据，就像素描中的笔触，数据越多，意味着你描述犯罪嫌疑人的线条越多，就越能清晰的把spam辨别出来，如果数据很少，那就很难解决复杂问题。另外，数据多了，也应该注意使用最简单有效的数据，RD会感谢你的。

1.5 解决问题

1.5.1 优先解决主要问题

一段时期只能解决一个问题，优先解决影响面最广危害最大的问题，这样获得的收益最大，同时对其他次要问题的解决也非常有帮助，甚至次要问题在解决主要问题的过程中，也会迎刃而解。

1.5.2 小数据量验证策略效果

当spam问题发现和分析完毕后，一般一个解决策略基本成型，这时，一定要先用小规模的数据验证一下策略的效果后，再进行策略的开发和上线。一个反spam策略无论多么的简单或巧妙，都要用数据去验证效果，验证的方法是抽小量的数据去检验，按照这个策略看是否能获得好的准确率和召回率。

1.5.3 坚持低成本、低误伤、高收益，数据说话的原则

很多反spam问题都不止一个解决办法，哪个低成本、低误伤、高收益我们就走哪条路，无论谁提出的想法或策略，用数据检验没有问题后，才进行下一步工作。
不要一上来就想搞个智能分析打分系统，什么贝叶斯，什么离散系统，先一个问题一个问题的解决，一个策略一个策略的上，等你有基础有积淀，如果还需要做这样的系统，那就再做吧。
智能系统很难做，要很高阶的RD和PM搞基一样的配合，才能孕育的出来的生命。Spam变化很快，做智能系统解决很耗时。

1.5.4 解决问题时，以PM还是RD为主导？

一般RD珍贵，事情又多，PM RD 7 3开吧

具体工作怎么做，讲完了，其实，在这个过程中有非常多的难点，定义如何制定，数据怎么分析，excel怎么用等等，欢迎讨论，有空我再续。

2. 如何在反spam业务上进阶

当各类问题和策略的制定，做到两位数的时候，比较少的会碰到无法解决的具体问题时，就可以开始考虑工作的进阶和深入。

2.1 综合问题把握方向

反spam工作是持久战，spam问题也会一直有不断有，头痛医头脚痛医脚只能解决一时局部的问题，要全面彻底做好反spam工作，把spam问题控制在一个相对低的水平，就必须每隔一段时间分析回顾这段时间所作的工作，总结经验把握下一步方向。
一般方法：
2.1.1 首先在解决具体问题中，不断明确解决反spam问题有哪些办法和角度，把这些角度归纳出几个方向。
2.1.2 回顾这一段时间里，我们都是从哪个方向出发的，这个方向我们做的如何？是否已经做的比较彻底了？是否到了瓶颈的地方？是否存在这个方向解决不了的问题。如果有，是否需要换个角度和思路，是数据少了还是方法不对等等。
2.1.3 分析当前面临的主要问题和spam流行趋势
2.1.4 综合过去的经验和当前遇到的问题，系统的完善上一个方向，同时在适当的时候提出和推进下一个方向的开展。

2.2 反 Spam 人才业务上的培养

PM的人才培养，每个产品经理都有自己的特点，我只说一下反 Spam 业务中，如果培养的话，特别需要注意的问题。

（注释：本文的pm不是product manager，而是product marketing的缩写，意思是基于市场需求的产品，（而非创造需求）翻出来说，是提醒新入的pm，别上来就搞什么管理，先把精力投入到产品研究上，product master比别的都有价值。via UBee）

2.2.1 解决问题的办法真心不止一条，教给新同学方法，不要总觉得自己的想法最靠谱，都要按照你的意思来。
2.2.2 放权，在背后做支持，发挥新同学的主观能动性吧，做的好是他的功劳，做的不好是他的责任，让新同学尽快的负起责任来，有利于新同学更快的独当一面。
2.2.3 没有做数据分析，就不要乱发表具体策略的建议。经验是个好东西，但会犯错，作为资深人员，仍要注意，没有亲自看数据，不要随便定策略，说出来很容易不靠谱。
2.2.4 把试错的机会留给新人。每个资深产品人员想想自己是怎么成长的，犯了多少错，只要不是方向性的错误，尽量把试错的机会留给别人，在新同学每次犯错后引导他们去思考避免，从错误中学到成长。via 百度产品市场部

3. Anti-spam 产品经理需要具备哪些能力
正如之前所述，解决反 Spam 问题的办法有很多种，所以，成功的反 Spam 产品经理各有千秋，从介绍这个行业或圈子的角度，我列一列众多能力中的几种，大家参考，方便大家了解或招聘时参考。不同的环境导致不同的成长路径，不一定非要照此修炼。
3.1 反 Spam 的数据分析能力
这是实际动手的能力，方法论都可以学可以听，数据分析能力我觉得是一个无法传授，只能自己实践的能力，但在实践过程中，也有一些总结提高的方法。
3.1.1 培养数据亲切感
在热爱这个产品的前提下，数据抽出来时，别人看到的是数据，你看到的是数据背后的用户，用户的需求，他们的种种行为总是给你带来惊喜，他们需求得到满足后，总是能给你带来喜悦。
Spam 各种行为背后都是有着各种各样的联系，产品对他们来说是黑匣子，大量数据放在一起的时候，稍微的排一下顺序，规律就会显现出来。
3.1.2 在数据分析时，不要想当然的给用户打上标签，也就是不要过快的判定非黑即白完事儿，而是不停问自己，他为什么要这样做，是一个还是很多个这样，很多个这样一定有原因，这样原因可以先假设，但一定要用数据验证假设，验证的次数越多下次做建设的时候越容易正确。道理很简单：熟能生巧，简单的东西做到极致，你就像在开外挂一样，别人看不出的规律你总能看出来。

（写到这里说说题外话：写到这的时候，我想起的搜索引擎9238，搜索研究院―，一个超级到不能再超级的超级用户、每天至少搜索上千个词、半夜还在用产品、深夜实在累的不行了摊开睡袋睡下，大家早上上班的时候他去洗手间洗脸刷牙。成功的路上没有捷径，听到、看到或者别人教你关于某个问题如何做跟自己完全掌握，之间还有数以百计个小时。）

3.2 关键问题的把握
做产品做久了，一起讨论问题的时候，你会发现总有那么几个人，他们每次指出的问题都是整个问题的关键点，策略型 PM 这点非常重要。

3.3 全局的产品意识
3.3.1 平台型产品不用在产品设计之初特别在意反 Spam 问题，有这个意识觉悟，不要故意做漏洞，犯低级错误即可。

非小型UGC产品，一般都是先有了这个产品，这个产品发展到一定阶段后，才出现 Spam 问题，所以在产品一开始设计之初，很难有人能考虑到反 Spam，即便有人考虑到这个问题，在产品都不知道以后能否火的前提下，反spam的需求也会因为优先级、资源等问题搁置。再则，平台型产品初期就是要以低门槛来抢用户，成功的运气因素也很重要，在早期做相关的功能或限制没有必要。
另外，反spam是问题导向，问题没有发生，你怎么预设问题然后去控制。

产品人员在分析用户需求，设计产品之初，要心无旁骛的只关注如何更好的满足用户需求，一定要抱有N个假设，这样才能把产品做好。这个 N 个假设里，其中两条是：RD 是万能的，只有成本和收益的权衡；Spam 问题不存在无法解决的问题，只有重视程度和阶段的不同。

3.3.2 能深入细节，更能跳出细节看大局部，看整体。

这句话，看起来比较虚。举个项目例子（我不可以细说），比如你解决某类 Spam 问题，时刻想着做这事儿的目的是什么，有时候解决到80%了，是否可以换个方向审视一下，做一做，可能效果更好。

我一直打一个比方——反 Spam 需要几十个策略，交织在一起想一张网，Spam 来了都要过这张网，当你的网策略少比较稀疏的时候，漏洞就大，Spamer 一试就知道你的大概策略，大概阈值，很容易就钻过去，但是当策略较多，网比较密的时候，钻过去的成本就大大提高，这就要求产品经理能细节能整体。

3.3.3 要共赢，维持生态平衡，不要伤及产品和自身。

Spam与营销有时候只有一线之差

反Spam的目标就是把Spam控制在可以接受的范围内，保持生态平衡，利益链条平衡。做的太狠，也会自损忍受阉割之痛，另外，也会有意想不到的麻烦，你懂的。

4. 我个人的一些经验

4.1 以spam问题为导向
4.2 一段时间只解决一个问题
4.3 优先解决范围最广危害最大的 Spam问题
4.4 策略提出后一定要小数据量验证效果
4.5 发挥每个人的积极性、主观能动性
4.6 坚持低成本、低误伤、高收益，数据说话的原则
4.7 Spam问题具有时效性，反spam更要快速有效
4.8 先下猛药再解决误伤
4.9 不要指望一个策略或一组策略解决所有问题
4.10 勿以善小而不为
当成本也很小的时候，一些收益看起来小的策略，在多个策略综合起效的时候，也能带来很大的收益。例如：在策略很多的前提下（这个前提很重要）解决某些问题的时候，关键词匹配也能很有效。
4.11 人工靠不住，尽量多用机器
4.12 对数据要有亲切感，乐意探究数据背后的故事
4.13 机器不够用，人工过来补。注意是应对图片、视频 Spam，机器识别难度很大的问题。
4.14 注意遗漏，连连看、挖掘召回。
4.15 解决问题的路不止一条。
4.16 PM抽数据困难不畏惧。

========================================================
2016 年 8 月 2 日更新

5. 最近一些时间，在反作弊业务上，自我感觉成长不多，有什么新的感悟，我会逐步更新在下面

5.1 如何解决误伤问题

不同大小体量的平台解决思路不一样

流量、用户群比较大的平台，一般的做法是，周期性的评估误伤，误伤比较高的策略下线掉，再去优化策略，优化到一定程度后再上线；

优化策略一般都会面临挖掘新的数据项的问题，在当前仅有的数据项基础上去优化策略难度比较大，需要很认真细致的看数据，思考策略；而挖掘新的数据项会更容易更有效。新数据项的挖掘，产品最好找到多一些的数据项，预防着有些很好的数据项工程师挖的难度比较大，就需要换。

策略评估误伤，下线，优化，再上线，再评估……，这样的循环做多了，需要思考如何把策略制定变得产品或运营人员可配置化，策略上下线自动化的工作。

策略可配置化，主要是要抽象化策略共同的项，由技术做成模块，新的策略就是由这些通用的模块搭配一些条件生成出来。产品或运营可以去组合出新策略，自由调整关键阈值。

体量小一些的平台，用上述的方法，可能效果不好且成本高，有点像用牛刀杀鸡，而且小体量的平台往往更注重误伤（原因：小平台正常用户本来就少，误伤几个就是大事儿；小平台里当个正常用户影响力更大，十几个核心用户出来反馈误伤，感觉就是大新闻），那怎么办呢？

解决方案是：把处理手段做的有层次些。以前反作弊抓住了，都砍头，砍错了，当然压力大；现在反作弊抓住了都把小拇指的指甲剪了，剪错了，压力不大。但是，剪指甲的手段也要达到反作弊的效果。追求什么效果？第一追求：把人和机器区分开，把机器人干掉；做不到的话，退而求其次，打断 spam 的连续性，提高 spam 的成本吧。

这种手段怎么做？
验证码就别用了，除非是 Google 那种行为验证码，但一般公司做不了也也舍不得花钱做，开源接口被墙了。
三个强
结合性强，要跟自身产品特性结合。Facebook 验证你是不是账户拥有者的时候，会让你填你上传上的照片中，人脸都是谁的脸。
趣味性强，验证有段要有趣，不然正常用户被误伤的时候填起来就会很颓废。我在微博做过把你最近关注的人抽 4 个出来，把名字打乱，让你让头像和名字连线连正确。
安全性强，能真正的把人和机器分开。

5.2 如何挖掘数据项

挖掘数据项是反作弊至关重要的一环，数据项多，解决 Spam 的思路就广。

挖掘数据最关键是两点：好的分类方法和注意细节的能力

好的分类方法：我的经验是，基础数据、社区属性数据、用户之间的交互数据；两个维度：显性数据和隐性数据。

细节数据的分析归纳能力，我的两个经验：多思考如果我是正常用户使用产品一般流程是什么；多思考如果我是作弊的我会怎么作弊，以及多研究各种发帖机、注册机。

宋一松 Facebook，Uber

收录于编辑推荐 •159 人赞同

我觉得如何应对spam可以很明显的展示出一个公司的实力。原因有两个：

如何通过技术手段来做主动的自动化运营，而不是通过人工手段去被动地应对每一个突发事件，很考验一个公司的技术能力。
除非是应对「重大突发Spam事件」，否则解决Spam对公司短期的KPI没有正面作用（有时可能还是负面的）。因此为什么要解决Spam，怎么Spam，解决到什么程度，都可以体现公司的产品价值观。

接下来就说说Facebook这个超大型UCG平台是怎么解决这两个问题的。

1. 技术化运营
Facebook有一套专门Anti-spam的基于机器学习的系统，叫Sigma。

对于每一个用户在Facebook网站上的每一个动作，比如发帖/点赞/评论/私信/好友申请，Sigma都会实时预测其行为的「可疑」程度。
这个「可疑」程度具体又分为多个子维度，包括假帐号，被盗号，刷榜刷赞，发钓鱼帖等。
针对每个维度，Sigma都会基于机器学习生成一个可疑值，数值高的就会自动触发对应的规则系统：删号，删帖，发邮件或短信来要求用户确认帐号等。

相比起用于精准广告，智能排序，个性化推荐一类的机器学习系统，Sigma最大的不同就是响应速度要快，在各个层面都要快：

模型的训练必须是online的，用实时的数据。否则新出现的Spam没有第一时间体现在数据里，再好的系统也没用。
「学习率」必须要快。相对的，「准确率」就没有那么重要。一个2%失误率的算法在当天就控制住了spam，让它只影响了1000个用户，远好于一个失误率只有1%，但到了第二天才学会正确识别spam，以至于让它影响了10万个用户的算法。
模型和规则的部署要快。新的模型出来了，或者万不得已手动加一个新规则，你如何把新的模型和规则部署到服务器上去？在这十万火急争分夺秒的时刻，你总不能让机器们轮流着重启一遍吧。

在上述的这些独特的技术问题之外，还有更重要的一点值得再次强调一下：Sigma不是一个独立的模块。它在每个用户的每个行为都会被触发，因此它与整个Facebook技术系统的结合要极为紧密，涉及各个环节。这对规模不大的产品来说不是什么难事，但如果接触过类似FB这种一个网站包含各种复杂功能的系统，应该能理解工程上的挑战吧。对应的，如果能把这件事做好，体现的也就不仅仅是anti-spam什么的，而是公司整体的技术工程能力了。

2. 产品的价值观
为什么要anti-spam？那些引诱用户去钓鱼网站的自然要解决，但那些买僵尸粉来给自己刷赞的呢？把他们做掉了，短期内产品的数据反而会降，那要不要做呢？如果做的话，目的又是什么呢？

是为了维护社区的质量，无论这会怎样影响短期数据。

想明白这些，对「spam」的定义就会宽泛很多。对应的，也就不能仅依靠anti-spam一个团队来做工作，而是要求公司内的每一个产品团队都要保持对质量的关注。

举个例子，我在Facebook时做的是Newsfeed排序，离开公司前的最后一个项目，就是和广义上的spam有关：抓出标题党。

很多公众号/营销号/蓝V号爱做标题党，这事在Facebook上也不例外。然而，在FB这侧，通过对比一个分享的点击率和平均阅读时长，很容易找出那些典型的标题党。在新鲜事排序上对这些标题党做降权处理，减少他们在新鲜事上的曝光量，从而控制了低质量内容在社区内的传播。

同理，我们还会做掉骗赞的和骗转发的。

可以看出来，做这些工作对社区绝对是好的，但对宏观数据完全没帮助，反而可能不利于公司与公众号运营者们的关系。某种程度上，anti-spam天然地与KPI文化相违背。因此，anti-spam最终做得好不好，取决于公司自上向下的产品价值观：

到底是冲数据，还是做正确的事？

————
附：
[1]: 关于Sigma的paper: http://research.microsoft.com/en-us/projects/ldg/a10-stein.pdf

aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢

30 人赞同

搜索的spam、微博的spam、论坛的spam、软件客户端的spam不太一样。
本经验部分来自于客户端spam的个人经验。
——————
补充几点具体的：
1.ip聚集,地理位置异常，细分视图毛刺
2.恶意id属性信息分析
3.恶意行为轨迹分析
4.流水log小样本抽样，点定人肉观察评估
5.价值链分析
6.智能预警（合理划分低于标准、正常、异常三个维度即可，不说专业词汇了）
7.不要什么都依赖验证码，另没有不会被破解的验证码
##########
提高对方成本，降低自己成本
抓大放小
事前控制
实时限制
事后打击
三十六计若干都可用
《失控》第二章吧，机器人那段。小而独立，但有用，可复用，可被组合。求全，求系统，你就死了。人家是钻空。

裴立（Pz）入门级PM http://www.lockon.cc

39 人赞同

以论坛中的反垃圾信息为例，从具体策略上说说自己的看法。

1.对每一个帐号都设定打分项，主要从帐号发布的内容、帐号的行为、与帐号的关联因素三方面考虑。
内容因素：
首先，垃圾帐号发布的内容多半会提供一个外站的链接或者手机、QQ号。因此一个帐号连续多次发布的信息中如果有重复的链接/数字出现，他有极高的可能性是一个垃圾帐号。
其次，每个论坛都会有自己的敏感词库，如果不是那种最ugly的敏感词库，至少应该会有三层级别：
a.直接删除内容并禁言帐号；
b.需要对内容做先审后发的处理同时监控帐号其他发布的内容；
c.内容可以先发后审，帐号不作处理。
对于前两种情况，垃圾信息能造成的危害被降到了最低。第三种情况，就需要结合其他因素一起来判断。

行为因素：
这里举一个例子来说，垃圾帐号因为是趋利，所以在行为上一定会异于普通的正常用户。比如在论坛上它会一直不停地发帖，而正常用户都是看帖多发帖少。这就给我们提供一个参考。通过post数量和浏览的url数量比值我们就能找到垃圾帐号和正常帐号的差异。

其他的关联因素：
看到之前的回答中有提到不少，这里补充一个：帐号所使用的主机id。垃圾帐号通常是批量注册的，因此一个垃圾账号背后来自同一个ip、同一个主机的其他帐号往往也都是垃圾帐号。但是这里提出一点：不要轻易封掉ip或主机，一方面是会有误伤，另一方面这种简单的封杀做法会让你的反垃圾体系变成马其诺防线，一旦被突破，只会抬高你的反垃圾成本。

2.基于上述三方面的考虑后，我们已经拥有评估垃圾帐号可能性的几个因素了，基于三个因素对帐号做评估。可以使用一些比较智能的算法，比如贝叶斯公式，但这需要你能准确地统计出垃圾帐号中各个因素的占比系数，这个模型一旦建立起来，整个反垃圾系统需要通过不断地机器学习来对系数做调整，才可能应对垃圾帐号即时的变化。
当然，你可以有比较简单的做法，只要某个帐号具备了其中的若干因素，就可以怀疑它是垃圾帐号了。接下来就看是否需要借助人为的监控行为做进一步识别了。

3.验证码和反垃圾策略的关系
必须明确的一点是：验证码本身只能用来防住机器人，防不住人，更何况破解技术层出不穷，实际上抵挡机器人的效果也不完全能让人满意。即使你对自己的验证码有把握，那么你也许能挡得住一部分机器人，但并不能把所有垃圾帐号都防住。
所以验证码实际上只能算抵挡垃圾信息的第一道防线，在验证码之后，一定要有合理的反垃圾策略。

4.反垃圾工作的确是一项长期的工作
理论上来说，当垃圾信息的发布成本高于所能得到的收获时，垃圾信息会减少，这些发布垃圾信息的人也会选择离开，转而寻找其他的社区。但事实上，垃圾信息行为与反垃圾行为永远都是一场你来我往的战斗，随时注意网站的数据变化，及时找到典型的垃圾模型。才能巩固已有的战果。

iammutex 彩石手机CTO – 做最好的中老年智能手机

19 人赞同

贴一个两年多以前的文章吧，相信并不完全过时。

——————————————————-
《谈谈反垃圾》
由于常年从事用户产品的开发工作，工作中难免遇到过各种各样反垃圾的事，一回生二回熟，在摸爬滚打的对抗中，也摸出了一些门道，此文算是对个人经验的总结，非专业视角的分享。
这里说的垃圾主要针对诸如垃圾评论，机器注册，机器刷接口等等。

反垃圾很重要的两步是：垃圾识别，垃圾处理（包括预防）。
【垃圾识别】
对于判别垃圾，通常有下面一些方法。
1.基于内容的识别在基于内容的判别上，最直接的是关键词过滤，比如包含“开发票”、“激情视频”这类词的极有可能是垃圾内容，我们通过字符串匹配来判断是否有这类关键词。这里有一个难题，如果是检索一段内容是否包含某一个词还算简单，有很多算法可以实现，比如经典的KMP算法，很多语言内置的字符串查找方法效率也很高。但是，要判断一段内容是否包含一堆关键词中的某一个或某几个，那就有一些难度了，总不能循环一遍所有关键词挨个做匹配吧，所以此法必不可取。

这里推荐两个方法，一个是基于trie树的关键词树，具体有没有开源实现的不清楚，我们使用中是自己基于Memcached改了一个，保留Memcached的简单协议，修改内部逻辑为trie树的查找，简单来说就是将关键词做字节切分，建立一棵trie树，判断一段话中是否包含这些关键词，只需要从根节点向下检索即可。

另外一个方法，是利用贝叶斯算法来进行垃圾概率计算。贝叶斯算法这里就不多展开说了，其原理简单来说就是，收集一组正常内容和一组垃圾内容，用此内容对系统进行训练，让系统能够知道每个词在正常内容中和是在垃圾内容中的概率。做完训练后，再有一段新内容过来，可以直接对其中的词进行综合加权计算，得出整段内容是正常或垃圾的概率。
2.基于特殊内容的识别上面是纯粹基于随机内容的识别，而实际上我们可能还有一些省力的方法，比如一般的垃圾内容经常会有下面一些特征：带链接（因为要把用户引导到自己的垃圾网站），带图片（为了更醒目），带数字串（比如QQ号，电话号等等），通过这些特征做字符串匹配也是一个好方法，而且就个人经验来看，还比较奏效。其中需要注意的一点就是，上面的链接、数字串这些，通常攻击者都会搞一些变体，不会直接写链接和数字让你判断。比如换成中文数字和字母，你知道，UTF8是很博大精深的。比如：1҉2҉3҉4҉5҉6҉7҉8҉9҉0҉ 这种。所以判断规则上需要多做一些兼容，比如把这种东西先全转成数字来判断。
3.基于请求方式的识别另外，垃圾毕竟是通过我们暴露给用户的各种接口进来的，而攻击者请求我们接口的方法难免与真实用户有差距。比如说，正常用户会先进入注册页面，再填表单，再提交注册按钮。但是恶意注册程序，很可能是不会先访问你的注册页面的，而是直接请求注册接口（利用这一点我们就可以作文章，比如对用户访问路径进行记录，如果未访问页面就直接请求接口的，判为恶意请求）。另外就是攻击者的http头信息，比如最常见的，UA字段是否是cUrl或者其它非正常浏览器。或者像很多前端团队都有在请求url上添加随机数的习惯，这样本来是为了避免后端缓存，但有些低水平的垃圾请求会原样的每次都用同一个随机数，这就很容易识别他们了。总之，从http请求的层面可识别的东西很多，只要攻击者伪装有一点纰漏，咱们就可以抓到他的尾巴。
4.基于请求主体的识别如果我们遇到UGC内容的垃圾攻击，那么发起请求的肯定得是一个正常用户（如果是匿名社区请忽略此条）。这时候，内容发送主体的信用级别，就可以转移为对信息质量的判别上来。就像我们都懂的，某些大的平台也会对不同用户执行不同的审核策略（比如都知道的先审后放，还是先放后审），这也需要我们对内容发布主体有充分的信用分级。比如，一个注册24小时内的用户相对一个注册三年发帖无数的用户来说，信用等级就低得多。
5.基于内容载体的识别垃圾内容之所以能形成黑色产业链，通常绝不会是恶作剧玩玩而已，所以跟互联网最传统的广告模式一样，垃圾也希望能够多曝光，多赚点击。那怎么做呢，通常就是选择在用户扎堆的地方去发。比如时下热门的电视剧，热点的新闻事件下面就是垃圾流量的公共厕所了。另外，在一些政治军事内容版块发反动言论，在一些娱乐美女内容版块发成人网站，这些也都是常用的路数。总的来说就是，同样一条内容，在热门版块发布，更有可能会是垃圾内容，需要我们更多的关注。
【垃圾处理】
好吧，上面说了一大堆的方法去给内容和用户评级，以便我们能够对一个用户或者一段发布的内容进行预估，那么，在我们了解了一个用户或者一段内容是否可能是垃圾后，我们脑子里首先蹦出来的可能就是：封杀！但实际处理方法可能不仅封杀一种，下面我们就来探讨一下对垃圾攻击的几种处理方法。
1.制定封杀方法如果我们已经确切掌握了垃圾流量的规律，比如某一个IP或一组IP，比如同一组参数，比如内容总是包含某网址的变体，那么我们就可以直接大开杀戒，用这些特征直接进行封杀操作。
2.制定审核级别顺着上面的思路，我们可以对不同的用户和内容施加不同的审核策略，比如是直接放行、先审后放、先放后审还是直接毙掉。我们还可以对用户施加不同的限制策略，比如新注册用户每天只能发3条内容（在审核通过一条后又可以再发）。
3.工作量证明工作量证明是一个在反垃圾邮件中的方法，最近火得不得了的比特币，工作量证明也是其核心理论支柱之一。通过引入工作量证明方法，我们甚至可以不用对垃圾流量进行判别。只要加一道隐形的门槛，就足以让很多攻击者却步。

举个例子，如果攻击者原来只需要请求一次接口就能够发布一条信息，现在我们需要他在接口请求之前先填一个验证码，他就没那么容易自动狂发内容了。上面这个逻辑大家都能理解，也确实能奏效，但是很抱歉，这样做很伤用户体验，产品经理说不行。

那我们换一种做法，我们让用户在请求前先做大约10w次的md5运算，普通用户的机器偶尔进行一

次这样的计算不算什么，但是对攻击者来说，它需要单机发布大量内容，如果我们要求每条内容都需要做10w次md5的话，对的硬件资源是很大的挑战，也是让他放弃对你网站进行攻击的一个方法。

当然，如果我们直接用上面的10w次md5的方法，我们在服务端也需要做同样多的工作才能对传入的接口进行验证，对我们服务器本身也是很大的挑战。所以上面只是一个为了让我们理解的例子，通常的做法是，服务端给定一个随机字符串 s1，客户端需要找到一个数 d，这个数要满足下面条件：这个数破加在这个随机串后同组成一个新串 s2，这个新串进行md5后，前5位都要是0。大家可以想一下，要达到这样的标准，客户端需要不断循环来寻找这个合适的d，而服务端验证却是只需要进行一次md5就可以了。这就是所谓的工作量证明。
4.请求签名请求签名也是一个省时省力的好方法，前后端约定一种hash算法（最好是自创的），前端对请求内容进行签名，后端验证签名。通过对前端代码进行混淆，让攻击者很难实现你的hash算法。增加他的攻击成本。
5.查出源头发垃圾内容的攻击者通常都不会用自己机器或服务器IP（要不你就赚到了，直接封IP就行了），而是用手里控制的肉鸡或者扫描来的http代理来做，其实识别肉鸡和代理也比较简单，最直接的方法就是看看开没开着80、8080、3128等端口。这是一般代理的常用接口，另外一般情况下被拿下的肉鸡也都是web接口防范不严造成的。如果是普通http代理，很可能会很有良心的通过x-forward-for，或者x-real-ip等http头信息把源ip传给你，而对于肉鸡找到肉鸡，如果你的黑客水平够，你可以直接也黑上去，看看是哪个IP在控制它，从而查到真实IP。查到攻击者的真实IP后如何处理就看你的了，是联系攻击方和平解决，直接报案还是把攻击者给黑了。那就看个人想法和水平了。
【策略与战略】
上面说了一堆战术层面的东西，下面聊一点战略上的原则。
1.反垃圾是一场成本的较量反垃圾，其实不是一项技术竞赛，更不像是个人恩怨，更多的是成本较量。如果你的网站流量大，但防护措施做得不够，那垃圾流量过来是必然的。我们所有的反垃圾策略只有一个目的，就是增加攻击者的成本，当成本上升到某一阈值时，攻击者会发现在你的网站玩太费劲，投入产出比太低，于是会去找同类型的其它网站。所以就像狮子和羊群一样，只要不是跑得最慢的那一只，就能逃过狮子的爪牙。
2.多数攻击者痛点在IP无论是用代理，还是肉鸡，攻击者的IP资源总是比较有限的，所以收集到足够多的IP进行封杀，通常能够解决大问题。
3.实而示之虚上面说反垃圾是一场成本较量，但在我们实际操作中，却要尽量避免真正的较上劲。比如当你发现了恶意请求的规律，如果你选择直接对此规则的请求返回404，那么攻击者也会马上知道它的攻击特征被你发现了，从而迅速进行升级对抗。但是如果你只是让他的操作无实际效果，但还照样返回“注册成功”、“发布成功”，那么攻击者可能会麻痹大意很长时间才会发现。正如《孙子兵法》中说的：“实而示之虚”。实际上在垃圾与反垃圾的较量中，最忌讳的就是无止境的军备竞赛。
4.发现特征之钓鱼策略有的攻击者很高明，能够将自己的请求伪装得得正常用户一模一样，所有的http头信息，请求参数，都完全仿真。对于这样的攻击者，我们有什么办法抓到他的尾巴呢。这里给大家介绍一种钓鱼策略。首先你修改一下你的网站的前后端逻辑，比如前端增加某一个参数，后端判断没有这个参数请求就会失败，这时候攻击者马上就会发现自己请求失败了，通过对正常请求的抓包，他很快发现你增加了一个参数，那他会跟着进行修改。这时我们让他爽几天。然后偷偷地把这个无关紧要的参数撤掉。这时候，所有正常用户请求中都不会有这个参数了，但是，攻击者不会时时关注我们的请求参数，所以还会在一段时间内，继续加上这个参数请求。这时钓鱼成功，正是我们的好机会，在这段时间内，我们可以尽量收集垃圾的IP，发布账号等信息。等收集到一定程度一起封掉（当然，这里的封掉也不要暴力封掉，而是让看起来没有被封掉）。
总的来说，反垃圾工作其实不是一个技术活，要求更多的是细致、谨慎与耐心，希望上面东西对你有用。

数据挖掘与机器学习

如何在 Kaggle 首战中进入前 10%

October 25, 2016 zr9558 Leave a comment

Introduction

Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，在 2125 个参赛队伍中排名第 98 位（~ 5%）。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

本文的英文版见这里。

Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

首先简单介绍一些关于 Kaggle 比赛的知识：

不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。
比赛结束前一周是一个 Deadline，在这之后不能再组队，也不能再新加入比赛。所以想要参加比赛请务必在这一 Deadline 之前有过至少一次有效的提交。
一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。
反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
LB 指的就是在 Leaderboard 得到的分数，由上，有 Public LB 和 Private LB 之分。
自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
新手可以从比赛的 Forum 和 Scripts 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。

那么就开始吧！

P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!

General Approach

在这一节中我会讲述一次 Kaggle 比赛的大致流程。

Data Exploration

在这一步要做的基本就是 EDA (Exploratory Data Analysis)，也就是对数据进行探索性的分析，从而为之后的处理和建模提供必要的结论。

通常我们会用 pandas 来载入数据，并做一些简单的可视化来理解数据。

Visualization

通常来说 matplotlib 和 seaborn 提供的绘图功能就可以满足需求了。

比较常用的图表有：

查看目标变量的分布。当分布不平衡时，根据评分标准和具体模型的使用不同，可能会严重影响性能。
对 Numerical Variable，可以用 Box Plot 来直观地查看它的分布。
对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。
绘制变量之间两两的分布和相关度图表。

这里有一个在著名的 Iris 数据集上做了一系列可视化的例子，非常有启发性。

Statistical Tests

我们可以对数据进行一些统计上的测试来验证一些假设的显著性。虽然大部分情况下靠可视化就能得到比较明确的结论，但有一些定量结果总是更理想的。不过，在实际数据中经常会遇到非 i.i.d. 的分布。所以要注意测试类型的的选择和对显著性的解释。

在某些比赛中，由于数据分布比较奇葩或是噪声过强，Public LB 的分数可能会跟Local CV 的结果相去甚远。可以根据一些统计测试的结果来粗略地建立一个阈值，用来衡量一次分数的提高究竟是实质的提高还是由于数据的随机性导致的。

Data Preprocessing

大部分情况下，在构造 Feature 之前，我们需要对比赛提供的数据集进行一些处理。通常的步骤有：

有时数据会分散在几个不同的文件中，需要 Join 起来。
处理 Missing Data。
处理 Outlier。
必要时转换某些 Categorical Variable 的表示方式。
有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

这一部分的处理策略多半依赖于在前一步中探索数据集所得到的结论以及创建的可视化图表。在实践中，我建议使用 iPython Notebook 进行对数据的操作，并熟练掌握常用的 pandas 函数。这样做的好处是可以随时得到结果的反馈和进行修改，也方便跟其他人进行交流（在 Data Science 中 Reproducible Results 是很重要的)。

下面给两个例子。

Outlier

这是经过 Scaling 的坐标数据。可以发现右上角存在一些离群点，去除以后分布比较正常。

Dummy Variables

对于 Categorical Variable，常用的做法就是 One-hot encoding。即对这一变量创建一组新的伪变量，对应其所有可能的取值。这些变量中只有这条数据对应的取值为 1，其他都为 0。

如下，将原本有 7 种可能取值的 Weekdays 变量转换成 7 个 Dummy Variables。

要注意，当变量可能取值的范围很大（比如一共有成百上千类）时，这种简单的方法就不太适用了。这时没有有一个普适的方法，但我会在下一小节描述其中一种。

Feature Engineering

有人总结 Kaggle 比赛是 “Feature 为主，调参和 Ensemble 为辅”，我觉得很有道理。Feature Engineering 能做到什么程度，取决于对数据领域的了解程度。比如在数据包含大量文本的比赛中，常用的 NLP 特征就是必须的。怎么构造有用的 Feature，是一个不断学习和提高的过程。

一般来说，当一个变量从直觉上来说对所要完成的目标有帮助，就可以将其作为 Feature。至于它是否有效，最简单的方式就是通过图表来直观感受。比如：

Feature Selection

总的来说，我们应该生成尽量多的 Feature，相信 Model 能够挑出最有用的 Feature。但有时先做一遍 Feature Selection 也能带来一些好处：

Feature 越少，训练越快。
有些 Feature 之间可能存在线性关系，影响 Model 的性能。
通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。

Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的Feature Importance 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如这个。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。

看 Feature Importance 对于某些数据经过脱敏处理的比赛尤其重要。这可以免得你浪费大把时间在琢磨一个不重要的变量的意义上。

Feature Encoding

这里用一个例子来说明在一些情况下 Raw Feature 可能需要经过一些转换才能起到比较好的效果。

假设有一个 Categorical Variable 一共有几万个取值可能，那么创建 Dummy Variables 的方法就不可行了。这时一个比较好的方法是根据 Feature Importance 或是这些取值本身在数据中的出现频率，为最重要（比如说前 95% 的 Importance）那些取值（有很大可能只有几个或是十几个）创建 Dummy Variables，而所有其他取值都归到一个“其他”类里面。

Model Selection

准备好 Feature 以后，就可以开始选用一些常见的模型进行训练了。Kaggle 上最常用的模型基本都是基于树的模型：

Gradient Boosting
Random Forest
Extra Randomized Trees

以下模型往往在性能上稍逊一筹，但是很适合作为 Ensemble 的 Base Model。这一点之后再详细解释。（当然，在跟图像有关的比赛中神经网络的重要性还是不能小觑的。）

SVM
Linear Regression
Logistic Regression
Neural Networks

以上这些模型基本都可以通过 sklearn 来使用。

当然，这里不能不提一下 Xgboost。Gradient Boosting 本身优秀的性能加上Xgboost 高效的实现，使得它在 Kaggle 上广为使用。几乎每场比赛的获奖者都会用 Xgboost 作为最终 Model 的重要组成部分。在实战中，我们往往会以 Xgboost 为主来建立我们的模型并且验证 Feature 的有效性。顺带一提，在 Windows 上安装 Xgboost 很容易遇到问题，目前已知最简单、成功率最高的方案可以参考我在这篇帖子中的描述。

Model Training

在训练时，我们主要希望通过调整参数来得到一个性能不错的模型。一个模型往往有很多参数，但其中比较重要的一般不会太多。比如对 sklearn 的 RandomForestClassifier 来说，比较重要的就是随机森林中树的数量 n_estimators 以及在训练每棵树时最多选择的特征数量 max_features。所以我们需要对自己使用的模型有足够的了解，知道每个参数对性能的影响是怎样的。

通常我们会通过一个叫做 Grid Search 的过程来确定一组最佳的参数。其实这个过程说白了就是根据给定的参数候选对所有的组合进行暴力搜索。

1
2
3

param_grid = {‘n_estimators’: [300, 500], ‘max_features’: [10, 12, 14]}
model = grid_search.GridSearchCV(estimator=rfr, param_grid=param_grid, n_jobs=1, cv=10, verbose=20, scoring=RMSE)
model.fit(X_train, y_train)

顺带一提，Random Forest 一般在 max_features 设为 Feature 数量的平方根附近得到最佳结果。

这里要重点讲一下 Xgboost 的调参。通常认为对它性能影响较大的参数有：

eta：每次迭代完成后更新权重时的步长。越小训练越慢。
num_round：总共迭代的次数。
subsample：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。
colsample_bytree：训练每棵树时用来训练的特征的比例，类似 RandomForestClassifier 的 max_features。
max_depth：每棵树的最大深度限制。与 Random Forest 不同，Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的。
early_stopping_rounds：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。

一般的调参步骤是：

将训练数据的一部分划出来作为验证集。
先将 eta 设得比较高（比如 0.1），num_round 设为 300 ~ 500。
用 Grid Search 对其他参数进行搜索
逐步将 eta 降低，找到最佳值。
以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 early_stopping_rounds。

X_dtrain, X_deval, y_dtrain, y_deval = cross_validation.train_test_split(X_train, y_train, random_state=1026, test_size=0.3)
dtrain = xgb.DMatrix(X_dtrain, y_dtrain)
deval = xgb.DMatrix(X_deval, y_deval)
watchlist = [(deval, ‘eval’)]
params = {
‘booster’: ‘gbtree’,
‘objective’: ‘reg:linear’,
‘subsample’: 0.8,
‘colsample_bytree’: 0.85,
‘eta’: 0.05,
‘max_depth’: 7,
‘seed’: 2016,
‘silent’: 0,
‘eval_metric’: ‘rmse’
}
clf = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=50)
pred = clf.predict(xgb.DMatrix(df_test))

最后要提一点，所有具有随机性的 Model 一般都会有一个 seed 或是 random_state 参数用于控制随机种子。得到一个好的 Model 后，在记录参数时务必也记录下这个值，从而能够在之后重现 Model。

Cross Validation

Cross Validation 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 Public LB 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 CV 结果，提交之后得到的 LB 结果却变差了，一般认为这时应该相信 CV 的结果。当然，最理想的情况是多种不同的 CV 方法得到的结果和 LB 同时提高，但这样的比赛并不是太多。

在数据的分布比较随机均衡的情况下，5-Fold CV 一般就足够了。如果不放心，可以提到 10-Fold。但是 Fold 越多训练也就会越慢，需要根据实际情况进行取舍。

很多时候简单的 CV 得到的分数会不大靠谱，Kaggle 上也有很多关于如何做 CV 的讨论。比如这个。但总的来说，靠谱的 CV 方法是 Case By Case 的，需要在实际比赛中进行尝试和学习，这里就不再（也不能）叙述了。

Ensemble Generation

Ensemble Learning 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以同时降低最终模型的 Bias 和 Variance（证明可以参考这篇论文，我最近在研究类似的理论，可能之后会写新文章详述)，从而在提高分数的同时又降低 Overfitting 的风险。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。

常见的 Ensemble 方法有这么几种：

Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。
Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。
Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。
Stacking：接下来会详细介绍。

从理论上讲，Ensemble 要成功，有两个要素：

Base Model 之间的相关性要尽可能的小。这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。
Base Model 之间的性能表现不能差距太大。这其实是一个 Trade-off，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。

Stacking

相比 Blending，Stacking 能更好地利用训练数据。以 5-Fold Stacking 为例，它的基本原理如图所示：

整个过程很像 Cross Validation。首先将训练数据分为 5 份，接下来一共 5 个迭代，每次迭代时，将 4 份数据作为 Training Set 对每个 Base Model 进行训练，然后在剩下一份 Hold-out Set 上进行预测。同时也要将其在测试数据上的预测保存下来。这样，每个 Base Model 在每次迭代时会对训练数据的其中 1 份做出预测，对测试数据的全部做出预测。5 个迭代都完成以后我们就获得了一个 #训练数据行数 x #Base Model 数量 的矩阵，这个矩阵接下来就作为第二层的 Model 的训练数据。当第二层的 Model 训练完以后，将之前保存的 Base Model 对测试数据的预测（因为每个 Base Model 被训练了 5 次，对测试数据的全体做了 5 次预测，所以对这 5 次求一个平均值，从而得到一个形状与第二层训练数据相同的矩阵）拿出来让它进行预测，就得到最后的输出。

这里给出我的实现代码：

class Ensemble(object):
def __init__(self, n_folds, stacker, base_models):
self.n_folds = n_folds
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], len(folds)))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
# y_holdout = y[test_idx]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_holdout)[:]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict(T)[:]
S_test[:, i] = S_test_i.mean(1)
self.stacker.fit(S_train, y)
y_pred = self.stacker.predict(S_test)[:]
return y_pred

获奖选手往往会使用比这复杂得多的 Ensemble，会出现三层、四层甚至五层，不同的层数之间有各种交互，还有将经过不同的 Preprocessing 和不同的 Feature Engineering 的数据用 Ensemble 组合起来的做法。但对于新手来说，稳稳当当地实现一个正确的 5-Fold Stacking 已经足够了。

*Pipeline

可以看出 Kaggle 比赛的 Workflow 还是比较复杂的。尤其是 Model Selection 和 Ensemble。理想情况下，我们需要搭建一个高自动化的 Pipeline，它可以做到：

模块化 Feature Transform，只需写很少的代码就能将新的 Feature 更新到训练集中。
自动化 Grid Search，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。
自动化 Ensemble Generation，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

对新手来说，第一点可能意义还不是太大，因为 Feature 的数量总是人脑管理的过来的；第三点问题也不大，因为往往就是在最后做几次 Ensemble。但是第二点还是很有意义的，手工记录每个 Model 的表现不仅浪费时间而且容易产生混乱。

Crowdflower Search Results Relevance 的第一名获得者 Chenglong Chen 将他在比赛中使用的 Pipeline 公开了，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。

Home Depot Search Relevance

在这一节中我会具体分享我在 Home Depot Search Relevance 比赛中是怎么做的，以及比赛结束后从排名靠前的队伍那边学到的做法。

首先简单介绍这个比赛。Task 是判断用户搜索的关键词和网站返回的结果之间的相关度有多高。相关度是由 3 个人类打分取平均得到的，每个人可能打 1 ~ 3 分，所以这是一个回归问题。数据中包含用户的搜索词，返回的产品的标题和介绍，以及产品相关的一些属性比如品牌、尺寸、颜色等。使用的评分基准是 RMSE。

这个比赛非常像 Crowdflower Search Results Relevance 那场比赛。不过那边用的评分基准是 Quadratic Weighted Kappa，把 1 误判成 4 的惩罚会比把 1 判成 2 的惩罚大得多，所以在最后 Decode Prediction 的时候会更麻烦一点。除此以外那次比赛没有提供产品的属性。

EDA

由于加入比赛比较晚，当时已经有相当不错的 EDA 了。尤其是这个。从中我得到的启发有：

同一个搜索词/产品都出现了多次，数据分布显然不 i.i.d.。
文本之间的相似度很有用。
产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。
产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？

Preprocessing

这次比赛中我的 Preprocessing 和 Feature Engineering 的具体做法都可以在这里看到。我只简单总结一下和指出重要的点。

利用 Forum 上的 Typo Dictionary 修正搜索词中的错误。
统计属性的出现次数，将其中出现次数多又容易利用的记录下来。
将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。
对所有文本能做 Stemming 和 Tokenizing，同时手工做了一部分格式统一化（比如涉及到数字和单位的）和同义词替换。

Feature

*Attribute Features
- 是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）
- 这个特定的属性是否匹配
Meta Features
- 各个文本域的长度
- 是否包含属性域
- 品牌（将所有的品牌做数值离散化）
- 产品 ID
简单匹配
- 搜索词是否在产品标题、产品介绍或是产品属性中出现
- 搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例
- *搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现
搜索词和产品标题、产品介绍以及产品属性之间的文本相似度
- BOW Cosine Similairty
- TF-IDF Cosine Similarity
- Jaccard Similarity
- *Edit Distance
- Word2Vec Distance（由于效果不好，最后没有使用，但似乎是因为用的不对）
Latent Semantic Indexing：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 SVD 分解，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。

值得一提的是，上面打了 * 的 Feature 都是我在最后一批加上去的。问题是，使用这批 Feature 训练得到的 Model 反而比之前的要差，而且还差不少。我一开始是以为因为 Feature 的数量变多了所以一些参数需要重新调优，但在浪费了很多时间做 Grid Search 以后却发现还是没法超过之前的分数。这可能就是之前提到的 Feature 之间的相互作用导致的问题。当时我设想过一个看到过好几次的解决方案，就是将使用不同版本 Feature 的 Model 通过 Ensemble 组合起来。但最终因为时间关系没有实现。事实上排名靠前的队伍分享的解法里面基本都提到了将不同的 Preprocessing 和 Feature Engineering 做 Ensemble 是获胜的关键。

Model

我一开始用的是 RandomForestRegressor，后来在 Windows 上折腾 Xgboost 成功了就开始用 XGBRegressor。XGB 的优势非常明显，同样的数据它只需要不到一半的时间就能跑完，节约了很多时间。

比赛中后期我基本上就是一边台式机上跑 Grid Search，一边在笔记本上继续研究 Feature。

这次比赛数据分布很不独立，所以期间多次遇到改进的 Feature 或是 Grid Search新得到的参数训练出来的模型反而 LB 分数下降了。由于被很多前辈教导过要相信自己的 CV，我的决定是将 5-Fold 提到 10-Fold，然后以 CV 为标准继续前进。

Ensemble

最终我的 Ensemble 的 Base Model 有以下四个：

RandomForestRegressor
ExtraTreesRegressor
GradientBoostingRegressor
XGBRegressor

第二层的 Model 还是用的 XGB。

因为 Base Model 之间的相关都都太高了（最低的一对也有 0.9），我原本还想引入使用 gblinear 的 XGBRegressor 以及 SVR，但前者的 RMSE 比其他几个 Model 高了 0.02（这在 LB 上有几百名的差距），而后者的训练实在太慢了。最后还是只用了这四个。

值得一提的是，在开始做 Stacking 以后，我的 CV 和 LB 成绩的提高就是完全同步的了。

在比赛最后两天，因为身心疲惫加上想不到还能有什么显著的改进，我做了一件事情：用 20 个不同的随机种子来生成 Ensemble，最后取 Weighted Average。这个其实算是一种变相的 Bagging。其意义在于按我实现 Stacking 的方式，我在训练 Base Model 时只用了 80% 的训练数据，而训练第二层的 Model 时用了 100% 的数据，这在一定程度上增大了 Overfitting 的风险。而每次更改随机种子可以确保每次用的是不同的 80%，这样在多次训练取平均以后就相当于逼近了使用 100% 数据的效果。这给我带来了大约 0.0004 的提高，也很难受说是真的有效还是随机性了。

比赛结束后我发现我最好的单个 Model 在 Private LB 上的得分是 0.46378，而最终 Stacking 的得分是 0.45849。这是 174 名和 98 名的差距。也就是说，我单靠 Feature 和调参进到了前 10%，而 Stacking 使我进入了前 5%。

Lessons Learned

比赛结束后一些队伍分享了他们的解法，从中我学到了一些我没有做或是做的不够好的地方：

产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用With/Without XXX 的格式放在标题最后。
使用外部数据，比如 WordNet，Reddit 评论数据集等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。
基于字母而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现越长的单词约具体，所以越容易被用户认为相关度高。此外他们还使用了逐字符的序列比较（difflib.SequenceMatcher），因为这个相似度能够衡量视觉上的相似度。像这样的 Feature 的确不是每个人都能想到的。
标注单词的词性，找出中心词，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。
将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。
一些新颖的距离尺度，比如 Word Movers Distance
除了 SVD 以外还可以用上 NMF。
最重要的 Feature 之间的 Pairwise Polynomial Interaction。
针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布。

至于 Ensemble 的方法，我暂时还没有办法学到什么，因为自己只有最简单的 Stacking 经验。

Summary

Takeaways

比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。
很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。
Feature 为王。我花在 Feature 上的时间还是太少。
可能的话，多花点时间去手动查看原始数据中的 Pattern。

Issues Raised

我认为在这次比赛中遇到的一些问题是很有研究价值的：

在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 CV。
如何量化 Ensemble 中 Diversity vs. Accuracy 的 Trade-off。
如何处理 Feature 之间互相影响导致性能反而下降。

Beginner Tips

给新手的一些建议：

选择一个感兴趣的比赛。如果你对相关领域原本就有一些洞见那就更理想了。
根据我描述的方法开始探索、理解数据并进行建模。
通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。
如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。
在得到一个比较不错的 LB 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。
如果觉得自己有希望拿到奖金，开始找人组队吧！
到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。
比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备。
好好休息！

Reference

Introduction

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th (~ 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal.

This post is also available in Chinese.

Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.

First let’s go through some facts about Kaggle competitions in case you are not very familiar with them.

Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited amount of times a day. Usually it’s 5 times a day.
There will be a deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore be sure to have at least one valid submission before that.
You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.
The score you get is calculated on a subset of testing set, which is commonly referred to as a Public LB score. Whereas the final result will use the remaining data in the testing set, which is referred to as Private LB score.
The score you get by local cross validation is commonly referred to as a CVscore. Generally speaking, CV scores are more reliable than LB scores.
Beginners can learn a lot from Forum and Scripts. Do not hesitate to ask, Kagglers are very kind and helpful.

I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!

General Approach

In this section, I will walk you through the whole process of a Kaggle competition.

Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
For numerical variables, use box plot to inspect their distributions.
For coordinates-like data, use scatter plot to inspect the distribution and check for outliers.
For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
Make pairwise distribution plots and examine correlations between pairs of variables.

Be sure to read this very inspiring tutorial of exploratory visualization before you go on.

Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about the choice of tests and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.id. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is an genuine one or due to randomness.

Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

Sometimes several files are provided and we need to join them.
Deal with missing data.
Deal with outliers.
Encode categorical variables if necessary.
Deal with noise. For example you may have some floats derived from unknown integers. The loss of precision during floating-point operations can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.

How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using iPython Notebook for data manipulating and mastering usages of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun operations. Also this makes it very convenient to share your approaches with others. After all reproducible results are very important in data science.

Let’s have some examples.

Outlier

The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.

Dummy Variables

For categorical variables, a common practice is One-hot Encoding. For a categorical variable with n possible values, we create a group of n dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Like this, we transform DayOfWeek into 7 dummy variables.

Note that when the categorical variable can takes many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.

Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, common NLP features are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to check by plotting it against the target variable like this:

Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

Less features mean faster training
Some features are linearly related to others. This might put a strain on the model.
By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.

The simplest way to inspect feature importance is by fitting a random forest model. There exist more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

Feature Encoding

Sometimes raw features have to be converted to some other formats for them to be work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

Model Selection

With the features set, we can start training models. Kaggle competitions usually favor tree-based models:

Gradient Boosted Trees
Random Forest
Extra Randomized Trees

These models are slightly worse in terms of performance, but are suitable as base models in ensemble learning (will be discussed later):

SVM
Linear Regression
Logistic Regression
Neural Networks

Of course, neural networks are very important in image-related competitions.

All these models can be accessed using Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

BTW, installing Xgboost on Windows could be a painstaking process. You can refer to this post by me if you run into problems.

Model Training

We can obtain a good model by tuning its parameters. A model usually have many parameters, but only a few of them are important to its performance. For example, the most important parameters for random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each of the parameters have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

param_grid = {‘n_estimators’: [300, 500], ‘max_features’: [10, 12, 14]}
model = grid_search.GridSearchCV(
estimator=rfr, param_grid=param_grid, n_jobs=1, cv=10, verbose=20, scoring=RMSE
)
model.fit(X_train, y_train)

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

eta: Step size used in updating weights. Lower eta means slower training.
num_round: Total round of iterations.
subsample: The ratio of training data used in each iteration. This is to combat overfitting.
colsample_bytree: The ratio of features used in each iteration. This is like max_features of RandomForestClassifier.
max_depth: The maximum depth of each tree. Unlike random forest,gradient boosting would eventually overfit if we do not limit its depth.
early_stopping_rounds: Controls how many iterations that do not show a increase of score on validation set are needed for the algorithm to stop early. This is to combat overfitting, too.

Usual tuning steps:

Reserve a portion of training set as the validation set.
Set eta to a relatively high value (e.g. 0.1), num_round to 300 ~ 500.
Use grid search to find best combination of other parameters.
Gradually lower eta to find the optimum.
Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

X_dtrain, X_deval, y_dtrain, y_deval = \
cross_validation.train_test_split(X_train, y_train, random_state=1026, test_size=0.3)
dtrain = xgb.DMatrix(X_dtrain, y_dtrain)
deval = xgb.DMatrix(X_deval, y_deval)
watchlist = [(deval, ‘eval’)]
params = {
‘booster’: ‘gbtree’,
‘objective’: ‘reg:linear’,
‘subsample’: 0.8,
‘colsample_bytree’: 0.85,
‘eta’: 0.05,
‘max_depth’: 7,
‘seed’: 2016,
‘silent’: 0,
‘eval_metric’: ‘rmse’
}
clf = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=50)
pred = clf.predict(xgb.DMatrix(df_test))

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

Cross Validation

Cross validation is an essential step. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions where it’s not easy to do reliable CV.

Ensemble Generation

Ensemble Learning refers to ways of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thusincreasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

Bagging: Use different random subsets of training data to train each base model. Then base models vote to generate the final predictions. This is how random forest works.
Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. It performs better than bagging but is more prone to overfitting.
Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
Stacking: To be discussed next.

In theory, for the ensemble to perform well, two elements matter:

Base models should be as unrelated as possibly. This is why we tend to include non-tree-base models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase performance even under this circumstance.

Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(rows in training data) X #(base models). This matrix is then fed to the stacker in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

class Ensemble(object):
def __init__(self, n_folds, stacker, base_models):
self.n_folds = n_folds
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], len(folds)))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
# y_holdout = y[test_idx]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_holdout)[:]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict(T)[:]
S_test[:, i] = S_test_i.mean(1)
self.stacker.fit(S_train, y)
y_pred = self.stacker.predict(S_test)[:]
return y_pred

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

*Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

Modularized feature transform. We only need to write a few lines of codes and the new feature is added to the training set.
Automated grid search. We only need to set up models and parameter grid, the search will be run and best parameters are recorded.
Automated ensemble generation. Use best K models for ensemble as soon as last generation is done.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because manually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s still very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.

Home Depot Search Relevance

In this section I will share my solution in Home Depot Search Relevance and what I learned from top teams after the competition.

The task in this competitions is to predict how relevant a result is for a search term on Home Depot website. The relevance score is an average of three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is RMSE.

This is much like Crowdflower Search Results Relevance. The difference is thatQuadratic Weighted Kappa is used in that competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in that competition.

EDA

There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:

Many search terms / products appeared several times.
Text similarities are great features.
Many products don’t have attributes features. Would this be a problem?
Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?

Preprocessing

You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:

Use typo dictionary posted in forum to correct typos in search terms.
Count attributes. Find those frequent and easily exploited ones.
Join the training set with the testing set. This is important because otherwise you’ll have to do feature transform twice.
Do stemming and tokenizing for all the text fields. Some normalization(with digits and units) and synonym substitutions are performed manually.

Feature

*Attribute Features
- Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
- Whether a certain attribute matches with search term

Meta Features
- Length of each text field
- Whether the product contains attribute fields
- Brand (encoded as integers)
- Product ID
Matching
- Whether search term appears in product title / description / attributes
- Count and ratio of search term’s appearance in product title / description / attributes
- *Whether the i-th word of search term appears in product title / description / attributes
Text similarities between search term and product title/description/attributes
- BOW Cosine Similairty
- TF-IDF Cosine Similarity
- Jaccard Similarity
- *Edit Distance
- Word2Vec Distance (I didn’t include this because of poor performance. Yet it seems that I was using it wrong.)
Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get the latent descriptions of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).

Note that features listed above with * are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlation mentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

Model

At first I was using RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my PC while working on features on my laptop.

Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.

Ensemble

My final model is an ensemble consisting of 4 base models:

RandomForestRegressor
ExtraTreesRegressor
GradientBoostingRegressor
XGBRegressor

The stacker (L2 model) is also a XGBRegressor.

The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.

The good news is that, despite base models being highly correlated, stacking really bumps up my score. What’s more, my CV score and LB score are in complete sync after I started stacking.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of 0.0004, which might be just due to randomness.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.

Lessons Learned

There’s much to learn from the solutions shared by top teams:

There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by With/Without XXXat the end of the title.
Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
POS-tag words and find anchor words. Use anchor words for computing various distances.
Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
Some novel distance metrics like Word Movers Distance
Apart from SVD, some used NMF.
Generate pairwise polynomial interactions between top-ranking features.
For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.

Summary

Takeaways

It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
Features matter the most! I didn’t spend enough time on features in this competition.
If possible, spend some time to manually inspect raw data for patterns.

Issues Raised

Several issues I encountered in this competitions are of high research values.

How to do reliable CV with dependent data.
How to quantify the trade-off between diversity and accuracy in ensemble learning.
How to deal with feature interaction which harms the model’s performance. And how to determine whether new features are effective in such situation.

Beginner Tips

Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
Following my approach or somebody else’s, start exploring, understanding and modeling data.
Learn from forum and scripts. See how other interpret data and construct features.
Find winner interviews / blog post of previous competitions. They’re very helpful.
Start doing ensemble after you have reached a pretty good score (e.g. ~ 10%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
If you think you may have a chance to win the prize, try teaming up!
Don’t give up until the end of the competition. At least try something new every day.
Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
Get some rest!

Reference

安全业务领域

机器学习正在安全领域挂起一阵小旋风，但这里面有BUG

October 24, 2016 zr9558 Leave a comment

如今，安全领域是机器学习（Machine learning）正在大力进军的一个方向。

| 把机器学习应用到安全领域，老板们跃跃欲试

如果你亲自参加了 2016 RSA 大会，就会发现几乎没有哪家公司在说自家安全领域的产品时，不提及机器学习。这是为什么呢？

可能对外行人来说，机器学习就像一种魔法，能解决所有的安全问题：你把一堆未标识的数据统统塞进会机器学习的系统中，它就能分辨出连人类专家都分辨不出的数据规律，并且还可以学习新的行为指令和适应环境威胁。不仅如此，就连为规则加密也劳烦不到你，因为系统已经自动为你搞定这一切。

要真是像这样的话，那机器学习可真就是今年的重头戏了！但讽刺的是，每个人都兴师动众说要在这个领域搞出点名堂来，但真正理解什么是机器学习，或明白机器学习到底能用来做什么的人，却是凤毛麟角。可想而知，在这种大环境下机器学习大多是被滥用的，尤其在安全领域。

| 用机器学习有效解决安全问题，正确的方法是？

把机器学习应用到安全领域，大多会涉及到一种技术——异常检测（anomaly detection），它可以识别哪些部分和预期模式或数据集不匹配。但技术销售方要注意，这种技术只在某些条件下有效——不过显然，他们还不知道自己已经犯下错误：他们会告诉你，分析过你公司的网络流量后，就可以用机器学习揪出暗藏在网络中的黑客。但事实上，机器学习根本就做不到。这时候，你要立刻对这个销售商保持一丝怀疑。

那到底什么情况下才有效？答案是，只有为低维度的问题也配备上高质量的标识数据，这样的机器学习才是有效的。但很不幸，企业在实施过程并没有做到这一点。如果要检测新型的攻击方式，你得有很清晰并且经过标识的攻击案例。这就是说，如果没有透彻理解正常的网络行为，机器学习是不可能发现黑客的。再说，所有的黑客都很狡猾，他们一定会把自己伪装的天衣无缝。

| 机器学习和异常检测，用在哪里价值最大？

机器学习和异常检测真正有用的地方，在于它们能将人类行为分类。

事实证明，人类的预测能力非常强，他们也有能力建立非常精确的个体用户行为模型，让模型探测到异常情况。

其实，人们在这方面已小有成就，比如隐式认证（ Implicit Authentication）。隐式认证采用生物特征识别技术，基于击键力度、节奏和打字模式等技术对用户身份进行认证。不管是改善用户体验还是增强安全性，这个技术的优势都相当明显。最起码，它免除了用户记忆密码的负担和输入密码的麻烦。由于隐式认证所需元素大多是低维的，机器学习就只需处理少量几个参数，这也使得收集用户的高品质标识数据变得很方便。所以，即使有行为差异或信号干扰，机器学习还是能正确为计算机视觉进行图形搭配。同理，机器学习也能通过识别出个体的独特行为而进行身份验证，这当然也不在话下。

不过，它是怎么做到的呢？

其实，你走路、站立等所有动作，是由众多因素共同决定的，比如生理状况，年龄，性别，肌肉记忆等等。并且对个体来说，这些动作不会有太大改变。因此，不经意间，你口袋中的手机就通过内置传感器精确捕捉到了这些信息，并记录下来。而想要通过运动行为来识别一个人， 4 秒的运动信息就已足够。另外，通过对比用户的历史和当下的定位记录也可以进行身份识别。人们总是生活在各种各样的习惯当中，通过观察他们什么时候从哪出发，就能预测被测者到底是不是用户本人。

我们的手机和电脑上已有大量的传感器，以后随着可穿戴设备的普及和物联网的发展，传感器的数量更会暴增。用户大量的行为数据和环境数据就这样被收集起来，提供给机器学习，让它为用户建立个体模型，并找到各个因素之间的相互关系。

| 让机器学习进行安全防护，你需要做哪些功课？

想进行安全防护，就必须让你的系统提前知道都存在哪些威胁模型。

首先，也是最重要的事——收集数据。这些数据必须非常精确，才能用来训练系统，起到抵抗威胁的作用。不过身份认证系统要真是遭到攻击，你也不用过于担心。因为行为变化还是比较好检测的，系统很快就能识别出异常情况。比如，如果一个设备不小心被偷，那么这个设备被偷之后所记录的运动状态，地理位置和用法就会和之前的记录有明显不同。不过，系统是接受这种可能存在的异常情况的，这时候用户就需要在系统上以另外的方式确认身份，调整系统，以使假阳性最小化。而一旦我们在不同设备上连接起 4 个因素，那么隐式认证的假阳性就会低于 0.001% 。

这个世界上并没有哪一种机器学习真的神奇到能解决所有的安全问题。设计者想用机器学习创建一个有用的安全防卫产品，就需要对底层系统有深刻理解，并且承认很多问题并不适合用机器学习来解决。不过不用担心，那些处在浪潮之巅的科技公司会将这些问题一步步消灭掉。

机器学习正在安全领域酝酿着一股势不可挡的市场狂潮。

安全业务领域

未来的网络安全，离不开机器学习

October 24, 2016 zr9558 Leave a comment

信息安全一直就是猫与老鼠的游戏。好家伙新建一堵墙，坏家伙便想方设法通过或绕过它。但最近，坏家伙们似乎越来越轻易地就可以通过这堵墙。要想阻止他们，我们的能力需要有一个巨大的提升，这可能意味着我们需要更广泛地使用机器学习技术。

这可能会惊到行业外的旁观者，但机器学习目前并没有广泛地影响到IT安全领域。安全专家认为，尽管信用卡欺诈侦查系统和网络设备制造商正在使用先进的分析方法，但实际上每个大型公司常见的自动化安全行动——比如检测个人电脑上的恶意软件或者识别网络中的恶意活动——大部分都要依靠人类适时地对这些行动进行代码编写和配置。

尽管机器学习技术在网络安全领域的应用已经有了广泛的学术研究，但我们现在才刚开始了解这项技术对安全工具的影响。一些创业公司（如Invincea, Cylance, Exabeam和Argyle Data）正在利用机器学习驱动安全工具，使得它们比目前主要的安全软件供应商提供的工具更快捷和精准。

用数据摧毁恶意软件

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为，是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说：「我了解到，一些反病毒公司已经涉足机器学习领域，但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件，这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

他说：「他们在检测过去常见的恶意软件上很成功，但是他们并不擅长检测新的恶意软件，这也是当下网络犯罪大行其道的原因之一。即使你安装了杀毒系统，其他人还是能成功侵入你的电脑，因为特征码检测的方法根本不起作用。」

在Invincea，Saxe正带领团队用机器学习建立更完善的恶意软件检测系统。这个项目是DARPA网络基因组项目的一部分，主要是使用机器学习来摧毁检测到的恶意软件，包括反向还原恶意软件的运行方式、在代码中进行社交网络分析、使用机器学习系统快速摧毁自然网络环境中出现的恶意软件新样本。

「我们已经证明，我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作，甚至能做得更好。把机器学习系统与大量的训练数据结合，就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前，Saxe有大约150万个良性或恶意软件样品用来训练算法，这些都在使用 Python 工具的GPU中进行。他希望，随着样本数据增加到3000万，机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多，用来训练机器学习系统的恶意软件的数量越多，那机器学习系统在检测恶意软件上的性能优势就会越明显，」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说，就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

恶意用户检测

机器学习还有助于IT安全的其他方面：检测恶意的内部用户和识别损坏的账户。

正如主要的反病毒产品依赖特征码来识别恶意软件一样，监测用户活动的工具也是倚赖特征码。基于特征码的检测方法在恶意软件检测上开始失效，同样的，它在检测用户活动领域的效果也不尽如人意。

「过去，企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

他说：「这种方法寻找的是已经发生的事情。基于特征码的方法存在的问题是，只有事件发生过后，他们才能看到留下的特征码。而现在，安全人员非常聚焦于检测没有特征码的恶意事件。」

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

如今，精明的犯罪分子知道稍微改变一下他们的路径就能战胜特征码检测。所以，如果被侵入的检测系统中存有一个IP黑名单，网络犯罪分子可以通过在他处理下的大面积网域中不断来回跳动来打破这个IP黑名单。

Exabeam并没有固守昔日的防御策略，而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏，所以先假设他们是恶意的，你的网络是缺乏抵抗力的，所以你时刻对每个人的行为进行监测和制作模型，从而找到恶意行为者。

这就是用到机器学习算法的地方。Lin和他的团队获取了多种多样的资源（如服务器日志、虚拟私人网络日志和VPN日志等），使用各种监督和非监督式机器学习算法来检测用户行为的异常模式。

Lin说：「以上都是描绘用户行为的画像，问题是这是如何做到的。对于网络上每个用户或实体，我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后，我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常，让他们浮现出来，方便安全分析员查看。」

机器学习在安全领域的未来

「想一想我们经历过的几次主要的网络安全浪潮，网络犯罪分子正寻找有效地方法来打破安全系统，我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗？答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

他说：「现在我们正开始获得能够有效处理大量未结构化数据和检测模式的系统，我希望下一波网络安全浪潮中的产品是基于认知计算的。看看Watson，既然它可以赢得危险边缘（Jeopardy）游戏，那为什么它不可以用来广泛地分析和理解网络安全事件呢？我认为我们正处于用基于认知的计算来帮助处理安全问题的萌芽阶段。」

Invincea的Saxe希望可以成为弄潮儿。他说：「我并不惊讶该领域的公司没有抓住这次浪潮，生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

安全业务领域

Machine learning and big data know it wasn’t you who just swiped your credit card

October 24, 2016 zr9558 Leave a comment

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

安全业务领域

当朋友圈更新多到看不完时，来看看Facebook是怎么优化信息流的

October 24, 2016 zr9558 Leave a comment

【编者按】本文是FREES互联网团队成员覃超与徐万鸿进行的一场 Ask Me Anything。徐是前 Facebook 新闻流排序组的资深工程师，在今年9月回国出任神州专车 CTO。本文中他们聊的是关于 Facebook 的 Growth Hacking 策略、反垃圾信息系统、信息流排序，以及为什么选择回国参与创业。雷锋网(公众号：雷锋网)做了不修改原意的编辑。

所谓新闻流排序（news feed ranking），指的是 Facebook 的一项看家本领：用户每天会收到两三千条新鲜事，却只会阅读前 50 至 100 条。利用机器学习将用户最想看的内容排到最前面，从而提高粘性和日活。

这固然是一篇着重技术的文章，所在公司 Facebook 更是世界上最大的互联网公司之一。但这并不妨碍创业者从中得到经验。利用 A/B 测试作为迭代方法，借助 Growth Hacking 的核心——数据来驱动开发，新员工的入职宣讲……这些做法都体现了这位社交之王不同维度的文化所在：精神层面注重实现梦想，统一目标；而这一目标下放到微观层面，就是对于数据的尊重。

Facebook利用Sigma 系统做了什么？

我第一次去Facebook工作的时候，当时专注于用户增长的 VP 负责宣讲。他说将来全球所有人都会使用 Facebook，这家公司将来会成为万亿美元的公司，这让我印象很深刻。公司的所有人都很兴奋，对设定的目标有非常大的信心。他们的工作使命感非常强，非常专注。

这是Facebook给我印象深刻的一件事。

在 Facebook 的 site-integrity （站点完整性）组工作了两年。当时 Facebook 有很多的垃圾私信、垃圾信息，就像人人、微博上有各种广告、垃圾链接。有些用户的账号被盗用了，会使用个人页面发送垃圾短信、广告、病毒，还有一些不受欢迎的朋友请求。我会处理所有类似这些涉及到影响用户体验的东西。

Facebook 使用了一个叫做 sigma 的系统来抵制这些垃圾信息。这个系统安装在 2000 多台机器上面，Facebook 用户做的任何事情，都会经过 sigma 系统分析处理，比如评论、链接、朋友请求，都会被这个系统进行判断，是正常行为、滥用行为还是有问题的行为。

利用 Sigma 系统，Facebook 会对垃圾信息进行过滤和清理。

举个例子说，比如发送朋友请求，Facebook 的系统会自动判断一下：如果这个人的朋友请求都被别人拒绝了，他再发送朋友请求是不会被批准的。如果一个人发送的朋友请求十个有九个都被拒绝了，那么他下一次的朋友请求就会被系统拒绝。

当然这个系统还有其他的判断信号。

它是一个机器学习系统，通过你之前发的朋友请求拒绝概率高低来判断你被拒绝的概率有多高。

如果这个比率很高，Facebook 会让你进行手机短信或其他方式认证，来验证是软件还是真人发送的，以此判断你是不是真的要发送朋友请求，比如你发出的朋友请求对象与你没有任何共同好友，那就可能是一个不合理的请求。

基本上，你在 Facebook 上做的任何事情，都会经过这个系统来分析、预测、决定是否允许你发出信息，借此希望会减少生态圈中的骚扰行为。当时 Facebook 每天有上百亿次的信息发生要通过这个系统进行判断。

机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法，请求通过和拒绝就是一个迅捷数据组（Scrum）。任务通过，则说明这个任务是一个对机器学习来说的正样本，被拒绝则是一个负样本，很像 0 和 1。

比如发送朋友请求如果被接受，y 值是 1，如果被拒绝就是 0。如果是评论和点赞，系统就能寻找 y 值，用户发送的不当信息就会被删除。

而机器学习是整个 Sigma 系统的核心。

另外一个方法是通过一些异常行为的分析、数据挖掘的方法来分析用户的异常行为。

比如一个人发的同样类型评论非常多，所有评论里都有一个相似链接，这就非常有问题。正常操作不会在不同人的主页上留同样的评论，这显然属于异常行为，我们不会允许。

新闻流是Facebook最重要的产品

我工作两年之后选择去了这个组。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈，你的信息流是个什么样子，信息的位置。每个人产生的内容、新闻会有两三千个，用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示，比如你喜欢游戏，你的朋友不喜欢。

我 2012 年刚去的时候，新闻流排序组只有五六个人，尽管这可能是公司最大的机器学习系统，最核心的产品。每天有十亿多人上线，每个用户花 40 分钟在 Facebook 上，其中一半时间都花在新闻流上。Facebook 大部分收入来自新闻流广告。比如说，移动广告收入占所有广告的 70%，而其中所有的移动的广告都来自新闻流。不管是从用户的停留时间，还是收入来说，新闻流都是最重要的产品。

新闻流是 Facebook 最重要的产品，直接决定了用户所看到的内容。

做好新闻流排序是很难的问题，因为用户在新闻流上的行为有很多种，不只是传统广告点击或者不点击这一种操作，用户可以在新闻流里赞、评论、分享或者隐藏这个新闻流，也可以播放视频。我需要理解用户喜欢什么东西，评论、分享什么东西，想看什么样的视频。理解用户的兴趣所在，根据我们的讯息把最好的东西放在新闻流的最前面。

以国内的社交媒体作对比来说，微信的朋友圈是所有内容全部显示，它不需要排序，是因为朋友圈容量不是特别多，大家可以看完所有的内容。朋友越来越多的话，没有时间把分享看完，排序是必然的事情。你会很容易漏掉很重要人的图片，它们迅速埋没在大部分你不感兴趣的内容了。

Facebook 之前也是全部显示，慢慢到后来用户是看不完所有的信息的。如果不做排序，把最好的服务挑出来的话，用户不会愿意访问新闻流，因为他看到很多不感兴趣的东西，感兴趣的部分他已经没有时间找出来了。从不排序到排序是必然的过程，你的朋友越来越多，公众页面越来越多，排序是必然的。

比如说新浪微博没有做排序，有些细节杂乱无章，他们测试过，但是做得不太好。所以放弃了。微信的朋友圈也会到要做排序的阶段。Facebook 不只是排序，还会隐藏用户不感兴趣的内容，比如你的朋友玩过 Candy Crush 游戏，但可能你本身不玩任何游戏，关于这方面的信息就没有意义。Facebook 就不会给你显示这些内容——“朋友们在玩什么游戏”。

社交媒体的碎片化已成事实。只有采取更好的排序手段，推送给用户更精准的内容，才能提高平台停留时间，加强粘性。

新闻流排序的工作原理是什么？

基本上，新闻流是从两三千条内容里面，挑出了四五十个。按照每个内容打分，分高的内容排在最前面。每个内容、照片、分享或者状态，我们会预测一些概率值，比如你点赞的概率，评论、分享的概率。每个用户的行为，比如点赞、分享、评论，系统都会给权值。评这些用户行为概率是通过机器学习来系统计算的。如果用户对某个内容点赞、评论或者分享，说明用户愿意看到这个内容，对内容产生了反馈。

举个例子来说，比如你是我的好友，你上传了 100 张照片，我点赞了 20 次，那么点赞概率就是 20%。我们知道每个用户以前对哪些内容点赞、评论，这些都是我们的训练样本。我们通过学习用户的历史行为，进行相同类型、相同个人的未来行为预测，因为用户短期行为不会大幅变化，过去对哪些东西进行评论，将来也很有可能对相似内容进行评论。

对用户内容的预测

很多人关心，是否可以针对用户内容来进行预测？比如分析用户发了什么样的文字或者图片？这是可以的。如果是图片我们可以抽取图片特点，对图片进行模式识别，分析图片的主题，打上相应的标签，用机器来识别这些图片。现在在做相应的工作。Facebook 有 AI 实验室，可以对图片进行内容识别。

那么，Facebook 该如何检测这套算法的有效性呢？该如何进行更新迭代？

其实，这可以通过 A/B 测试来实现。我们会抽取 1% 用户进行新的算法，1%进行旧的算法。如果新的算法下用户每天点赞、评论或者分享次数增长了，那说明新的算法更好。我们就把新的算法发布给所有的用户。我们主要的核心目标是：让日活跃用户更多，停留时间更长，访问 Facebook 更频繁。

A/B 测试是很好的迭代方法。建立起核心指标，进行 A/B 测试，看新的改动能否提高核心指标，提高就发布，没有提高就不用发布。这很像 Growth hacking，当然最终目的还是提高 DAU。如果用户喜欢你的新闻流，就会更频繁访问，最终目的还是在线时长和日活跃用户。

A/B 测试是 Facebook 用来测试迭代可行性的手段，目前峰瑞资本所投资的吆喝科技，想让初创企业也能使用到这一技术。

“我已经没法看完所有朋友圈的内容”

我已经没法看完所有朋友圈内容了。一种改进方法是排序，把最好的内容放最前面，通过你以前点赞的内容，来学习你关心的内容，比如你女朋友发的东西你都会点赞。另外一种改进方法叫做 “内容置顶”（Story bumped）。有时候我早上起来刷微信，会看不完，只看了一小部分。过一会儿再刷的时候，已经没有什么新的内容了。

Facebook 的内容置顶功能会把你没有看完的东西再放到上面去再次推送给你。

微信是知道哪些内容你没看过的，我有很多在美国的朋友，朋友圈会有很多内容，上班前看不完只看了一部分。再刷新的时候就已经没有新的东西出来了，我也没有关心没看完的东西，朋友发的照片。Facebook 的 “内容置顶” 把很重要的、还没看的、有点旧的内容放在朋友圈前面，让你再看一眼，怕你漏掉重要的内容。

在九月份的时候我加入神州专车担任 CTO，从事业角度来说，我希望把从 Facebook 学到的公司文化、技术带回中国。中国在计算机行业上有很大的潜力。现在国内的产品质量上和美国产品已经相当了，比如微信，Facebook 的产品经理也学习了微信里面的功能。再往后面看几年的话，中国有机会赶上美国。

计算机学科已经成熟，创造力在慢慢变好。很多初创企业尝试不同的想法，中国的创业者是美国的好多倍，都在尝试不同的想法，会诞生出成功的公司。技术上，中国正在逼近美国，甚至会超越美国。长远来看，中国的计算机行业、互联网行业，应该是有潜力成为世界上互联网行业最好的国家。

安全业务领域

Fighting spam with Haskell

October 24, 2016 zr9558 Leave a comment

One of our weapons in the fight against spam, malware, and other abuse on Facebook is a system called Sigma. Its job is to proactively identify malicious actions on Facebook, such as spam, phishing attacks, posting links to malware, etc. Bad content detected by Sigma is removed automatically so that it doesn’t show up in your News Feed.

We recently completed a two-year-long major redesign of Sigma, which involved replacing the in-house FXL language previously used to program Sigma with Haskell. The Haskell-powered Sigma now runs in production, serving more than one million requests per second.

Haskell isn’t a common choice for large production systems like Sigma, and in this post, we’ll explain some of the thinking that led to that decision. We also wanted to share the experiences and lessons we learned along the way. We made several improvements to GHC (the Haskell compiler) and fed them back upstream, and we were able to achieve better performance from Haskell compared with the previous implementation.

How does Sigma work?

Sigma is a rule engine, which means it runs a set of rules, called policies. Every interaction on Facebook — from posting a status update to clicking “like” — results in Sigma evaluating a set of policies specific to that type of interaction. These policies make it possible for us to identify and block malicious interactions before they affect people on Facebook.

Policies are continuously deployed. At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses. This also means that safety in the language we write policies in is important. We don’t allow code to be checked into the repository unless it is type-correct.

Louis Brandy of Facebook’s Site Integrity team discusses scalable spam fighting and the anti-abuse structure at Facebook and Instagram in a 2014 @Scale talk.

Why Haskell?

The original language we designed for writing policies, FXL, was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted. We wanted the performance and expressivity of a fully fledged programming language. Thus, we decided to migrate to an existing language rather than try to improve FXL.

The following features were at the top of our list when we were choosing a replacement:

1. Purely functional and strongly typed. This ensures that policies can’t inadvertently interact with each other, they can’t crash Sigma, and they are easy to test in isolation. Strong types help eliminate many bugs before putting policies into production.

2. Automatically batch and overlap data fetches. Policies typically fetch data from other systems at Facebook, so we want to employ concurrency wherever possible for efficiency. We want concurrency to be implicit, so that engineers writing policies can concentrate on fighting spam and not worry about concurrency. Implicit concurrency also prevents the code from being cluttered with efficiency-related details that would obscure the functionality, and make the code harder to understand and modify.

3. Push code changes to production in minutes. This enables us to deploy new or updated policies quickly.

4. Performance. FXL’s slower performance meant that we were writing anything performance-critical in C++ and putting it in Sigma itself. This had a number of drawbacks, particularly the time required to roll out changes.

5. Support for interactive development. Developers working on policies want to be able to experiment and test their code interactively, and to see the results immediately.

Haskell measures up quite well: It is a purely functional and strongly typed language, and it has a mature optimizing compiler and an interactive environment (GHCi). It also has all the abstraction facilities we would need, it has a rich set of libraries available, and it’s backed by an active developer community.

That left us with two features from our list to address: (1) automatic batching and concurrency, and (2) hot-swapping of compiled code.

Automatic batching and concurrency: The Haxl framework

All the existing concurrency abstractions in Haskell are explicit, meaning that the user needs to say which things should happen concurrently. For data-fetching, which can be considered a purely functional operation, we wanted a programming model in which the system just exploits whatever concurrency is available, without the programmer having to use explicit concurrency constructs. We developed the Haxl framework to address this issue: Haxl enables multiple data-fetching operations to be automatically batched and executed concurrently.

We discussed Haxl in an earlier blog post, and we published a paper on Haxl at the ICFP 2014 conference. Haxl is open source and available on GitHub.

In addition to the Haxl framework, we needed help from the Haskell compiler in the form of theApplicative do-notation. This allows programmers to write sequences of statements that the compiler automatically rearranges to exploit concurrency. We also designed and implemented Applicative do-notation in GHC.

Hot-swapping of compiled code

Every time someone checks new code into the repository of policies, we want to have that code running on every machine in the Sigma fleet as quickly as possible. Haskell is a compiled language, so that involves compiling the code and distributing the new compiled code to all the machines running Sigma.

We want to update the compiled rules in a running Sigma process on the fly, while it is serving requests. Changing the code of a running program is a tricky problem in general, and it has been the subject of a great deal of research in the academic community. In our case, fortunately, the problem is simpler: Requests to Sigma are short-lived, so we don’t need to switch a running request to new code. We can serve new requests on the new code and let the existing requests finish before we discard the old code. We’re careful to ensure that we don’t change any code associated with persistent state in Sigma.

Loading and unloading code currently uses GHC’s built-in runtime linker, although in principle, we could use the system dynamic linker. To unload the old version of the code, the garbage collector gets involved. The garbage collector detects when old code is no longer being used by a running request, so we know when it is safe to unload it from the running process.

How Haskell fits in

Haskell is sandwiched between two layers of C++ in Sigma. At the top, we use the C++ thrift server. In principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant. It also supports more features. Furthermore, it can work seamlessly with the Haskell layers below because we can call into Haskell from C++. For these reasons, it made sense to use C++ for the server layer.

At the lowest layer, we have existing C++ client code for talking to other internal services. Rather than rewrite this code in Haskell, which would duplicate the functionality and create an additional maintenance burden, we wrapped each C++ client in a Haxl data source using Haskell’s Foreign Function Interface (FFI) so we could use it from Haskell.

Haskell’s FFI is designed to call C rather than C++, so calling C++ requires an intermediate C layer. In most cases, we were able to avoid the intermediate C layer by using a compile-time tool that demangles C++ function names so they can be called directly from Haskell.

Performance

Perhaps the biggest question here is “Does it run fast enough?” Requests to Sigma result from users performing actions on Facebook, such as sending a message on Messenger, and Sigma must respond before the action can take place. So we wanted to serve requests fast enough to avoid interruptions to the user experience.

The graph below shows the relative throughput performance between FXL and Haskell for the 25 most common types of requests served by Sigma (these requests account for approximately 95 percent of Sigma’s typical workload).

Haskell performs as much as three times faster than FXL for certain requests. On a typical workload mix, we measured a 20 percent to 30 percent improvement in overall throughput, meaning we can serve 20 percent to 30 percent more traffic with the same hardware. We believe additional improvements are possible through performance analysis, tuning, and optimizing the GHC runtime for our workload.

Achieving this level of performance required a lot of hard work, profiling the Haskell code, and identifying and resolving performance bottlenecks.

Here are a few specific things we did:

We implemented automatic memoization of top-level computations using a source-to-source translator. This is particularly beneficial in our use-case where multiple policies can refer to the same shared value, and we want to compute it only once. Note, this is per-request memoization rather than global memoization, which lazy evaluation already provides.
We made a change to the way GHC manages the heap, to reduce the frequency of garbage collections on multicore machines. GHC’s default heap settings are frugal, so we also use a larger allocation area size of at least 64 MB per core.
Fetching remote data usually involves marshaling the data structure across the C++/Haskell boundary. If the whole data structure isn’t required, it is better to marshal only the pieces needed. Or better still, don’t fetch the whole thing — although that’s only possible if the remote service implements an appropriate API.
We uncovered a nasty performance bug in aeson, the Haskell JSON parsing library. Bryan O’Sullivan, the author of aeson, wrote a nice blog post about how he fixed it. It turns out that when you do things at Facebook scale, those one-in-a-million corner cases tend to crop up all the time.

Resource limits

In a latency-sensitive service, you don’t want a single request using a lot of resources and slowing down other requests on the same machine. In this case, the “resources” include everything on the machine that is shared by the running requests — CPU, memory, network bandwidth, and so on.

A request that uses a lot of resources is normally a bug that we want to fix. It does happen from time to time, often as a result of a condition that occurs in production that wasn’t encountered during testing — perhaps an innocuous operation provided with some unexpectedly large input data, or pathological performance of an algorithm on certain rare inputs, for example. When this happens, we want Sigma to terminate the affected request with an error (that will subsequently result in the bug being fixed) and continue without any impact on the performance of other requests being served.

To make this possible, we implemented allocation limits in GHC, which places a bound on the amount of memory a thread can allocate before it is terminated. Terminating a computation safely is a hard problem in general, but Haskell provides a safe way to abort a computation in the form of asynchronous exceptions. Asynchronous exceptions allow us to write most of most of our code ignoring the potential for summary termination and still have all the nice guarantees that we need in the event that the limit is hit, including safe releasing of resources, closing network connections, and so forth.

The following graph illustrates of how well allocation limits work in practice. It tracks the maximum live memory across various groups of machines in the Sigma fleet. When we enabled one request that had some resource-intensive outliers, we saw large spikes in the maximum live memory, which disappeared when we enabled allocation limits.

Enabling interactive development

Facebook engineers develop policies interactively, testing code against real data as they go. To enable this workflow in Haskell, we needed the GHCi environment to work with our full stack, including making requests to other back-end services from the command line.

To make this work, we had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load. We also customized the GHCi front end to implement some of our own commands and streamline the desired workflows. The result is an interactive environment in which developers can load their code from source in a few seconds and work on it with a fast turnaround time. They have the full set of APIs available and can test against real production data sources.

While GHCi isn’t as easy to customize as it could be, we’ve already made several improvements and contributed them upstream. We hope to make more improvements in the future.

Packages and build systems

In addition to GHC itself, we make use of a lot of open-source Haskell library code. Haskell has its own packaging and build system, Cabal, and the open-source packages are all hosted onHackage. The problem with this setup is that the pace of change on Hackage is fast, there are often breakages, and not all combinations of packages work well together. The system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure, and the tool support isn’t what it could be. We found that using packages directly from Hackage together with Facebook’s internal build tools meant adding or updating an existing package sometimes led to a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations.

As a result of this experience, we switched to Stackage as our source of packages. Stackage provides a set of package versions that are known to work together, freeing us from the problem of having to find the set by trial and error.

Did we find bugs in GHC?

Yes, most notably:

We fixed a bug in GHC’s garbage collector that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years.
We fixed a bug in GHC’s handling of finalizers that occasionally caused crashes during process shutdown.

Following these fixes, we haven’t seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet.

What else?

At Facebook, we’re using Haskell at scale to fight spam and other types of abuse. We’ve found it to be reliable and performant in practice. Using the Haxl framework, our engineers working on spam fighting can focus on functionality rather than on performance, while the system can exploit the available concurrency automatically.

For more information on spam fighting at Facebook, check out our Protect the Graph page, or watch videos from our recent Spam Fighting @Scale event.

数据挖掘与机器学习

最大似然估计（Maximal Likelihood Estimation）

October 18, 2016 zr9558 Leave a comment

（一）基本思想

给定一个概率分布 $D$ ，假设其概率密度函数是 $f_{D}$ ，它与一个未知参数 $\theta$ 相关。我们可以从这个分布中抽取 $n$ 样本 $x_{1},x_{2},...,x_{n}$ ，我们就可以得到这个概率是

$P(x_{1},...,x_{n}) = f_{D}(x_{1},...,x_{n}|\theta)$ .

但是，在这里我们并不知道参数 $\theta$ 的值。如何估计参数 $\theta$ 的取值就成为了关键之处。一个简单的想法就是从这个分布中随机抽取样本 $x_{1},...,x_{n}$ ，然后利用这些数据来估算 $\theta$ 的值。

最大似然估计 (maximal likelihood estimator) 算法会计算参数 $\theta$ 的最可能的值，也就是说参数的选择会使得这个采样的概率最大化。

用数学的语言来说，首先我们需要定义似然函数：

$L(\theta) = f_{D}(x_{1},...,x_{n}|\theta)$ ,

并且在 $\theta$ 的所有取值上，使得这个函数的取值最大化。换言之，也就是函数 $L(\theta)$ 的一阶导数等于零。这个使得 $L(\theta)$ 最大化的参数 $\hat{\theta}$ 称为 $\theta$ 的最大似然估计。

Remark. 最大似然函数不一定是唯一的，甚至不一定是存在的。

（二）基本算法

求解最大似然函数估计值的一般步骤：

（1）定义似然函数；

（2）对似然函数求导数，或者说对似然函数的对数求导数，目的都是为了更加方便地计算一阶导数；

（3）令一阶导数等于零，得到关于参数 $\theta$ 的似然方程；

（4）求解似然方程，得到的参数就是最大似然估计。在求解的过程中，如果不能够直接求解方程的话，可以采取牛顿法来近似求解。

（三）例子

（i）Bernoulli 分布（Bernoulli Distribution）

假设我们有 $n$ 个随机样本 $x_{1},...,x_{n}$ . 如果第 $i$ 个学生没有自行车，那么 $x_{i}=0$ ; 否则 $x_{i}=1$ . 并且假设 $x_{i}$ 是满足未知参数 $p$ 的 Bernoulli 分布的。我们此时的目标是计算最大似然估计 $p$ ，也就是全体学生中拥有自行车的比例。

如果 $\{x_{i}:1\leq i\leq n\}$ 是相互独立的 Bernoulli 随机变量，那么对每一个 $x_{i}$ 而言，它的概率函数则是：

$f(x_{i};p)=p^{x_{i}}(1-p)^{1-x_{i}}, \text{ for } x_{i}=0 \text{ or } 1 \text{ and } 0<p<1$ .

因此，似然函数 $L(p)$ 可以定义为：

$L(p)=\prod_{i=1}^{n}f(x_{i};p)=p^{\sum_{i=1}^{n}x_{i}}(1-p)^{n-\sum_{i=1}^{n}x_{i}}$ .

为了计算参数 $p$ 的值，可以对 $ln(L(p))$ 求导：

$\ln(L(p))=(\sum_{i=1}^{n}x_{i})\ln(p) + (n-\sum_{i=1}^{n}x_{i})\ln(1-p)$

$\frac{\partial\ln(L(p))}{\partial p} = \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}$

令 $\frac{\partial\ln(L(p))}{\partial p} = 0$ ，可以得到

$p=\sum_{i=1}^{n}x_{i}/n$ .

也就是说最大似然估计是

$\hat{p}=\sum_{i=1}^{n}x_{i}/n$ .

（ii）Gaussian Distribution

假设 $x_{1},...,x_{n}$ 满足正态分布，并且该正态分布的参数 $\mu$ 和 $\sigma^{2}$ 都是未知的。目标是寻找均值 $\mu$ 和方差 $\sigma^{2}$ 的最大似然估计。

如果 $\{x_{1},...,x_{n}\}$ 是满足正态分布的，那么对于每一个变量 $x_{i}$ 的概率密度函数就是：

$f(x_{i};\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x_{i}-\theta)^{2}}{2\sigma^{2}})$ .

似然函数就是：

$L(\mu,\sigma) = \prod_{i=1}^{n} f(x_{i};\mu,\sigma^{2}) = (2\pi)^{-n/2}\sigma^{-n}\exp(-\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}})$

$\ln(L(\mu,\sigma)) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} = \frac{\sum_{i=1}^{n}(x_{i}-\mu)}{\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{\sigma^{3}}$

令 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} =0$ 和 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma}=0$ ，可以求解方程组得到：

$\hat{\mu}= \sum_{i=1}^{n}x_{i}/n$ ,

$\hat{\sigma}^{2} = \sum_{i=1}^{n}(x_{i}-\mu)^{2}/n$ .

(iii) Weibull 分布（Weibull Distribution）

首先，我们回顾一下 Weibull 分布的定义。Weibull 分布（Weibull Distribution）是连续型的概率分布，其概率密度函数是：

$f(x;\lambda,k) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1}\exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0, f(x;\lambda,k)=0 \text{ for } x<0.$

其中， $x$ 是随机变量， $\lambda>0$ 是 scale parameter， $k>0$ 是 shape parameter。特别地，当 $k=1$ 时，Weibull 分布就是指数分布；当 $k=2$ 时，Weibull 分布就是 Rayleigh 分布。

Weibull 分布的累积分布函数是

$F(x;k,\lambda) = 1- \exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0$ ,

$F(x;k,\lambda) = 0 \text{ for } x<0$ .

Weibull 分布的分位函数（quantile function, inverse cumulative distribution）是

$Q(p;k,\lambda) = \lambda(-\ln(1-p))^{1/k} \text{ for } 0\leq p <1$ .

其次，我们来计算最大似然估计。

假设 $\{x_{1},...,x_{n}\}$ 满足 Weibull 分布，其未知参数是 $k,\lambda.$ 那么对于每一个 $x_{i}$ 而言，概率密度函数是：

$p(x_{i};k,\lambda) = \frac{k}{\lambda}(\frac{x_{i}}{\lambda})^{k-1}\exp(-(\frac{x_{i}}{\lambda})^{k})$ .

定义似然函数为：

$L(k,\lambda) = \prod_{i=1}^{n}p(x_{i};k,\lambda)$

取对数之后得到：

$\ln(L(k,\lambda)) = n\ln(k) - nk\ln(\lambda) + (k-1)\sum_{i=1}^{n}\ln(x_{i}) - \sum_{i=1}^{n}x_{i}^{k}/\lambda^{k}.$

计算一阶偏导数得到：

$\frac{\partial \ln(L(k,\lambda))}{\partial \lambda} = - \frac{nk}{\lambda} + \frac{k\sum_{i=1}^{n}x_{i}^{k}}{\lambda^{k+1}},$

$\frac{\partial \ln(L(k,\lambda))}{\partial k} = \frac{n}{k} - n\ln(\lambda) + \sum_{i=1}^{n}\ln(x_{i}) -\sum_{i=1}^{n}(\frac{x_{i}}{\lambda})^{k}\ln(\frac{x_{i}}{\lambda}).$

可以计算得出：

$\lambda^{k}=\frac{\sum_{i=1}^{n}x_{i}^{k}}{n},$

$\frac{1}{k} = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} -\frac{\sum_{i=1}^{n}\ln(x_{i})}{n}.$

其中第一个式子可以计算出 $\lambda$ 的最大似然估计。第二个式子是关于 $k$ 的隐函数，不能够直接求解，需要使用 Newton’s method 来计算。

令

$f(k) = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} - \frac{\sum_{i=1}^{n}\ln(x_{i})}{n} - \frac{1}{k}.$

求导得到：

$f^{'}(k)= \frac{\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2}}{\sum_{i=1}^{n}x_{i}^{k}}-(\frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}})^{2} + \frac{1}{k^{2}}.$

根据 Cauchy’s Inequality, 可以得到：

$(\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2})\cdot(\sum_{i=1}^{n}x_{i}^{k})\geq (\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i}))^{2}.$

所以， $f^{'}(k)>0 \text{ for all } k>0$ ，换言之， $f(k)$ 是关于 $k$ 的递增函数，并且

$\lim_{k\rightarrow 0^{+}}f(k) = -\infty,$

$\lim_{k\rightarrow +\infty}f(k) > 0 \text{ if } \forall x_{i}>1.$

那么对于递增函数 $f(k)$ 而言，就必定有一个零点。因此使用 Newton’s Iteration 的时候，初始点可以从靠近零的整数开始，比如 $k_{0}=0.0001$ 。如果从一个比较大的数开始的时候，可能使用 Newton 法的时候，会与负轴相交。但是如果从一个较小的数开始，就必定只与正数轴相交。其中 Newton 法的公式是：

$k_{0}= 0.0001,$

$k_{n+1} = k_{n}- \frac{f(k_{n})}{f^{'}(k_{n})} \text{ for all } n\geq 0.$

当 $n$ 的次数足够大的时候， $k_{n}$ 就可以被当作最大似然估计。

数据挖掘与机器学习

How machine learning can help the security industry

October 11, 2016 zr9558 Leave a comment

Machine learning (ML) is such a hot area in security right now.

At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.

Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.

Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.

Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.

Where does ML work?

One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.

We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.

Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.

One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.

There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.

Threat models and anomaly detection

In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.

There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.

One Dimensional Dynamical System

Hausdorff dimension of the graphs of the classical Weierstrass functions

October 10, 2016 zr9558 Leave a comment

In this paper, we obtain the explicit value of the Hausdorff dimension of the graphs of the classical Weierstrass functions, by proving absolute continuity of the SRB measures of the associated solenoidal attractors.

1. Introduction

In Real Analysis, the classical Weierstrass function is

$\displaystyle W_{\lambda,b}(x) = \sum\limits_{n=0}^{\infty} \lambda^n \cos(2\pi b^n x)$

with ${1/b < \lambda < 1}$ .

Note that the Weierstrass functions have the form

$\displaystyle f^{\phi}_{\lambda,b}(x) = \sum\limits_{n=0}^{\infty} \lambda^n \phi(b^n x)$

where ${\phi}$ is a ${\mathbb{Z}}$ -periodic ${C^2}$ -function.

Weierstrass (1872) and Hardy (1916) were interested in ${W_{\lambda,b}}$ because they are concrete examples of continuous but nowhere differentiable functions.

Remark 1 The graph of ${f^{\phi}_{\lambda,b}}$ tends to be a “fractal object” because ${f^{\phi}_{\lambda,b}}$ is self-similar in the sense that

$\displaystyle f^{\phi}_{\lambda, b}(x) = \phi(x) + \lambda f^{\phi}_{\lambda,b}(bx)$

We will come back to this point later.

Remark 2 ${f^{\phi}_{\lambda,b}}$ is a ${C^{\alpha}}$ -function for all ${0\leq \alpha < \frac{-\log\lambda}{\log b}}$ . In fact, for all ${x,y\in[0,1]}$ , we have

$\displaystyle \frac{f^{\phi}_{\lambda, b}(x) - f^{\phi}_{\lambda,b}(y)}{|x-y|^{\alpha}} = \sum\limits_{n=0}^{\infty} \lambda^n b^{n\alpha} \left(\frac{\phi(b^n x) - \phi(b^n y)}{|b^n x - b^n y|^{\alpha}}\right),$

so that

$\displaystyle \frac{f^{\phi}_{\lambda, b}(x) - f^{\phi}_{\lambda,b}(y)}{|x-y|^{\alpha}} \leq \|\phi\|_{C^{\alpha}} \sum\limits_{n=0}^{\infty}(\lambda b^{\alpha})^n:=C(\phi,\alpha,\lambda,b) < \infty$

whenever ${\lambda b^{\alpha} < 1}$ , i.e., ${\alpha < -\log\lambda/\log b}$ .

The study of the graphs of ${W_{\lambda,b}}$ as fractal sets started with the work of Besicovitch-Ursell in 1937.

Remark 3 The Hausdorff dimension of the graph of a ${C^{\alpha}}$ -function ${f:[0,1]\rightarrow\mathbb{R}}$ is

$\displaystyle \textrm{dim}(\textrm{graph}(f))\leq 2 - \alpha$

Indeed, for each ${n\in\mathbb{N}}$ , the Hölder continuity condition

$\displaystyle |f(x)-f(y)|\leq C|x-y|^{\alpha}$

leads us to the “natural cover” of ${G=\textrm{graph}(f)}$ by the family ${(R_{j,n})_{j=1}^n}$ of rectangles given by

$\displaystyle R_{j,n}:=\left[\frac{j-1}{n}, \frac{j}{n}\right] \times \left[f(j/n)-\frac{C}{n^{\alpha}}, f(j/n)+\frac{C}{n^{\alpha}}\right]$

Nevertheless, a direct calculation with the family ${(R_{j,n})_{j=1}^n}$ does not give us an appropriate bound on ${\textrm{dim}(G)}$ . In fact, since ${\textrm{diam}(R_{j,n})\leq 4C/n^{\alpha}}$ for each ${j=1,\dots, n}$ , we have

$\displaystyle \sum\limits_{j=1}^n\textrm{diam}(R_{j,n})^d\leq n\left(\frac{4C}{n^{\alpha}}\right)^d = (4C)^{1/\alpha} < \infty$

for ${d=1/\alpha}$ . Because ${n\in\mathbb{N}}$ is arbitrary, we deduce that ${\textrm{dim}(G)\leq 1/\alpha}$ . Of course, this bound is certainly suboptimal for ${\alpha<1/2}$ (because we know that ${\textrm{dim}(G)\leq 2 < 1/\alpha}$ anyway).Fortunately, we can refine the covering ${(R_{j,n})}$ by taking into account that each rectangle ${R_{j,n}}$ tends to be more vertical than horizontal (i.e., its height ${2C/n^{\alpha}}$ is usually larger than its width ${1/n}$ ). More precisely, we can divide each rectangle ${R_{j,n}}$ into ${\lfloor n^{1-\alpha}\rfloor}$ squares, say

$\displaystyle R_{j,n} = \bigcup\limits_{k=1}^{\lfloor n^{1-\alpha}\rfloor}Q_{j,n,k},$

such that every square ${Q_{j,n,k}}$ has diameter ${\leq 2C/n}$ . In this way, we obtain a covering ${(Q_{j,n,k})}$ of ${G}$ such that

$\displaystyle \sum\limits_{j=1}^n\sum\limits_{k=1}^{\lfloor n^{1-\alpha}\rfloor} \textrm{diam}(Q_{j,n,k})^d \leq n\cdot n^{1-\alpha}\cdot\left(\frac{2}{n}\right)^d\leq (2C)^{2-\alpha}<\infty$

for ${d=2-\alpha}$ . Since ${n\in\mathbb{N}}$ is arbitrary, we conclude the desired bound

$\displaystyle \textrm{dim}(G)\leq 2-\alpha$

A long-standing conjecture about the fractal geometry of ${W_{\lambda,b}}$ is:

Conjecture (Mandelbrot 1977): The Hausdorff dimension of the graph of ${W_{\lambda,b}}$ is

$\displaystyle 1<\textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b} < 2$

Remark 4 In view of remarks 2 and 3, the whole point of Mandelbrot’s conjecture is to establish the lower bound

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) \geq 2 + \frac{\log\lambda}{\log b}$

Remark 5 The analog of Mandelbrot conjecture for the box and packing dimensions is known to be true: see, e.g., these papers here and here).

In a recent paper (see here), Shen proved the following result:

Theorem 1 (Shen) For any ${b\geq 2}$ integer and for all ${1/b < \lambda < 1}$ , the Mandelbrot conjecture is true, i.e.,

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

Remark 6 The techniques employed by Shen also allow him to show that given ${\phi:\mathbb{R}\rightarrow\mathbb{R}}$ a ${\mathbb{Z}}$ -periodic, non-constant, ${C^2}$ function, and given ${b\geq 2}$ integer, there exists ${K=K(\phi,b)>1}$ such that

$\displaystyle \textrm{dim}(\textrm{graph}(f^{\phi}_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

for all ${1/K < \lambda < 1}$ .

Remark 7 A previous important result towards Mandelbrot’s conjecture was obtained by Barańsky-Barány-Romanowska (in 2014): they proved that for all ${b\geq 2}$ integer, there exists ${1/b < \lambda_b < 1}$ such that

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

for all ${\lambda_b < \lambda < 1}$ .

The remainder of this post is dedicated to give some ideas of Shen’s proof of Theorem1 by discussing the particular case when ${1/b<\lambda<2/b}$ and ${b\in\mathbb{N}}$ is large.

2. Ledrappier’s dynamical approach

If ${b\geq 2}$ is an integer, then the self-similar function ${f^{\phi}_{\lambda,b}}$ (cf. Remark 1) is also ${\mathbb{Z}}$ -periodic, i.e., ${f^{\phi}_{\lambda,b}(x+1) = f^{\phi}_{\lambda,b}(x)}$ for all ${x\in\mathbb{R}}$ . In particular, if ${b\geq 2}$ is an integer, then ${\textrm{graph}(f^{\phi}_{\lambda,b})}$ is an invariant repeller for the endomorphism ${\Phi:\mathbb{R}/\mathbb{Z}\times\mathbb{R}\rightarrow \mathbb{R}/\mathbb{Z}\times\mathbb{R}}$ given by

$\displaystyle \Phi(x,y) = \left(bx\textrm{ mod }1, \frac{y-\phi(x)}{\lambda}\right)$

This dynamical characterization of ${G = \textrm{graph}(f^{\phi}_{\lambda,b})}$ led Ledrappier to the following criterion for the validity of Mandelbrot’s conjecture when ${b\geq 2}$ is an integer.

Denote by ${\mathcal{A}}$ the alphabet ${\mathcal{A}=\{0,\dots,b-1\}}$ . The unstable manifolds of ${\Phi}$ through ${G}$ have slopes of the form

$\displaystyle (1,-\gamma \cdot s(x,u))$

where ${\frac{1}{b} < \gamma = \frac{1}{\lambda b} <1}$ , ${x\in\mathbb{R}}$ , ${u\in\mathcal{A}^{\mathbb{N}}}$ , and

$\displaystyle s(x,u):=\sum\limits_{n=0}^{\infty} \gamma^n \phi'\left(\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

In this context, the push-forwards ${m_x := (u\mapsto s(x,u))_*\mathbb{P}}$ of the Bernoulli measure ${\mathbb{P}}$ on ${\mathcal{A}^{\mathbb{N}}}$ (induced by the discrete measure assigning weight ${1/b}$ to each letter of the alphabet ${\mathcal{A}}$ ) play the role of conditional measures along vertical fibers of the unique Sinai-Ruelle-Bowen (SRB) measure ${\theta}$ of the expanding endomorphism ${T:\mathbb{R}/\mathbb{Z}\times\mathbb{R} \rightarrow \mathbb{R}/\mathbb{Z}\times\mathbb{R}}$ ,

$\displaystyle T(x,y) = (bx\textrm{ mod }1, \gamma y + \psi(x)),$

where ${\gamma=1/\lambda b}$ and ${\psi(x)=\phi'(x)}$ . In plain terms, this means that

$\displaystyle \theta = \int_{\mathbb{R}/\mathbb{Z}} m_x \, d\textrm{Leb}(x) \ \ \ \ \ (1)$

where ${\theta}$ is the unique ${T}$ -invariant probability measure which is absolutely continuous along unstable manifolds (see Tsujii’s paper).

As it was shown by Ledrappier in 1992, the fractal geometry of the conditional measures ${m_x}$ have important consequences for the fractal geometry of the graph ${G}$ :

Theorem 2 (Ledrappier) Suppose that for Lebesgue almost every ${x\in\mathbb{R}}$ the conditional measures ${m_x}$ have dimension ${\textrm{dim}(m_x)=1}$ , i.e.,

$\displaystyle \lim\limits_{r\rightarrow 0}\frac{\log m_x(B(z,r))}{\log r} = 1 \textrm{ for } m_x\textrm{-a.e. } z$

Then, the graph ${G=\textrm{graph}(f^{\phi}_{\lambda,b})}$ has Hausdorff dimension

$\displaystyle \textrm{dim}(G) = 2 + \frac{\log\lambda}{\log b}$

Remark 8 Very roughly speaking, the proof of Ledrappier theorem goes as follows. By Remark 4, it suffices to prove that ${\textrm{dim}(G)\geq 2 + \frac{\log\lambda}{\log b}}$ . By Frostman lemma, we need to construct a Borel measure ${\nu}$ supported on ${G}$ such that

$\displaystyle \underline{\textrm{dim}}(\nu) := \textrm{ ess }\inf \underline{d}(\nu,x) \geq 2 + \frac{\log\lambda}{\log b}$

where ${\underline{d}(\nu,x):=\liminf\limits_{r\rightarrow 0}\log \nu(B(x,r))/\log r}$ . Finally, the main point is that the assumptions in Ledrappier theorem allow to prove that the measure ${\mu^{\phi}_{\lambda, b}}$ given by the lift to ${G}$ of the Lebesgue measure on ${[0,1]}$ via the map ${x\mapsto (x,f^{\phi}_{\lambda,b}(x))}$ satisfies

$\displaystyle \underline{\textrm{dim}}(\mu^{\phi}_{\lambda,b}) \geq 2 + \frac{\log\lambda}{\log b}$

An interesting consequence of Ledrappier theorem and the equation 1 is the following criterion for Mandelbrot’s conjecture:

Corollary 3 If ${\theta}$ is absolutely continuous with respect to the Lebesgue measure ${\textrm{Leb}_{\mathbb{R}^2}}$ , then

$\displaystyle \textrm{dim}(G) = 2 + \frac{\log\lambda}{\log b}$

Proof: By (1), the absolute continuity of ${\theta}$ implies that ${m_x}$ is absolutely continuous with respect to ${\textrm{Leb}_{\mathbb{R}}}$ for Lebesgue almost every ${x\in\mathbb{R}}$ .

Since ${m_x\ll \textrm{Leb}_{\mathbb{R}}}$ for almost every ${x}$ implies that ${\textrm{dim}(m_x)=1}$ for almost every ${x}$ , the desired corollary now follows from Ledrappier’s theorem. $\Box$

3. Tsujii’s theorem

The relevance of Corollary 3 is explained by the fact that Tsujii found an explicittransversality condition implying the absolute continuity of ${\theta}$ .

More precisely, Tsujii firstly introduced the following definition:

Definition 4

Given ${\varepsilon>0}$ , ${\delta>0}$ and ${x_0\in\mathbb{R}/\mathbb{Z}}$ , we say that two infinite words ${u, v\in\mathcal{A}^{\mathbb{N}}}$ are ${(\varepsilon,\delta)}$ -transverse at ${x_0}$ if either
$\displaystyle |s(x_0,u)-s(x_0,v)|>\varepsilon$

or

$\displaystyle |s'(x_0,u)-s'(x_0,v)|>\delta$

Given ${q\in\mathbb{N}}$ , ${\varepsilon>0}$ , ${\delta>0}$ and ${x_0\in\mathbb{R}/\mathbb{Z}}$ , we say that two finite words ${k,l\in\mathcal{A}^q}$ are ${(\varepsilon,\delta)}$ -transverse at ${x_0}$ if ${ku}$ , ${lv}$ are ${(\varepsilon,\delta)}$ -transverse at ${x_0}$ for all pairs of infinite words ${u,v\in\mathcal{A}^{\mathbb{N}}}$ ; otherwise, we say that ${k}$ and ${l}$ are ${(\varepsilon,\delta)}$ -tangent at ${x_0}$ ;

${E(q,x_0;\varepsilon,\delta):= \{(k,l)\in\mathcal{A}^q\times\mathcal{A}^q: (k,l) \textrm{ is } (\varepsilon,\delta)\textrm{-tangent at } x_0\}}$

${E(q,x_0):=\bigcap\limits_{\varepsilon>0}\bigcap\limits_{\delta>0} E(q,x_0;\varepsilon,\delta)}$ ;

${e(q,x_0):=\max\limits_{k\in\mathcal{A}^q}\#\{l\in\mathcal{A}^q: (k,l)\in E(q,x_0)\}}$

${e(q):=\max\limits_{x_0\in\mathbb{R}/\mathbb{Z}} e(q,x_0)}$ .

Next, Tsujii proves the following result:

Theorem 5 (Tsujii) If there exists ${q\geq 1}$ integer such that ${e(q)<(\gamma b)^q}$ , then

$\displaystyle \theta\ll\textrm{Leb}_{\mathbb{R}^2}$

Remark 9 Intuitively, Tsujii’s theorem says the following. The transversality condition ${e(q)<(\gamma b)^q}$ implies that the majority of strong unstable manifolds ${\ell^{uu}}$ are mutually transverse, so that they almost fill a small neighborhood ${U}$ of some point ${x_0}$ (see the figure below extracted from this paper of Tsujii). Since the SRB measure ${\theta}$ is absolutely continuous along strong unstable manifolds, the fact that the ${\ell^{uu}}$ ‘s almost fill ${U}$ implies that ${\theta}$ becomes “comparable” to the restriction of the Lebesgue measure ${\textrm{Leb}_{\mathbb{R}^2}}$ to ${U}$ .

Remark 10 In this setting, Barańsky-Barány-Romanowska obtained their main result by showing that, for adequate choices of the parameters ${\lambda}$ and ${b}$ , one has ${e(1)=1}$ . Indeed, once we know that ${e(1)=1}$ , since ${1<\gamma b}$ , they can apply Tsujii’s theorem and Ledrappier’s theorem (or rather Corollary 3) to derive the validity of Mandelbrot’s conjecture for certain parameters ${\lambda}$ and ${b}$ .

For the sake of exposition, we will give just a flavor of the proof of Theorem 1 by sketching the derivation of the following result:

Proposition 6 Let ${\phi(x) = \cos(2\pi x)}$ . If ${1/2<\gamma=1/\lambda b <1}$ and ${b\in\mathbb{N}}$ is sufficiently large, then

$\displaystyle e(1)<\gamma b$

In particular, by Corollary 3 and Tsujii’s theorem, if ${1/2<\gamma=1/\lambda b <1}$ and ${b\in\mathbb{N}}$ is sufficiently large, then Mandelbrot’s conjecture is valid, i.e.,

$\displaystyle \textrm{dim}(W_{\lambda,b}) = 2+\frac{\log\lambda}{\log b}$

Remark 11 The proof of Theorem 1 in full generality (i.e., for ${b\geq 2}$ integer and ${1/b<\lambda<1}$ ) requires the introduction of a modified version of Tsujii’s transversality condition: roughly speaking, Shen defines a function ${\sigma(q)\leq e(q)}$ (inspired from Peter-Paul inequality) and he proves

(a) a variant of Proposition 6: if ${b\geq 2}$ integer and ${1/b<\lambda<1}$ , then ${\sigma(q)<(\gamma b)^q}$ for some integer ${q}$ ;

(b) a variant of Tsujii’s theorem: if ${\sigma(q)<(\gamma b)^q}$ for some integer ${q}$ , then ${\theta\ll\textrm{Leb}_{\mathbb{R}^2}}$ .

See Sections 2, 3, 4 and 5 of Shen’s paper for more details.

We start the (sketch of) proof of Proposition 6 by recalling that the slopes of unstable manifolds are given by

$\displaystyle s(x,u):=-2\pi\sum\limits_{n=0}^{\infty} \gamma^n \sin\left(2\pi\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

for ${x\in\mathbb{R}}$ , ${u\in\mathcal{A}^{\mathbb{N}}}$ , so that

$\displaystyle s'(x,u)=-4\pi^2\sum\limits_{n=0}^{\infty} \left(\frac{\gamma}{b}\right)^n \cos\left(2\pi\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

Remark 12 Since ${\gamma/b < \gamma}$ , the series defining ${s'(x,u)}$ converges faster than the series defining ${s(x,u)}$ .

By studying the first term of the expansion of ${s(x,u)}$ and ${s'(x,u)}$ (while treating the remaining terms as a “small error term”), it is possible to show that if ${(k,l)\in E(1,x_0)}$ , then

$\displaystyle \left|\sin\left(2\pi\frac{x_0+k}{b}\right) - \sin\left(2\pi\frac{x_0+l}{b}\right)\right| \leq\frac{2\gamma}{1-\gamma} \ \ \ \ \ (2)$

and

$\displaystyle \left|\cos\left(2\pi\frac{x_0+k}{b}\right) - \cos\left(2\pi\frac{x_0+l}{b}\right)\right| \leq \frac{2\gamma}{b-\gamma} \ \ \ \ \ (3)$

(cf. Lemma 3.2 in Shen’s paper).

Using these estimates, we can find an upper bound for ${e(1)}$ as follows. Take ${x_0\in\mathbb{R}/\mathbb{Z}}$ with ${e(1)=e(1,x_0)}$ , and let ${k\in\mathcal{A}}$ be such that ${(k,l_1),\dots,(k,l_{e(1)})\in E(1,x_0)}$ distinct elements listed in such a way that

$\displaystyle \sin(2\pi x_i)\leq \sin(2\pi x_{i+1})$

for all ${i=1,\dots,e(1)-1}$ , where ${x_i:=(x_0+l_i)/b}$ .

From (3), we see that

$\displaystyle \left|\cos\left(2\pi x_i\right) - \cos\left(2\pi x_{i+1}\right)\right| \leq \frac{4\gamma}{b-\gamma}$

for all ${i=1,\dots,e(1)-1}$ .

Since

$\displaystyle (\cos(2\pi x_i)-\cos(2\pi x_{i+1}))^2 + (\sin(2\pi x_i)-\sin(2\pi x_{i+1}))^2 = 4\sin^2(\pi(x_i-x_{i+1}))\geq 4\sin^2(\pi/b),$

it follows that

$\displaystyle |\sin(2\pi x_i)-\sin(2\pi x_{i+1})|\geq \sqrt{4\sin^2\left(\frac{\pi}{b}\right) - \left(\frac{4\gamma}{b-\gamma}\right)^2} \ \ \ \ \ (4)$

Now, we observe that

$\displaystyle \sqrt{4\sin^2\left(\frac{\pi}{b}\right) - \left(\frac{4\gamma}{b-\gamma}\right)^2} > \frac{4}{b} \ \ \ \ \ (5)$

for ${b}$ large enough. Indeed, this happens because

${\sqrt{z^2-w^2}>2(z-w)}$ if ${z+w>4(z-w)}$ ;
${z+w>4(z-w)}$ if ${z/w:=u < 5/3}$ ;
${\frac{2\sin(\frac{\pi}{b})}{\frac{4\gamma}{b-\gamma}}\rightarrow \frac{2\pi}{4\gamma} (< \frac{5}{3})}$ as ${b\rightarrow\infty}$ , and ${2\sin(\frac{\pi}{b}) - \frac{4\gamma}{b-\gamma} \rightarrow (2\pi-4\gamma)\frac{1}{b} (>\frac{2}{b})}$ as ${b\rightarrow\infty}$ (here we used ${\gamma<1}$ ).

By combining (4) and (5), we deduce that

$\displaystyle |\sin(2\pi x_i)-\sin(2\pi x_{i+1})| > 4/b$

for all ${i=1,\dots, e(1)-1}$ .

Since ${-1\leq\sin(2\pi x_1)\leq\sin(2\pi x_2)\leq\dots\leq\sin(2\pi x_{e(1)})\leq 1}$ , the previous estimate implies that

$\displaystyle \frac{4}{b}(e(1)-1)<\sum\limits_{i=1}^{e(1)-1}(\sin(2\pi x_{i+1}) - \sin(2\pi x_i)) = \sin(2\pi x_{e(1)}) - \sin(2\pi x_1)\leq 2,$

i.e.,

$\displaystyle e(1)<1+\frac{b}{2}$

Thus, it follows from our assumptions ( ${\gamma>1/2}$ , ${b}$ large) that

$\displaystyle e(1)<1+\frac{b}{2}<\gamma b$

This completes the (sketch of) proof of Proposition 6 (and our discussion of Shen’s talk).

心理学

为什么说绝大多数人都是“低品质勤奋者”？

October 2, 2016 zr9558 Leave a comment

写在前面：面对时代的飞速变化，你可曾焦虑和无助？也许你见过所在的城市凌晨四点的样子，也曾搭乘最后一班地铁回家。然而，这并不是最可怕的，可怕的是你累得像条狗，感觉身体被掏空，然而却并没有什么卵用。也许你也曾像我一样，每天忙碌又疲惫，却依然只是一名“低品质勤奋者”。这篇文章正是自己对于“真正的勤奋”这一话题的思考，希望有机会一同深入探讨。

（一）一个普通“勤奋者”的模糊肖像

行色匆匆的上班族

如果你足够勤奋，你多半会按照被这个时代所鼓励的方式去生活——热爱学习，拥抱变化，走在快速成长的风口上——或者至少你是这么认为的：

首先，你会耳濡目染相当数量的缺乏实现路径的励志故事，相信天道必然酬勤，在地铁上也不忘用一本《创业维艰》或者《穷爸爸富爸爸》来配合自己的定位；

然后，你对潮流的走向也相当敏锐，罗辑思维的语音一天不落，忙于穿梭于各互联网创业训练营，一言不合就用微信来扫一扫，自以为与各种大咖建立了连接；

当然，作为崛起的中产阶级一份子，你对于旅游也持有支持的态度，说走就走的事情也不是没干过，体验不同的生活固然是一个很文艺的说辞，然而下面往往才是重点——用美颜相机精心地采集好你“生活在别处”的证据，通过朋友圈被选择性地展示出来，并满怀期待地等待32个赞。

可是问题是：你做完了以上所有事，你会如愿得到你想要的结果吗？或者你有认真考虑过结果吗？

是的，这才是问题的关键所在——我们讨论的绝不是“勤奋的姿势”，而是“勤奋所带来的结果”。

（二）表演“勤奋”，还是想把事情搞定？

大概很少人会拒绝“成功来自勤奋”这种说法。就像大多数人拥有梦想的人一样，说不定凌晨四点，你就踏上了一天的征途，去迎接一整天的忙忙碌碌和东奔西走，好不容易处理好一天的工作，顾不上身体被掏空，又赶着最后一班地铁回家。我相信，你这么一复一日地努力，无非想结果更好一些，离成功更近一点，不过令人遗憾的是，时间不仁以万物为刍狗，不舍昼夜地消逝。不经意间小半年过去了，接着一年又没了，直到你盘点收获时，才尴尬地发现以下事实：

1、之前计划好的雅思没有准备好，只得弃考或者硬着头皮裸考，导致无法出国；

2、一直想提高的演讲和写作技能也没大长进，所以那次难得的公众表达机会就这么白白溜走；

3、甚至你一直期待的“减肥成功后，自信满满地向女神大胆表白”这样的美好画面也没有出现，原因想必大家都了解。

这所有的一切，都与你制定目标时的雄心壮志相去甚远，压迫着你的神经，以至于你会显得忿忿不平：我投入了这么多时间，却没有收到预期的回报，实在是不公平！

两届总统任期这么快？有点尴尬了

事实上，我认为“说时间不公平”才是最大的不公平。进一步解释，时间甚至是绝无仅有跨越国籍限制、打破阶级边界、罔顾古今之别的神奇资源，它被无差别地分配到了每一个人手中。而具体到用相同的时间资源产生大不同的结果，原因也是有的，即每个人对时间的感知能力和利用效率不同。这一点李笑来老师在《把时间当做朋友》这本书里已经详细剖析过了——不同的心智水平会让同样的时间资源在不同人那里产生截然不同的结果。

学会感知时间，做时间的朋友

在我看来，这种优秀的心智能力更多的是一种策略利好，它会对你的实践起到“思维工具性质”的帮助。但是，如果让我拿一辆法拉利跑车为例，如果一名老司机想要真正发挥其威力，那除了“跑车足够新、司机足够老”之外，还有一点不可或缺——油箱里必须有足够的油。

所以，我质疑的从来不是“勤奋有没有用”，而是认为“表演勤奋”的这种行为没有价值，这种看似勤奋的行为实质上是一个人“思维懒惰”的保护色。用一句流传甚广的话来概括：这根本是在用战术上的勤奋来掩盖战略上的懒惰——表面上你很勤奋，实际上却刻意回避了真正困难却更有价值的部分——而这种“思维懒惰”的行为最终会导致你成为了文章开头中所提及的“低品质勤奋者”。

其实，我是个演员

还是结合文章开头的场景来谈：

你听完罗辑思维的语音后，一时心血来潮地下单了很多书，却从来不看——不难理解，毕竟买书的行为容易，看书则要困难得多；而更加困难的是，你完全没有思考过你应该系统地读哪些书来更好地解决你的实际问题，哪些书对你的帮助最大。

你下了血本，花了大几千块去听风头正劲的某大咖演讲，哪怕他标价￥38 的书里所阐述的思想完全一样——这也好理解，毕竟听演讲这个行为有逼格又轻松，况且还可以勾搭上大咖；而相对让人不那么愉悦的还是埋头看书这件事了，至于能否勾搭上大咖，我认为唯一靠谱的判断标准就是你的咖是不是够大，但是思维懒惰者总会有自欺欺人的理由。

至于“旅游去体验生活”这件事，我很认同其价值，不过我认为其美好特质依然与思维懒惰者无缘。我问过好多朋友：你旅行的目的是什么？令我吃惊的是，虽然答案五花八门，但是好像没几个人能真正说出一个让他们自己满意的答案。当然，有一个女生想得比较清楚，她认为“旅行是一种让自己从例行公事般日常脱离，去体验另一种生活的机会”，也许正是她的这种认真思考所带来的对于旅行的认同感，让她分外珍惜每次旅行的机会——往返机票和住宿的预订、装备行李的配置，以及记录心情的旅行札记——无一例外的精心规划。我几乎能想象出这种积极的准备态度会让她拥有怎样高品质的旅行经历。

思维懒惰，你有吗？

以上行为的价值有高有低，但毫无例外，你很有可能就选择了价值更低的那种。在此声明，虽然我用的代词是“你”，实际上也是“我”，这是我们每个人的思维倾向。事实上，一旦我们选择了“思维懒惰”，我们也就选择了做一名“低品质勤奋者”，同时也就选择了低价值的行为和由此而来的低价值结果。

到这里，我有了这么个初步结论：“思维懒惰”所带来的“低认知水平”才是“低品质勤奋者”产生的原因。不过依然困扰着我的是：费这么大劲，苦也没少吃，福却没多享，从经济学角度看，“低品质勤奋者”的勤奋行为性价比极低，完全不具备投资价值，那为什么包括自己在内的这么多人还乐此不疲地投入其中？也许美团王兴的一句话道破了天机。

（三）多数人为了逃避真正的思考，愿意做任何事情

我第一次看到这句话愣了半天，我想如果你初次看到这句话，并且足够走心，多半也会被震撼到。这句话的力量在于它放弃了自我欺骗，毫不留情地拒绝了任何寻找借口的可能。所以，经济学上解释不通的事情，就这样在心理学上找到了突破口。人是趋利避害的动物，在进化史上绝对长的时间内，人类都没有被赋予过多深度思考的任务，原因很简单，光是应激反应就足以解决掉过去95%以上的问题了。但是让基因万万没想到的是，人类的进化速度竟然是如此之快！

事实上，心理学家丹尼尔.卡尼曼在其著作《思考，快与慢》中对此有过精妙描述。他认为，我们的大脑有快与慢两种作决定的方式。常用的无意识的“系统1”依赖情感、记忆和经验迅速作出判断，它见闻广博，反应快速，但很容易上当。有意识的“系统2”通过调动注意力来分析和解决问题，并作出决定，它比较慢，不容易出错，但却很懒惰，经常走捷径采纳系统1的直觉型判断结果。

在很长一段时间内，这种处理方式——面对于变化缓慢的环境，基因采纳系统1的直觉型判断结果——不存在任何问题。一方面，它做出了一个大概率靠谱的决定来应对环境的缓慢变化；另一方面，懒惰地走捷径也让基因节约了能量，这对于远古时期食物获取成本极高的人类而言意义非凡。所以，当我们谈到为什么人会“思维懒惰”、或者为什么不习惯于“深度思考”的时候，我们实际上是在通过向基因施压，让其减少对于“条件反射”这种救命神器的能量分配，转而向“深度思考”这种奢侈品倾斜。这对从远古穿越而来的基因而言，无异于降低基因携带者的生存概率。简单来概括，深度思考在基因层面是反人性的。

基因不鼓励原始人深度思考

让我们从远古穿越到现在，那么目力所及，现在社会究竟是什么样的存在? 变化，急剧的变化，非常急剧的变化！事实上，变化早已经成为了我们彼此心知肚明的共识，这种越来越快的变化所导致的一个直接结果就是信息的指数级发展，从信息的承载方式上亦可见一斑——从甲骨、竹简、羊皮卷、印刷纸，一直到理论上无限大的虚拟存储空间。

对于这种信息疯狂蔓延所引发的知识洪荒，每个人可怜的认知能力显得是那样微不足道，认知能力取代了知识信息储备成为了更为稀缺的资源，构建起人与人之间新的壁垒。如果此时还顺从顽固不化的基因，继续思维懒惰下去，避免反人性的深度思考，会导致什么结果？我想结果大概也很容易预测——我们将无法享受到知识增长和环境改变所带来的好处，至多维持目前水平的生活质量，甚至被淹死在信息的洪荒之中，焦虑无助、不知所措。有句话说得好，如果想得到与过去不同的结果，就必须做一些与过去不同的事情，而这些不一样首先要体现在认知层面。

（四）“深度思考”才能带来“认知升级”，从而成为“高品质勤奋者”

谈到“深度思考”，爱因斯坦说过这么一段话，让我印象尤为深刻：

如果给我1个小时解答一道决定我生死的问题，我会花55分钟来弄清楚这道题到底是在问什么。一旦清楚了它到底在问什么，剩下的5分钟足够回答这个问题。

想象力比知识更重要

死生亦大矣，这段话用事关生死的极端描述强调了“深度思考”的重要性，很有说服力。而事实上，在真正生死命悬一线的战争场景中，“深度思考”这种特质的地位不仅没有因为兵荒马乱的紧迫性而被削弱，反而是因为沙场嗜血的残酷特质被大大提升了。

我们都知道《孙子兵法》，在这部被誉为“兵学圣典”一书的“军行篇”中有这么一句：胜兵先胜而后求战，败兵先战而后求胜。意思是说，在两军短兵相接之前，就要做好充分的准备：努力收集一切渠道的信息，充分评估当下态势，殚精竭虑地质问己方一切的隐患和可能发生的问题，然后在脑海里推测、模拟战争可能的走势，利用现存资源来精心筹划出解决方案。等到这所有工作都就绪，双方真正踏上战场的时候，才能将一切了然于心而胸有成竹，这仗才会有胜算。由此可见，对于精心准备的一方，战争的大部分工作在战前就在深度思考的头脑里完成了，上战场打仗只不过是一个例行公事般的存在，胜负的天平早已倾斜。

说到战争，这里不得不提到另一个人，也我个人欣赏的战争天才。作为被毛泽东评价为“无以伦比的常胜元帅”、被蒋介石赞赏为“黄埔最优秀将军”的林彪，之所以能在战场上所向披靡、战无不胜，靠的绝非简简单单的“狭路相逢勇者胜”，而是“深度思考”得出的对战场局势胜人一筹的认知水平。江湖上关于林的战争传闻很多，最有传奇色彩的可能要数“他利用大数据活捉廖耀湘”这件事。

战时林彪（中）

自1948年辽沈战役，每天深夜林彪都在东北野战军前线指挥所里听取军情汇报，由值班参谋读出下属各个纵队、师、团用电台报告的当日战况和缴获情况，而林彪则认真细致地记录着大数据。在一次关于“胡家窝棚那个战斗的缴获”中，林彪敏锐地从一个数据的微小变化中察觉到了异样，面对一脸懵逼的吃瓜部下，林彪用三个疑问确定了问题的关键所在：

“为什么那里缴获的短枪与长枪的比例比其它战斗略高?”

“为什么那里缴获和击毁的小车与大车的比例比其它战斗略高?”

“为什么在那里俘虏和击毙的军官与士兵的比例比其它战斗略高?”

没等大家反应过来，林彪大步走向挂满军用地图的墙壁，指着地图上的那个点说：“我断定敌人的指挥所就在这里!”事实上，林彪可以如此笃定，正是得益于他高品质的勤奋——拒绝思维懒惰，坚持深度思考——长期的数据记录和分析，让这些枯燥的数字在林彪脑中形成了系统化的数据库，所以一旦出现偏差，他便可以及时发现不同，推理出准确信息，找出关键价值所在。

在林彪推理出情报的帮助下，新六军的指挥所很快就被连锅端了。新六军军长廖耀湘，这位出身黄埔并留学法国著名的圣西尔军校、参加过滇缅战役的名将，想不到自己精心隐蔽的精悍野战司令部这么快就被灭掉，输的不甘心，认为这是一个偶然事件。而当他得知林彪是如何得出判断之后，他说：“我服了，败在他手下，不丢人。”

除了重视数据，林彪的勤奋细节还体现在他尤为重视调查，作为“八路军出师以来打的第一个大胜仗”平型关大捷的总指挥，他在战前三次到平型关乔沟一带进行实地勘察：

第一次是他带着参谋人员和电台去的。首先到平型关关口，爬上关口北侧山岭，对着地图观察平型关以东的山势、河沟、村庄和道路。然后下山沿西跑池、东跑池公路到乔沟至东河南，察看峡谷公路两侧的地形地貌；

第二次是他化装去侦察的。重点勘察了老爷庙前的地形和乔沟南侧山地地貌，一个完整的伏击战计划在林彪脑海里基本形成；

第三次是在上寨动员会后，林彪和聂荣臻带着旅长、团长们去侦查的，并在现场向各团指定了埋伏地点，明确了师、旅、团指挥所的位置。

战争的筹备工作历来繁杂，在战争开始前三天，基于各种局势下的战斗模拟就没有停过，这还不包括对于战时的部队部署，以及战前对于全师连以上干部的动员工作。诚然，战争胜利的因素很多，但是至少，林彪在战前基于“深度思考”的勤奋准备对于平型关大捷的结果功不可没。

我有一个习惯，如果我觉得一个人与众不同，我会去分析他的思维方式，而了解一个人思维方式的最好方法莫过于听他自己怎么讲。林彪自己在《怎样当好一名师长》一文中就分述九点，把他自己的工作方法进行了细致的总结。文章网上可以找到，看上去朴实无华却内含寸劲，条条直达要害。在我看来，估计很少有人能按这九条来落实，原因是太耗心力——至少有四条要求需要投入大量的精力来“深度思考”，其中第五条的要求是这样的，因为太过经典，我原封不动引用出来：

五、要把各方面的问题想够想透：

每一次战役、战斗的组织，要让大家提出各种可能出现的问题，要让大家来找答案，而且要从最坏的最严重的情况来找答案。把所有提出来的问题都回答了，再没有问题没有回答的了，这样，打起仗来才不会犯大错误，万一犯了错误，也比较容易纠正。没有得到答案的问题，不能因为想了很久想不出来就把它丢开，留下一个疙瘩。如果这样，是很危险的，在紧要关头，这个疙瘩很可能冒出来，就会使你们心中无数，措手不及。当然，在战争环境中，要考虑的问题很多，不可能一次都提完，也不可能一次都回答完，整个战役、战斗的过程，就是不断提出问题和不断回答问题的过程。有时脑子很疲劳，有的问题可能立即回答不了。这时，除了好好地和别人商量以外，就好好地睡一觉，睡好了，睡醒了，头脑清醒了，再躺在床上好好想一想，就可能开窍，可能想通了，回答了，解决了。总之，对每一个问题不能含糊了事。问题回答完了，战役、战斗的组织才算完成。

这里必须要说实话，起初我看完林彪这篇文章，居然相当紧张，直冒一身冷汗。因为遑论真刀真枪地上沙场，仅仅看完这九条，就发现居然有如此多不达标之处，可见通过“深度思考”让自己的“认知升级”，确实不是一件容易的事情。不过同时也松了一口气，甚至略有欣喜——好歹已经知道了方法，也算上了道。

写到这里，我基本上也理清了自己的思路：勤奋很重要，怎么强调都不为过，它是优秀结果的必要非充分条件。那么如何让它变得充分必要？我给出的答案是——拒绝思维懒惰，习惯于深度思考，提升自己的认知水平。而至于如何深度思考，我觉得每个人都应该尝试着给出自己的答案，我也会在今后的文章中就这一话题谈谈自己的看法。

文章写完，于我而言却仅仅是一个开始，因为我清醒地知道“拒绝思维懒惰，习惯于深度思考”其实是在同人性抗争，必须要做好打持久战的准备。同时，我也希望每一位真正勤奋的人都能撕掉“低品质”的标签，过上配得上你努力程度的高品质生活。

文／布洛迪的后花园（简书作者）
原文链接：http://www.jianshu.com/p/e26f435b7b0a
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。