ZHANG RONG

机器学习正在安全领域挂起一阵小旋风，但这里面有BUG

如今，安全领域是机器学习（Machine learning）正在大力进军的一个方向。

| 把机器学习应用到安全领域，老板们跃跃欲试

如果你亲自参加了 2016 RSA 大会，就会发现几乎没有哪家公司在说自家安全领域的产品时，不提及机器学习。这是为什么呢？

可能对外行人来说，机器学习就像一种魔法，能解决所有的安全问题：你把一堆未标识的数据统统塞进会机器学习的系统中，它就能分辨出连人类专家都分辨不出的数据规律，并且还可以学习新的行为指令和适应环境威胁。不仅如此，就连为规则加密也劳烦不到你，因为系统已经自动为你搞定这一切。

要真是像这样的话，那机器学习可真就是今年的重头戏了！但讽刺的是，每个人都兴师动众说要在这个领域搞出点名堂来，但真正理解什么是机器学习，或明白机器学习到底能用来做什么的人，却是凤毛麟角。可想而知，在这种大环境下机器学习大多是被滥用的，尤其在安全领域。

| 用机器学习有效解决安全问题，正确的方法是？

把机器学习应用到安全领域，大多会涉及到一种技术——异常检测（anomaly detection），它可以识别哪些部分和预期模式或数据集不匹配。但技术销售方要注意，这种技术只在某些条件下有效——不过显然，他们还不知道自己已经犯下错误：他们会告诉你，分析过你公司的网络流量后，就可以用机器学习揪出暗藏在网络中的黑客。但事实上，机器学习根本就做不到。这时候，你要立刻对这个销售商保持一丝怀疑。

那到底什么情况下才有效？答案是，只有为低维度的问题也配备上高质量的标识数据，这样的机器学习才是有效的。但很不幸，企业在实施过程并没有做到这一点。如果要检测新型的攻击方式，你得有很清晰并且经过标识的攻击案例。这就是说，如果没有透彻理解正常的网络行为，机器学习是不可能发现黑客的。再说，所有的黑客都很狡猾，他们一定会把自己伪装的天衣无缝。

| 机器学习和异常检测，用在哪里价值最大？

机器学习和异常检测真正有用的地方，在于它们能将人类行为分类。

事实证明，人类的预测能力非常强，他们也有能力建立非常精确的个体用户行为模型，让模型探测到异常情况。

其实，人们在这方面已小有成就，比如隐式认证（ Implicit Authentication）。隐式认证采用生物特征识别技术，基于击键力度、节奏和打字模式等技术对用户身份进行认证。不管是改善用户体验还是增强安全性，这个技术的优势都相当明显。最起码，它免除了用户记忆密码的负担和输入密码的麻烦。由于隐式认证所需元素大多是低维的，机器学习就只需处理少量几个参数，这也使得收集用户的高品质标识数据变得很方便。所以，即使有行为差异或信号干扰，机器学习还是能正确为计算机视觉进行图形搭配。同理，机器学习也能通过识别出个体的独特行为而进行身份验证，这当然也不在话下。

不过，它是怎么做到的呢？

其实，你走路、站立等所有动作，是由众多因素共同决定的，比如生理状况，年龄，性别，肌肉记忆等等。并且对个体来说，这些动作不会有太大改变。因此，不经意间，你口袋中的手机就通过内置传感器精确捕捉到了这些信息，并记录下来。而想要通过运动行为来识别一个人， 4 秒的运动信息就已足够。另外，通过对比用户的历史和当下的定位记录也可以进行身份识别。人们总是生活在各种各样的习惯当中，通过观察他们什么时候从哪出发，就能预测被测者到底是不是用户本人。

我们的手机和电脑上已有大量的传感器，以后随着可穿戴设备的普及和物联网的发展，传感器的数量更会暴增。用户大量的行为数据和环境数据就这样被收集起来，提供给机器学习，让它为用户建立个体模型，并找到各个因素之间的相互关系。

| 让机器学习进行安全防护，你需要做哪些功课？

想进行安全防护，就必须让你的系统提前知道都存在哪些威胁模型。

首先，也是最重要的事——收集数据。这些数据必须非常精确，才能用来训练系统，起到抵抗威胁的作用。不过身份认证系统要真是遭到攻击，你也不用过于担心。因为行为变化还是比较好检测的，系统很快就能识别出异常情况。比如，如果一个设备不小心被偷，那么这个设备被偷之后所记录的运动状态，地理位置和用法就会和之前的记录有明显不同。不过，系统是接受这种可能存在的异常情况的，这时候用户就需要在系统上以另外的方式确认身份，调整系统，以使假阳性最小化。而一旦我们在不同设备上连接起 4 个因素，那么隐式认证的假阳性就会低于 0.001% 。

这个世界上并没有哪一种机器学习真的神奇到能解决所有的安全问题。设计者想用机器学习创建一个有用的安全防卫产品，就需要对底层系统有深刻理解，并且承认很多问题并不适合用机器学习来解决。不过不用担心，那些处在浪潮之巅的科技公司会将这些问题一步步消灭掉。

机器学习正在安全领域酝酿着一股势不可挡的市场狂潮。

未来的网络安全，离不开机器学习

信息安全一直就是猫与老鼠的游戏。好家伙新建一堵墙，坏家伙便想方设法通过或绕过它。但最近，坏家伙们似乎越来越轻易地就可以通过这堵墙。要想阻止他们，我们的能力需要有一个巨大的提升，这可能意味着我们需要更广泛地使用机器学习技术。

这可能会惊到行业外的旁观者，但机器学习目前并没有广泛地影响到IT安全领域。安全专家认为，尽管信用卡欺诈侦查系统和网络设备制造商正在使用先进的分析方法，但实际上每个大型公司常见的自动化安全行动——比如检测个人电脑上的恶意软件或者识别网络中的恶意活动——大部分都要依靠人类适时地对这些行动进行代码编写和配置。

尽管机器学习技术在网络安全领域的应用已经有了广泛的学术研究，但我们现在才刚开始了解这项技术对安全工具的影响。一些创业公司（如Invincea, Cylance, Exabeam和Argyle Data）正在利用机器学习驱动安全工具，使得它们比目前主要的安全软件供应商提供的工具更快捷和精准。

用数据摧毁恶意软件

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为，是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说：「我了解到，一些反病毒公司已经涉足机器学习领域，但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件，这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

他说：「他们在检测过去常见的恶意软件上很成功，但是他们并不擅长检测新的恶意软件，这也是当下网络犯罪大行其道的原因之一。即使你安装了杀毒系统，其他人还是能成功侵入你的电脑，因为特征码检测的方法根本不起作用。」

在Invincea，Saxe正带领团队用机器学习建立更完善的恶意软件检测系统。这个项目是DARPA网络基因组项目的一部分，主要是使用机器学习来摧毁检测到的恶意软件，包括反向还原恶意软件的运行方式、在代码中进行社交网络分析、使用机器学习系统快速摧毁自然网络环境中出现的恶意软件新样本。

「我们已经证明，我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作，甚至能做得更好。把机器学习系统与大量的训练数据结合，就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前，Saxe有大约150万个良性或恶意软件样品用来训练算法，这些都在使用 Python 工具的GPU中进行。他希望，随着样本数据增加到3000万，机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多，用来训练机器学习系统的恶意软件的数量越多，那机器学习系统在检测恶意软件上的性能优势就会越明显，」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说，就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

恶意用户检测

机器学习还有助于IT安全的其他方面：检测恶意的内部用户和识别损坏的账户。

正如主要的反病毒产品依赖特征码来识别恶意软件一样，监测用户活动的工具也是倚赖特征码。基于特征码的检测方法在恶意软件检测上开始失效，同样的，它在检测用户活动领域的效果也不尽如人意。

「过去，企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

他说：「这种方法寻找的是已经发生的事情。基于特征码的方法存在的问题是，只有事件发生过后，他们才能看到留下的特征码。而现在，安全人员非常聚焦于检测没有特征码的恶意事件。」

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

如今，精明的犯罪分子知道稍微改变一下他们的路径就能战胜特征码检测。所以，如果被侵入的检测系统中存有一个IP黑名单，网络犯罪分子可以通过在他处理下的大面积网域中不断来回跳动来打破这个IP黑名单。

Exabeam并没有固守昔日的防御策略，而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏，所以先假设他们是恶意的，你的网络是缺乏抵抗力的，所以你时刻对每个人的行为进行监测和制作模型，从而找到恶意行为者。

这就是用到机器学习算法的地方。Lin和他的团队获取了多种多样的资源（如服务器日志、虚拟私人网络日志和VPN日志等），使用各种监督和非监督式机器学习算法来检测用户行为的异常模式。

Lin说：「以上都是描绘用户行为的画像，问题是这是如何做到的。对于网络上每个用户或实体，我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后，我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常，让他们浮现出来，方便安全分析员查看。」

机器学习在安全领域的未来

「想一想我们经历过的几次主要的网络安全浪潮，网络犯罪分子正寻找有效地方法来打破安全系统，我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗？答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

他说：「现在我们正开始获得能够有效处理大量未结构化数据和检测模式的系统，我希望下一波网络安全浪潮中的产品是基于认知计算的。看看Watson，既然它可以赢得危险边缘（Jeopardy）游戏，那为什么它不可以用来广泛地分析和理解网络安全事件呢？我认为我们正处于用基于认知的计算来帮助处理安全问题的萌芽阶段。」

Invincea的Saxe希望可以成为弄潮儿。他说：「我并不惊讶该领域的公司没有抓住这次浪潮，生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

当朋友圈更新多到看不完时，来看看Facebook是怎么优化信息流的

【编者按】本文是FREES互联网团队成员覃超与徐万鸿进行的一场 Ask Me Anything。徐是前 Facebook 新闻流排序组的资深工程师，在今年9月回国出任神州专车 CTO。本文中他们聊的是关于 Facebook 的 Growth Hacking 策略、反垃圾信息系统、信息流排序，以及为什么选择回国参与创业。雷锋网(公众号：雷锋网)做了不修改原意的编辑。

所谓新闻流排序（news feed ranking），指的是 Facebook 的一项看家本领：用户每天会收到两三千条新鲜事，却只会阅读前 50 至 100 条。利用机器学习将用户最想看的内容排到最前面，从而提高粘性和日活。

这固然是一篇着重技术的文章，所在公司 Facebook 更是世界上最大的互联网公司之一。但这并不妨碍创业者从中得到经验。利用 A/B 测试作为迭代方法，借助 Growth Hacking 的核心——数据来驱动开发，新员工的入职宣讲……这些做法都体现了这位社交之王不同维度的文化所在：精神层面注重实现梦想，统一目标；而这一目标下放到微观层面，就是对于数据的尊重。

Facebook利用Sigma 系统做了什么？

我第一次去Facebook工作的时候，当时专注于用户增长的 VP 负责宣讲。他说将来全球所有人都会使用 Facebook，这家公司将来会成为万亿美元的公司，这让我印象很深刻。公司的所有人都很兴奋，对设定的目标有非常大的信心。他们的工作使命感非常强，非常专注。

这是Facebook给我印象深刻的一件事。

在 Facebook 的 site-integrity （站点完整性）组工作了两年。当时 Facebook 有很多的垃圾私信、垃圾信息，就像人人、微博上有各种广告、垃圾链接。有些用户的账号被盗用了，会使用个人页面发送垃圾短信、广告、病毒，还有一些不受欢迎的朋友请求。我会处理所有类似这些涉及到影响用户体验的东西。

Facebook 使用了一个叫做 sigma 的系统来抵制这些垃圾信息。这个系统安装在 2000 多台机器上面，Facebook 用户做的任何事情，都会经过 sigma 系统分析处理，比如评论、链接、朋友请求，都会被这个系统进行判断，是正常行为、滥用行为还是有问题的行为。

利用 Sigma 系统，Facebook 会对垃圾信息进行过滤和清理。

举个例子说，比如发送朋友请求，Facebook 的系统会自动判断一下：如果这个人的朋友请求都被别人拒绝了，他再发送朋友请求是不会被批准的。如果一个人发送的朋友请求十个有九个都被拒绝了，那么他下一次的朋友请求就会被系统拒绝。

当然这个系统还有其他的判断信号。

它是一个机器学习系统，通过你之前发的朋友请求拒绝概率高低来判断你被拒绝的概率有多高。

如果这个比率很高，Facebook 会让你进行手机短信或其他方式认证，来验证是软件还是真人发送的，以此判断你是不是真的要发送朋友请求，比如你发出的朋友请求对象与你没有任何共同好友，那就可能是一个不合理的请求。

基本上，你在 Facebook 上做的任何事情，都会经过这个系统来分析、预测、决定是否允许你发出信息，借此希望会减少生态圈中的骚扰行为。当时 Facebook 每天有上百亿次的信息发生要通过这个系统进行判断。

机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法，请求通过和拒绝就是一个迅捷数据组（Scrum）。任务通过，则说明这个任务是一个对机器学习来说的正样本，被拒绝则是一个负样本，很像 0 和 1。

比如发送朋友请求如果被接受，y 值是 1，如果被拒绝就是 0。如果是评论和点赞，系统就能寻找 y 值，用户发送的不当信息就会被删除。

而机器学习是整个 Sigma 系统的核心。

另外一个方法是通过一些异常行为的分析、数据挖掘的方法来分析用户的异常行为。

比如一个人发的同样类型评论非常多，所有评论里都有一个相似链接，这就非常有问题。正常操作不会在不同人的主页上留同样的评论，这显然属于异常行为，我们不会允许。

新闻流是Facebook最重要的产品

我工作两年之后选择去了这个组。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈，你的信息流是个什么样子，信息的位置。每个人产生的内容、新闻会有两三千个，用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示，比如你喜欢游戏，你的朋友不喜欢。

我 2012 年刚去的时候，新闻流排序组只有五六个人，尽管这可能是公司最大的机器学习系统，最核心的产品。每天有十亿多人上线，每个用户花 40 分钟在 Facebook 上，其中一半时间都花在新闻流上。Facebook 大部分收入来自新闻流广告。比如说，移动广告收入占所有广告的 70%，而其中所有的移动的广告都来自新闻流。不管是从用户的停留时间，还是收入来说，新闻流都是最重要的产品。

新闻流是 Facebook 最重要的产品，直接决定了用户所看到的内容。

做好新闻流排序是很难的问题，因为用户在新闻流上的行为有很多种，不只是传统广告点击或者不点击这一种操作，用户可以在新闻流里赞、评论、分享或者隐藏这个新闻流，也可以播放视频。我需要理解用户喜欢什么东西，评论、分享什么东西，想看什么样的视频。理解用户的兴趣所在，根据我们的讯息把最好的东西放在新闻流的最前面。

以国内的社交媒体作对比来说，微信的朋友圈是所有内容全部显示，它不需要排序，是因为朋友圈容量不是特别多，大家可以看完所有的内容。朋友越来越多的话，没有时间把分享看完，排序是必然的事情。你会很容易漏掉很重要人的图片，它们迅速埋没在大部分你不感兴趣的内容了。

Facebook 之前也是全部显示，慢慢到后来用户是看不完所有的信息的。如果不做排序，把最好的服务挑出来的话，用户不会愿意访问新闻流，因为他看到很多不感兴趣的东西，感兴趣的部分他已经没有时间找出来了。从不排序到排序是必然的过程，你的朋友越来越多，公众页面越来越多，排序是必然的。

比如说新浪微博没有做排序，有些细节杂乱无章，他们测试过，但是做得不太好。所以放弃了。微信的朋友圈也会到要做排序的阶段。Facebook 不只是排序，还会隐藏用户不感兴趣的内容，比如你的朋友玩过 Candy Crush 游戏，但可能你本身不玩任何游戏，关于这方面的信息就没有意义。Facebook 就不会给你显示这些内容——“朋友们在玩什么游戏”。

社交媒体的碎片化已成事实。只有采取更好的排序手段，推送给用户更精准的内容，才能提高平台停留时间，加强粘性。

新闻流排序的工作原理是什么？

基本上，新闻流是从两三千条内容里面，挑出了四五十个。按照每个内容打分，分高的内容排在最前面。每个内容、照片、分享或者状态，我们会预测一些概率值，比如你点赞的概率，评论、分享的概率。每个用户的行为，比如点赞、分享、评论，系统都会给权值。评这些用户行为概率是通过机器学习来系统计算的。如果用户对某个内容点赞、评论或者分享，说明用户愿意看到这个内容，对内容产生了反馈。

举个例子来说，比如你是我的好友，你上传了 100 张照片，我点赞了 20 次，那么点赞概率就是 20%。我们知道每个用户以前对哪些内容点赞、评论，这些都是我们的训练样本。我们通过学习用户的历史行为，进行相同类型、相同个人的未来行为预测，因为用户短期行为不会大幅变化，过去对哪些东西进行评论，将来也很有可能对相似内容进行评论。

对用户内容的预测

很多人关心，是否可以针对用户内容来进行预测？比如分析用户发了什么样的文字或者图片？这是可以的。如果是图片我们可以抽取图片特点，对图片进行模式识别，分析图片的主题，打上相应的标签，用机器来识别这些图片。现在在做相应的工作。Facebook 有 AI 实验室，可以对图片进行内容识别。

那么，Facebook 该如何检测这套算法的有效性呢？该如何进行更新迭代？

其实，这可以通过 A/B 测试来实现。我们会抽取 1% 用户进行新的算法，1%进行旧的算法。如果新的算法下用户每天点赞、评论或者分享次数增长了，那说明新的算法更好。我们就把新的算法发布给所有的用户。我们主要的核心目标是：让日活跃用户更多，停留时间更长，访问 Facebook 更频繁。

A/B 测试是很好的迭代方法。建立起核心指标，进行 A/B 测试，看新的改动能否提高核心指标，提高就发布，没有提高就不用发布。这很像 Growth hacking，当然最终目的还是提高 DAU。如果用户喜欢你的新闻流，就会更频繁访问，最终目的还是在线时长和日活跃用户。

A/B 测试是 Facebook 用来测试迭代可行性的手段，目前峰瑞资本所投资的吆喝科技，想让初创企业也能使用到这一技术。

“我已经没法看完所有朋友圈的内容”

我已经没法看完所有朋友圈内容了。一种改进方法是排序，把最好的内容放最前面，通过你以前点赞的内容，来学习你关心的内容，比如你女朋友发的东西你都会点赞。另外一种改进方法叫做 “内容置顶”（Story bumped）。有时候我早上起来刷微信，会看不完，只看了一小部分。过一会儿再刷的时候，已经没有什么新的内容了。

Facebook 的内容置顶功能会把你没有看完的东西再放到上面去再次推送给你。

微信是知道哪些内容你没看过的，我有很多在美国的朋友，朋友圈会有很多内容，上班前看不完只看了一部分。再刷新的时候就已经没有新的东西出来了，我也没有关心没看完的东西，朋友发的照片。Facebook 的 “内容置顶” 把很重要的、还没看的、有点旧的内容放在朋友圈前面，让你再看一眼，怕你漏掉重要的内容。

在九月份的时候我加入神州专车担任 CTO，从事业角度来说，我希望把从 Facebook 学到的公司文化、技术带回中国。中国在计算机行业上有很大的潜力。现在国内的产品质量上和美国产品已经相当了，比如微信，Facebook 的产品经理也学习了微信里面的功能。再往后面看几年的话，中国有机会赶上美国。

计算机学科已经成熟，创造力在慢慢变好。很多初创企业尝试不同的想法，中国的创业者是美国的好多倍，都在尝试不同的想法，会诞生出成功的公司。技术上，中国正在逼近美国，甚至会超越美国。长远来看，中国的计算机行业、互联网行业，应该是有潜力成为世界上互联网行业最好的国家。

Fighting spam with Haskell

One of our weapons in the fight against spam, malware, and other abuse on Facebook is a system called Sigma. Its job is to proactively identify malicious actions on Facebook, such as spam, phishing attacks, posting links to malware, etc. Bad content detected by Sigma is removed automatically so that it doesn’t show up in your News Feed.

We recently completed a two-year-long major redesign of Sigma, which involved replacing the in-house FXL language previously used to program Sigma with Haskell. The Haskell-powered Sigma now runs in production, serving more than one million requests per second.

Haskell isn’t a common choice for large production systems like Sigma, and in this post, we’ll explain some of the thinking that led to that decision. We also wanted to share the experiences and lessons we learned along the way. We made several improvements to GHC (the Haskell compiler) and fed them back upstream, and we were able to achieve better performance from Haskell compared with the previous implementation.

How does Sigma work?

Sigma is a rule engine, which means it runs a set of rules, called policies. Every interaction on Facebook — from posting a status update to clicking “like” — results in Sigma evaluating a set of policies specific to that type of interaction. These policies make it possible for us to identify and block malicious interactions before they affect people on Facebook.

Policies are continuously deployed. At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses. This also means that safety in the language we write policies in is important. We don’t allow code to be checked into the repository unless it is type-correct.

Louis Brandy of Facebook’s Site Integrity team discusses scalable spam fighting and the anti-abuse structure at Facebook and Instagram in a 2014 @Scale talk.

Why Haskell?

The original language we designed for writing policies, FXL, was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted. We wanted the performance and expressivity of a fully fledged programming language. Thus, we decided to migrate to an existing language rather than try to improve FXL.

The following features were at the top of our list when we were choosing a replacement:

1. Purely functional and strongly typed. This ensures that policies can’t inadvertently interact with each other, they can’t crash Sigma, and they are easy to test in isolation. Strong types help eliminate many bugs before putting policies into production.

2. Automatically batch and overlap data fetches. Policies typically fetch data from other systems at Facebook, so we want to employ concurrency wherever possible for efficiency. We want concurrency to be implicit, so that engineers writing policies can concentrate on fighting spam and not worry about concurrency. Implicit concurrency also prevents the code from being cluttered with efficiency-related details that would obscure the functionality, and make the code harder to understand and modify.

3. Push code changes to production in minutes. This enables us to deploy new or updated policies quickly.

4. Performance. FXL’s slower performance meant that we were writing anything performance-critical in C++ and putting it in Sigma itself. This had a number of drawbacks, particularly the time required to roll out changes.

5. Support for interactive development. Developers working on policies want to be able to experiment and test their code interactively, and to see the results immediately.

Haskell measures up quite well: It is a purely functional and strongly typed language, and it has a mature optimizing compiler and an interactive environment (GHCi). It also has all the abstraction facilities we would need, it has a rich set of libraries available, and it’s backed by an active developer community.

That left us with two features from our list to address: (1) automatic batching and concurrency, and (2) hot-swapping of compiled code.

Automatic batching and concurrency: The Haxl framework

All the existing concurrency abstractions in Haskell are explicit, meaning that the user needs to say which things should happen concurrently. For data-fetching, which can be considered a purely functional operation, we wanted a programming model in which the system just exploits whatever concurrency is available, without the programmer having to use explicit concurrency constructs. We developed the Haxl framework to address this issue: Haxl enables multiple data-fetching operations to be automatically batched and executed concurrently.

We discussed Haxl in an earlier blog post, and we published a paper on Haxl at the ICFP 2014 conference. Haxl is open source and available on GitHub.

In addition to the Haxl framework, we needed help from the Haskell compiler in the form of theApplicative do-notation. This allows programmers to write sequences of statements that the compiler automatically rearranges to exploit concurrency. We also designed and implemented Applicative do-notation in GHC.

Hot-swapping of compiled code

Every time someone checks new code into the repository of policies, we want to have that code running on every machine in the Sigma fleet as quickly as possible. Haskell is a compiled language, so that involves compiling the code and distributing the new compiled code to all the machines running Sigma.

We want to update the compiled rules in a running Sigma process on the fly, while it is serving requests. Changing the code of a running program is a tricky problem in general, and it has been the subject of a great deal of research in the academic community. In our case, fortunately, the problem is simpler: Requests to Sigma are short-lived, so we don’t need to switch a running request to new code. We can serve new requests on the new code and let the existing requests finish before we discard the old code. We’re careful to ensure that we don’t change any code associated with persistent state in Sigma.

Loading and unloading code currently uses GHC’s built-in runtime linker, although in principle, we could use the system dynamic linker. To unload the old version of the code, the garbage collector gets involved. The garbage collector detects when old code is no longer being used by a running request, so we know when it is safe to unload it from the running process.

How Haskell fits in

Haskell is sandwiched between two layers of C++ in Sigma. At the top, we use the C++ thrift server. In principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant. It also supports more features. Furthermore, it can work seamlessly with the Haskell layers below because we can call into Haskell from C++. For these reasons, it made sense to use C++ for the server layer.

At the lowest layer, we have existing C++ client code for talking to other internal services. Rather than rewrite this code in Haskell, which would duplicate the functionality and create an additional maintenance burden, we wrapped each C++ client in a Haxl data source using Haskell’s Foreign Function Interface (FFI) so we could use it from Haskell.

Haskell’s FFI is designed to call C rather than C++, so calling C++ requires an intermediate C layer. In most cases, we were able to avoid the intermediate C layer by using a compile-time tool that demangles C++ function names so they can be called directly from Haskell.

Performance

Perhaps the biggest question here is “Does it run fast enough?” Requests to Sigma result from users performing actions on Facebook, such as sending a message on Messenger, and Sigma must respond before the action can take place. So we wanted to serve requests fast enough to avoid interruptions to the user experience.

The graph below shows the relative throughput performance between FXL and Haskell for the 25 most common types of requests served by Sigma (these requests account for approximately 95 percent of Sigma’s typical workload).

Haskell performs as much as three times faster than FXL for certain requests. On a typical workload mix, we measured a 20 percent to 30 percent improvement in overall throughput, meaning we can serve 20 percent to 30 percent more traffic with the same hardware. We believe additional improvements are possible through performance analysis, tuning, and optimizing the GHC runtime for our workload.

Achieving this level of performance required a lot of hard work, profiling the Haskell code, and identifying and resolving performance bottlenecks.

Here are a few specific things we did:

We implemented automatic memoization of top-level computations using a source-to-source translator. This is particularly beneficial in our use-case where multiple policies can refer to the same shared value, and we want to compute it only once. Note, this is per-request memoization rather than global memoization, which lazy evaluation already provides.
We made a change to the way GHC manages the heap, to reduce the frequency of garbage collections on multicore machines. GHC’s default heap settings are frugal, so we also use a larger allocation area size of at least 64 MB per core.
Fetching remote data usually involves marshaling the data structure across the C++/Haskell boundary. If the whole data structure isn’t required, it is better to marshal only the pieces needed. Or better still, don’t fetch the whole thing — although that’s only possible if the remote service implements an appropriate API.
We uncovered a nasty performance bug in aeson, the Haskell JSON parsing library. Bryan O’Sullivan, the author of aeson, wrote a nice blog post about how he fixed it. It turns out that when you do things at Facebook scale, those one-in-a-million corner cases tend to crop up all the time.

Resource limits

In a latency-sensitive service, you don’t want a single request using a lot of resources and slowing down other requests on the same machine. In this case, the “resources” include everything on the machine that is shared by the running requests — CPU, memory, network bandwidth, and so on.

A request that uses a lot of resources is normally a bug that we want to fix. It does happen from time to time, often as a result of a condition that occurs in production that wasn’t encountered during testing — perhaps an innocuous operation provided with some unexpectedly large input data, or pathological performance of an algorithm on certain rare inputs, for example. When this happens, we want Sigma to terminate the affected request with an error (that will subsequently result in the bug being fixed) and continue without any impact on the performance of other requests being served.

To make this possible, we implemented allocation limits in GHC, which places a bound on the amount of memory a thread can allocate before it is terminated. Terminating a computation safely is a hard problem in general, but Haskell provides a safe way to abort a computation in the form of asynchronous exceptions. Asynchronous exceptions allow us to write most of most of our code ignoring the potential for summary termination and still have all the nice guarantees that we need in the event that the limit is hit, including safe releasing of resources, closing network connections, and so forth.

The following graph illustrates of how well allocation limits work in practice. It tracks the maximum live memory across various groups of machines in the Sigma fleet. When we enabled one request that had some resource-intensive outliers, we saw large spikes in the maximum live memory, which disappeared when we enabled allocation limits.

Enabling interactive development

Facebook engineers develop policies interactively, testing code against real data as they go. To enable this workflow in Haskell, we needed the GHCi environment to work with our full stack, including making requests to other back-end services from the command line.

To make this work, we had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load. We also customized the GHCi front end to implement some of our own commands and streamline the desired workflows. The result is an interactive environment in which developers can load their code from source in a few seconds and work on it with a fast turnaround time. They have the full set of APIs available and can test against real production data sources.

While GHCi isn’t as easy to customize as it could be, we’ve already made several improvements and contributed them upstream. We hope to make more improvements in the future.

Packages and build systems

In addition to GHC itself, we make use of a lot of open-source Haskell library code. Haskell has its own packaging and build system, Cabal, and the open-source packages are all hosted onHackage. The problem with this setup is that the pace of change on Hackage is fast, there are often breakages, and not all combinations of packages work well together. The system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure, and the tool support isn’t what it could be. We found that using packages directly from Hackage together with Facebook’s internal build tools meant adding or updating an existing package sometimes led to a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations.

As a result of this experience, we switched to Stackage as our source of packages. Stackage provides a set of package versions that are known to work together, freeing us from the problem of having to find the set by trial and error.

Did we find bugs in GHC?

Yes, most notably:

We fixed a bug in GHC’s garbage collector that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years.
We fixed a bug in GHC’s handling of finalizers that occasionally caused crashes during process shutdown.

Following these fixes, we haven’t seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet.

What else?

At Facebook, we’re using Haskell at scale to fight spam and other types of abuse. We’ve found it to be reliable and performant in practice. Using the Haxl framework, our engineers working on spam fighting can focus on functionality rather than on performance, while the system can exploit the available concurrency automatically.

For more information on spam fighting at Facebook, check out our Protect the Graph page, or watch videos from our recent Spam Fighting @Scale event.

最大似然估计（Maximal Likelihood Estimation）

October 18, 2016 zr9558 Leave a comment

（一）基本思想

给定一个概率分布 $D$ ，假设其概率密度函数是 $f_{D}$ ，它与一个未知参数 $\theta$ 相关。我们可以从这个分布中抽取 $n$ 样本 $x_{1},x_{2},...,x_{n}$ ，我们就可以得到这个概率是

$P(x_{1},...,x_{n}) = f_{D}(x_{1},...,x_{n}|\theta)$ .

但是，在这里我们并不知道参数 $\theta$ 的值。如何估计参数 $\theta$ 的取值就成为了关键之处。一个简单的想法就是从这个分布中随机抽取样本 $x_{1},...,x_{n}$ ，然后利用这些数据来估算 $\theta$ 的值。

最大似然估计 (maximal likelihood estimator) 算法会计算参数 $\theta$ 的最可能的值，也就是说参数的选择会使得这个采样的概率最大化。

用数学的语言来说，首先我们需要定义似然函数：

$L(\theta) = f_{D}(x_{1},...,x_{n}|\theta)$ ,

并且在 $\theta$ 的所有取值上，使得这个函数的取值最大化。换言之，也就是函数 $L(\theta)$ 的一阶导数等于零。这个使得 $L(\theta)$ 最大化的参数 $\hat{\theta}$ 称为 $\theta$ 的最大似然估计。

Remark. 最大似然函数不一定是唯一的，甚至不一定是存在的。

（二）基本算法

求解最大似然函数估计值的一般步骤：

（1）定义似然函数；

（2）对似然函数求导数，或者说对似然函数的对数求导数，目的都是为了更加方便地计算一阶导数；

（3）令一阶导数等于零，得到关于参数 $\theta$ 的似然方程；

（4）求解似然方程，得到的参数就是最大似然估计。在求解的过程中，如果不能够直接求解方程的话，可以采取牛顿法来近似求解。

（三）例子

（i）Bernoulli 分布（Bernoulli Distribution）

假设我们有 $n$ 个随机样本 $x_{1},...,x_{n}$ . 如果第 $i$ 个学生没有自行车，那么 $x_{i}=0$ ; 否则 $x_{i}=1$ . 并且假设 $x_{i}$ 是满足未知参数 $p$ 的 Bernoulli 分布的。我们此时的目标是计算最大似然估计 $p$ ，也就是全体学生中拥有自行车的比例。

如果 $\{x_{i}:1\leq i\leq n\}$ 是相互独立的 Bernoulli 随机变量，那么对每一个 $x_{i}$ 而言，它的概率函数则是：

$f(x_{i};p)=p^{x_{i}}(1-p)^{1-x_{i}}, \text{ for } x_{i}=0 \text{ or } 1 \text{ and } 0<p<1$ .

因此，似然函数 $L(p)$ 可以定义为：

$L(p)=\prod_{i=1}^{n}f(x_{i};p)=p^{\sum_{i=1}^{n}x_{i}}(1-p)^{n-\sum_{i=1}^{n}x_{i}}$ .

为了计算参数 $p$ 的值，可以对 $ln(L(p))$ 求导：

$\ln(L(p))=(\sum_{i=1}^{n}x_{i})\ln(p) + (n-\sum_{i=1}^{n}x_{i})\ln(1-p)$

$\frac{\partial\ln(L(p))}{\partial p} = \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}$

令 $\frac{\partial\ln(L(p))}{\partial p} = 0$ ，可以得到

$p=\sum_{i=1}^{n}x_{i}/n$ .

也就是说最大似然估计是

$\hat{p}=\sum_{i=1}^{n}x_{i}/n$ .

（ii）Gaussian Distribution

假设 $x_{1},...,x_{n}$ 满足正态分布，并且该正态分布的参数 $\mu$ 和 $\sigma^{2}$ 都是未知的。目标是寻找均值 $\mu$ 和方差 $\sigma^{2}$ 的最大似然估计。

如果 $\{x_{1},...,x_{n}\}$ 是满足正态分布的，那么对于每一个变量 $x_{i}$ 的概率密度函数就是：

$f(x_{i};\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x_{i}-\theta)^{2}}{2\sigma^{2}})$ .

似然函数就是：

$L(\mu,\sigma) = \prod_{i=1}^{n} f(x_{i};\mu,\sigma^{2}) = (2\pi)^{-n/2}\sigma^{-n}\exp(-\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}})$

$\ln(L(\mu,\sigma)) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} = \frac{\sum_{i=1}^{n}(x_{i}-\mu)}{\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{\sigma^{3}}$

令 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} =0$ 和 $\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma}=0$ ，可以求解方程组得到：

$\hat{\mu}= \sum_{i=1}^{n}x_{i}/n$ ,

$\hat{\sigma}^{2} = \sum_{i=1}^{n}(x_{i}-\mu)^{2}/n$ .

(iii) Weibull 分布（Weibull Distribution）

首先，我们回顾一下 Weibull 分布的定义。Weibull 分布（Weibull Distribution）是连续型的概率分布，其概率密度函数是：

$f(x;\lambda,k) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1}\exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0, f(x;\lambda,k)=0 \text{ for } x<0.$

其中， $x$ 是随机变量， $\lambda>0$ 是 scale parameter， $k>0$ 是 shape parameter。特别地，当 $k=1$ 时，Weibull 分布就是指数分布；当 $k=2$ 时，Weibull 分布就是 Rayleigh 分布。

Weibull 分布的累积分布函数是

$F(x;k,\lambda) = 1- \exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0$ ,

$F(x;k,\lambda) = 0 \text{ for } x<0$ .

Weibull 分布的分位函数（quantile function, inverse cumulative distribution）是

$Q(p;k,\lambda) = \lambda(-\ln(1-p))^{1/k} \text{ for } 0\leq p <1$ .

其次，我们来计算最大似然估计。

假设 $\{x_{1},...,x_{n}\}$ 满足 Weibull 分布，其未知参数是 $k,\lambda.$ 那么对于每一个 $x_{i}$ 而言，概率密度函数是：

$p(x_{i};k,\lambda) = \frac{k}{\lambda}(\frac{x_{i}}{\lambda})^{k-1}\exp(-(\frac{x_{i}}{\lambda})^{k})$ .

定义似然函数为：

$L(k,\lambda) = \prod_{i=1}^{n}p(x_{i};k,\lambda)$

取对数之后得到：

$\ln(L(k,\lambda)) = n\ln(k) - nk\ln(\lambda) + (k-1)\sum_{i=1}^{n}\ln(x_{i}) - \sum_{i=1}^{n}x_{i}^{k}/\lambda^{k}.$

计算一阶偏导数得到：

$\frac{\partial \ln(L(k,\lambda))}{\partial \lambda} = - \frac{nk}{\lambda} + \frac{k\sum_{i=1}^{n}x_{i}^{k}}{\lambda^{k+1}},$

$\frac{\partial \ln(L(k,\lambda))}{\partial k} = \frac{n}{k} - n\ln(\lambda) + \sum_{i=1}^{n}\ln(x_{i}) -\sum_{i=1}^{n}(\frac{x_{i}}{\lambda})^{k}\ln(\frac{x_{i}}{\lambda}).$

可以计算得出：

$\lambda^{k}=\frac{\sum_{i=1}^{n}x_{i}^{k}}{n},$

$\frac{1}{k} = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} -\frac{\sum_{i=1}^{n}\ln(x_{i})}{n}.$

其中第一个式子可以计算出 $\lambda$ 的最大似然估计。第二个式子是关于 $k$ 的隐函数，不能够直接求解，需要使用 Newton’s method 来计算。

令

$f(k) = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} - \frac{\sum_{i=1}^{n}\ln(x_{i})}{n} - \frac{1}{k}.$

求导得到：

$f^{'}(k)= \frac{\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2}}{\sum_{i=1}^{n}x_{i}^{k}}-(\frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}})^{2} + \frac{1}{k^{2}}.$

根据 Cauchy’s Inequality, 可以得到：

$(\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2})\cdot(\sum_{i=1}^{n}x_{i}^{k})\geq (\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i}))^{2}.$

所以， $f^{'}(k)>0 \text{ for all } k>0$ ，换言之， $f(k)$ 是关于 $k$ 的递增函数，并且

$\lim_{k\rightarrow 0^{+}}f(k) = -\infty,$

$\lim_{k\rightarrow +\infty}f(k) > 0 \text{ if } \forall x_{i}>1.$

那么对于递增函数 $f(k)$ 而言，就必定有一个零点。因此使用 Newton’s Iteration 的时候，初始点可以从靠近零的整数开始，比如 $k_{0}=0.0001$ 。如果从一个比较大的数开始的时候，可能使用 Newton 法的时候，会与负轴相交。但是如果从一个较小的数开始，就必定只与正数轴相交。其中 Newton 法的公式是：

$k_{0}= 0.0001,$

$k_{n+1} = k_{n}- \frac{f(k_{n})}{f^{'}(k_{n})} \text{ for all } n\geq 0.$

当 $n$ 的次数足够大的时候， $k_{n}$ 就可以被当作最大似然估计。

How machine learning can help the security industry

October 11, 2016 zr9558 Leave a comment

Machine learning (ML) is such a hot area in security right now.

At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.

Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.

Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.

Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.

Where does ML work?

One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.

We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.

Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.

One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.

There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.

Threat models and anomaly detection

In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.

There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.

异常点检测算法综述

September 28, 2016 zr9558 Leave a comment

异常点检测（又称为离群点检测）是找出其行为很不同于预期对象的一个检测过程。这些对象被称为异常点或者离群点。异常点检测在很多实际的生产生活中都有着具体的应用，比如信用卡欺诈，工业损毁检测，图像检测等。

本文主要介绍一些常见的异常点检测算法，包括基于统计的模型，基于距离的模型，线性变换的模型，非线性变换的模型等。

异常点检测和聚类分析是两项高度相关的人物。聚类分析发现数据集中的各种模式，而异常点检测则是试图捕捉那些显著偏离多数模式的异常情况。异常点检测和聚类模型服务于不同的目的。

This slideshow requires JavaScript.

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

September 27, 2016 zr9558 Leave a comment

Introduction

If things don’t go your way in predictive modeling, use XGboost. XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?

This article is best suited to people who are new to XGBoost. In this article, we’ll learn the art of parameter tuning along with some useful information about XGBoost. Also, we’ll practice this algorithm using a data set in Python.

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

What should you know ?

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python, I highly recommend going through that before reading further. It will help you bolster your understanding of boosting in general and parameter tuning for GBM.

Special Thanks: Personally, I would like to acknowledge the timeless support provided by Mr. Sudalai Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. He is helping us guide thousands of data scientists. A big thanks to SRK!

The XGBoost Advantage
Understanding XGBoost Parameters
Tuning Parameters (with Example)

1. The XGBoost Advantage

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:

Regularization:
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as ‘regularized boosting‘ technique.
Parallel Processing:
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
- XGBoost also supports implementation on Hadoop.
High Flexibility
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
Handling Missing Values
- XGBoost has an in-built routine to handle missing values.
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
Tree Pruning:
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruningthe tree backwards and remove splits beyond which there is no positive gain.
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
Built-in Cross-Validation
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.

I hope now you understand the sheer power XGBoost algorithm. Note that these are the points which I could muster. You know a few more? Feel free to drop a comment below and I will update the list.

Did I whet your appetite ? Good. You can refer to following web-pages for a deeper understanding:

2. XGBoost Parameters

The overall parameters have been divided into 3 categories by XGBoost authors:

General Parameters: Guide the overall functioning
Booster Parameters: Guide the individual booster (tree/regression) at each step
Learning Task Parameters: Guide the optimization performed

I will give analogies to GBM here and highly recommend to read this article to learn from the very basics.

General Parameters

These define the overall functionality of XGBoost.

booster [default=gbtree]
- Select the type of model to run at each iteration. It has 2 options:
  - gbtree: tree-based models
  - gblinear: linear models
silent [default=0]:
- Silent mode is activated is set to 1, i.e. no running messages will be printed.
- It’s generally good to keep it 0 as the messages might help in understanding the model.
nthread [default to maximum number of threads available if not set]
- This is used for parallel processing and number of cores in the system should be entered
- If you wish to run on all cores, value should not be entered and algorithm will detect automatically

There are 2 more parameters which are set automatically by XGBoost and you need not worry about them. Lets move on to Booster parameters.

Booster Parameters

Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

eta [default=0.3]
- Analogous to learning rate in GBM
- Makes the model more robust by shrinking the weights on each step
- Typical final values to be used: 0.01-0.2
min_child_weight [default=1]
- Defines the minimum sum of weights of all observations required in a child.
- This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.
max_depth [default=6]
- The maximum depth of a tree, same as GBM.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
- Should be tuned using CV.
- Typical values: 3-10
max_leaf_nodes
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
gamma [default=0]
- A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
- Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
max_delta_step [default=0]
- In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
- Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
- This is generally not used but you can explore further if you wish.
subsample [default=1]
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
- Typical values: 0.5-1
colsample_bytree [default=1]
- Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
- Typical values: 0.5-1
colsample_bylevel [default=1]
- Denotes the subsample ratio of columns for each split, in each level.
- I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
lambda [default=1]
- L2 regularization term on weights (analogous to Ridge regression)
- This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
alpha [default=0]
- L1 regularization term on weight (analogous to Lasso regression)
- Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
scale_pos_weight [default=1]
- A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

Learning Task Parameters

These parameters are used to define the optimization objective the metric to be calculated at each step.

objective [default=reg:linear]
- This defines the loss function to be minimized. Mostly used values are:
  - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
  - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
    - you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
  - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
eval_metric [ default according to objective ]
- The metric to be used for validation data.
- The default values are rmse for regression and error for classification.
- Typical values are:
  - rmse – root mean square error
  - mae – mean absolute error
  - logloss – negative log-likelihood
  - error – Binary classification error rate (0.5 threshold)
  - merror – Multiclass classification error rate
  - mlogloss – Multiclass logloss
  - auc: Area under the curve
seed [default=0]
- The random number seed.
- Can be used for generating reproducible results and also for parameter tuning.

If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:

eta –> learning_rate
lambda –> reg_lambda
alpha –> reg_alpha

You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier. However, it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.

I recommend you to go through the following parts of xgboost guide to better understand the parameters and codes:

3. Parameter Tuning with Example

We will take the data set from Data Hackathon 3.x AV hackathon, same as that taken in the GBM article. The details of the problem can be found on the competition page. You can download the data set from here. I have performed the following steps:

City variable dropped because of too many categories
DOB converted to Age | DOB dropped
EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped
EmployerName dropped because of too many categories
Existing_EMI imputed with 0 (median) since only 111 values were missing
Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped
Lead_Creation_Date dropped because made little intuitive impact on outcome
Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values
Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped
Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped
LoggedIn, Salary_Account dropped
Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped
Source – top 2 kept as is and all others combined into different category
Numerical and One-Hot-Coding performed

For those who have the original data from competition, you can check out these steps from the data_preparation iPython notebook in the repository.

Lets start by importing the required libraries and loading the data:

#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'

Note that I have imported 2 forms of XGBoost:

xgb – this is the direct xgboost library. I will use a specific function “cv” from this library
XGBClassifier – this is an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM

Before proceeding further, lets define a function which will help us create XGBoost models and perform cross-validation. The best part is that you can take this function as it is and use it later for your own models.

def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
                    
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

This code is slightly different from what I used for GBM. The focus of this article is to cover the concepts and not coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it. Note that xgboost’s sklearn wrapper doesn’t have a “feature_importances” metric but a get_fscore() function which does the same job.

General Approach for Parameter Tuning

We will use an approach similar to that of GBM here. The various steps to be performed are:

Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
Lower the learning rate and decide the optimal parameters .

Let us look at a more detailed step by step approach.

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:

max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.
min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.
gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.
subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.
scale_pos_weight = 1: Because of high class imbalance.

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.

#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, train, predictors)

As you can see that here we got 140 as the optimal estimators for 0.1 learning rate. Note that this value might be too high for you depending on the power of your system. In that case you can increase the learning rate and re-run the command to get the reduced number of estimators.

Note: You will see the test AUC as “AUC Score (Test)” in the outputs here. But this would not appear if you try to run the command on your system as the data is not made public. It’s provided here just for reference. The part of the code which generates this output has been removed here.

Step 2: Tune max_depth and min_child_weight

We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.

Important Note: I’ll be doing some heavy-duty grid searched in this section which can take 15-30 mins or even more time to run depending on your system. You can vary the number of values you are testing based on what your system can handle.

param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

Here, we have run 12 combinations with wider intervals between values. The ideal values are 5 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.

param_test2 = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

Here, we get the optimum values as 4 for max_depth and 6 for min_child_weight. Also, we can see the CV score increasing slightly. Note that as the model performance increases, it becomes exponentially difficult to achieve even marginal gains in performance. You would have noticed that here we got 6 as optimum value for min_child_weight but we haven’t tried values more than 6. We can do that as follow:.

param_test2b = {
 'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])

modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_

We see 6 as the optimal value.

Step 3: Tune gamma

Now lets tune gamma value using the parameters already tuned above. Gamma can take various values but I’ll check for 5 values here. You can go into more precise values as.

param_test3 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

This shows that our original value of gamma, i.e. 0 is the optimum one. Before proceeding, a good idea would be to re-calibrate the number of boosting rounds for the updated parameters.

xgb2 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb2, train, predictors)

Here, we can see the improvement in score. So the final parameters are:

max_depth: 4
min_child_weight: 6
gamma: 0

Step 4: Tune subsample and colsample_bytree

The next step would be try different subsample and colsample_bytree values. Lets do this in 2 stages as well and take values 0.6,0.7,0.8,0.9 for both to start with.

param_test4 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

Here, we found 0.8 as the optimum value for both subsample and colsample_bytree. Now we should try values in 0.05 interval around these.

param_test5 = {
 'subsample':[i/100.0 for i in range(75,90,5)],
 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])

Again we got the same values as before. Thus the optimum values are:

subsample: 0.8
colsample_bytree: 0.8

Step 5: Tuning Regularization Parameters

Next step is to apply regularization to reduce overfitting. Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity. But we should always try it. I’ll tune ‘reg_alpha’ value here and leave it upto you to try different values of ‘reg_lambda’.

param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

We can see that the CV score is less than the previous case. But the values tried are very widespread, we should try values closer to the optimum here (0.01) to see if we get something better.

param_test7 = {
 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

You can see that we got a better CV. Now we can apply this regularization in the model and look at the impact:

xgb3 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb3, train, predictors)

Again we can see slight improvement in the score.

Step 6: Reducing Learning Rate

Lastly, we should lower the learning rate and add more trees. Lets use the cv function of XGBoost to do the job again.

xgb4 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb4, train, predictors)

Now we can see a significant boost in performance and the effect of parameter tuning is clearer.

As we come to the end, I would like to share 2 key thoughts:

It is difficult to get a very big leap in performance by just using parameter tuning or slightly better models. The max score for GBM was 0.8487 while XGBoost gave 0.8494. This is a decent improvement but not something very substantial.
A significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc

You can also download the iPython notebook with all these model codes from my GitHub account. For codes in R, you can refer to this article.

End Notes

This article was based on developing a XGBoost model end-to-end. We started with discussing why XGBoost has superior performance over GBM which was followed by detailed discussion on thevarious parameters involved. We also defined a generic function which you can re-use for making models.

Finally, we discussed the general approach towards tackling a problem with XGBoost and also worked out the AV Data Hackathon 3.x problem through that approach.

I hope you found this useful and now you feel more confident to apply XGBoost in solving a data science problem. You can try this out in out upcoming hackathons.

Did you like this article? Would you like to share some other hacks which you implement while making XGBoost models? Please feel free to drop a note in the comments below and I’ll be glad to discuss.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

AI^2: Training a big data machine to defend

September 22, 2016 zr9558 Leave a comment

AI²: Training a big data machine to defend Veeramachaneni et al.IEEE International conference on Big Data Security, 2016

Will machines take over? The lesson of today’s paper is that we’re better off together. Combining AI with HI (human intelligence, I felt like we deserved an acronym of our own ) yields much better results than a system that uses only unsupervised learning. The context is information security, scanning millions of log entries per day to detect suspicious activity and prevent attacks. Examples of attacks include account takeovers, new account fraud (opening a new account using stolen credit card information), and terms of service abuse (e.g. abusing promotional codes, or manipulating cookies for advantage).

A typical attack has a behavioral signature, which comprises the series of steps involved in commiting it. The information necessary to quantify these signatures is buried deep in the raw data, and is often delivered as logs.

The usual problem with such outlier/anomaly detection systems is that they trigger lots of false positive alarms, that take substantial time and effort to investigate. After the system has ‘cried wolf’ enough times they can become distrusted and of limited use. AI² combines the experience and intuition of analysts with machine learning techniques. An ensemble of unsupervised learning models generates a set of k events to be analysed per day (where the daily budget k of events that can be analysed is a configurable parameter). The human judgements on these k events are used to train a supervised model, the results of which are combined with the unsupervised ensemble results to refine the k events to be presented to the analyst on the next day. And so it goes on.

The end result looks a bit like this:

With a daily investigation budget (k) of 200 events, AI² detects 86.8% of attacks with a false positive rate of 4.4%. Using only unsupervised learning, on 7.9% of attacks are detected. If the investigation budget is upped to 1000 events/day, unsupervised learning can detect 73.7% of attacks with a false positive rate of 22%. At this level, the unsupervised system is generating 5x the false positives of AI², and still not detecting as many attacks.

Detecting attacks is a true ‘needle-in-a-haystack’ problem as the following table shows:

Entities in the above refers to the number of unique IP addresses, users, sessions etc. anaysed on a daily basis. The very small relative number of true attacks results in extreme class imbalance when trying to learn a supervised model.

AI² tracks activity based on ingested log records and aggregates activities over intervals of time (for example,counters, indicators – did this happen in the window at all? – elapsed time between events, number of unique values and so on). These features are passed into an ensemble of three unsupervised outlier detection models:

A Principle Component Analysis (PCA) based model. The basic idea is to use PCA to determine the most significant features (those that explain most of the variance in the data). Given an input take its PCA projection, and then from the projection, reconstruct the original variables. The reconstruction error will be small for the majority of examples, but will remain high for outliers.
A Replicator Neural Network (not to be confused with a RecurrentNeural Network – both get abbreviated to RNN). This works on a very similar principal. The input and output layers have the same number of nodes, and intermediate layers have fewer nodes. The goal is to train the network to recreate the input at the output layer – which means it must learn an efficient compressed representation in the lower-dimensional hidden layers. Once the RNN has been trained, the reconstruction error can be used as the outlier score.

The third unsupervised model uses copula functions to build a joint probability function that can be used to detect rare events.

A copula framework provides a means of inference after modeling a multivariate joint probability distribution from training data. Because copula frameworks are less well known than other forms of estimation, we will now briefly review copula theory…

(If you’re interested in that review, and how copula functions are used to form a multivariate density function, see section 6.3 in the paper).

The scores from each of the models are translated into probabilities using a Weibull distribution, “which is flexible and can model a wide variety of shapes.” This translation means that we can compare like-with-like when combining the results from the three models. Here’s an example of the combination process using one-day’s worth of data:

The whole AI² system cycles through training, deployment, and feedback collection/model updating phases on a daily basis. The system trains unsupervised and supervised models based on all the available data, applies those models to the incoming data, identifies k entities as extreme events or attacks, and brings these to the analyst’s attention. The analysts deductions are used to build a new predictive model for the next day.

This combined approach makes effective use of the limited available analyst bandwidth, can overcome some of the weaknesses of pure unsupervised learning, and actively adapts and synthesizes new models.

This setup captures the cascading effect of the human-machine interaction: the more attacks the predictive system detects, the more feeback it will receive from the analysts; this feedback, in turn, will improve the accuracy of future predictions.

Glossary of AI Terms for Cyber Security

September 22, 2016 zr9558 Leave a comment

We often encounter confusion and hype surrounding the terminology of Artificial Intelligence. In this post, it is hoped that the security practitioner can have a quick reference guide for some of the more important and common terms.

Note that this is a limited set. We discovered that, once you start defining these terms, the terms themselves introduce new terms that require definition. We had to draw the line somewhere…

Artificial Intelligence – “The goal of work in Artificial Intelligence is to build machines that perform tasks that normally require human intelligence.” This quote from Nils Nilsson is an excellent definition, but it is not the only one. There are many definitions of Artificial Intelligence here and here.
Algorithms – are a self-contained step-by-step sets of operations to be performed. Algorithms perform calculation, data processing, and/or automated reasoning tasks. Among other things, Algorithms can be used to train Machine Learning models.
Machine Learning – A discipline or subfield of Artificial Intelligence. Paraphrasing the definition by Tom Mitchell, ML is the study of computer algorithms that learn from experience to perform a set of predefined tasks.
Machine Learning Models – The output of a machine learning algorithm. There are two types of machine learning models, those generated by Supervised algorithms and Unsupervised algorithms. See below.
The difference between “algorithms” and “models”: this is a common question and still quite difficult to answer. In the context of Artificial Intelligence, we can say that learning algorithms generate models. The learning algorithms are either Supervised or Unsupervised.N.B. People often use “models” and “algorithms” interchangeably which is a common source of confusion. To the layman, think that algorithms are like programs and the models are the output of the program.
Unsupervised Learning (algorithm) – a family of machine learning algorithms that learn without labels (labels defined below). The output of Unsupervised Learning algorithms are models that capture the structure of the data, can identify groups, or find statistical outliers. For example, Unsupervised Learning models can show you behaviors that are unlike other behaviors in a corpus of data.
Supervised Learning (algorithm) – a family of machine learning algorithms that learn from labeled data. The output of Supervised Learning algorithms are predictive models that can classify or assign a score to a data pattern. For example, trained Supervised Learning models can classify behaviors patterns into different attack tactics, or can assign a risk score to a behavior. In cyber-security, Supervised Learning models predict what a human would label a given behavior pattern.
Labeling – is the act of classification or describing something. For PatternEx, the act of labeling is something that a human analyst does every day. He or she marks something as a malicious or benign behavior. The more labels are provided, the more accurate the system becomes.
Active Learning – Active learning is a machine learning process in which a learning algorithm interactively requests inputs from an external source to improve a model. It is most commonly applied when only unlabeled data is available, the goal is to train Supervised Learning models, and the external source is a human expert that provides labels, and the labeling process is expensive and/or slow. Active learning strategies are also useful when, as in the case of InfoSec, the data changes fast.
Behavior Vectors – a quantified description of the activity of the modeled entities.
Entities – a thing with distinct, independent existence against which the behaviors relate. In cyber-security, examples would be users, IP’s, domains, and so on.
Human-Assisted Artificial Intelligence – the synergy between human intuition and artificial intelligence. Note that the humans assist the AI by providing feedback (e.g. labels) and the trained AI assists the humans by automating and scaling the tasks requiring human intelligence.
Predictions – are the activity of the system anticipating how an event would be classified by the security analyst.
Rare Events – very similar to “anomalies” and “outliers,” Rare Events are events that activities seen in log data that are unusual or out of the ordinary but not yet determined to be either malicious or benign.
Transfer Learning – means you can port knowledge acquired at one environment to another to improve model accuracy. For example, a model trained at company X can be transferred to companies A, B and C, increasing the detection capabilities of the entire group.
Virtual Analyst – the term that describes the effect of a fully trained AI system. Because a trained AI system greatly scales the analytic capability of the human analyst team, we say it is like expanding your team with “virtual analysts.”

PatternEx Unique Approach

PatternEx comes with many algorithms out-of-the-box that allow it to create predictive models that select what the analyst should review. Humans will always be needed to identify in context what is malicious in a constantly changing sea of events. In this way, Human-Assisted Artificial intelligence systems learn from analysts to identify what events are malicious. This results in greater detection accuracy at scale and reduced mean time to identify an attack.

If you’d like to learn more about how PatternEx is bringing Artificial Intelligence into the domain of Cyber Security click here: LEARN MORE

异常点检测算法（二）

June 23, 2016 zr9558 3 Comments

前面一篇文章《异常点检测算法（一）》简要的介绍了如何使用概率统计的方法来计算异常点，本文将会介绍一种基于矩阵分解的异常点检测方法。在介绍这种方法之前，先回顾一下主成分分析（Principle Component Analysis）这一基本的降维方法。

（一）主成分分析（Principle Component Analysis）

对高维数据集合的简化有各种各样的原因，例如：

（1）使得数据集合更容易使用；

（2）降低很多算法的计算开销；

（3）去除噪声；

（4）更加容易的描述结果。

在主成分分析（PCA）这种降维方法中，数据从原来的坐标系转换到新的坐标系，新坐标系的选择是由数据集本身所决定的。第一个新坐标轴的方向选择的是原始数据集中方差最大的方向，第二个新坐标轴的选择是和第一个坐标轴正交并且具有最大方差的方向。该过程一直重复，重复的次数就是原始数据中特征的数目。如此操作下去，将会发现，大部分方差都包含在最前面的几个新坐标轴之中。因此，我们可以忽略余下的坐标轴，也就是对数据进行了降维的处理。

为了提取到第一个主成分（数据差异性最大）的方向，进而提取到第二个主成分（数据差异性次大）的方向，并且该方向需要和第一个主成分方向正交，那么我们就需要对数据集的协方差矩阵进行特征值的分析，从而获得这些主成分的方向。一旦我们计算出了协方差矩阵的特征向量，我们就可以保留最大的 N 个值。正是这 N 个值反映了 N 个最重要特征的真实信息，可以把原始数据集合映射到 N 维的低维空间。

提取 N 个主成分的伪代码如下：

去除平均值

计算协方差矩阵

计算协方差矩阵的特征值和特征向量

将特征值从大到小排序

保留最大的N个特征值以及它们的特征向量

将数据映射到上述N个特征向量构造的新空间中

通过 Python 的 numpy 库和 matplotlib 库可以计算出某个二维数据集合的第一主成分如下：原始数据集使用蓝色的三角形表示，第一主成分使用黄色的圆点表示。

PCA

Principle Component Analysis 的基本性质：

Principle component analysis provides a set of eigenvectors satisfying the following properties:

（1）If the top-k eigenvectors are picked (by largest eigenvalue), then the k-dimensional hyperplane defined by these eigenvectors, and passing through the mean of the data, is a plane for which the mean square distance of all data points to it is as small as possible among all hyperplanes of dimensionality k.

（2）If the data is transformed to the axis-system corresponding to the orthogonal eigenvectors, the variance of the transformed data along each eigenvector dimension is equal to the corresponding eigenvalue. The covariances of the transformed data in this new representation are 0.

（3）Since the variances of the transformed data along the eigenvectors with small eigenvalues are low, significant deviations of the transformed data from the mean values along these directions may represent outliers.

（二）基于矩阵分解的异常点检测方法

基于矩阵分解的异常点检测方法的关键思想是利用主成分分析去寻找那些违背了数据之间相关性的异常点。为了发现这些异常点，基于主成分分析（PCA）的算法会把原始数据从原始的空间投影到主成分空间，然后再把投影拉回到原始的空间。如果只使用第一主成分来进行投影和重构，对于大多数的数据而言，重构之后的误差是小的；但是对于异常点而言，重构之后的误差依然相对大。这是因为第一主成分反映了正常值的方差，最后一个主成分反映了异常点的方差。

假设 dataMat 是一个 p 维的数据集合，有 N 个样本，它的协方差矩阵是 X。那么协方差矩阵就通过奇异值分解写成：

$X=PDP^{T},$

其中 P 是一个 (p,p) 维的正交矩阵，它的每一列都是 X 的特征向量。D 是一个 (p,p) 维的对角矩阵，包含了特征值 $\lambda_{1},...,\lambda_{p}$ 。从图像上看，一个特征向量可以看成 2 维平面上面的一条线，或者高维空间里面的一个超平面。特征向量所对应的特征值反映了这批数据在这个方向上的拉伸程度。通常情况下，可以把对角矩阵 D 中的特征值进行从大到小的排序，矩阵 P 的每一列也进行相应的调整，保证 P 的第 i 列对应的是 D 的第 i 个对角值。

这个数据集 dataMat 在主成分空间的投影可以写成

$Y=dataMat\times P.$

需要注意的是做投影可以只在部分的维度上进行，如果使用 top-j 的主成分的话，那么投影之后的数据集是

$Y^{j}=dataMat \times P^{j},$

其中 $P^{j}$ 是矩阵 P 的前 j 列，也就是说 $P^{j}$ 是一个 (p,j) 维的矩阵， $Y^{j}$ 是一个 (N,j) 维的矩阵。如果考虑拉回映射的话（也就是从主成分空间映射到原始空间），重构之后的数据集合是

$R^{j}=(P^{j}\times (Y^{j})^{T})^{T}=Y^{j}\times (P^{j})^{T},$

其中 $R^{j}$ 是使用 top-j 的主成分进行重构之后形成的数据集，是一个 (N,p) 维的矩阵。

下面可以定义数据 $dataMat_{i}=(dataMat_{i,1},...,dataMat_{i,p})$ 的异常值分数（outlier score）如下：

$score(dataMat_{i})=\sum_{j=1}^{p}(|dataMat_{i}-R_{i}^{j}|)\times ev(j)$

$ev(j)=\sum_{k=1}^{j}\lambda_{k}/\sum_{k=1}^{p}\lambda_{k}$

注意到 $|dataMat_{i}-R_{i}^{j}|$ 指的是 Euclidean 范数， ev(j) 表示的是 top-j 的主成分在所有主成分中所占的比例，并且特征值是按照从大到小的顺序排列的。因此，ev(j) 是递增的序列，这就表示 j 越高，越多的方差就会被考虑在 ev(j) 中，因为是从 1 到 j 的求和。在这个定义下，偏差最大的第一个主成分获得最小的权重，偏差最小的最后一个主成分获得了最大的权重 1。根据 PCA 的性质，异常点在最后一个主成分上有着较大的偏差，因此可以获得更高的分数。

整个算法的结构如图所示：

PCC

（三）效果展示

下面两幅图使用了同一批数据集，分别采用了基于矩阵分解的异常点检测算法和基于高斯分布的概率模型的异常点算法。

PCC2

基于矩阵分解的异常点检测

Gauss

基于高斯分布的概率模型的异常点检测

根据图像可以看出，如果使用基于矩阵分解的异常点检测算法的话，偏离第一主成分较多的点都被标记为异常点，其中包括部分左下角的点。需要注意的是如果使用基于高斯分布的概率模型的话，是不太可能标记出左下角的点的，两者形成鲜明对比。

异常点检测算法（一）

June 13, 2016 zr9558 Leave a comment

异常点（outlier）是一个数据对象，它明显不同于其他的数据对象，就好像它是被不同的机制产生的一样。例如下图红色的点，就明显区别于蓝色的点。相对于蓝色的点而言，红色的点就是异常点。

outlier

一般来说，进行异常点检测的方法有很多，最常见的就是基于统计学的方法。

（一）基于正态分布的一元离群点检测方法

假设有 n 个点 $(x_{1},...,x_{n})$ ，那么可以计算出这 n 个点的均值 $\mu$ 和方差 $\sigma$ 。均值和方差分别被定义为：

$\mu=\sum_{i=1}^{n}x_{i}/n,$

$\sigma^{2}=\sum_{i=1}^{n}(x_{i}-\mu)^{2}/n.$

在正态分布的假设下，区域 $\mu\pm 3\sigma$ 包含了99.7% 的数据，如果某个值距离分布的均值 $\mu$ 超过了 $3\sigma$ ，那么这个值就可以被简单的标记为一个异常点（outlier）。

（二）多元离群点的检测方法

涉及两个或者两个以上变量的数据称为多元数据，很多一元离群点的检测方法都可以扩展到高维空间中，从而处理多元数据。

（1）基于一元正态分布的离群点检测方法

假设 n 维的数据集合形如 $\vec{x}_{i}=(x_{i,1},...,x_{i,n}), i\in \{1,...,m\}$ ，那么可以计算每个维度的均值和方差 $\mu_{j},\sigma_{j}, j\in\{1,...,n\}.$ 具体来说，对于 $j\in \{1,...,n\}$ ，可以计算

$\mu_{j}=\sum_{i=1}^{m}x_{i,j}/m$

$\sigma_{j}^{2}=\sum_{i=1}^{m}(x_{i,j}-\mu_{j})^{2}/m$

在正态分布的假设下，如果有一个新的数据 $\vec{x}$ ，可以计算概率 $p(\vec{x})$ 如下：

$p(\vec{x})=\prod_{j=1}^{n} p(x_{j};\mu_{j},\sigma_{j}^{2})=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}\sigma_{j}}\exp(-\frac{(x_{j}-\mu_{j})^{2}}{2\sigma_{j}^{2}})$

根据概率值的大小就可以判断 x 是否属于异常值。运用该方法检测到的异常点如图，红色标记为异常点，蓝色表示原始的数据点。

Gauss

（2）多元高斯分布的异常点检测

假设 n 维的数据集合 $\vec{x}=(x_{1},...,x_{n}),$ ，可以计算 n 维的均值向量

$\vec{\mu}=(E(x_{1}),...,E(x_{n}))$

和 $n\times n$ 的协方差矩阵：

$\Sigma=[Cov(x_{i},x_{j})], i,j \in \{1,...,n\}$

如果有一个新的数据 $\vec{x}$ ，可以计算

$p(\vec{x})=\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \exp(-\frac{1}{2}(\vec{x}-\vec{\mu})^{T}\Sigma^{-1}(\vec{x}-\vec{\mu}))$

根据概率值的大小就可以判断 $\vec{x}$ 是否属于异常值。

（3）使用 Mahalanobis 距离检测多元离群点

对于一个多维的数据集合 D，假设 $\overline{a}$ 是均值向量，那么对于数据集 D 中的其他对象 $a$ ，从 $a$ 到 $\overline{a}$ 的 Mahalanobis 距离是

$MDist(a,\overline{a})=\sqrt{(a-\overline{a})^{T}S^{-1}(a-\overline{a})},$

其中 $S$ 是协方差矩阵。

在这里， $MDist(a,\overline{a})$ 是数值，可以对这个数值进行排序，如果数值过大，那么就可以认为点 $a$ 是离群点。或者对一元实数集合 $\{MDist(a,\overline{a})|a\in D\}$ 进行离群点检测，如果 $MDist(a,\overline{a})$ 被检测为异常点，那么就认为 $a$ 在多维的数据集合 D 中就是离群点。

运用 Mahalanobis 距离方法检测到的异常点如图，红色标记为异常点，蓝色表示原始的数据点。

Mahalanobis

（4）使用 $\chi^{2}$ 统计量检测多元离群点

在正态分布的假设下， $\chi^{2}$ 统计量可以用来检测多元离群点。对于某个对象 $\bold{a}$ ， $\chi^{2}$ 统计量是

$\chi^{2}=\sum_{i=1}^{n}(a_{i}-E_{i})^{2}/E_{i}.$

其中， $a_{i}$ 是 $\bold{a}$ 在第 i 维上的取值， $E_{i}$ 是所有对象在第 i 维的均值，n 是维度。如果对象 $\bold{a}$ 的 $\chi^{2}$ 统计量很大，那么该对象就可以认为是离群点。

运用 $\chi^{2}$ 统计量检测到的异常点如图，红色标记为异常点，蓝色表示原始的数据点。

ChiSquare

异常点检测算法（三）Replicator Neural Networks

June 12, 2016 zr9558 2 Comments

异常值检测算法在数据挖掘的诸多领域有着应用场景，例如金融领域，信息传输领域，图像领域等。在研究过程中，有学者给出了异常点的一个定义：

An outlier is an observation that deviates so much from other observations as as to arouse suspicion that it was generated by a different mechanism.

RNN 算法的主要思想

在这篇文章中，我们将会介绍一个多层的前馈神经网络，该神经网络可以用来进行异常值的检测。这个神经网络模拟的是一个恒等映射，输入层的神经元个数和输出层的神经元个数是一样的。这类的神经网络被称为 Replicator Neural Networks (RNNs)，请注意这里的 RNN 算法指的并不是 Recurrent Neural Networks（RNNs），而是 Replicator Neural Networks，尽管它们拥有着同样的缩写名字 RNNs。具体来说， Replicator Neural Networks (RNNs)，或者说自编码器，是一个多层前馈的神经网络 (multi-layer feed-forward neural networks)。在 Replicator Neural Networks 中，输入的变量也是输出的变量，模型中间层节点的个数少于输入层和输出层节点的个数。这样的话，模型就起到了压缩数据和恢复数据的作用。

rnn1

如图所示，这里的 RNNs 有三个隐藏层，输入层和输出层的节点个数都是6，第一个隐藏层和第三个隐藏层的节点个数（图中是4个节点）少于输入层，第二个隐藏层的节点个数是最少的（图中是2个节点）。在神经网络传输的时候，中间使用了 tanh 函数和 sigmoid 函数。这个神经网络是训练一个从输入层到输出层的恒等函数（identity mapping），传输的时候从输入层开始压缩数据，然后到了第二个隐藏层的时候开始解压数据。训练的目标就是使得整体的输出误差足够小，整体的误差是由所有的样本误差之和除以样本的个数得到的。由于图中只画出了6个特征，因此第 i 个样本的误差是

$e_{i}=\sum_{j=1}^{6}(x_{i j}-r_{i j})^{2}/6$

如果使用已经训练好的 RNN 模型，异常值的分数就可以定义为重构误差（reconstruction error）。

下面简要介绍一下 RNN 模型是如何构建的：

rnn2

根据上图所示，左边的是输入层，右边的输出层。假设第 k 层中第 i 个神经元的输出是 $S_{k}(I_{ki})$ ，其中 $I_{ki}$ 表示第 k 层中第 i 个神经元的输入， $S_{k}$ 表示第 k 层使用的激活函数。那么

$\theta=I_{ki}=\sum_{j=0}^{L_{k-1}}w_{kij}Z_{(k-1)j}$

其中 $Z_{kj}$ 是第 k 层中第 j 个神经元的输出， $L_{k}$ 是第 k 层神经元的个数。对于第二层和第四层而言 (k=2,4)，激活函数选择为

$S_{k}(\theta)=tanh(a_{k}\theta) \text{ for } k=2 \text{ or } 4,$

这里的 $a_{k}$ 是一个参数，通常假设为1。对于中间层 (k=3) 而言，激活函数是一个类阶梯 (step-like) 函数。有两个参数 N 和 $a_{3}$ ，N 表示阶梯的个数， $a_{3}$ 表示从这一层到下一层的提升率 (transition rate)：

$S_{3}(\theta)=\frac{1}{2}$ + $\frac{1}{4}\sum_{j=1}^{N-1}tanh(a_{3}(\theta-\frac{j}{N}))$ .

在这里可以假设 $a_{3}=100$ ， $N=4$ . 那么 $S_{3}(\theta)$ 就如下图所示。

第三层的激活函数的输出就变成了 N 个离散的变量：0, 1/(N-1), 2/(N-1),…,1。这个阶梯型的激活函数是把第三层的连续输入值变成了一批离散的值。也就意味着把样本映射到了 N 个簇，那么 RNN 就可以计算出单个的异常点和一小簇的异常点。

备注：

根据上面的分析，可以看出如果按照以上算法，则不能使用反向传播算法来训练模型，原因是 $S_{3}(\theta)$ 的导数不能够通过它的取值来表示。这一点与 tanh 函数， $\sigma$ 函数是不一致的，因为 $tanh^{'}(x) = 1-tanh^{2}(x)$ 和 $\sigma^{'}(x)=\sigma(x)(1-\sigma(x))$ 。因此有学者指出 [1]，使用三个隐藏层是没有必要的，使用1个或者2个隐藏层的神经网络也能够得到类似的结果；同样，没有必要使用 $S_{3}(\theta)$ 这样类型的阶梯函数，使用传统的 $\sigma$ 激活函数也能够得到类似的结果。并且 $S_{3}(\theta)$ 是一个 step-like 函数，很多地方的导数取值都是接近于零的。

后向传播算法：

一般来说，为了训练神经网络模型，需要使用后向传播算法（back propagation），也简称为 BP 算法，或者误差逆传播算法（error back propagation）。在本文中，仅针对最简单的 RNN 模型介绍如何使用 BP 算法进行模型训练，至于多层的神经网络模型或者其他的神经网络模型，方法则是完全类似的。

rnn3

给定训练集合 $D=\{(\bold{x}_{1},\bold{y}_{1}),...,(\bold{x}_{m},\bold{y}_{m})\}$ ，其中有 m 个样本，并且输入和输出是一样的值。换句话说，也就是 n 维向量

$\bold{x}_{i}=\bold{y}_{i}\in\mathbb{R}^{n} \text{ for all } 1\leq i\leq m$ .

换句话说，输入样例是由 n 个属性描述，输出的结果也是 n 个属性。隐藏层只有一个，隐藏层的神经元个数是 $q=[(n+1)/2]$ ，这里的 [] 表示 Gauss 取整函数。输出层第 j 个神经元的阈值使用 $\theta_{j}$ 表示，隐藏层第 h 个神经元的阈值使用 $\gamma_{h}$ 表示。输入层第 i 个神经元与隐藏层第 h 个神经元之间的连接权重是 $v_{i h}$ , 隐藏层第 h 个神经元与输出层第 j 个神经元之间的连接权重是 $w_{h j},$ 其中 $1\leq i \leq n, 1\leq h \leq q, 1\leq j \leq n.$

记隐藏层第 h 个神经元接收到的输入为

$\alpha_{h} = \sum_{i=1}^{n}v_{i h}x_{i} \text{ for all } 1\leq h \leq q.$

写成矩阵形式就是：

$(\alpha_{1},\cdot\cdot\cdot,\alpha_{q})=(x_{1},\cdot\cdot\cdot,x_{n})\begin{bmatrix} v_{11} & ... & v_{1q} \\ ... & ... & ... \\ v_{n1} & ... & v_{nq} \end{bmatrix}.$

记输出层第 j 个神经元接收到的输入为

$\beta_{j}=\sum_{h=1}^{q}w_{h j}b_{h} \text{ for all } 1\leq j\leq n,$

其中 $b_{h}$ 是隐藏层第 h 个神经元的输出， $b_{h} = f(\alpha_{h}-\gamma_{h}) \text{ for all } 1\leq h \leq q,$ $f$ 是激活函数。写成矩阵形式就是：

$(\beta_{1},\cdot\cdot\cdot,\beta_{n})=(b_{1},\cdot\cdot\cdot,b_{q})\begin{bmatrix} w_{11} & ... & w_{1n} \\ ... & ... & ... \\ w_{q1} & ... & w_{qn} \end{bmatrix}.$

输出层第 j 个神经元的输出是 $f(\beta_{j}-\theta_{j}),$ 其中 $1\leq j \leq n.$

下面可以假定激活函数都使用 $f(x)=1/(1+\exp(-x)),$ 那么直接通过导数计算可以得到 $f^{'}(x)=f(x)(1-f(x)).$

对于训练集 $(\bold{x}_{k},\bold{y}_{k}),$ 通过神经网络得到的输出是 $\hat{\bold{y}}_{k}=(\hat{y}_{k1},...,\hat{y}_{kn}),$ 并且 $\hat{y}_{kj} = f(\beta_{j}-\theta_{j})$ 对于 $1\leq j \leq n$ 都成立。那么神经网络在训练集 $(\bold{x}_{k},\bold{y}_{k})$ 的均方误差是

$E_{k} =\frac{1}{2}\sum_{j=1}^{n}(\hat{y}_{kj}-y_{kj})^{2},$

其中 $\bold{y}_{k}=(y_{k1},...,y_{kn}).$ 整体的误差是

$E = \frac{1}{m}\sum_{k=1}^{m}E_{k} = \frac{1}{2m}\sum_{k=1}^{m}\sum_{j=1}^{n}(\hat{y}_{kj}-y_{kj})^{2}$

标准 BP 算法：

网络中有个参数需要确定：输入层到隐藏层的 n*q 个权重值，隐藏层到输出层的 n*q 个权重值，q个隐层神经元的阈值，n 个输出层神经元的阈值。BP 算法是一个迭代学习算法，在迭代的每一轮采用了梯度下降法来进行参数的更新。任意参数的更新规则是

$v \leftarrow v$ + $\Delta v.$

标准 BP 算法是根据每一个 $E_{k}$ 来获得更新规则，下面来推导每一个参数的更新规则。对于 $1\leq h \leq q, 1\leq j \leq n,$ 计算梯度

$\Delta w_{hj} = -\eta \frac{\partial E_{k}}{\partial w_{hj}},$

注意到 $w_{hj}$ 先影响到第 j 个输出层神经元的输入值 $\beta_{j},$ 再影响到第 j 个输出层神经元的输出值 $\hat{y}_{kj}$ ，最后影响到 $E_{k}$ ，根据高等数学的链式法则可以得到

$\frac{\partial E_{k}}{\partial w_{hj}} = \frac{\partial E_{k}}{\partial \hat{y}_{kj}} \cdot \frac{\partial \hat{y}_{kj}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{hj}}$

根据定义 $\beta_{j}=\sum_{h=1}^{q}w_{hj}b_{h}$ 可以得到 $\frac{\partial \beta_{j}}{\partial w_{hj}}=b_{h}$ 对于 $1\leq j \leq n$ 都成立。

根据定义 $E_{k}=\frac{1}{2}\sum_{j=1}^{n}(\hat{y}_{kj}-y_{kj})^{2}$ 可以得到 $\frac{\partial E_{k}}{\partial \hat{y}_{kj}}=(\hat{y}_{kj}-y_{kj})$ .

根据定义 $\hat{y}_{kj}=f(\beta_{j}-\theta_{j})$ 和 $f^{'}(x)=f(x)\cdot(1-f(x))$ 可以得到 $\frac{\partial \hat{y}_{kj}}{\partial \beta_{j}}=f^{'}(\beta_{j}-\theta_{j})=f(\beta_{j}-\theta_{j})\cdot(1-f(\beta_{j}-\theta_{j}))=\hat{y}_{kj}\cdot (1-\hat{y}_{kj}).$

所以可以计算出对于 $1\leq h \leq q, 1\leq j \leq n,$ 有

$\frac{\partial E_{k}}{\partial w_{hj}} = (\hat{y}_{kj}-y_{kj})\cdot\hat{y}_{kj}\cdot(1-\hat{y}_{kj})\cdot b_{h}$

如果假设

$g_{j}=-\frac{\partial E_{k}}{\partial \beta_{j}}=-\frac{\partial E_{k}}{\partial \hat{y}_{kj}}\cdot \frac{\hat{y}_{kj}}{\partial \beta_{j}}$

那么可以得到

$g_{j}=\hat{y}_{kj}\cdot(1-\hat{y}_{kj})\cdot(y_{kj}-\hat{y}_{kj})$

因此对于 $1\leq h \leq q, 1\leq j \leq n,$ 可以得到 $\Delta w_{hj}=\eta g_{j}b_{h}.$

根据类似的想法，有

$\Delta \theta_{j}=-\eta\cdot\frac{\partial E_{k}}{\partial \theta_{j}}, \Delta v_{ih}=-\eta\cdot\frac{\partial E_{k}}{\partial v_{ih}}, \Delta \gamma_{h}=-\eta\cdot\frac{\partial E_{k}}{\partial \gamma_{h}}$ .

逐个计算：

$\frac{\partial E_{k}}{\partial \theta_{j}}=\frac{\partial E_{k}}{\partial \hat{y}_{kj}}\cdot\frac{\partial\hat{y}_{kj}}{\partial\theta_{j}}=(\hat{y}_{kj}-y_{kj})\cdot(-1)\cdot f^{'}(\beta_{j}-\theta_{j})=(y_{kj}-\hat{y}_{kj})\cdot\hat{y}_{kj}\cdot(1-\hat{y}_{kj})=g_{j}$

$\frac{\partial E_{k}}{\partial v_{ih}}=\frac{\partial E_{k}}{\partial\alpha_{h}}\cdot\frac{\partial\alpha_{h}}{\partial v_{ih}}=\frac{\partial E_{k}}{\partial b_{h}}\cdot\frac{\partial b_{h}}{\partial \alpha_{h}}\cdot\frac{\partial\alpha_{h}}{\partial v_{ih}}$

由于

$\frac{\partial \alpha_{h}}{\partial v_{ih}}=x_{ki}$

$\frac{\partial b_{h}}{\partial\alpha_{h}}=f^{'}(\alpha_{h}-\gamma_{h})=f(\alpha_{h}-\gamma_{h})\cdot(1-f(\alpha_{h}-\gamma_{h}))=b_{h}\cdot(1-b_{h})$

$\frac{\partial E_{k}}{\partial b_{h}}=\sum_{j=1}^{n}\frac{\partial E_{k}}{\partial \beta_{j}}\cdot\frac{\partial \beta_{j}}{\partial b_{h}}=\sum_{j=1}^{n}(-g_{j})\cdot w_{hj}$

所以，

$\Delta v_{ih}=\eta(\sum_{j=1}^{n}g_{j}w_{hj})\cdot b_{h}\cdot (1-b_{h})x_{ki} = \eta e_{h}x_{ki},$ 其中 $e_{h}=-\partial E_{k}/\partial\alpha_{h}=(\sum_{j=1}^{n}g_{j}w_{hj})\cdot b_{h}\cdot(1-b_{h}).$

$\Delta \gamma_{h}=(-\eta)\cdot\frac{\partial E_{k}}{\partial\gamma_{h}}=(-\eta)\cdot\frac{\partial E_{k}}{\partial b_{h}}\cdot\frac{\partial b_{h}}{\partial\gamma_{h}}=\eta\cdot(\sum_{j=1}^{n}g_{j}w_{hj})\cdot(-1)\cdot f^{'}(\alpha_{h}-\gamma_{h})=(-\eta)\cdot(\sum_{j=1}^{n}g_{j}w_{hj})\cdot b_{h}\cdot(1-b_{h})=(-\eta)\cdot e_{h} .$

整理之后，任意参数 v 的更新式子是 $v\leftarrow v$ + $\Delta v,$ 并且更新的规则如下：

$\Delta w_{hj}=\eta g_{j}b_{h} \text{ for all } 1\leq j\leq n, 1\leq h \leq q,$

$\Delta \theta_{j}=-\eta g_{j} \text{ for all } 1\leq j\leq n,$

$\Delta v_{ih}=\eta e_{h}x_{ki} \text{ for all } 1\leq i\leq n, 1\leq h\leq q,$

$\Delta \gamma_{h}=-\eta e_{h} \text{ for all } 1\leq h\leq q,$

其中学习率 $\eta\in(0,1)$ 控制着算法每一轮迭代中的更新步长，若步长太大则容易振荡，太小则收敛速度过慢，需要人工调整学习率。对每个训练样例，BP 算法执行下面的步骤：先把输入样例提供给输入层神经元，然后逐层将信号往前传，直到计算出输出层的结果；然后根据输出层的误差，再将误差逆向传播至隐藏层的神经元，根据隐藏层的神经元误差来对连接权和阈值进行迭代（梯度下降法）。该迭代过程循环进行，直到达到某个停止条件为止。

标准 BP 算法的训练流程：
输入：训练集合 $D={(\bold{x}_{k},\bold{y}_{k})}_{k=1}^{m}$ 和学习率 $\eta.$
过程：
1. 在 (0,1) 范围内随机神经网络中的所有连接权重和阈值
2. repeat
for all $(\bold{x}_{k},\bold{y}_{k})$ do
根据当前参数，计算出当前的样本输出 $\bold{y}_{k}$
计算输出层神经元的梯度项 $g_{j}$
计算隐藏层神经元的梯度项 $e_{h}$
更新连接权重 $w_{hj}, v_{ih}$ 与阈值 $\theta_{j},\gamma_{h}$
end for
3. 达到停止条件
输出：链接权与阈值都确定的神经网络模型

累积 BP 算法：

BP 算法的目的是最小化训练集上的累计误差 $E=\sum_{k=1}^{m}E_{k}/m,$ 其中 m 是训练集合中样本的个数。不过，标准的 BP 算法每次仅针对一个训练样例更新连接权重和阈值，也就是说，标准 BP 算法的更新规则是基于单个的 $E_{k}$ 推导而得到的。通过类似的计算方法可以推导出累计误差的最小化更新规则，那就得到了累计误差逆传播（accumulate error backpropagation）算法。标准 BP 算法需要进行更多次的迭代，并且参数的更新速度快，累积 BP 算法必须扫描一次训练集合才会进行一次参数的更新，而且累计误差下降到一定的程度以后，进一步下降就会明显变慢，此时标准 BP 算法往往会更快的得到较好的解，尤其是训练集合大的时候。

训练方法：

（1）把数据集合的每一列都进行归一化；

（2）选择 70% 的数据集合作为训练集合，30% 的数据集合作为验证集合。或者训练集合 : 验证集合 = 8 : 2，这个需要根据情况而定。

（3）随机生成一个三层的神经网络结构，里面的权重都是随机生成，范围在 [0,1] 内。输入层的数据和输出层的数据保持一致，并且神经网络中间层的节点个数是输入层的一半。

（4）使用后向传播算法（back-propagation）来训练模型。为了防止神经网络的过拟合，通常有两种策略来防止这个问题。（i）第一种策略是“早停”（early stopping）：当训练集合的误差降低，但是验证集合的误差增加时，则停止训练，同时返回具有最小验证集合误差的神经网络；（ii）第二种策略是“正则化”（regularization）：基本思想是在误差目标函数中增加一个用于描述网络复杂度的部分，例如链接权和阀值的平方和。

测试效果：

其中蓝色的点表示正常点，红色的点表示被 RNN 算法标记的异常点。

rnn_result1

rnn_result2

参考文献：

[1] Anomaly Detection Using Replicator Neural Networks Trained on Examples of One Class, Hoang Anh Dau, Vic Ciesielski, Andy Song

[2] Replicator Neural Networks for Outlier Modeling in Segmental Speech Recognition, Laszlo Toth and Gabor Gosztolya

[3] Outlier Detection Using Replicator Neural Networks, Simon Hawkins, Honxing He, Graham Williams and Rohan Baxter