Category Archives: 安全业务领域

How AI is helping detect fraud and fight criminals

http://venturebeat.com/2017/02/18/how-ai-is-helping-detect-fraud-and-fight-criminals/

AI is about to go mainstream. It will show up in the connected home, in your car, and everywhere else. While it’s not as glamorous as the sentient beings that turn on us in futuristic theme parks, the use of AI in fraud detection holds major promise. Keeping fraud at bay is an ever-evolving battle in which both sides, good and bad, are adapting as quickly as possible to determine how to best use AI to their advantage.

There are currently three major ways that AI is used to fight fraud, and they correspond to how AI has developed as a field. These are:

  1. Rules and reputation lists
  2. Supervised machine learning
  3. Unsupervised machine learning

Rules and reputation lists

Rules and reputation lists exist in many modern organizations today to help fight fraud and are akin to “expert systems,” which were first introduced to the AI field in the 1970s. Expert systems are computer programs combined with rules from domain experts.They’re easy to get up and running and are human-understandable, but they’re also limited by their rigidity and high manual effort.

A “rule” is a human-encoded logical statement that is used to detect fraudulent accounts and behavior. For example, an institution may put in place a rule that states, “If the account is purchasing an item costing more than $1000, is located in Nigeria, and signed up less than 24 hours ago, block the transaction.”

Reputation lists, similarly, are based on what you already know is bad. A reputation list is a list of specific IPs, device types, and other single characteristics and their corresponding reputation score. Then, if an account is coming from an IP on the bad reputation list, you block them.

While rules and reputation lists are a good first attempt at fraud detection and prevention, they can be easily gamed by cybercriminals. These days, digital services abound, and these companies make the sign-up process frictionless. Therefore, it takes very little time for fraudsters to make dozens, or even thousands, of accounts. They then use these accounts to learn the boundaries of the rules and reputation lists put in place. Easy access to cloud hosting services, VPNs, anonymous email services, device emulators, and mobile device flashing makes it easy to come up with unsuspicious attributes that would miss reputation lists.

Since the 1990s, expert systems have fallen out of favor in many domains, losing out to more sophisticated techniques. Clearly, there are better tools at our disposal for fighting fraud. However, a significant number of fraud-fighting teams in modern companies still rely on this rudimentary approach for the majority of their fraud detection, leading to massive human review overhead, false positives, and sub-optimal detection results.

Supervised machine learning (SML)

Machine learning is a subfield of AI that attempts to address the issue of previous approaches being too rigid. Researchers wanted the machines to learn from data, rather than encoding what these computer programs should look for (a different approach from expert systems). Machine learning began to make big strides in the 1990s, and by the 2000s it was effectively being used in fighting fraud as well.

Applied to fraud, supervised machine learning (SML) represents a big step forward. It’s vastly different from rules and reputation lists because instead of looking at just a few features with simple rules and gates in place, all features are considered together.

There’s one downside to this approach. An SML model for fraud detection must be fed historical data to determine what the fraudulent accounts and activity look like versus what the good accounts and activity look like. The model would then be able to look through all of the features associated with the account to make a decision. Therefore, the model can only find fraud that is similar to previous attacks. Many sophisticated modern-day fraudsters are still able to get around these SML models.

That said, SML applied to fraud detection is an active area of development because there are many SML models and approaches. For instance, applying neural networks to fraud can be very helpful because it automates feature engineering, an otherwise costly step that requires human intervention. This approach can decrease the incidence of false positives and false negatives compared to other SML models, such as SVM and random forest models, since the hidden neurons can encode many more feature possibilities than can be done by a human.

Unsupervised machine learning (UML)

Compared to SML, unsupervised machine learning (UML) has cracked fewer domain problems. For fraud detection, UML hasn’t historically been able to help much. Common UML approaches (e.g., k-means and hierarchical clustering, unsupervised neural networks, and principal component analysis) have not been able to achieve good results for fraud detection.

Having an unsupervised approach to fraud can be  difficult to build in-house since it requires processing billions of events all together and there are no out-of-the-box effective unsupervised models. However, there are companies that have made strides in this area.

The reason it can be applied to fraud is due to the anatomy of most fraud attacks. Normal user behavior is chaotic, but fraudsters will work in patterns, whether they realize it or not. They are working quickly and at scale. A fraudster isn’t going to try to steal $100,000 in one go from an online service. Rather, they make dozens to thousands of accounts, each of which may yield a profit of a few cents to several dollars. But those activities will inevitably create patterns, and UML can detect them.

The main benefits of using UML are:

  • You can catch new attack patterns earlier
  • All of the accounts are caught, stopping the fraudster from making any money
  • Chance of false positives is much lower, since you collect much more information before making a detection decision

Putting it all together

Each approach has its own advantages and disadvantages, and you can benefit from each method. Rules and reputation lists can be implemented cheaply and quickly without AI expertise. However, they have to be constantly updated and will only block the most naive fraudsters. SML has become an out-of-the box technology that can consider all the attributes for a single account or event, but it’s still limited in that it can’t find new attack patterns. UML is the next evolution, as it can find new attack patterns, identify all of the accounts associated with an attack, and provide a full global view. On the other hand, it’s not as effective at stopping individual fraudsters with low-volume attacks and is difficult to implement in-house. Still, it’s certainly promising for companies looking to block large-scale or constantly evolving attacks.

A healthy fraud detection system often employs all three major ways of using AI to fight fraud. When they’re used together properly, it’s possible to benefit from the advantages of each while mitigating the weaknesses of the others.

AI in fraud detection will continue to evolve, well beyond the technologies explored above, and it’s hard to even grasp what the next frontier will look like. One thing we know for sure, though, is that the bad guys will continue to evolve along with it, and the race is on to use AI to detect criminals faster than they can use it to hide.

Catherine Lu is a technical product manager at DataVisor, a full-stack online fraud analytics platform.

MI 3.0 Thumbnail (1)

Advertisement

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

原文链接:http://www.twoeggz.com/news/3147867.html

规则引擎,机器学习模型,设备指纹,黑白名单(例如邮件、IP地址黑白名单)和无监督检测分析?经常会有人问,我们应该选择哪种反欺诈检测方式?其实每一种方法都有其独特的优势,企业应该结合反欺诈解决方案及反欺诈行业专家经验,搭建出一套最适合自己公司业务、产品以及用户类型的反欺诈管理系统。

规则引擎和学习模型是传统反欺诈系统构建中重要的两个基本组成部分。接下来的文章中会介绍这两套系统是如何工作的?它们各自的优势和局限性是什么?为什么无监督分析算法优越于规则引擎和机器学习模型,以及使用无监督分析算法在捕捉新型欺诈时的必要性。

>>>>规则引擎

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

>>>>工作机制

规则引擎将商业业务逻辑和应用程序代码划分开来,安全和风险分析师等基于SQL或数据库知识就可以独自管理运行规则。有效的规则可以通过几行逻辑代码一目了然的进行表述:If A and B, then do C。例如:

IF(user_email=type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

规则引擎同样可以使用加权打分评分机制。例如,下表中的每一项规则都对应一个分值,正数或负数,这个分值可以由分析师赋值。所有规则的分数会被加起来,之后得到一个总计分数。规则引擎基于分数临界值创建出业务运维流程。在一个典型的运维流中,根据分数范围,一般会分为三种行为类型,例如:

1.高于1000 - 否认(如拒绝交易,暂停帐户)

2.低于300-接受(如确认订单,通过内容)

3.介于300到1000-提示需要增加额外的审核,置入人工审校池

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

规则引擎可以从数据库中导入数据,挑选出黑名单(如IP地址)和其它坏的列表。每当一个新的欺诈情况发生后,分析师会增加一个新规则,以保证公司在可预见范围内免于欺诈风险。这样通过使用规则引擎,公司便可以避免一些周期性出现的欺诈。

➜局限性

一旦欺诈规模增大,规则引擎就会展现出局限性。欺诈者不会在被捕捉后依旧坐以待毙,他们会研究你是如何捕捉他们,之后变换新的方式,避免再次被捉到。所以,规则作用的时间很有限,可能是几周,甚至几天。试想一下,当你在运行和测试成百上千条新的规则同时,还需要每隔几天增加新的规则,删除或更新之前的规则,并对规则进行加权,这无疑要花费大量运营资源,时间,和费用来维护。

如果一个反欺诈分析师要在3种规则下计算出通过、拒绝及比例数字,并通过比例变化情况调整每一项规则的分值,需要做出8种改变:2^3 = 8(values^rules)。而测试3种不同值的10种规则需要做出超过5.9万次变化。逐渐随着规则数量增加,改变频率也会随之快速增长。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

规则引擎不会从分析观察或反馈中自动学习。由于欺诈者经常改变欺诈方式,导致数据会间歇性暴露在各种新的攻击下。此外,规则引擎是基于二进制方式处理信息,有可能无法完全检测到数据细微差别,这会导致出现更高的误判率及用户负面体验。

有监督机器学习模型

➜工作机制

有监督机器学习模式是反欺诈检测中最为广泛使用的机器学习模式。其中包含的几个学习技术分别有决策树算法,随机森林,最近邻算法,支持向量机和朴素贝叶斯分类。机器学习通常从有标签数据中自动创建出模型,来检测欺诈行为。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

在创建模型的过程中,清楚了解哪些是欺诈行为,哪些不是,会起到至关重要的作用。模型中倒入的数据会影响其检测效果。用已知欺诈数据和正常数据做训练集,可以训练出学习模型来填补并增强规则引擎无法覆盖的复杂欺诈行为。

下面是一个关于有监督机器学习机制如何将新的数据划分为欺诈和非欺诈的例子。训练数据通过识别模型特点,可以预知两种类型欺诈者: 1. 信用卡欺诈者 2. 垃圾信息制造者。以下三种特征对识别欺诈攻击类型非常有帮助:1. 邮件地址结构 2. IP地址类型 3. 关联账户密度指示欺诈攻击类型(如变化的回复)。实际上,一个典型的模型有成百上千种特征。

在此例中,拥有以下特征的用户会被训练出的模型识别为信用卡欺诈:

邮箱地址前5个是字母,后3个是数字

使用匿名代理

中等密度关联账号(例如10)

有以下特征的用户会被识别为垃圾信息制造者:

邮箱地址按某种形式随机生成的

使用数据中心的IP地址

高密度关联账号(例如30+)

假设现在你的模型正在从下面一批用户里评估风险,这个模型会计算每个用户的邮件地址结构,IP地址类型以及账号关联密度。正常情况下,模型会将第二种和第三种用户归类为垃圾制造者,把第一、第四、第五种归为信用卡欺诈者。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

训练学习模型填补并增强了规则引擎无法覆盖的范围,学习模型可以通过增加训练数据持续提高其检测效率。学习模型可以处理非结构数据(如图像,邮件内容),即使有成千上万的输入信息变化特征,也可以自动识别复杂的欺诈模式。

➜局限性

虽然有监督机器学习创建模型功能比较强大,但同时也有局限性。如果出现之前没有标签案例的、新的欺诈类型该怎么办?由于欺诈方式经常变化,这种情况普遍存在。毕竟欺诈者在不停地变化欺诈手段,日以继夜的实施各种新型攻击,如果之前没有遇到这种欺诈攻击模式,也没有足够的训练数据,那么训练出的模型就不能返回优质、可靠的结果。

从下图中可以看出,收集数据和标记数据是创建有监督机器学习过程中最重要的部分。产出准确的训练标签可能需要花费数周到数月的时间。并且产生标签的过程需要反欺诈分析团队全面审核案例,将数据进行正确标签分类,并在投入使用前进行验证测试。除非学习模型之前有足够的相应训练数据,否则一旦出现新的攻击,学出的模型将会无法识别。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

无监督机器学习-超越规则引擎和有监督机器学习

以上两种欺诈检测框架都有各自明显的局限性,DataVisor创新的无监督机器学习算法弥补了这两种模型的不足。无监督检测算法无需依赖于任何标签数据来训练模型。这种检测机制算法的核心内容是无监督欺诈行为检测,通过利用关联分析和相似性分析,发现欺诈用户行为间的联系,创建群组,并在一个或多个其他群组中发掘新型欺诈行为和案例。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

无监督检测提供了攻击的群组信息,并自动生成训练数据,之后汇入到有监督的机器学习模块中。基于这些数据,有监督机器学习通过模型结构,可以进一步发现大规模攻击群组之外的欺诈用户。DataVisor所采用的这种框架模式不仅可以找出由个人账号发起的攻击,更重要的是可以有效发现由多个账号组成的欺诈或犯罪团伙实施的有组织的大规模攻击,为客户反欺诈检测框架增加至关重要的早期全方位检测。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor采用的关联分析方法将欺诈行为相似的群组归为一类。而另一种检测技术-异常检测,将不符合好用户行为特点的用户均列为欺诈对象。其原理是假设坏用户都是孤立于正常用户之外的单个用户或小群组。下面图表列举了欺诈者F1、F3、群组F2,以及好用户群组G1和G2。异常检测模型只能发现此类孤立的欺诈行为,但在鉴别大规模的群组欺诈时就会面临很大的挑战。在这一点上,相比于异常检测,无监督分析的优势显而易见。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor把无监督分析算法结合规则引擎和机器学习模型一起使用。对于客户来说,这种全方位的检测在提供欺诈信息列表的同时,也会提供给客户新的欺诈检测模型,并帮助用户创建新的检测规则。一旦DataVisor的检测方式发现客户遇到新型未知欺诈,无监督检测可以有效提前早期预警。

通过专注于早期检测和发现未知欺诈,DataVisor帮助客户在欺诈解决方案的各个方面提升机制、提高效率:

鉴别虚假用户注册和帐户侵权;

检测虚假金融交易和活动;

发现虚假推广和促销滥用;

阻止社交垃圾信息,虚假内容发布、虚假阅读量和虚假点赞数量;

翻译者:Lily.Wang

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

无监督机器学习:超越规则引擎和有监督机器学习的新一代反欺诈分析方法

Rules engines, machine learning models, ID verification, or reputation lookups (e.g. email, IP blacklists and whitelists) and unsupervised analytics? I’ve often been asked which one to use and should you only go with one over the others. There is a place for each to provide value and you should anticipate incorporating some combination of these fraud solutions along with solid domain expertise to build a fraud management system that best accounts for your business, products and users. With that said, rules engines and learning models are two of the major foundational components of a company’s fraud detection architecture. I’ll explain how they work, discuss the benefits and limitations of each and highlight the demand for unsupervised analytics that can go beyond rules engines and machine learning in order to catch new fraud that has yet to be seen.

Rules Engines

Unsupervised analytics - RULES BLOG IMAGE 1

Image Source

How they work

Rules engines partition the operational business logic from the application code, enabling non-engineering fraud domain experts (e.g. Trust & Safety or Risk Analysts) with SQL/database knowledge to manage the rules themselves. So what types of rules are effective? Rules can be as straightforward as a few lines of logic: If A and B, then do C. For example,

IF (user_email = type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

Rules engines can also employ weighted scoring mechanisms. For example, in the table below each rule has a score value, positive or negative, which can be assigned by an analyst. The points for all of the rules triggered will be added together to compute an aggregate score. Subsequently, rules engines aid in establishing business operation workflows based on the score thresholds. In a typical workflow, there could be three types of actions to take based on the score range:

  1. Above 1000 – Deny (e.g. reject a transaction, suspend the account)
  2. Below 300 – Accept (e.g. order is ok, approve the content post)
  3. Between 300 and 1000 – Flag for additional review and place into a manual review bin

Unsupervised Analytics - RULES BLOG 2

Advantages

Rules engines can take black lists (e.g. IP addresses) and other negative lists derived from consortium databases as input data. An analyst can add a new rule as soon as he or she encounters a new fraud/risk scenario, helping the company benefit from the real-world insights of the analyst on the ground seeing the fraud every day. As a result, rules engines give businesses the control and capability to handle one-off brute force attacks, seasonality and short-term emerging trends.

Limitations

Rules engines have limitations when it comes to scale. Fraudsters don’t sit idle after you catch them. They will change what they do after learning how you caught them to prevent being caught again. Thus, the shelf life of rules can be a couple of weeks or even as short as a few days before their effectiveness begins to diminish. Imagine having to add, remove, and update rules and weights every few days when you’re in a situation with hundreds or thousands of rules to run and test. This could require huge operational resources and costs to maintain.

If a fraud analyst wants to calculate the accept, reject, and review rates for 3 rules and get the changes in those rates for adjusting each rule down or up by 100 points, that would require 8 changes: 23^ = 8 (values^rules). Testing 10 rules with 3 different values would be over 59K changes! As the number of rules increases, the time to make adjustments increases quickly.

Unsupervised Analytics - rules_engine_costs

Rules engines don’t automatically learn from analyst observations or feedback. As fraudsters adapt their tactics, businesses can be temporarily exposed to new types of fraud attacks. And since rules engines treat information in a binary fashion and may not detect subtle nuances, this can lead to higher instances of false positives and negative customer experiences.

Learning Models
Unsupervised analytics - svm

Image Source

How they work

Supervised machine learning is the most widely used learning approach when it comes to fraud detection. A few of the learning techniques include decision trees, random forests, nearest neighbors, Support Vector Machines (SVM) and Naive Bayes. Machine learning models often solve complex computations with hundreds of variables (high-dimensional space) in order to accurately determine cases of fraud.

Having a good understanding of both what is and what is not fraud plays a central role in the process of creating models. The input data to the models influences their effectiveness. The models are trained on known cases of fraud and non-fraud (e.g. labeled training data), which then facilitate its ability to classify new data and cases as either fraudulent or not. Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage.

Below is a simplified example of how a supervised machine learning program would classify new data into the categories of non-fraud or fraud. Training data informs the model of the characteristics of two types of fraudsters: 1) credit card fraudsters and 2) spammers. Three features: 1) the email address structure, 2) the IP address type, and 3) the density of linked accounts are indicative of the type of fraud attack (e.g. response variable). Note in reality, there could be hundreds of features for a model.

The trained model recognizes that a user with:

  • an email address that has 5 letters followed by 3 numbers
  • using an anonymous proxy
  • with a medium density (e.g. 10) of connected accounts

is a credit card fraudster.

It also knows recognizes that a user with:

  • an email address structure with a “dot” pattern
  • using an IP address from a datacenter
  • with a high density (e.g. 30+) of linked accounts

is a spammer.

Now suppose your model is evaluating new users from the batch of users below. It computes the email address structure, IP address type, and density of linked accounts for each user. If working properly, it will classify the users in Cases 2 and 3 as spammers and the users in Cases 1, 4 and 5 as credit card fraudsters.

BLOG EMAILS

Advantages

Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage. Learning models have the ability to digest millions of row of data scalably, pick up from past behaviors and continually improve their predictions based on new and different data. They can handle unstructured data (e.g. images, email text) and recognize sophisticated fraud patterns automatically even if there are thousands of features/variables in the input data set. With learning models, you can also measure effectiveness and improve it by only changing algorithms or algorithm parameters.

Limitations

Trained learning models, while powerful, have their limitations. What happens if there are no labeled examples for a given type of fraud? Given how quickly fraud is evolving, this is not that uncommon of an occurrence. After all, fraudsters change schemes and conduct new types of attacks around the clock. If we have not encountered the fraud attack pattern, and therefore do not have sufficient training data, the trained learning models may not have the appropriate support to return good and reliable results.

As seen in the diagram below, collecting and labeling data is a crucial part of building a learning model and the time required to generate accurate training labels can be weeks to months. Labeling can involve teams of fraud analysts reviewing cases thoroughly, categorizing it with the right fraud tags, and undergoing a verification process before being used as training data. In the event a new type of fraud emerges, a learning model may not be able to detect it until weeks later after sufficient data has been acquired to properly train it.
unsupervised analytics - supervised_learning_flow

Unsupervised Analytics – Going Beyond Rules Engines and Learning Models

While both of these approaches are critical pieces of a fraud detection architecture, here at DataVisor we take it one step further. DataVisor employs unsupervised analytics, which do not rely on having prior knowledge of the fraud patterns. In other words no training data is needed. The core component of the algorithm is theunsupervised attack campaign detection which leverages correlation analysis and graph processing to discover the linkages between fraudulent user behaviors, create clusters and assign new examples into one or the other of the clusters.

unsupervised anaytics - DV_Apache-Spark

The unsupervised campaign detection provides the attack campaign group info and also the self-generated training data, both of which can be fed into our machine learning models to bootstrap them. With this data, the supervised machine learning will pick up patterns and find the fraudulent users that don’t fit into these large attack campaign groups. This framework enables DataVisor to uncover fraud attacks perpetrated by individual accounts, as well as organized mass scale attacks coordinated among many users such as fraud and crime rings – adding a valuable piece to your fraud detection architecture with a “full-stack.”

unsupervised analytics DV-fullstack

Our correlation analysis groups fraudsters “acting” similarly into the same cluster. In contrast, anomaly detection, another useful technique, finds the set of fraud objects that are considerably dissimilar from the remainder of the good users. It does this is by assuming anomalies do not belong to any group or they belong to small/sparse clusters. See graph below for anomaly detection illustrating fraudsters F1, F3, and group F2and good users G1 and G2. The benefits of unsupervised analytics is on display when comparing it to anomaly detection. While anomaly detection can find outlying fraudsters from a given data set, it would encounter a challenge identifying large fraud groups.

anomaly_detect

With unsupervised analytics, DataVisor collaborates with rules engines and machine learning models. For customers, the analytics provides them a list of the fraudsters and also gives their fraud analysts insights to create new rules. When DataVisor finds fraud that has not been encountered by a customer previously, the data from the unsupervised campaign detection can serve as early warning signals and/or training data to their learning models, creating new and valuable dimensions to their model’s accuracy.

By focusing on early detection and discovering unknown fraud, DataVisor has helped customers to become better and more efficient in solving fraud in diverse range of areas such as:

  • Identifying fake user registration and account takeovers (ATO)
  • Detecting fraudulent financial transactions and activity
  • Discovering user acquisition and promotion abuse
  • Preventing social spam, fake posts, reviews and likes

Stay tuned for future blog posts where I will address topics such as new online fraud attacks, case review management tools, and a closer look into DataVisor’s fraud detection technology stack. If you want to learn more about how DataVisor can help you fight online fraud, please visit https://datavisor.com/ or schedule atrial.

该如何做大中型 UGC 平台(如新浪微博)的反垃圾(anti-spam)工作?

来自知乎

帅帅 产品经理
宋一松 Facebook,Uber
收录于 编辑推荐 159 人赞同
aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢
iammutex 彩石手机CTO – 做最好的中老年智能手机

机器学习正在安全领域挂起一阵小旋风,但这里面有BUG

如今,安全领域是机器学习(Machine learning)正在大力进军的一个方向。

| 把机器学习应用到安全领域,老板们跃跃欲试

如果你亲自参加了 2016 RSA 大会,就会发现几乎没有哪家公司在说自家安全领域的产品时,不提及机器学习。这是为什么呢?

可能对外行人来说,机器学习就像一种魔法,能解决所有的安全问题:你把一堆未标识的数据统统塞进会机器学习的系统中,它就能分辨出连人类专家都分辨不出的数据规律,并且还可以学习新的行为指令和适应环境威胁。不仅如此,就连为规则加密也劳烦不到你,因为系统已经自动为你搞定这一切。

要真是像这样的话,那机器学习可真就是今年的重头戏了!但讽刺的是,每个人都兴师动众说要在这个领域搞出点名堂来,但真正理解什么是机器学习,或明白机器学习到底能用来做什么的人,却是凤毛麟角。可想而知,在这种大环境下机器学习大多是被滥用的,尤其在安全领域

| 用机器学习有效解决安全问题,正确的方法是?

把机器学习应用到安全领域,大多会涉及到一种技术——异常检测(anomaly detection),它可以识别哪些部分和预期模式或数据集不匹配。但技术销售方要注意,这种技术只在某些条件下有效——不过显然,他们还不知道自己已经犯下错误:他们会告诉你,分析过你公司的网络流量后,就可以用机器学习 揪出暗藏在网络中的黑客。但事实上,机器学习根本就做不到。这时候,你要立刻对这个销售商保持一丝怀疑。

那到底什么情况下才有效?答案是,只有为低维度的问题也配备上高质量的标识数据,这样的机器学习才是有效的。但很不幸,企业在实施过程并没有做到这一点。如果要检测新型的攻击方式,你得有很清晰并且经过标识的攻击案例。这就是说,如果没有透彻理解正常的网络行为,机器学习是不可能发现黑客的。再说,所有的黑客都很狡猾,他们一定会把自己伪装的天衣无缝。

| 机器学习和异常检测,用在哪里价值最大?

机器学习和异常检测真正有用的地方,在于它们能将人类行为分类。

事实证明,人类的预测能力非常强,他们也有能力建立非常精确的个体用户行为模型,让模型探测到异常情况。

其实,人们在这方面已小有成就,比如隐式认证( Implicit Authentication)。隐式认证采用生物特征识别技术,基于击键力度、节奏和打字模式等技术对用户身份进行认证。不管是改善用户体验还是增强安全性,这个技术的优势都相当明显。最起码,它免除了用户记忆密码的负担和输入密码的麻烦。由于隐式认证所需元素大多是低维的, 机器学习就只需处理少量几个参数,这也使得收集用户的高品质标识数据变得很方便。所以,即使有行为差异或信号干扰, 机器学习还是能正确为计算机视觉进行图形搭配。同理,机器学习也能通过识别出个体的独特行为而进行身份验证,这当然也不在话下。

不过,它是怎么做到的呢?

其实,你走路、站立等所有动作,是由众多因素共同决定的,比如生理状况,年龄,性别,肌肉记忆等等。并且对个体来说,这些动作不会有太大改变。因此,不经意间,你口袋中的手机就通过内置传感器精确捕捉到了这些信息,并记录下来。而想要通过运动行为来识别一个人, 4 秒的运动信息就已足够。另外,通过对比用户的历史和当下的定位记录也可以进行身份识别。人们总是生活在各种各样的习惯当中,通过观察他们什么时候从哪出发,就能预测被测者到底是不是用户本人。

我们的手机和电脑上已有大量的传感器,以后随着可穿戴设备的普及和物联网的发展,传感器的数量更会暴增。用户大量的行为数据和环境数据就这样被收集起来,提供给机器学习,让它为用户建立个体模型,并找到各个因素之间的相互关系。

| 让机器学习进行安全防护,你需要做哪些功课?

想进行安全防护,就必须让你的系统提前知道都存在哪些威胁模型。

首先,也是最重要的事——收集数据。这些数据必须非常精确,才能用来训练系统,起到抵抗威胁的作用。不过身份认证系统要真是遭到攻击,你也不用过于担心。因为行为变化还是比较好检测的,系统很快就能识别出异常情况。比如,如果一个设备不小心被偷,那么这个设备被偷之后所记录的运动状态,地理位置和用法就会和之前的记录有明显不同。不过,系统是接受这种可能存在的异常情况的,这时候用户就需要在系统上以另外的方式确认身份,调整系统,以使假阳性最小化。而一旦我们在不同设备上连接起 4 个因素,那么隐式认证的假阳性就会低于 0.001% 。

这个世界上并没有哪一种机器学习真的神奇到能解决所有的安全问题。设计者想用机器学习创建一个有用的安全防卫产品,就需要对底层系统有深刻理解,并且承认很多问题并不适合用机器学习来解决。不过不用担心,那些处在浪潮之巅的科技公司会将这些问题一步步消灭掉。

机器学习正在安全领域酝酿着一股势不可挡的市场狂潮。

未来的网络安全,离不开机器学习

信息安全一直就是猫与老鼠的游戏。好家伙新建一堵墙,坏家伙便想方设法通过或绕过它。但最近,坏家伙们似乎越来越轻易地就可以通过这堵墙。要想阻止他们,我们的能力需要有一个巨大的提升,这可能意味着我们需要更广泛地使用机器学习技术。

这可能会惊到行业外的旁观者,但机器学习目前并没有广泛地影响到IT安全领域。安全专家认为,尽管信用卡欺诈侦查系统和网络设备制造商正在使用先进的分析方法,但实际上每个大型公司常见的自动化安全行动——比如检测个人电脑上的恶意软件或者识别网络中的恶意活动——大部分都要依靠人类适时地对这些行动进行代码编写和配置。

尽管机器学习技术在网络安全领域的应用已经有了广泛的学术研究,但我们现在才刚开始了解这项技术对安全工具的影响。一些创业公司(如Invincea, Cylance, Exabeam和Argyle Data)正在利用机器学习驱动安全工具,使得它们比目前主要的安全软件供应商提供的工具更快捷和精准。

用数据摧毁恶意软件

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为,是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说:「我了解到,一些反病毒公司已经涉足机器学习领域,但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件,这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

他说:「他们在检测过去常见的恶意软件上很成功,但是他们并不擅长检测新的恶意软件,这也是当下网络犯罪大行其道的原因之一。即使你安装了杀毒系统,其他人还是能成功侵入你的电脑,因为特征码检测的方法根本不起作用。」

在Invincea,Saxe正带领团队用机器学习建立更完善的恶意软件检测系统。这个项目是DARPA网络基因组项目的一部分,主要是使用机器学习来摧毁检测到的恶意软件,包括反向还原恶意软件的运行方式、在代码中进行社交网络分析、使用机器学习系统快速摧毁自然网络环境中出现的恶意软件新样本。

「我们已经证明,我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作,甚至能做得更好。把机器学习系统与大量的训练数据结合,就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前,Saxe有大约150万个良性或恶意软件样品用来训练算法,这些都在使用 Python 工具的GPU中进行。他希望,随着样本数据增加到3000万,机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多,用来训练机器学习系统的恶意软件的数量越多,那机器学习系统在检测恶意软件上的性能优势就会越明显,」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说,就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

恶意用户检测

机器学习还有助于IT安全的其他方面:检测恶意的内部用户和识别损坏的账户。

正如主要的反病毒产品依赖特征码来识别恶意软件一样,监测用户活动的工具也是倚赖特征码。基于特征码的检测方法在恶意软件检测上开始失效,同样的,它在检测用户活动领域的效果也不尽如人意。

「过去,企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

他说:「这种方法寻找的是已经发生的事情。基于特征码的方法存在的问题是,只有事件发生过后,他们才能看到留下的特征码。而现在,安全人员非常聚焦于检测没有特征码的恶意事件。」

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

如今,精明的犯罪分子知道稍微改变一下他们的路径就能战胜特征码检测。所以,如果被侵入的检测系统中存有一个IP黑名单,网络犯罪分子可以通过在他处理下的大面积网域中不断来回跳动来打破这个IP黑名单。

Exabeam并没有固守昔日的防御策略,而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏,所以先假设他们是恶意的,你的网络是缺乏抵抗力的,所以你时刻对每个人的行为进行监测和制作模型,从而找到恶意行为者。

这就是用到机器学习算法的地方。Lin和他的团队获取了多种多样的资源(如服务器日志、虚拟私人网络日志和VPN日志等),使用各种监督和非监督式机器学习算法来检测用户行为的异常模式。

Lin说:「以上都是描绘用户行为的画像,问题是这是如何做到的。对于网络上每个用户或实体,我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后,我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常,让他们浮现出来,方便安全分析员查看。」

机器学习在安全领域的未来

「想一想我们经历过的几次主要的网络安全浪潮,网络犯罪分子正寻找有效地方法来打破安全系统,我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗?答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

他说:「现在我们正开始获得能够有效处理大量未结构化数据和检测模式的系统,我希望下一波网络安全浪潮中的产品是基于认知计算的。看看Watson,既然它可以赢得危险边缘(Jeopardy)游戏,那为什么它不可以用来广泛地分析和理解网络安全事件呢?我认为我们正处于用基于认知的计算来帮助处理安全问题的萌芽阶段。」

Invincea的Saxe希望可以成为弄潮儿。他说:「我并不惊讶该领域的公司没有抓住这次浪潮,生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

 

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

当朋友圈更新多到看不完时,来看看Facebook是怎么优化信息流的

【编者按】本文是FREES互联网团队成员覃超与徐万鸿进行的一场 Ask Me Anything。徐是前 Facebook 新闻流排序组的资深工程师,在今年9月回国出任神州专车 CTO。本文中他们聊的是关于 Facebook 的 Growth Hacking 策略、反垃圾信息系统、信息流排序,以及为什么选择回国参与创业。雷锋网(公众号:雷锋网)做了不修改原意的编辑。

 

所谓新闻流排序(news feed ranking),指的是 Facebook 的一项看家本领:用户每天会收到两三千条新鲜事,却只会阅读前 50 至 100 条。利用机器学习将用户最想看的内容排到最前面,从而提高粘性和日活。

这固然是一篇着重技术的文章,所在公司 Facebook 更是世界上最大的互联网公司之一。但这并不妨碍创业者从中得到经验。利用 A/B 测试作为迭代方法,借助 Growth Hacking 的核心——数据来驱动开发,新员工的入职宣讲……这些做法都体现了这位社交之王不同维度的文化所在:精神层面注重实现梦想,统一目标;而这一目标下放到微观层面,就是对于数据的尊重。

Facebook利用Sigma 系统做了什么?

我第一次去Facebook工作的时候,当时专注于用户增长的 VP 负责宣讲。他说将来全球所有人都会使用 Facebook,这家公司将来会成为万亿美元的公司,这让我印象很深刻。公司的所有人都很兴奋,对设定的目标有非常大的信心。他们的工作使命感非常强,非常专注。

这是Facebook给我印象深刻的一件事。

在 Facebook 的 site-integrity (站点完整性) 组工作了两年。当时 Facebook 有很多的垃圾私信、垃圾信息,就像人人、微博上有各种广告、垃圾链接。有些用户的账号被盗用了,会使用个人页面发送垃圾短信、广告、病毒,还有一些不受欢迎的朋友请求。我会处理所有类似这些涉及到影响用户体验的东西。

Facebook 使用了一个叫做 sigma 的系统来抵制这些垃圾信息。这个系统安装在 2000 多台机器上面,Facebook 用户做的任何事情,都会经过 sigma 系统分析处理,比如评论、链接、朋友请求,都会被这个系统进行判断,是正常行为、滥用行为还是有问题的行为。

利用 Sigma 系统,Facebook 会对垃圾信息进行过滤和清理。

举个例子说,比如发送朋友请求,Facebook 的系统会自动判断一下:如果这个人的朋友请求都被别人拒绝了,他再发送朋友请求是不会被批准的。如果一个人发送的朋友请求十个有九个都被拒绝了,那么他下一次的朋友请求就会被系统拒绝。

当然这个系统还有其他的判断信号。

它是一个机器学习系统,通过你之前发的朋友请求拒绝概率高低来判断你被拒绝的概率有多高。

如果这个比率很高,Facebook 会让你进行手机短信或其他方式认证,来验证是软件还是真人发送的,以此判断你是不是真的要发送朋友请求,比如你发出的朋友请求对象与你没有任何共同好友,那就可能是一个不合理的请求。

基本上,你在 Facebook 上做的任何事情,都会经过这个系统来分析、预测、决定是否允许你发出信息,借此希望会减少生态圈中的骚扰行为。当时 Facebook 每天有上百亿次的信息发生要通过这个系统进行判断。

机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法,请求通过和拒绝就是一个迅捷数据组(Scrum)。任务通过,则说明这个任务是一个对机器学习来说的正样本,被拒绝则是一个负样本,很像 0 和 1。

比如发送朋友请求如果被接受,y 值是 1,如果被拒绝就是 0。如果是评论和点赞,系统就能寻找 y 值,用户发送的不当信息就会被删除。

而机器学习是整个 Sigma 系统的核心。

另外一个方法是通过一些异常行为的分析、数据挖掘的方法来分析用户的异常行为。

比如一个人发的同样类型评论非常多,所有评论里都有一个相似链接,这就非常有问题。正常操作不会在不同人的主页上留同样的评论,这显然属于异常行为,我们不会允许。

新闻流是Facebook最重要的产品

我工作两年之后选择去了这个组。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈,你的信息流是个什么样子,信息的位置。每个人产生的内容、新闻会有两三千个,用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示,比如你喜欢游戏,你的朋友不喜欢。

我 2012 年刚去的时候,新闻流排序组只有五六个人,尽管这可能是公司最大的机器学习系统,最核心的产品。每天有十亿多人上线,每个用户花 40 分钟在 Facebook 上,其中一半时间都花在新闻流上。Facebook 大部分收入来自新闻流广告。比如说,移动广告收入占所有广告的 70%,而其中所有的移动的广告都来自新闻流。不管是从用户的停留时间,还是收入来说,新闻流都是最重要的产品。

新闻流是 Facebook 最重要的产品,直接决定了用户所看到的内容。

做好新闻流排序是很难的问题,因为用户在新闻流上的行为有很多种,不只是传统广告点击或者不点击这一种操作,用户可以在新闻流里赞、评论、分享或者隐藏这个新闻流,也可以播放视频。我需要理解用户喜欢什么东西,评论、分享什么东西,想看什么样的视频。理解用户的兴趣所在,根据我们的讯息把最好的东西放在新闻流的最前面。

以国内的社交媒体作对比来说,微信的朋友圈是所有内容全部显示,它不需要排序,是因为朋友圈容量不是特别多,大家可以看完所有的内容。朋友越来越多的话,没有时间把分享看完,排序是必然的事情。你会很容易漏掉很重要人的图片,它们迅速埋没在大部分你不感兴趣的内容了。

Facebook 之前也是全部显示,慢慢到后来用户是看不完所有的信息的。如果不做排序,把最好的服务挑出来的话,用户不会愿意访问新闻流,因为他看到很多不感兴趣的东西,感兴趣的部分他已经没有时间找出来了。从不排序到排序是必然的过程,你的朋友越来越多,公众页面越来越多,排序是必然的。

比如说新浪微博没有做排序,有些细节杂乱无章,他们测试过,但是做得不太好。所以放弃了。微信的朋友圈也会到要做排序的阶段。Facebook 不只是排序,还会隐藏用户不感兴趣的内容,比如你的朋友玩过 Candy Crush 游戏,但可能你本身不玩任何游戏,关于这方面的信息就没有意义。Facebook 就不会给你显示这些内容——“朋友们在玩什么游戏”。

社交媒体的碎片化已成事实。只有采取更好的排序手段,推送给用户更精准的内容,才能提高平台停留时间,加强粘性。

新闻流排序的工作原理是什么?

基本上,新闻流是从两三千条内容里面,挑出了 四五十 个。按照每个内容打分,分高的内容排在最前面。每个内容、照片、分享或者状态,我们会预测一些概率值,比如你点赞的概率,评论、分享的概率。每个用户的行为,比如点赞、分享、评论,系统都会给权值。评这些用户行为概率是通过机器学习来系统计算的。如果用户对某个内容点赞、评论或者分享,说明用户愿意看到这个内容,对内容产生了反馈。

举个例子来说,比如你是我的好友,你上传了 100 张照片,我点赞了 20 次,那么点赞概率就是 20%。我们知道每个用户以前对哪些内容点赞、评论,这些都是我们的训练样本。我们通过学习用户的历史行为,进行相同类型、相同个人的未来行为预测,因为用户短期行为不会大幅变化,过去对哪些东西进行评论,将来也很有可能对相似内容进行评论。

对用户内容的预测

很多人关心,是否可以针对用户内容来进行预测?比如分析用户发了什么样的文字或者图片?这是可以的。如果是图片我们可以抽取图片特点,对图片进行模式识别,分析图片的主题,打上相应的标签,用机器来识别这些图片。现在在做相应的工作。Facebook 有 AI 实验室,可以对图片进行内容识别。

那么,Facebook 该如何检测这套算法的有效性呢?该如何进行更新迭代?

其实,这可以通过 A/B 测试来实现。我们会抽取 1% 用户进行新的算法,1%进行旧的算法。如果新的算法下用户每天点赞、评论或者分享次数增长了,那说明新的算法更好。我们就把新的算法发布给所有的用户。我们主要的核心目标是:让日活跃用户更多,停留时间更长,访问 Facebook 更频繁。

A/B 测试是很好的迭代方法。建立起核心指标,进行 A/B 测试,看新的改动能否提高核心指标,提高就发布,没有提高就不用发布。这很像 Growth hacking,当然最终目的还是提高 DAU。如果用户喜欢你的新闻流,就会更频繁访问,最终目的还是在线时长和日活跃用户。

A/B 测试是 Facebook 用来测试迭代可行性的手段,目前峰瑞资本所投资的吆喝科技,想让初创企业也能使用到这一技术。

“我已经没法看完所有朋友圈的内容”

我已经没法看完所有朋友圈内容了。一种改进方法是排序,把最好的内容放最前面,通过你以前点赞的内容,来学习你关心的内容,比如你女朋友发的东西你都会点赞。另外一种改进方法叫做 “内容置顶”(Story bumped)。有时候我早上起来刷微信,会看不完,只看了一小部分。过一会儿再刷的时候,已经没有什么新的内容了。

Facebook 的内容置顶功能会把你没有看完的东西再放到上面去再次推送给你。

微信是知道哪些内容你没看过的,我有很多在美国的朋友,朋友圈会有很多内容,上班前看不完只看了一部分。再刷新的时候就已经没有新的东西出来了,我也没有关心没看完的东西,朋友发的照片。Facebook 的 “内容置顶” 把很重要的、还没看的、有点旧的内容放在朋友圈前面,让你再看一眼,怕你漏掉重要的内容。

在九月份的时候我加入神州专车担任 CTO,从事业角度来说,我希望把从 Facebook 学到的公司文化、技术带回中国。中国在计算机行业上有很大的潜力。现在国内的产品质量上和美国产品已经相当了,比如微信,Facebook 的产品经理也学习了微信里面的功能。再往后面看几年的话,中国有机会赶上美国。

计算机学科已经成熟,创造力在慢慢变好。很多初创企业尝试不同的想法,中国的创业者是美国的好多倍,都在尝试不同的想法,会诞生出成功的公司。技术上,中国正在逼近美国,甚至会超越美国。长远来看,中国的计算机行业、互联网行业,应该是有潜力成为世界上互联网行业最好的国家。

Fighting spam with Haskell

One of our weapons in the fight against spam, malware, and other abuse on Facebook is a system called Sigma. Its job is to proactively identify malicious actions on Facebook, such as spam, phishing attacks, posting links to malware, etc. Bad content detected by Sigma is removed automatically so that it doesn’t show up in your News Feed.

We recently completed a two-year-long major redesign of Sigma, which involved replacing the in-house FXL language previously used to program Sigma with Haskell. The Haskell-powered Sigma now runs in production, serving more than one million requests per second.

Haskell isn’t a common choice for large production systems like Sigma, and in this post, we’ll explain some of the thinking that led to that decision. We also wanted to share the experiences and lessons we learned along the way. We made several improvements to GHC (the Haskell compiler) and fed them back upstream, and we were able to achieve better performance from Haskell compared with the previous implementation.

How does Sigma work?

Sigma is a rule engine, which means it runs a set of rules, called policies. Every interaction on Facebook — from posting a status update to clicking “like” — results in Sigma evaluating a set of policies specific to that type of interaction. These policies make it possible for us to identify and block malicious interactions before they affect people on Facebook.

Policies are continuously deployed. At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses. This also means that safety in the language we write policies in is important. We don’t allow code to be checked into the repository unless it is type-correct.

Louis Brandy of Facebook’s Site Integrity team discusses scalable spam fighting and the anti-abuse structure at Facebook and Instagram in a 2014 @Scale talk.

Why Haskell?

The original language we designed for writing policies, FXL, was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted. We wanted the performance and expressivity of a fully fledged programming language. Thus, we decided to migrate to an existing language rather than try to improve FXL.

The following features were at the top of our list when we were choosing a replacement:

1. Purely functional and strongly typed. This ensures that policies can’t inadvertently interact with each other, they can’t crash Sigma, and they are easy to test in isolation. Strong types help eliminate many bugs before putting policies into production.

2. Automatically batch and overlap data fetches. Policies typically fetch data from other systems at Facebook, so we want to employ concurrency wherever possible for efficiency. We want concurrency to be implicit, so that engineers writing policies can concentrate on fighting spam and not worry about concurrency. Implicit concurrency also prevents the code from being cluttered with efficiency-related details that would obscure the functionality, and make the code harder to understand and modify.

3. Push code changes to production in minutes. This enables us to deploy new or updated policies quickly.

4. Performance. FXL’s slower performance meant that we were writing anything performance-critical in C++ and putting it in Sigma itself. This had a number of drawbacks, particularly the time required to roll out changes.

5. Support for interactive development. Developers working on policies want to be able to experiment and test their code interactively, and to see the results immediately.

Haskell measures up quite well: It is a purely functional and strongly typed language, and it has a mature optimizing compiler and an interactive environment (GHCi). It also has all the abstraction facilities we would need, it has a rich set of libraries available, and it’s backed by an active developer community.

That left us with two features from our list to address: (1) automatic batching and concurrency, and (2) hot-swapping of compiled code.

Automatic batching and concurrency: The Haxl framework

All the existing concurrency abstractions in Haskell are explicit, meaning that the user needs to say which things should happen concurrently. For data-fetching, which can be considered a purely functional operation, we wanted a programming model in which the system just exploits whatever concurrency is available, without the programmer having to use explicit concurrency constructs. We developed the Haxl framework to address this issue: Haxl enables multiple data-fetching operations to be automatically batched and executed concurrently.

We discussed Haxl in an earlier blog post, and we published a paper on Haxl at the ICFP 2014 conference. Haxl is open source and available on GitHub.

In addition to the Haxl framework, we needed help from the Haskell compiler in the form of theApplicative do-notation. This allows programmers to write sequences of statements that the compiler automatically rearranges to exploit concurrency. We also designed and implemented Applicative do-notation in GHC.

Hot-swapping of compiled code

Every time someone checks new code into the repository of policies, we want to have that code running on every machine in the Sigma fleet as quickly as possible. Haskell is a compiled language, so that involves compiling the code and distributing the new compiled code to all the machines running Sigma.

We want to update the compiled rules in a running Sigma process on the fly, while it is serving requests. Changing the code of a running program is a tricky problem in general, and it has been the subject of a great deal of research in the academic community. In our case, fortunately, the problem is simpler: Requests to Sigma are short-lived, so we don’t need to switch a running request to new code. We can serve new requests on the new code and let the existing requests finish before we discard the old code. We’re careful to ensure that we don’t change any code associated with persistent state in Sigma.

Loading and unloading code currently uses GHC’s built-in runtime linker, although in principle, we could use the system dynamic linker. To unload the old version of the code, the garbage collector gets involved. The garbage collector detects when old code is no longer being used by a running request, so we know when it is safe to unload it from the running process.

How Haskell fits in

Haskell is sandwiched between two layers of C++ in Sigma. At the top, we use the C++ thrift server. In principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant. It also supports more features. Furthermore, it can work seamlessly with the Haskell layers below because we can call into Haskell from C++. For these reasons, it made sense to use C++ for the server layer.

At the lowest layer, we have existing C++ client code for talking to other internal services. Rather than rewrite this code in Haskell, which would duplicate the functionality and create an additional maintenance burden, we wrapped each C++ client in a Haxl data source using Haskell’s Foreign Function Interface (FFI) so we could use it from Haskell.

Haskell’s FFI is designed to call C rather than C++, so calling C++ requires an intermediate C layer. In most cases, we were able to avoid the intermediate C layer by using a compile-time tool that demangles C++ function names so they can be called directly from Haskell.

Performance

Perhaps the biggest question here is “Does it run fast enough?” Requests to Sigma result from users performing actions on Facebook, such as sending a message on Messenger, and Sigma must respond before the action can take place. So we wanted to serve requests fast enough to avoid interruptions to the user experience.

The graph below shows the relative throughput performance between FXL and Haskell for the 25 most common types of requests served by Sigma (these requests account for approximately 95 percent of Sigma’s typical workload).

Haskell performs as much as three times faster than FXL for certain requests. On a typical workload mix, we measured a 20 percent to 30 percent improvement in overall throughput, meaning we can serve 20 percent to 30 percent more traffic with the same hardware. We believe additional improvements are possible through performance analysis, tuning, and optimizing the GHC runtime for our workload.

Achieving this level of performance required a lot of hard work, profiling the Haskell code, and identifying and resolving performance bottlenecks.

Here are a few specific things we did:

 

  • We implemented automatic memoization of top-level computations using a source-to-source translator. This is particularly beneficial in our use-case where multiple policies can refer to the same shared value, and we want to compute it only once. Note, this is per-request memoization rather than global memoization, which lazy evaluation already provides.
  • We made a change to the way GHC manages the heap, to reduce the frequency of garbage collections on multicore machines. GHC’s default heap settings are frugal, so we also use a larger allocation area size of at least 64 MB per core.
  • Fetching remote data usually involves marshaling the data structure across the C++/Haskell boundary. If the whole data structure isn’t required, it is better to marshal only the pieces needed. Or better still, don’t fetch the whole thing — although that’s only possible if the remote service implements an appropriate API.
  • We uncovered a nasty performance bug in aeson, the Haskell JSON parsing library. Bryan O’Sullivan, the author of aeson, wrote a nice blog post about how he fixed it. It turns out that when you do things at Facebook scale, those one-in-a-million corner cases tend to crop up all the time.

 

Resource limits

In a latency-sensitive service, you don’t want a single request using a lot of resources and slowing down other requests on the same machine. In this case, the “resources” include everything on the machine that is shared by the running requests — CPU, memory, network bandwidth, and so on.

A request that uses a lot of resources is normally a bug that we want to fix. It does happen from time to time, often as a result of a condition that occurs in production that wasn’t encountered during testing — perhaps an innocuous operation provided with some unexpectedly large input data, or pathological performance of an algorithm on certain rare inputs, for example. When this happens, we want Sigma to terminate the affected request with an error (that will subsequently result in the bug being fixed) and continue without any impact on the performance of other requests being served.

To make this possible, we implemented allocation limits in GHC, which places a bound on the amount of memory a thread can allocate before it is terminated. Terminating a computation safely is a hard problem in general, but Haskell provides a safe way to abort a computation in the form of asynchronous exceptions. Asynchronous exceptions allow us to write most of most of our code ignoring the potential for summary termination and still have all the nice guarantees that we need in the event that the limit is hit, including safe releasing of resources, closing network connections, and so forth.

The following graph illustrates of how well allocation limits work in practice. It tracks the maximum live memory across various groups of machines in the Sigma fleet. When we enabled one request that had some resource-intensive outliers, we saw large spikes in the maximum live memory, which disappeared when we enabled allocation limits.

Enabling interactive development

Facebook engineers develop policies interactively, testing code against real data as they go. To enable this workflow in Haskell, we needed the GHCi environment to work with our full stack, including making requests to other back-end services from the command line.

To make this work, we had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load. We also customized the GHCi front end to implement some of our own commands and streamline the desired workflows. The result is an interactive environment in which developers can load their code from source in a few seconds and work on it with a fast turnaround time. They have the full set of APIs available and can test against real production data sources.

While GHCi isn’t as easy to customize as it could be, we’ve already made several improvements and contributed them upstream. We hope to make more improvements in the future.

Packages and build systems

In addition to GHC itself, we make use of a lot of open-source Haskell library code. Haskell has its own packaging and build system, Cabal, and the open-source packages are all hosted onHackage. The problem with this setup is that the pace of change on Hackage is fast, there are often breakages, and not all combinations of packages work well together. The system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure, and the tool support isn’t what it could be. We found that using packages directly from Hackage together with Facebook’s internal build tools meant adding or updating an existing package sometimes led to a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations.

As a result of this experience, we switched to Stackage as our source of packages. Stackage provides a set of package versions that are known to work together, freeing us from the problem of having to find the set by trial and error.

Did we find bugs in GHC?

Yes, most notably:

 

  • We fixed a bug in GHC’s garbage collector that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years.
  • We fixed a bug in GHC’s handling of finalizers that occasionally caused crashes during process shutdown.

 

Following these fixes, we haven’t seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet.

What else?

At Facebook, we’re using Haskell at scale to fight spam and other types of abuse. We’ve found it to be reliable and performant in practice. Using the Haxl framework, our engineers working on spam fighting can focus on functionality rather than on performance, while the system can exploit the available concurrency automatically.

For more information on spam fighting at Facebook, check out our Protect the Graph page, or watch videos from our recent Spam Fighting @Scale event.