Category Archives: 安全业务领域

How AI is helping detect fraud and fight criminals

February 28, 2017 zr9558 Leave a comment

How AI is helping detect fraud and fight criminals

AI is about to go mainstream. It will show up in the connected home, in your car, and everywhere else. While it’s not as glamorous as the sentient beings that turn on us in futuristic theme parks, the use of AI in fraud detection holds major promise. Keeping fraud at bay is an ever-evolving battle in which both sides, good and bad, are adapting as quickly as possible to determine how to best use AI to their advantage.

There are currently three major ways that AI is used to fight fraud, and they correspond to how AI has developed as a field. These are:

Rules and reputation lists
Supervised machine learning
Unsupervised machine learning

Rules and reputation lists

Rules and reputation lists exist in many modern organizations today to help fight fraud and are akin to “expert systems,” which were first introduced to the AI field in the 1970s. Expert systems are computer programs combined with rules from domain experts.They’re easy to get up and running and are human-understandable, but they’re also limited by their rigidity and high manual effort.

A “rule” is a human-encoded logical statement that is used to detect fraudulent accounts and behavior. For example, an institution may put in place a rule that states, “If the account is purchasing an item costing more than $1000, is located in Nigeria, and signed up less than 24 hours ago, block the transaction.”

Reputation lists, similarly, are based on what you already know is bad. A reputation list is a list of specific IPs, device types, and other single characteristics and their corresponding reputation score. Then, if an account is coming from an IP on the bad reputation list, you block them.

While rules and reputation lists are a good first attempt at fraud detection and prevention, they can be easily gamed by cybercriminals. These days, digital services abound, and these companies make the sign-up process frictionless. Therefore, it takes very little time for fraudsters to make dozens, or even thousands, of accounts. They then use these accounts to learn the boundaries of the rules and reputation lists put in place. Easy access to cloud hosting services, VPNs, anonymous email services, device emulators, and mobile device flashing makes it easy to come up with unsuspicious attributes that would miss reputation lists.

Since the 1990s, expert systems have fallen out of favor in many domains, losing out to more sophisticated techniques. Clearly, there are better tools at our disposal for fighting fraud. However, a significant number of fraud-fighting teams in modern companies still rely on this rudimentary approach for the majority of their fraud detection, leading to massive human review overhead, false positives, and sub-optimal detection results.

Supervised machine learning (SML)

Machine learning is a subfield of AI that attempts to address the issue of previous approaches being too rigid. Researchers wanted the machines to learn from data, rather than encoding what these computer programs should look for (a different approach from expert systems). Machine learning began to make big strides in the 1990s, and by the 2000s it was effectively being used in fighting fraud as well.

Applied to fraud, supervised machine learning (SML) represents a big step forward. It’s vastly different from rules and reputation lists because instead of looking at just a few features with simple rules and gates in place, all features are considered together.

There’s one downside to this approach. An SML model for fraud detection must be fed historical data to determine what the fraudulent accounts and activity look like versus what the good accounts and activity look like. The model would then be able to look through all of the features associated with the account to make a decision. Therefore, the model can only find fraud that is similar to previous attacks. Many sophisticated modern-day fraudsters are still able to get around these SML models.

That said, SML applied to fraud detection is an active area of development because there are many SML models and approaches. For instance, applying neural networks to fraud can be very helpful because it automates feature engineering, an otherwise costly step that requires human intervention. This approach can decrease the incidence of false positives and false negatives compared to other SML models, such as SVM and random forest models, since the hidden neurons can encode many more feature possibilities than can be done by a human.

Unsupervised machine learning (UML)

Compared to SML, unsupervised machine learning (UML) has cracked fewer domain problems. For fraud detection, UML hasn’t historically been able to help much. Common UML approaches (e.g., k-means and hierarchical clustering, unsupervised neural networks, and principal component analysis) have not been able to achieve good results for fraud detection.

Having an unsupervised approach to fraud can be difficult to build in-house since it requires processing billions of events all together and there are no out-of-the-box effective unsupervised models. However, there are companies that have made strides in this area.

The reason it can be applied to fraud is due to the anatomy of most fraud attacks. Normal user behavior is chaotic, but fraudsters will work in patterns, whether they realize it or not. They are working quickly and at scale. A fraudster isn’t going to try to steal $100,000 in one go from an online service. Rather, they make dozens to thousands of accounts, each of which may yield a profit of a few cents to several dollars. But those activities will inevitably create patterns, and UML can detect them.

The main benefits of using UML are:

You can catch new attack patterns earlier
All of the accounts are caught, stopping the fraudster from making any money
Chance of false positives is much lower, since you collect much more information before making a detection decision

Putting it all together

Each approach has its own advantages and disadvantages, and you can benefit from each method. Rules and reputation lists can be implemented cheaply and quickly without AI expertise. However, they have to be constantly updated and will only block the most naive fraudsters. SML has become an out-of-the box technology that can consider all the attributes for a single account or event, but it’s still limited in that it can’t find new attack patterns. UML is the next evolution, as it can find new attack patterns, identify all of the accounts associated with an attack, and provide a full global view. On the other hand, it’s not as effective at stopping individual fraudsters with low-volume attacks and is difficult to implement in-house. Still, it’s certainly promising for companies looking to block large-scale or constantly evolving attacks.

A healthy fraud detection system often employs all three major ways of using AI to fight fraud. When they’re used together properly, it’s possible to benefit from the advantages of each while mitigating the weaknesses of the others.

AI in fraud detection will continue to evolve, well beyond the technologies explored above, and it’s hard to even grasp what the next frontier will look like. One thing we know for sure, though, is that the bad guys will continue to evolve along with it, and the race is on to use AI to detect criminals faster than they can use it to hide.

Catherine Lu is a technical product manager at DataVisor, a full-stack online fraud analytics platform.

安全业务领域

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

February 14, 2017 zr9558 Leave a comment

原文链接：http://www.twoeggz.com/news/3147867.html

规则引擎，机器学习模型，设备指纹，黑白名单（例如邮件、IP地址黑白名单）和无监督检测分析？经常会有人问，我们应该选择哪种反欺诈检测方式？其实每一种方法都有其独特的优势，企业应该结合反欺诈解决方案及反欺诈行业专家经验，搭建出一套最适合自己公司业务、产品以及用户类型的反欺诈管理系统。

规则引擎和学习模型是传统反欺诈系统构建中重要的两个基本组成部分。接下来的文章中会介绍这两套系统是如何工作的？它们各自的优势和局限性是什么？为什么无监督分析算法优越于规则引擎和机器学习模型，以及使用无监督分析算法在捕捉新型欺诈时的必要性。

>>>>规则引擎

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

>>>>工作机制

规则引擎将商业业务逻辑和应用程序代码划分开来，安全和风险分析师等基于SQL或数据库知识就可以独自管理运行规则。有效的规则可以通过几行逻辑代码一目了然的进行表述：If A and B, then do C。例如：

IF(user_email=type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

规则引擎同样可以使用加权打分评分机制。例如，下表中的每一项规则都对应一个分值，正数或负数，这个分值可以由分析师赋值。所有规则的分数会被加起来，之后得到一个总计分数。规则引擎基于分数临界值创建出业务运维流程。在一个典型的运维流中，根据分数范围，一般会分为三种行为类型，例如：

1.高于1000 －否认（如拒绝交易，暂停帐户）

2.低于300－接受（如确认订单，通过内容）

3.介于300到1000－提示需要增加额外的审核，置入人工审校池

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

规则引擎可以从数据库中导入数据，挑选出黑名单（如IP地址）和其它坏的列表。每当一个新的欺诈情况发生后，分析师会增加一个新规则，以保证公司在可预见范围内免于欺诈风险。这样通过使用规则引擎，公司便可以避免一些周期性出现的欺诈。

➜局限性

一旦欺诈规模增大，规则引擎就会展现出局限性。欺诈者不会在被捕捉后依旧坐以待毙，他们会研究你是如何捕捉他们，之后变换新的方式，避免再次被捉到。所以，规则作用的时间很有限,可能是几周，甚至几天。试想一下，当你在运行和测试成百上千条新的规则同时，还需要每隔几天增加新的规则，删除或更新之前的规则，并对规则进行加权，这无疑要花费大量运营资源,时间，和费用来维护。

如果一个反欺诈分析师要在3种规则下计算出通过、拒绝及比例数字，并通过比例变化情况调整每一项规则的分值，需要做出8种改变：2^3 = 8（values^rules）。而测试3种不同值的10种规则需要做出超过5.9万次变化。逐渐随着规则数量增加，改变频率也会随之快速增长。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

规则引擎不会从分析观察或反馈中自动学习。由于欺诈者经常改变欺诈方式，导致数据会间歇性暴露在各种新的攻击下。此外，规则引擎是基于二进制方式处理信息，有可能无法完全检测到数据细微差别，这会导致出现更高的误判率及用户负面体验。

有监督机器学习模型

➜工作机制

有监督机器学习模式是反欺诈检测中最为广泛使用的机器学习模式。其中包含的几个学习技术分别有决策树算法，随机森林，最近邻算法，支持向量机和朴素贝叶斯分类。机器学习通常从有标签数据中自动创建出模型，来检测欺诈行为。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

在创建模型的过程中，清楚了解哪些是欺诈行为，哪些不是，会起到至关重要的作用。模型中倒入的数据会影响其检测效果。用已知欺诈数据和正常数据做训练集，可以训练出学习模型来填补并增强规则引擎无法覆盖的复杂欺诈行为。

下面是一个关于有监督机器学习机制如何将新的数据划分为欺诈和非欺诈的例子。训练数据通过识别模型特点，可以预知两种类型欺诈者： 1. 信用卡欺诈者 2. 垃圾信息制造者。以下三种特征对识别欺诈攻击类型非常有帮助：1. 邮件地址结构 2. IP地址类型 3. 关联账户密度指示欺诈攻击类型（如变化的回复）。实际上，一个典型的模型有成百上千种特征。

在此例中，拥有以下特征的用户会被训练出的模型识别为信用卡欺诈：

邮箱地址前5个是字母，后3个是数字

使用匿名代理

中等密度关联账号（例如10）

有以下特征的用户会被识别为垃圾信息制造者：

邮箱地址按某种形式随机生成的

使用数据中心的IP地址

高密度关联账号（例如30+）

假设现在你的模型正在从下面一批用户里评估风险，这个模型会计算每个用户的邮件地址结构，IP地址类型以及账号关联密度。正常情况下，模型会将第二种和第三种用户归类为垃圾制造者，把第一、第四、第五种归为信用卡欺诈者。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

训练学习模型填补并增强了规则引擎无法覆盖的范围，学习模型可以通过增加训练数据持续提高其检测效率。学习模型可以处理非结构数据（如图像，邮件内容），即使有成千上万的输入信息变化特征，也可以自动识别复杂的欺诈模式。

➜局限性

虽然有监督机器学习创建模型功能比较强大，但同时也有局限性。如果出现之前没有标签案例的、新的欺诈类型该怎么办？由于欺诈方式经常变化，这种情况普遍存在。毕竟欺诈者在不停地变化欺诈手段，日以继夜的实施各种新型攻击，如果之前没有遇到这种欺诈攻击模式，也没有足够的训练数据，那么训练出的模型就不能返回优质、可靠的结果。

从下图中可以看出，收集数据和标记数据是创建有监督机器学习过程中最重要的部分。产出准确的训练标签可能需要花费数周到数月的时间。并且产生标签的过程需要反欺诈分析团队全面审核案例，将数据进行正确标签分类，并在投入使用前进行验证测试。除非学习模型之前有足够的相应训练数据，否则一旦出现新的攻击，学出的模型将会无法识别。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

无监督机器学习－超越规则引擎和有监督机器学习

以上两种欺诈检测框架都有各自明显的局限性，DataVisor创新的无监督机器学习算法弥补了这两种模型的不足。无监督检测算法无需依赖于任何标签数据来训练模型。这种检测机制算法的核心内容是无监督欺诈行为检测，通过利用关联分析和相似性分析，发现欺诈用户行为间的联系，创建群组，并在一个或多个其他群组中发掘新型欺诈行为和案例。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

无监督检测提供了攻击的群组信息，并自动生成训练数据，之后汇入到有监督的机器学习模块中。基于这些数据，有监督机器学习通过模型结构，可以进一步发现大规模攻击群组之外的欺诈用户。DataVisor所采用的这种框架模式不仅可以找出由个人账号发起的攻击，更重要的是可以有效发现由多个账号组成的欺诈或犯罪团伙实施的有组织的大规模攻击，为客户反欺诈检测框架增加至关重要的早期全方位检测。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor采用的关联分析方法将欺诈行为相似的群组归为一类。而另一种检测技术－异常检测，将不符合好用户行为特点的用户均列为欺诈对象。其原理是假设坏用户都是孤立于正常用户之外的单个用户或小群组。下面图表列举了欺诈者F1、F3、群组F2，以及好用户群组G1和G2。异常检测模型只能发现此类孤立的欺诈行为，但在鉴别大规模的群组欺诈时就会面临很大的挑战。在这一点上，相比于异常检测，无监督分析的优势显而易见。

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor把无监督分析算法结合规则引擎和机器学习模型一起使用。对于客户来说，这种全方位的检测在提供欺诈信息列表的同时，也会提供给客户新的欺诈检测模型，并帮助用户创建新的检测规则。一旦DataVisor的检测方式发现客户遇到新型未知欺诈，无监督检测可以有效提前早期预警。

通过专注于早期检测和发现未知欺诈，DataVisor帮助客户在欺诈解决方案的各个方面提升机制、提高效率：

鉴别虚假用户注册和帐户侵权；

检测虚假金融交易和活动；

发现虚假推广和促销滥用；

阻止社交垃圾信息，虚假内容发布、虚假阅读量和虚假点赞数量；

翻译者：Lily.Wang

安全业务领域

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

February 13, 2017 zr9558 Leave a comment

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

无监督机器学习：超越规则引擎和有监督机器学习的新一代反欺诈分析方法

Rules engines, machine learning models, ID verification, or reputation lookups (e.g. email, IP blacklists and whitelists) and unsupervised analytics? I’ve often been asked which one to use and should you only go with one over the others. There is a place for each to provide value and you should anticipate incorporating some combination of these fraud solutions along with solid domain expertise to build a fraud management system that best accounts for your business, products and users. With that said, rules engines and learning models are two of the major foundational components of a company’s fraud detection architecture. I’ll explain how they work, discuss the benefits and limitations of each and highlight the demand for unsupervised analytics that can go beyond rules engines and machine learning in order to catch new fraud that has yet to be seen.

Rules Engines

Image Source

How they work

Rules engines partition the operational business logic from the application code, enabling non-engineering fraud domain experts (e.g. Trust & Safety or Risk Analysts) with SQL/database knowledge to manage the rules themselves. So what types of rules are effective? Rules can be as straightforward as a few lines of logic: If A and B, then do C. For example,

IF (user_email = type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

Rules engines can also employ weighted scoring mechanisms. For example, in the table below each rule has a score value, positive or negative, which can be assigned by an analyst. The points for all of the rules triggered will be added together to compute an aggregate score. Subsequently, rules engines aid in establishing business operation workflows based on the score thresholds. In a typical workflow, there could be three types of actions to take based on the score range:

Above 1000 – Deny (e.g. reject a transaction, suspend the account)
Below 300 – Accept (e.g. order is ok, approve the content post)
Between 300 and 1000 – Flag for additional review and place into a manual review bin

Advantages

Rules engines can take black lists (e.g. IP addresses) and other negative lists derived from consortium databases as input data. An analyst can add a new rule as soon as he or she encounters a new fraud/risk scenario, helping the company benefit from the real-world insights of the analyst on the ground seeing the fraud every day. As a result, rules engines give businesses the control and capability to handle one-off brute force attacks, seasonality and short-term emerging trends.

Limitations

Rules engines have limitations when it comes to scale. Fraudsters don’t sit idle after you catch them. They will change what they do after learning how you caught them to prevent being caught again. Thus, the shelf life of rules can be a couple of weeks or even as short as a few days before their effectiveness begins to diminish. Imagine having to add, remove, and update rules and weights every few days when you’re in a situation with hundreds or thousands of rules to run and test. This could require huge operational resources and costs to maintain.

If a fraud analyst wants to calculate the accept, reject, and review rates for 3 rules and get the changes in those rates for adjusting each rule down or up by 100 points, that would require 8 changes: 23^ = 8 (values^rules). Testing 10 rules with 3 different values would be over 59K changes! As the number of rules increases, the time to make adjustments increases quickly.

Rules engines don’t automatically learn from analyst observations or feedback. As fraudsters adapt their tactics, businesses can be temporarily exposed to new types of fraud attacks. And since rules engines treat information in a binary fashion and may not detect subtle nuances, this can lead to higher instances of false positives and negative customer experiences.

Learning Models

Image Source

How they work

Supervised machine learning is the most widely used learning approach when it comes to fraud detection. A few of the learning techniques include decision trees, random forests, nearest neighbors, Support Vector Machines (SVM) and Naive Bayes. Machine learning models often solve complex computations with hundreds of variables (high-dimensional space) in order to accurately determine cases of fraud.

Having a good understanding of both what is and what is not fraud plays a central role in the process of creating models. The input data to the models influences their effectiveness. The models are trained on known cases of fraud and non-fraud (e.g. labeled training data), which then facilitate its ability to classify new data and cases as either fraudulent or not. Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage.

Below is a simplified example of how a supervised machine learning program would classify new data into the categories of non-fraud or fraud. Training data informs the model of the characteristics of two types of fraudsters: 1) credit card fraudsters and 2) spammers. Three features: 1) the email address structure, 2) the IP address type, and 3) the density of linked accounts are indicative of the type of fraud attack (e.g. response variable). Note in reality, there could be hundreds of features for a model.

The trained model recognizes that a user with:

an email address that has 5 letters followed by 3 numbers
using an anonymous proxy
with a medium density (e.g. 10) of connected accounts

is a credit card fraudster.

It also knows recognizes that a user with:

an email address structure with a “dot” pattern
using an IP address from a datacenter
with a high density (e.g. 30+) of linked accounts

is a spammer.

Now suppose your model is evaluating new users from the batch of users below. It computes the email address structure, IP address type, and density of linked accounts for each user. If working properly, it will classify the users in Cases 2 and 3 as spammers and the users in Cases 1, 4 and 5 as credit card fraudsters.

Advantages

Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage. Learning models have the ability to digest millions of row of data scalably, pick up from past behaviors and continually improve their predictions based on new and different data. They can handle unstructured data (e.g. images, email text) and recognize sophisticated fraud patterns automatically even if there are thousands of features/variables in the input data set. With learning models, you can also measure effectiveness and improve it by only changing algorithms or algorithm parameters.

Limitations

Trained learning models, while powerful, have their limitations. What happens if there are no labeled examples for a given type of fraud? Given how quickly fraud is evolving, this is not that uncommon of an occurrence. After all, fraudsters change schemes and conduct new types of attacks around the clock. If we have not encountered the fraud attack pattern, and therefore do not have sufficient training data, the trained learning models may not have the appropriate support to return good and reliable results.

As seen in the diagram below, collecting and labeling data is a crucial part of building a learning model and the time required to generate accurate training labels can be weeks to months. Labeling can involve teams of fraud analysts reviewing cases thoroughly, categorizing it with the right fraud tags, and undergoing a verification process before being used as training data. In the event a new type of fraud emerges, a learning model may not be able to detect it until weeks later after sufficient data has been acquired to properly train it.

Unsupervised Analytics – Going Beyond Rules Engines and Learning Models

While both of these approaches are critical pieces of a fraud detection architecture, here at DataVisor we take it one step further. DataVisor employs unsupervised analytics, which do not rely on having prior knowledge of the fraud patterns. In other words no training data is needed. The core component of the algorithm is theunsupervised attack campaign detection which leverages correlation analysis and graph processing to discover the linkages between fraudulent user behaviors, create clusters and assign new examples into one or the other of the clusters.

The unsupervised campaign detection provides the attack campaign group info and also the self-generated training data, both of which can be fed into our machine learning models to bootstrap them. With this data, the supervised machine learning will pick up patterns and find the fraudulent users that don’t fit into these large attack campaign groups. This framework enables DataVisor to uncover fraud attacks perpetrated by individual accounts, as well as organized mass scale attacks coordinated among many users such as fraud and crime rings – adding a valuable piece to your fraud detection architecture with a “full-stack.”

Our correlation analysis groups fraudsters “acting” similarly into the same cluster. In contrast, anomaly detection, another useful technique, finds the set of fraud objects that are considerably dissimilar from the remainder of the good users. It does this is by assuming anomalies do not belong to any group or they belong to small/sparse clusters. See graph below for anomaly detection illustrating fraudsters F1, F3, and group F2and good users G1 and G2. The benefits of unsupervised analytics is on display when comparing it to anomaly detection. While anomaly detection can find outlying fraudsters from a given data set, it would encounter a challenge identifying large fraud groups.

With unsupervised analytics, DataVisor collaborates with rules engines and machine learning models. For customers, the analytics provides them a list of the fraudsters and also gives their fraud analysts insights to create new rules. When DataVisor finds fraud that has not been encountered by a customer previously, the data from the unsupervised campaign detection can serve as early warning signals and/or training data to their learning models, creating new and valuable dimensions to their model’s accuracy.

By focusing on early detection and discovering unknown fraud, DataVisor has helped customers to become better and more efficient in solving fraud in diverse range of areas such as:

Identifying fake user registration and account takeovers (ATO)
Detecting fraudulent financial transactions and activity
Discovering user acquisition and promotion abuse
Preventing social spam, fake posts, reviews and likes

Stay tuned for future blog posts where I will address topics such as new online fraud attacks, case review management tools, and a closer look into DataVisor’s fraud detection technology stack. If you want to learn more about how DataVisor can help you fight online fraud, please visit https://datavisor.com/ or schedule atrial.

安全业务领域

该如何做大中型 UGC 平台（如新浪微博）的反垃圾（anti-spam）工作？

October 27, 2016 zr9558 Leave a comment

来自知乎

帅帅产品经理

423 人赞同

Anti-spam

@周源的邀请，我现在才回应，见谅。做反 Spam 工作的人，要禁得住诱惑耐得住寂寞扛得住压力受得了委屈，本想路过算了。但看看互联网上这块内容都比较少，看到有人说自己会说些干货，结果找到很少，做 Anti-spam 的人不多，也时常不受重视，其实交流又非常重要，基于此，我就从产品的角度谈谈这块自己的一点积累，抛砖引玉。

Anti-spam 是数据分析工作的一个方向，非常考验一个产品人员对数据整体和局部的把握，如果对产品无爱，对数据，特别是数据的细节刨根问底不着迷，这事儿做不好。

做 Anti-spam工作，只掌握了数据分析的方法是不够的，还要加入足够的产品市场人员的思维——对用户需求的分析，对用户需求的理解，对人性的理解，多换位思考。有了这些才能真正的把基于XX产品的反 Spam 工作做好。这也是一般做了几年反 Spam 工作后，能力提升瓶颈的关键点。

开始正文，先分4部分：具体工作怎么做，如何进阶，反spam产品经理还需要具备哪些能力，我个人的经验。

1. 具体工作怎么做
1.1 做数据分析
第一次接触这个工作的人，一般压力很大，都是人肉通过后台工具解决spam问题解决不了或者这个问题已经严重的威胁产品安全了，希望你能解决，如果你幸运的解决了一两个问题，更希望你能成为黯淡无光黑夜里的救星。
在很多人指手画脚，投诉各种问题的时候，自己不要乱，一定先只做一件事件——数据分析，抽XX产品10万个数据分析分析。
目的：了解目前整体的情况，对问题严重性，多样性，有足够的认识。
产出：分析报告，列出当前所有问题的分类情况，比例情况，严重性情况，每类呈现出什么特点，给出问题解决的优先级排序。
做完这个事情，整体情况你应该最了解，老板再问你，你就能从全局介绍情况，然后再分类给出优先级。一般老板都关注最关键，最重要，影响最大等关键问题。

1.2 给出XX产品spam的定义
数据分析报告中列出所有问题，而非仅仅是spam问题，因为几乎没有人能在不看大量数据的情况下，就能给出这个产品spam准确的定义，如果有给出的，基本也是拍各种器官拍出来的。
给出XX产品spam的定义很重要，重要的意义有：
1.2.1 明确自己的工作范围
做反spam工作一般开始压力大，万事开头难，千万不要一上来眉毛胡子一把抓，贪多，定位太高，当前具体问题解决不好，赢得不了信任，以后工作很难开展。
跟反spam工作，相关的有很多，黄反监控、账号安全、防攻击防抓站，这每一个都是难度大不好做的工作，反spam没有做好前，不要牵扯精力。
1.2.2 明确自己的工作目标
有了工作范围和工作任务定义，自己的工作目标就容易定出来了，也就是你的KPI，这个很重要，spam问题只要不是瞎子都看得到，不管懂不懂都可以上来说一通自己的策略，如果没有KPI，你就无法证明自己的工作是否有效，无法证明虽然现在问题比较多，但整体情况是在前进，变好的。
1.2.3 指导今后判定问题的标准
今后的工作中，会遇到很多灰色地带和问题，这个定义就是你划分是否属于你工作范围的明灯，也是你在数据分析中，判断具体问题是否是spam的标准。

1.3 发现问题

1.3.1 以spam问题为导向
没啥好说的，初期就是哪里有问题，哪里就有你的分析，研究。
1.3.2 全面掌握spam情况，找出主要问题
面对一个产品的spam问题，首先应从基础数据入手，全面的掌握spam问题的类型、比例情况，最有效的办法是：大量的基础数据标注。这个办法好处非常多，除了能掌握主要问题和全面的情况外，还能对spam的贴子有亲生的体会，了解spamer在想什么，惯用的伎手段有哪些，找出很多典型的例子。
1.3.3 保持发现问题的敏感性，把握spam流行趋势
大型的数据调研有一定的周期性，获得的结论一般关注的是主要问题，由于spam问题有着很强的时效性，反spam系统一旦出现漏洞，某一类之前可能比例较小的spam问题也很容易泛滥起来，因此保持敏感性，把握流行趋势很重要。方法是：
① 关注spam收益高的spam案例；
这个因产品而已，但是每个产品总能找到。
② 注意用户反馈；
任何监控和机制，总免不了有疏漏，我们也要非常注意用户关于spam问题的投诉、反馈，用户深恶痛绝的spam问题，往往也是危害大，容易流行起来的问题。

1.4 分析问题
一个产品中出现的spam行为，也可以看成是一种用户需求，当然这些用户需求从产品官方角度看是不正常的，都是以伤害绝大多数用户体验为代价，满足小部分人赚钱的需求。
反spam中，分析问题最主要的目的，就是把这些一小撮害群之马的行为从绝大多数正常行为中，抽象化、规律化、用机器能执行的语言分离出来，最终变成反spam策略解决掉。主要方法是：
 找碴，找不容易变的碴
反spam就是找出spam行为与正常用户行为之间的不同规律，把这些不同区分出来，区分的办法价值的高低，主要是两点来衡量：spamer的规律是否易变和我们区分的成本是否很低。机器最容易区别的，spamer变化成本高的不同点，就是我们要的点。
常见的4个方向
① 内容；spam行为都是以获利为目的的，在产品里spam，最终spamer都是要把用户、流量导入到目标网站，一般都会在内容中留下spam特征即利益的出口。
② 行为；凡是spam能获利的地方，spamer都希望更快更多的获利，这就注定了spam行为一定会走发的多、发的快的路线，一定会跟正常用户有区别。
③ 社区属性数据，包括：发贴作者注册时间、作者等级（新用户、平民、会员、认证人员），spam贴子发布的连续性，spam用户发贴在贴子页面停留时间等等
④ 用户之间的交互数据，这个不一一列举。

总之，一种类型的数据，就像素描中的笔触，数据越多，意味着你描述犯罪嫌疑人的线条越多，就越能清晰的把spam辨别出来，如果数据很少，那就很难解决复杂问题。另外，数据多了，也应该注意使用最简单有效的数据，RD会感谢你的。

1.5 解决问题

1.5.1 优先解决主要问题

一段时期只能解决一个问题，优先解决影响面最广危害最大的问题，这样获得的收益最大，同时对其他次要问题的解决也非常有帮助，甚至次要问题在解决主要问题的过程中，也会迎刃而解。

1.5.2 小数据量验证策略效果

当spam问题发现和分析完毕后，一般一个解决策略基本成型，这时，一定要先用小规模的数据验证一下策略的效果后，再进行策略的开发和上线。一个反spam策略无论多么的简单或巧妙，都要用数据去验证效果，验证的方法是抽小量的数据去检验，按照这个策略看是否能获得好的准确率和召回率。

1.5.3 坚持低成本、低误伤、高收益，数据说话的原则

很多反spam问题都不止一个解决办法，哪个低成本、低误伤、高收益我们就走哪条路，无论谁提出的想法或策略，用数据检验没有问题后，才进行下一步工作。
不要一上来就想搞个智能分析打分系统，什么贝叶斯，什么离散系统，先一个问题一个问题的解决，一个策略一个策略的上，等你有基础有积淀，如果还需要做这样的系统，那就再做吧。
智能系统很难做，要很高阶的RD和PM搞基一样的配合，才能孕育的出来的生命。Spam变化很快，做智能系统解决很耗时。

1.5.4 解决问题时，以PM还是RD为主导？

一般RD珍贵，事情又多，PM RD 7 3开吧

具体工作怎么做，讲完了，其实，在这个过程中有非常多的难点，定义如何制定，数据怎么分析，excel怎么用等等，欢迎讨论，有空我再续。

2. 如何在反spam业务上进阶

当各类问题和策略的制定，做到两位数的时候，比较少的会碰到无法解决的具体问题时，就可以开始考虑工作的进阶和深入。

2.1 综合问题把握方向

反spam工作是持久战，spam问题也会一直有不断有，头痛医头脚痛医脚只能解决一时局部的问题，要全面彻底做好反spam工作，把spam问题控制在一个相对低的水平，就必须每隔一段时间分析回顾这段时间所作的工作，总结经验把握下一步方向。
一般方法：
2.1.1 首先在解决具体问题中，不断明确解决反spam问题有哪些办法和角度，把这些角度归纳出几个方向。
2.1.2 回顾这一段时间里，我们都是从哪个方向出发的，这个方向我们做的如何？是否已经做的比较彻底了？是否到了瓶颈的地方？是否存在这个方向解决不了的问题。如果有，是否需要换个角度和思路，是数据少了还是方法不对等等。
2.1.3 分析当前面临的主要问题和spam流行趋势
2.1.4 综合过去的经验和当前遇到的问题，系统的完善上一个方向，同时在适当的时候提出和推进下一个方向的开展。

2.2 反 Spam 人才业务上的培养

PM的人才培养，每个产品经理都有自己的特点，我只说一下反 Spam 业务中，如果培养的话，特别需要注意的问题。

（注释：本文的pm不是product manager，而是product marketing的缩写，意思是基于市场需求的产品，（而非创造需求）翻出来说，是提醒新入的pm，别上来就搞什么管理，先把精力投入到产品研究上，product master比别的都有价值。via UBee）

2.2.1 解决问题的办法真心不止一条，教给新同学方法，不要总觉得自己的想法最靠谱，都要按照你的意思来。
2.2.2 放权，在背后做支持，发挥新同学的主观能动性吧，做的好是他的功劳，做的不好是他的责任，让新同学尽快的负起责任来，有利于新同学更快的独当一面。
2.2.3 没有做数据分析，就不要乱发表具体策略的建议。经验是个好东西，但会犯错，作为资深人员，仍要注意，没有亲自看数据，不要随便定策略，说出来很容易不靠谱。
2.2.4 把试错的机会留给新人。每个资深产品人员想想自己是怎么成长的，犯了多少错，只要不是方向性的错误，尽量把试错的机会留给别人，在新同学每次犯错后引导他们去思考避免，从错误中学到成长。via 百度产品市场部

3. Anti-spam 产品经理需要具备哪些能力
正如之前所述，解决反 Spam 问题的办法有很多种，所以，成功的反 Spam 产品经理各有千秋，从介绍这个行业或圈子的角度，我列一列众多能力中的几种，大家参考，方便大家了解或招聘时参考。不同的环境导致不同的成长路径，不一定非要照此修炼。
3.1 反 Spam 的数据分析能力
这是实际动手的能力，方法论都可以学可以听，数据分析能力我觉得是一个无法传授，只能自己实践的能力，但在实践过程中，也有一些总结提高的方法。
3.1.1 培养数据亲切感
在热爱这个产品的前提下，数据抽出来时，别人看到的是数据，你看到的是数据背后的用户，用户的需求，他们的种种行为总是给你带来惊喜，他们需求得到满足后，总是能给你带来喜悦。
Spam 各种行为背后都是有着各种各样的联系，产品对他们来说是黑匣子，大量数据放在一起的时候，稍微的排一下顺序，规律就会显现出来。
3.1.2 在数据分析时，不要想当然的给用户打上标签，也就是不要过快的判定非黑即白完事儿，而是不停问自己，他为什么要这样做，是一个还是很多个这样，很多个这样一定有原因，这样原因可以先假设，但一定要用数据验证假设，验证的次数越多下次做建设的时候越容易正确。道理很简单：熟能生巧，简单的东西做到极致，你就像在开外挂一样，别人看不出的规律你总能看出来。

（写到这里说说题外话：写到这的时候，我想起的搜索引擎9238，搜索研究院―，一个超级到不能再超级的超级用户、每天至少搜索上千个词、半夜还在用产品、深夜实在累的不行了摊开睡袋睡下，大家早上上班的时候他去洗手间洗脸刷牙。成功的路上没有捷径，听到、看到或者别人教你关于某个问题如何做跟自己完全掌握，之间还有数以百计个小时。）

3.2 关键问题的把握
做产品做久了，一起讨论问题的时候，你会发现总有那么几个人，他们每次指出的问题都是整个问题的关键点，策略型 PM 这点非常重要。

3.3 全局的产品意识
3.3.1 平台型产品不用在产品设计之初特别在意反 Spam 问题，有这个意识觉悟，不要故意做漏洞，犯低级错误即可。

非小型UGC产品，一般都是先有了这个产品，这个产品发展到一定阶段后，才出现 Spam 问题，所以在产品一开始设计之初，很难有人能考虑到反 Spam，即便有人考虑到这个问题，在产品都不知道以后能否火的前提下，反spam的需求也会因为优先级、资源等问题搁置。再则，平台型产品初期就是要以低门槛来抢用户，成功的运气因素也很重要，在早期做相关的功能或限制没有必要。
另外，反spam是问题导向，问题没有发生，你怎么预设问题然后去控制。

产品人员在分析用户需求，设计产品之初，要心无旁骛的只关注如何更好的满足用户需求，一定要抱有N个假设，这样才能把产品做好。这个 N 个假设里，其中两条是：RD 是万能的，只有成本和收益的权衡；Spam 问题不存在无法解决的问题，只有重视程度和阶段的不同。

3.3.2 能深入细节，更能跳出细节看大局部，看整体。

这句话，看起来比较虚。举个项目例子（我不可以细说），比如你解决某类 Spam 问题，时刻想着做这事儿的目的是什么，有时候解决到80%了，是否可以换个方向审视一下，做一做，可能效果更好。

我一直打一个比方——反 Spam 需要几十个策略，交织在一起想一张网，Spam 来了都要过这张网，当你的网策略少比较稀疏的时候，漏洞就大，Spamer 一试就知道你的大概策略，大概阈值，很容易就钻过去，但是当策略较多，网比较密的时候，钻过去的成本就大大提高，这就要求产品经理能细节能整体。

3.3.3 要共赢，维持生态平衡，不要伤及产品和自身。

Spam与营销有时候只有一线之差

反Spam的目标就是把Spam控制在可以接受的范围内，保持生态平衡，利益链条平衡。做的太狠，也会自损忍受阉割之痛，另外，也会有意想不到的麻烦，你懂的。

4. 我个人的一些经验

4.1 以spam问题为导向
4.2 一段时间只解决一个问题
4.3 优先解决范围最广危害最大的 Spam问题
4.4 策略提出后一定要小数据量验证效果
4.5 发挥每个人的积极性、主观能动性
4.6 坚持低成本、低误伤、高收益，数据说话的原则
4.7 Spam问题具有时效性，反spam更要快速有效
4.8 先下猛药再解决误伤
4.9 不要指望一个策略或一组策略解决所有问题
4.10 勿以善小而不为
当成本也很小的时候，一些收益看起来小的策略，在多个策略综合起效的时候，也能带来很大的收益。例如：在策略很多的前提下（这个前提很重要）解决某些问题的时候，关键词匹配也能很有效。
4.11 人工靠不住，尽量多用机器
4.12 对数据要有亲切感，乐意探究数据背后的故事
4.13 机器不够用，人工过来补。注意是应对图片、视频 Spam，机器识别难度很大的问题。
4.14 注意遗漏，连连看、挖掘召回。
4.15 解决问题的路不止一条。
4.16 PM抽数据困难不畏惧。

========================================================
2016 年 8 月 2 日更新

5. 最近一些时间，在反作弊业务上，自我感觉成长不多，有什么新的感悟，我会逐步更新在下面

5.1 如何解决误伤问题

不同大小体量的平台解决思路不一样

流量、用户群比较大的平台，一般的做法是，周期性的评估误伤，误伤比较高的策略下线掉，再去优化策略，优化到一定程度后再上线；

优化策略一般都会面临挖掘新的数据项的问题，在当前仅有的数据项基础上去优化策略难度比较大，需要很认真细致的看数据，思考策略；而挖掘新的数据项会更容易更有效。新数据项的挖掘，产品最好找到多一些的数据项，预防着有些很好的数据项工程师挖的难度比较大，就需要换。

策略评估误伤，下线，优化，再上线，再评估……，这样的循环做多了，需要思考如何把策略制定变得产品或运营人员可配置化，策略上下线自动化的工作。

策略可配置化，主要是要抽象化策略共同的项，由技术做成模块，新的策略就是由这些通用的模块搭配一些条件生成出来。产品或运营可以去组合出新策略，自由调整关键阈值。

体量小一些的平台，用上述的方法，可能效果不好且成本高，有点像用牛刀杀鸡，而且小体量的平台往往更注重误伤（原因：小平台正常用户本来就少，误伤几个就是大事儿；小平台里当个正常用户影响力更大，十几个核心用户出来反馈误伤，感觉就是大新闻），那怎么办呢？

解决方案是：把处理手段做的有层次些。以前反作弊抓住了，都砍头，砍错了，当然压力大；现在反作弊抓住了都把小拇指的指甲剪了，剪错了，压力不大。但是，剪指甲的手段也要达到反作弊的效果。追求什么效果？第一追求：把人和机器区分开，把机器人干掉；做不到的话，退而求其次，打断 spam 的连续性，提高 spam 的成本吧。

这种手段怎么做？
验证码就别用了，除非是 Google 那种行为验证码，但一般公司做不了也也舍不得花钱做，开源接口被墙了。
三个强
结合性强，要跟自身产品特性结合。Facebook 验证你是不是账户拥有者的时候，会让你填你上传上的照片中，人脸都是谁的脸。
趣味性强，验证有段要有趣，不然正常用户被误伤的时候填起来就会很颓废。我在微博做过把你最近关注的人抽 4 个出来，把名字打乱，让你让头像和名字连线连正确。
安全性强，能真正的把人和机器分开。

5.2 如何挖掘数据项

挖掘数据项是反作弊至关重要的一环，数据项多，解决 Spam 的思路就广。

挖掘数据最关键是两点：好的分类方法和注意细节的能力

好的分类方法：我的经验是，基础数据、社区属性数据、用户之间的交互数据；两个维度：显性数据和隐性数据。

细节数据的分析归纳能力，我的两个经验：多思考如果我是正常用户使用产品一般流程是什么；多思考如果我是作弊的我会怎么作弊，以及多研究各种发帖机、注册机。

宋一松 Facebook，Uber

收录于编辑推荐 •159 人赞同

我觉得如何应对spam可以很明显的展示出一个公司的实力。原因有两个：

如何通过技术手段来做主动的自动化运营，而不是通过人工手段去被动地应对每一个突发事件，很考验一个公司的技术能力。
除非是应对「重大突发Spam事件」，否则解决Spam对公司短期的KPI没有正面作用（有时可能还是负面的）。因此为什么要解决Spam，怎么Spam，解决到什么程度，都可以体现公司的产品价值观。

接下来就说说Facebook这个超大型UCG平台是怎么解决这两个问题的。

1. 技术化运营
Facebook有一套专门Anti-spam的基于机器学习的系统，叫Sigma。

对于每一个用户在Facebook网站上的每一个动作，比如发帖/点赞/评论/私信/好友申请，Sigma都会实时预测其行为的「可疑」程度。
这个「可疑」程度具体又分为多个子维度，包括假帐号，被盗号，刷榜刷赞，发钓鱼帖等。
针对每个维度，Sigma都会基于机器学习生成一个可疑值，数值高的就会自动触发对应的规则系统：删号，删帖，发邮件或短信来要求用户确认帐号等。

相比起用于精准广告，智能排序，个性化推荐一类的机器学习系统，Sigma最大的不同就是响应速度要快，在各个层面都要快：

模型的训练必须是online的，用实时的数据。否则新出现的Spam没有第一时间体现在数据里，再好的系统也没用。
「学习率」必须要快。相对的，「准确率」就没有那么重要。一个2%失误率的算法在当天就控制住了spam，让它只影响了1000个用户，远好于一个失误率只有1%，但到了第二天才学会正确识别spam，以至于让它影响了10万个用户的算法。
模型和规则的部署要快。新的模型出来了，或者万不得已手动加一个新规则，你如何把新的模型和规则部署到服务器上去？在这十万火急争分夺秒的时刻，你总不能让机器们轮流着重启一遍吧。

在上述的这些独特的技术问题之外，还有更重要的一点值得再次强调一下：Sigma不是一个独立的模块。它在每个用户的每个行为都会被触发，因此它与整个Facebook技术系统的结合要极为紧密，涉及各个环节。这对规模不大的产品来说不是什么难事，但如果接触过类似FB这种一个网站包含各种复杂功能的系统，应该能理解工程上的挑战吧。对应的，如果能把这件事做好，体现的也就不仅仅是anti-spam什么的，而是公司整体的技术工程能力了。

2. 产品的价值观
为什么要anti-spam？那些引诱用户去钓鱼网站的自然要解决，但那些买僵尸粉来给自己刷赞的呢？把他们做掉了，短期内产品的数据反而会降，那要不要做呢？如果做的话，目的又是什么呢？

是为了维护社区的质量，无论这会怎样影响短期数据。

想明白这些，对「spam」的定义就会宽泛很多。对应的，也就不能仅依靠anti-spam一个团队来做工作，而是要求公司内的每一个产品团队都要保持对质量的关注。

举个例子，我在Facebook时做的是Newsfeed排序，离开公司前的最后一个项目，就是和广义上的spam有关：抓出标题党。

很多公众号/营销号/蓝V号爱做标题党，这事在Facebook上也不例外。然而，在FB这侧，通过对比一个分享的点击率和平均阅读时长，很容易找出那些典型的标题党。在新鲜事排序上对这些标题党做降权处理，减少他们在新鲜事上的曝光量，从而控制了低质量内容在社区内的传播。

同理，我们还会做掉骗赞的和骗转发的。

可以看出来，做这些工作对社区绝对是好的，但对宏观数据完全没帮助，反而可能不利于公司与公众号运营者们的关系。某种程度上，anti-spam天然地与KPI文化相违背。因此，anti-spam最终做得好不好，取决于公司自上向下的产品价值观：

到底是冲数据，还是做正确的事？

————
附：
[1]: 关于Sigma的paper: http://research.microsoft.com/en-us/projects/ldg/a10-stein.pdf

aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢

30 人赞同

搜索的spam、微博的spam、论坛的spam、软件客户端的spam不太一样。
本经验部分来自于客户端spam的个人经验。
——————
补充几点具体的：
1.ip聚集,地理位置异常，细分视图毛刺
2.恶意id属性信息分析
3.恶意行为轨迹分析
4.流水log小样本抽样，点定人肉观察评估
5.价值链分析
6.智能预警（合理划分低于标准、正常、异常三个维度即可，不说专业词汇了）
7.不要什么都依赖验证码，另没有不会被破解的验证码
##########
提高对方成本，降低自己成本
抓大放小
事前控制
实时限制
事后打击
三十六计若干都可用
《失控》第二章吧，机器人那段。小而独立，但有用，可复用，可被组合。求全，求系统，你就死了。人家是钻空。

裴立（Pz）入门级PM http://www.lockon.cc

39 人赞同

以论坛中的反垃圾信息为例，从具体策略上说说自己的看法。

1.对每一个帐号都设定打分项，主要从帐号发布的内容、帐号的行为、与帐号的关联因素三方面考虑。
内容因素：
首先，垃圾帐号发布的内容多半会提供一个外站的链接或者手机、QQ号。因此一个帐号连续多次发布的信息中如果有重复的链接/数字出现，他有极高的可能性是一个垃圾帐号。
其次，每个论坛都会有自己的敏感词库，如果不是那种最ugly的敏感词库，至少应该会有三层级别：
a.直接删除内容并禁言帐号；
b.需要对内容做先审后发的处理同时监控帐号其他发布的内容；
c.内容可以先发后审，帐号不作处理。
对于前两种情况，垃圾信息能造成的危害被降到了最低。第三种情况，就需要结合其他因素一起来判断。

行为因素：
这里举一个例子来说，垃圾帐号因为是趋利，所以在行为上一定会异于普通的正常用户。比如在论坛上它会一直不停地发帖，而正常用户都是看帖多发帖少。这就给我们提供一个参考。通过post数量和浏览的url数量比值我们就能找到垃圾帐号和正常帐号的差异。

其他的关联因素：
看到之前的回答中有提到不少，这里补充一个：帐号所使用的主机id。垃圾帐号通常是批量注册的，因此一个垃圾账号背后来自同一个ip、同一个主机的其他帐号往往也都是垃圾帐号。但是这里提出一点：不要轻易封掉ip或主机，一方面是会有误伤，另一方面这种简单的封杀做法会让你的反垃圾体系变成马其诺防线，一旦被突破，只会抬高你的反垃圾成本。

2.基于上述三方面的考虑后，我们已经拥有评估垃圾帐号可能性的几个因素了，基于三个因素对帐号做评估。可以使用一些比较智能的算法，比如贝叶斯公式，但这需要你能准确地统计出垃圾帐号中各个因素的占比系数，这个模型一旦建立起来，整个反垃圾系统需要通过不断地机器学习来对系数做调整，才可能应对垃圾帐号即时的变化。
当然，你可以有比较简单的做法，只要某个帐号具备了其中的若干因素，就可以怀疑它是垃圾帐号了。接下来就看是否需要借助人为的监控行为做进一步识别了。

3.验证码和反垃圾策略的关系
必须明确的一点是：验证码本身只能用来防住机器人，防不住人，更何况破解技术层出不穷，实际上抵挡机器人的效果也不完全能让人满意。即使你对自己的验证码有把握，那么你也许能挡得住一部分机器人，但并不能把所有垃圾帐号都防住。
所以验证码实际上只能算抵挡垃圾信息的第一道防线，在验证码之后，一定要有合理的反垃圾策略。

4.反垃圾工作的确是一项长期的工作
理论上来说，当垃圾信息的发布成本高于所能得到的收获时，垃圾信息会减少，这些发布垃圾信息的人也会选择离开，转而寻找其他的社区。但事实上，垃圾信息行为与反垃圾行为永远都是一场你来我往的战斗，随时注意网站的数据变化，及时找到典型的垃圾模型。才能巩固已有的战果。

iammutex 彩石手机CTO – 做最好的中老年智能手机

19 人赞同

贴一个两年多以前的文章吧，相信并不完全过时。

——————————————————-
《谈谈反垃圾》
由于常年从事用户产品的开发工作，工作中难免遇到过各种各样反垃圾的事，一回生二回熟，在摸爬滚打的对抗中，也摸出了一些门道，此文算是对个人经验的总结，非专业视角的分享。
这里说的垃圾主要针对诸如垃圾评论，机器注册，机器刷接口等等。

反垃圾很重要的两步是：垃圾识别，垃圾处理（包括预防）。
【垃圾识别】
对于判别垃圾，通常有下面一些方法。
1.基于内容的识别在基于内容的判别上，最直接的是关键词过滤，比如包含“开发票”、“激情视频”这类词的极有可能是垃圾内容，我们通过字符串匹配来判断是否有这类关键词。这里有一个难题，如果是检索一段内容是否包含某一个词还算简单，有很多算法可以实现，比如经典的KMP算法，很多语言内置的字符串查找方法效率也很高。但是，要判断一段内容是否包含一堆关键词中的某一个或某几个，那就有一些难度了，总不能循环一遍所有关键词挨个做匹配吧，所以此法必不可取。

这里推荐两个方法，一个是基于trie树的关键词树，具体有没有开源实现的不清楚，我们使用中是自己基于Memcached改了一个，保留Memcached的简单协议，修改内部逻辑为trie树的查找，简单来说就是将关键词做字节切分，建立一棵trie树，判断一段话中是否包含这些关键词，只需要从根节点向下检索即可。

另外一个方法，是利用贝叶斯算法来进行垃圾概率计算。贝叶斯算法这里就不多展开说了，其原理简单来说就是，收集一组正常内容和一组垃圾内容，用此内容对系统进行训练，让系统能够知道每个词在正常内容中和是在垃圾内容中的概率。做完训练后，再有一段新内容过来，可以直接对其中的词进行综合加权计算，得出整段内容是正常或垃圾的概率。
2.基于特殊内容的识别上面是纯粹基于随机内容的识别，而实际上我们可能还有一些省力的方法，比如一般的垃圾内容经常会有下面一些特征：带链接（因为要把用户引导到自己的垃圾网站），带图片（为了更醒目），带数字串（比如QQ号，电话号等等），通过这些特征做字符串匹配也是一个好方法，而且就个人经验来看，还比较奏效。其中需要注意的一点就是，上面的链接、数字串这些，通常攻击者都会搞一些变体，不会直接写链接和数字让你判断。比如换成中文数字和字母，你知道，UTF8是很博大精深的。比如：1҉2҉3҉4҉5҉6҉7҉8҉9҉0҉ 这种。所以判断规则上需要多做一些兼容，比如把这种东西先全转成数字来判断。
3.基于请求方式的识别另外，垃圾毕竟是通过我们暴露给用户的各种接口进来的，而攻击者请求我们接口的方法难免与真实用户有差距。比如说，正常用户会先进入注册页面，再填表单，再提交注册按钮。但是恶意注册程序，很可能是不会先访问你的注册页面的，而是直接请求注册接口（利用这一点我们就可以作文章，比如对用户访问路径进行记录，如果未访问页面就直接请求接口的，判为恶意请求）。另外就是攻击者的http头信息，比如最常见的，UA字段是否是cUrl或者其它非正常浏览器。或者像很多前端团队都有在请求url上添加随机数的习惯，这样本来是为了避免后端缓存，但有些低水平的垃圾请求会原样的每次都用同一个随机数，这就很容易识别他们了。总之，从http请求的层面可识别的东西很多，只要攻击者伪装有一点纰漏，咱们就可以抓到他的尾巴。
4.基于请求主体的识别如果我们遇到UGC内容的垃圾攻击，那么发起请求的肯定得是一个正常用户（如果是匿名社区请忽略此条）。这时候，内容发送主体的信用级别，就可以转移为对信息质量的判别上来。就像我们都懂的，某些大的平台也会对不同用户执行不同的审核策略（比如都知道的先审后放，还是先放后审），这也需要我们对内容发布主体有充分的信用分级。比如，一个注册24小时内的用户相对一个注册三年发帖无数的用户来说，信用等级就低得多。
5.基于内容载体的识别垃圾内容之所以能形成黑色产业链，通常绝不会是恶作剧玩玩而已，所以跟互联网最传统的广告模式一样，垃圾也希望能够多曝光，多赚点击。那怎么做呢，通常就是选择在用户扎堆的地方去发。比如时下热门的电视剧，热点的新闻事件下面就是垃圾流量的公共厕所了。另外，在一些政治军事内容版块发反动言论，在一些娱乐美女内容版块发成人网站，这些也都是常用的路数。总的来说就是，同样一条内容，在热门版块发布，更有可能会是垃圾内容，需要我们更多的关注。
【垃圾处理】
好吧，上面说了一大堆的方法去给内容和用户评级，以便我们能够对一个用户或者一段发布的内容进行预估，那么，在我们了解了一个用户或者一段内容是否可能是垃圾后，我们脑子里首先蹦出来的可能就是：封杀！但实际处理方法可能不仅封杀一种，下面我们就来探讨一下对垃圾攻击的几种处理方法。
1.制定封杀方法如果我们已经确切掌握了垃圾流量的规律，比如某一个IP或一组IP，比如同一组参数，比如内容总是包含某网址的变体，那么我们就可以直接大开杀戒，用这些特征直接进行封杀操作。
2.制定审核级别顺着上面的思路，我们可以对不同的用户和内容施加不同的审核策略，比如是直接放行、先审后放、先放后审还是直接毙掉。我们还可以对用户施加不同的限制策略，比如新注册用户每天只能发3条内容（在审核通过一条后又可以再发）。
3.工作量证明工作量证明是一个在反垃圾邮件中的方法，最近火得不得了的比特币，工作量证明也是其核心理论支柱之一。通过引入工作量证明方法，我们甚至可以不用对垃圾流量进行判别。只要加一道隐形的门槛，就足以让很多攻击者却步。

举个例子，如果攻击者原来只需要请求一次接口就能够发布一条信息，现在我们需要他在接口请求之前先填一个验证码，他就没那么容易自动狂发内容了。上面这个逻辑大家都能理解，也确实能奏效，但是很抱歉，这样做很伤用户体验，产品经理说不行。

那我们换一种做法，我们让用户在请求前先做大约10w次的md5运算，普通用户的机器偶尔进行一

次这样的计算不算什么，但是对攻击者来说，它需要单机发布大量内容，如果我们要求每条内容都需要做10w次md5的话，对的硬件资源是很大的挑战，也是让他放弃对你网站进行攻击的一个方法。

当然，如果我们直接用上面的10w次md5的方法，我们在服务端也需要做同样多的工作才能对传入的接口进行验证，对我们服务器本身也是很大的挑战。所以上面只是一个为了让我们理解的例子，通常的做法是，服务端给定一个随机字符串 s1，客户端需要找到一个数 d，这个数要满足下面条件：这个数破加在这个随机串后同组成一个新串 s2，这个新串进行md5后，前5位都要是0。大家可以想一下，要达到这样的标准，客户端需要不断循环来寻找这个合适的d，而服务端验证却是只需要进行一次md5就可以了。这就是所谓的工作量证明。
4.请求签名请求签名也是一个省时省力的好方法，前后端约定一种hash算法（最好是自创的），前端对请求内容进行签名，后端验证签名。通过对前端代码进行混淆，让攻击者很难实现你的hash算法。增加他的攻击成本。
5.查出源头发垃圾内容的攻击者通常都不会用自己机器或服务器IP（要不你就赚到了，直接封IP就行了），而是用手里控制的肉鸡或者扫描来的http代理来做，其实识别肉鸡和代理也比较简单，最直接的方法就是看看开没开着80、8080、3128等端口。这是一般代理的常用接口，另外一般情况下被拿下的肉鸡也都是web接口防范不严造成的。如果是普通http代理，很可能会很有良心的通过x-forward-for，或者x-real-ip等http头信息把源ip传给你，而对于肉鸡找到肉鸡，如果你的黑客水平够，你可以直接也黑上去，看看是哪个IP在控制它，从而查到真实IP。查到攻击者的真实IP后如何处理就看你的了，是联系攻击方和平解决，直接报案还是把攻击者给黑了。那就看个人想法和水平了。
【策略与战略】
上面说了一堆战术层面的东西，下面聊一点战略上的原则。
1.反垃圾是一场成本的较量反垃圾，其实不是一项技术竞赛，更不像是个人恩怨，更多的是成本较量。如果你的网站流量大，但防护措施做得不够，那垃圾流量过来是必然的。我们所有的反垃圾策略只有一个目的，就是增加攻击者的成本，当成本上升到某一阈值时，攻击者会发现在你的网站玩太费劲，投入产出比太低，于是会去找同类型的其它网站。所以就像狮子和羊群一样，只要不是跑得最慢的那一只，就能逃过狮子的爪牙。
2.多数攻击者痛点在IP无论是用代理，还是肉鸡，攻击者的IP资源总是比较有限的，所以收集到足够多的IP进行封杀，通常能够解决大问题。
3.实而示之虚上面说反垃圾是一场成本较量，但在我们实际操作中，却要尽量避免真正的较上劲。比如当你发现了恶意请求的规律，如果你选择直接对此规则的请求返回404，那么攻击者也会马上知道它的攻击特征被你发现了，从而迅速进行升级对抗。但是如果你只是让他的操作无实际效果，但还照样返回“注册成功”、“发布成功”，那么攻击者可能会麻痹大意很长时间才会发现。正如《孙子兵法》中说的：“实而示之虚”。实际上在垃圾与反垃圾的较量中，最忌讳的就是无止境的军备竞赛。
4.发现特征之钓鱼策略有的攻击者很高明，能够将自己的请求伪装得得正常用户一模一样，所有的http头信息，请求参数，都完全仿真。对于这样的攻击者，我们有什么办法抓到他的尾巴呢。这里给大家介绍一种钓鱼策略。首先你修改一下你的网站的前后端逻辑，比如前端增加某一个参数，后端判断没有这个参数请求就会失败，这时候攻击者马上就会发现自己请求失败了，通过对正常请求的抓包，他很快发现你增加了一个参数，那他会跟着进行修改。这时我们让他爽几天。然后偷偷地把这个无关紧要的参数撤掉。这时候，所有正常用户请求中都不会有这个参数了，但是，攻击者不会时时关注我们的请求参数，所以还会在一段时间内，继续加上这个参数请求。这时钓鱼成功，正是我们的好机会，在这段时间内，我们可以尽量收集垃圾的IP，发布账号等信息。等收集到一定程度一起封掉（当然，这里的封掉也不要暴力封掉，而是让看起来没有被封掉）。
总的来说，反垃圾工作其实不是一个技术活，要求更多的是细致、谨慎与耐心，希望上面东西对你有用。

安全业务领域

机器学习正在安全领域挂起一阵小旋风，但这里面有BUG

October 24, 2016 zr9558 Leave a comment

如今，安全领域是机器学习（Machine learning）正在大力进军的一个方向。

| 把机器学习应用到安全领域，老板们跃跃欲试

如果你亲自参加了 2016 RSA 大会，就会发现几乎没有哪家公司在说自家安全领域的产品时，不提及机器学习。这是为什么呢？

可能对外行人来说，机器学习就像一种魔法，能解决所有的安全问题：你把一堆未标识的数据统统塞进会机器学习的系统中，它就能分辨出连人类专家都分辨不出的数据规律，并且还可以学习新的行为指令和适应环境威胁。不仅如此，就连为规则加密也劳烦不到你，因为系统已经自动为你搞定这一切。

要真是像这样的话，那机器学习可真就是今年的重头戏了！但讽刺的是，每个人都兴师动众说要在这个领域搞出点名堂来，但真正理解什么是机器学习，或明白机器学习到底能用来做什么的人，却是凤毛麟角。可想而知，在这种大环境下机器学习大多是被滥用的，尤其在安全领域。

| 用机器学习有效解决安全问题，正确的方法是？

把机器学习应用到安全领域，大多会涉及到一种技术——异常检测（anomaly detection），它可以识别哪些部分和预期模式或数据集不匹配。但技术销售方要注意，这种技术只在某些条件下有效——不过显然，他们还不知道自己已经犯下错误：他们会告诉你，分析过你公司的网络流量后，就可以用机器学习揪出暗藏在网络中的黑客。但事实上，机器学习根本就做不到。这时候，你要立刻对这个销售商保持一丝怀疑。

那到底什么情况下才有效？答案是，只有为低维度的问题也配备上高质量的标识数据，这样的机器学习才是有效的。但很不幸，企业在实施过程并没有做到这一点。如果要检测新型的攻击方式，你得有很清晰并且经过标识的攻击案例。这就是说，如果没有透彻理解正常的网络行为，机器学习是不可能发现黑客的。再说，所有的黑客都很狡猾，他们一定会把自己伪装的天衣无缝。

| 机器学习和异常检测，用在哪里价值最大？

机器学习和异常检测真正有用的地方，在于它们能将人类行为分类。

事实证明，人类的预测能力非常强，他们也有能力建立非常精确的个体用户行为模型，让模型探测到异常情况。

其实，人们在这方面已小有成就，比如隐式认证（ Implicit Authentication）。隐式认证采用生物特征识别技术，基于击键力度、节奏和打字模式等技术对用户身份进行认证。不管是改善用户体验还是增强安全性，这个技术的优势都相当明显。最起码，它免除了用户记忆密码的负担和输入密码的麻烦。由于隐式认证所需元素大多是低维的，机器学习就只需处理少量几个参数，这也使得收集用户的高品质标识数据变得很方便。所以，即使有行为差异或信号干扰，机器学习还是能正确为计算机视觉进行图形搭配。同理，机器学习也能通过识别出个体的独特行为而进行身份验证，这当然也不在话下。

不过，它是怎么做到的呢？

其实，你走路、站立等所有动作，是由众多因素共同决定的，比如生理状况，年龄，性别，肌肉记忆等等。并且对个体来说，这些动作不会有太大改变。因此，不经意间，你口袋中的手机就通过内置传感器精确捕捉到了这些信息，并记录下来。而想要通过运动行为来识别一个人， 4 秒的运动信息就已足够。另外，通过对比用户的历史和当下的定位记录也可以进行身份识别。人们总是生活在各种各样的习惯当中，通过观察他们什么时候从哪出发，就能预测被测者到底是不是用户本人。

我们的手机和电脑上已有大量的传感器，以后随着可穿戴设备的普及和物联网的发展，传感器的数量更会暴增。用户大量的行为数据和环境数据就这样被收集起来，提供给机器学习，让它为用户建立个体模型，并找到各个因素之间的相互关系。

| 让机器学习进行安全防护，你需要做哪些功课？

想进行安全防护，就必须让你的系统提前知道都存在哪些威胁模型。

首先，也是最重要的事——收集数据。这些数据必须非常精确，才能用来训练系统，起到抵抗威胁的作用。不过身份认证系统要真是遭到攻击，你也不用过于担心。因为行为变化还是比较好检测的，系统很快就能识别出异常情况。比如，如果一个设备不小心被偷，那么这个设备被偷之后所记录的运动状态，地理位置和用法就会和之前的记录有明显不同。不过，系统是接受这种可能存在的异常情况的，这时候用户就需要在系统上以另外的方式确认身份，调整系统，以使假阳性最小化。而一旦我们在不同设备上连接起 4 个因素，那么隐式认证的假阳性就会低于 0.001% 。

这个世界上并没有哪一种机器学习真的神奇到能解决所有的安全问题。设计者想用机器学习创建一个有用的安全防卫产品，就需要对底层系统有深刻理解，并且承认很多问题并不适合用机器学习来解决。不过不用担心，那些处在浪潮之巅的科技公司会将这些问题一步步消灭掉。

机器学习正在安全领域酝酿着一股势不可挡的市场狂潮。

安全业务领域

未来的网络安全，离不开机器学习

October 24, 2016 zr9558 Leave a comment

信息安全一直就是猫与老鼠的游戏。好家伙新建一堵墙，坏家伙便想方设法通过或绕过它。但最近，坏家伙们似乎越来越轻易地就可以通过这堵墙。要想阻止他们，我们的能力需要有一个巨大的提升，这可能意味着我们需要更广泛地使用机器学习技术。

这可能会惊到行业外的旁观者，但机器学习目前并没有广泛地影响到IT安全领域。安全专家认为，尽管信用卡欺诈侦查系统和网络设备制造商正在使用先进的分析方法，但实际上每个大型公司常见的自动化安全行动——比如检测个人电脑上的恶意软件或者识别网络中的恶意活动——大部分都要依靠人类适时地对这些行动进行代码编写和配置。

尽管机器学习技术在网络安全领域的应用已经有了广泛的学术研究，但我们现在才刚开始了解这项技术对安全工具的影响。一些创业公司（如Invincea, Cylance, Exabeam和Argyle Data）正在利用机器学习驱动安全工具，使得它们比目前主要的安全软件供应商提供的工具更快捷和精准。

用数据摧毁恶意软件

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为，是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说：「我了解到，一些反病毒公司已经涉足机器学习领域，但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件，这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

他说：「他们在检测过去常见的恶意软件上很成功，但是他们并不擅长检测新的恶意软件，这也是当下网络犯罪大行其道的原因之一。即使你安装了杀毒系统，其他人还是能成功侵入你的电脑，因为特征码检测的方法根本不起作用。」

在Invincea，Saxe正带领团队用机器学习建立更完善的恶意软件检测系统。这个项目是DARPA网络基因组项目的一部分，主要是使用机器学习来摧毁检测到的恶意软件，包括反向还原恶意软件的运行方式、在代码中进行社交网络分析、使用机器学习系统快速摧毁自然网络环境中出现的恶意软件新样本。

「我们已经证明，我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作，甚至能做得更好。把机器学习系统与大量的训练数据结合，就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前，Saxe有大约150万个良性或恶意软件样品用来训练算法，这些都在使用 Python 工具的GPU中进行。他希望，随着样本数据增加到3000万，机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多，用来训练机器学习系统的恶意软件的数量越多，那机器学习系统在检测恶意软件上的性能优势就会越明显，」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说，就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

恶意用户检测

机器学习还有助于IT安全的其他方面：检测恶意的内部用户和识别损坏的账户。

正如主要的反病毒产品依赖特征码来识别恶意软件一样，监测用户活动的工具也是倚赖特征码。基于特征码的检测方法在恶意软件检测上开始失效，同样的，它在检测用户活动领域的效果也不尽如人意。

「过去，企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

他说：「这种方法寻找的是已经发生的事情。基于特征码的方法存在的问题是，只有事件发生过后，他们才能看到留下的特征码。而现在，安全人员非常聚焦于检测没有特征码的恶意事件。」

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

如今，精明的犯罪分子知道稍微改变一下他们的路径就能战胜特征码检测。所以，如果被侵入的检测系统中存有一个IP黑名单，网络犯罪分子可以通过在他处理下的大面积网域中不断来回跳动来打破这个IP黑名单。

Exabeam并没有固守昔日的防御策略，而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏，所以先假设他们是恶意的，你的网络是缺乏抵抗力的，所以你时刻对每个人的行为进行监测和制作模型，从而找到恶意行为者。

这就是用到机器学习算法的地方。Lin和他的团队获取了多种多样的资源（如服务器日志、虚拟私人网络日志和VPN日志等），使用各种监督和非监督式机器学习算法来检测用户行为的异常模式。

Lin说：「以上都是描绘用户行为的画像，问题是这是如何做到的。对于网络上每个用户或实体，我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后，我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常，让他们浮现出来，方便安全分析员查看。」

机器学习在安全领域的未来

「想一想我们经历过的几次主要的网络安全浪潮，网络犯罪分子正寻找有效地方法来打破安全系统，我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗？答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

他说：「现在我们正开始获得能够有效处理大量未结构化数据和检测模式的系统，我希望下一波网络安全浪潮中的产品是基于认知计算的。看看Watson，既然它可以赢得危险边缘（Jeopardy）游戏，那为什么它不可以用来广泛地分析和理解网络安全事件呢？我认为我们正处于用基于认知的计算来帮助处理安全问题的萌芽阶段。」

Invincea的Saxe希望可以成为弄潮儿。他说：「我并不惊讶该领域的公司没有抓住这次浪潮，生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

安全业务领域

Machine learning and big data know it wasn’t you who just swiped your credit card

October 24, 2016 zr9558 Leave a comment

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

安全业务领域

当朋友圈更新多到看不完时，来看看Facebook是怎么优化信息流的

October 24, 2016 zr9558 Leave a comment

【编者按】本文是FREES互联网团队成员覃超与徐万鸿进行的一场 Ask Me Anything。徐是前 Facebook 新闻流排序组的资深工程师，在今年9月回国出任神州专车 CTO。本文中他们聊的是关于 Facebook 的 Growth Hacking 策略、反垃圾信息系统、信息流排序，以及为什么选择回国参与创业。雷锋网(公众号：雷锋网)做了不修改原意的编辑。

所谓新闻流排序（news feed ranking），指的是 Facebook 的一项看家本领：用户每天会收到两三千条新鲜事，却只会阅读前 50 至 100 条。利用机器学习将用户最想看的内容排到最前面，从而提高粘性和日活。

这固然是一篇着重技术的文章，所在公司 Facebook 更是世界上最大的互联网公司之一。但这并不妨碍创业者从中得到经验。利用 A/B 测试作为迭代方法，借助 Growth Hacking 的核心——数据来驱动开发，新员工的入职宣讲……这些做法都体现了这位社交之王不同维度的文化所在：精神层面注重实现梦想，统一目标；而这一目标下放到微观层面，就是对于数据的尊重。

Facebook利用Sigma 系统做了什么？

我第一次去Facebook工作的时候，当时专注于用户增长的 VP 负责宣讲。他说将来全球所有人都会使用 Facebook，这家公司将来会成为万亿美元的公司，这让我印象很深刻。公司的所有人都很兴奋，对设定的目标有非常大的信心。他们的工作使命感非常强，非常专注。

这是Facebook给我印象深刻的一件事。

在 Facebook 的 site-integrity （站点完整性）组工作了两年。当时 Facebook 有很多的垃圾私信、垃圾信息，就像人人、微博上有各种广告、垃圾链接。有些用户的账号被盗用了，会使用个人页面发送垃圾短信、广告、病毒，还有一些不受欢迎的朋友请求。我会处理所有类似这些涉及到影响用户体验的东西。

Facebook 使用了一个叫做 sigma 的系统来抵制这些垃圾信息。这个系统安装在 2000 多台机器上面，Facebook 用户做的任何事情，都会经过 sigma 系统分析处理，比如评论、链接、朋友请求，都会被这个系统进行判断，是正常行为、滥用行为还是有问题的行为。

利用 Sigma 系统，Facebook 会对垃圾信息进行过滤和清理。

举个例子说，比如发送朋友请求，Facebook 的系统会自动判断一下：如果这个人的朋友请求都被别人拒绝了，他再发送朋友请求是不会被批准的。如果一个人发送的朋友请求十个有九个都被拒绝了，那么他下一次的朋友请求就会被系统拒绝。

当然这个系统还有其他的判断信号。

它是一个机器学习系统，通过你之前发的朋友请求拒绝概率高低来判断你被拒绝的概率有多高。

如果这个比率很高，Facebook 会让你进行手机短信或其他方式认证，来验证是软件还是真人发送的，以此判断你是不是真的要发送朋友请求，比如你发出的朋友请求对象与你没有任何共同好友，那就可能是一个不合理的请求。

基本上，你在 Facebook 上做的任何事情，都会经过这个系统来分析、预测、决定是否允许你发出信息，借此希望会减少生态圈中的骚扰行为。当时 Facebook 每天有上百亿次的信息发生要通过这个系统进行判断。

机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法，请求通过和拒绝就是一个迅捷数据组（Scrum）。任务通过，则说明这个任务是一个对机器学习来说的正样本，被拒绝则是一个负样本，很像 0 和 1。

比如发送朋友请求如果被接受，y 值是 1，如果被拒绝就是 0。如果是评论和点赞，系统就能寻找 y 值，用户发送的不当信息就会被删除。

而机器学习是整个 Sigma 系统的核心。

另外一个方法是通过一些异常行为的分析、数据挖掘的方法来分析用户的异常行为。

比如一个人发的同样类型评论非常多，所有评论里都有一个相似链接，这就非常有问题。正常操作不会在不同人的主页上留同样的评论，这显然属于异常行为，我们不会允许。

新闻流是Facebook最重要的产品

我工作两年之后选择去了这个组。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈，你的信息流是个什么样子，信息的位置。每个人产生的内容、新闻会有两三千个，用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示，比如你喜欢游戏，你的朋友不喜欢。

我 2012 年刚去的时候，新闻流排序组只有五六个人，尽管这可能是公司最大的机器学习系统，最核心的产品。每天有十亿多人上线，每个用户花 40 分钟在 Facebook 上，其中一半时间都花在新闻流上。Facebook 大部分收入来自新闻流广告。比如说，移动广告收入占所有广告的 70%，而其中所有的移动的广告都来自新闻流。不管是从用户的停留时间，还是收入来说，新闻流都是最重要的产品。

新闻流是 Facebook 最重要的产品，直接决定了用户所看到的内容。

做好新闻流排序是很难的问题，因为用户在新闻流上的行为有很多种，不只是传统广告点击或者不点击这一种操作，用户可以在新闻流里赞、评论、分享或者隐藏这个新闻流，也可以播放视频。我需要理解用户喜欢什么东西，评论、分享什么东西，想看什么样的视频。理解用户的兴趣所在，根据我们的讯息把最好的东西放在新闻流的最前面。

以国内的社交媒体作对比来说，微信的朋友圈是所有内容全部显示，它不需要排序，是因为朋友圈容量不是特别多，大家可以看完所有的内容。朋友越来越多的话，没有时间把分享看完，排序是必然的事情。你会很容易漏掉很重要人的图片，它们迅速埋没在大部分你不感兴趣的内容了。

Facebook 之前也是全部显示，慢慢到后来用户是看不完所有的信息的。如果不做排序，把最好的服务挑出来的话，用户不会愿意访问新闻流，因为他看到很多不感兴趣的东西，感兴趣的部分他已经没有时间找出来了。从不排序到排序是必然的过程，你的朋友越来越多，公众页面越来越多，排序是必然的。

比如说新浪微博没有做排序，有些细节杂乱无章，他们测试过，但是做得不太好。所以放弃了。微信的朋友圈也会到要做排序的阶段。Facebook 不只是排序，还会隐藏用户不感兴趣的内容，比如你的朋友玩过 Candy Crush 游戏，但可能你本身不玩任何游戏，关于这方面的信息就没有意义。Facebook 就不会给你显示这些内容——“朋友们在玩什么游戏”。

社交媒体的碎片化已成事实。只有采取更好的排序手段，推送给用户更精准的内容，才能提高平台停留时间，加强粘性。

新闻流排序的工作原理是什么？

基本上，新闻流是从两三千条内容里面，挑出了四五十个。按照每个内容打分，分高的内容排在最前面。每个内容、照片、分享或者状态，我们会预测一些概率值，比如你点赞的概率，评论、分享的概率。每个用户的行为，比如点赞、分享、评论，系统都会给权值。评这些用户行为概率是通过机器学习来系统计算的。如果用户对某个内容点赞、评论或者分享，说明用户愿意看到这个内容，对内容产生了反馈。

举个例子来说，比如你是我的好友，你上传了 100 张照片，我点赞了 20 次，那么点赞概率就是 20%。我们知道每个用户以前对哪些内容点赞、评论，这些都是我们的训练样本。我们通过学习用户的历史行为，进行相同类型、相同个人的未来行为预测，因为用户短期行为不会大幅变化，过去对哪些东西进行评论，将来也很有可能对相似内容进行评论。

对用户内容的预测

很多人关心，是否可以针对用户内容来进行预测？比如分析用户发了什么样的文字或者图片？这是可以的。如果是图片我们可以抽取图片特点，对图片进行模式识别，分析图片的主题，打上相应的标签，用机器来识别这些图片。现在在做相应的工作。Facebook 有 AI 实验室，可以对图片进行内容识别。

那么，Facebook 该如何检测这套算法的有效性呢？该如何进行更新迭代？

其实，这可以通过 A/B 测试来实现。我们会抽取 1% 用户进行新的算法，1%进行旧的算法。如果新的算法下用户每天点赞、评论或者分享次数增长了，那说明新的算法更好。我们就把新的算法发布给所有的用户。我们主要的核心目标是：让日活跃用户更多，停留时间更长，访问 Facebook 更频繁。

A/B 测试是很好的迭代方法。建立起核心指标，进行 A/B 测试，看新的改动能否提高核心指标，提高就发布，没有提高就不用发布。这很像 Growth hacking，当然最终目的还是提高 DAU。如果用户喜欢你的新闻流，就会更频繁访问，最终目的还是在线时长和日活跃用户。

A/B 测试是 Facebook 用来测试迭代可行性的手段，目前峰瑞资本所投资的吆喝科技，想让初创企业也能使用到这一技术。

“我已经没法看完所有朋友圈的内容”

我已经没法看完所有朋友圈内容了。一种改进方法是排序，把最好的内容放最前面，通过你以前点赞的内容，来学习你关心的内容，比如你女朋友发的东西你都会点赞。另外一种改进方法叫做 “内容置顶”（Story bumped）。有时候我早上起来刷微信，会看不完，只看了一小部分。过一会儿再刷的时候，已经没有什么新的内容了。

Facebook 的内容置顶功能会把你没有看完的东西再放到上面去再次推送给你。

微信是知道哪些内容你没看过的，我有很多在美国的朋友，朋友圈会有很多内容，上班前看不完只看了一部分。再刷新的时候就已经没有新的东西出来了，我也没有关心没看完的东西，朋友发的照片。Facebook 的 “内容置顶” 把很重要的、还没看的、有点旧的内容放在朋友圈前面，让你再看一眼，怕你漏掉重要的内容。

在九月份的时候我加入神州专车担任 CTO，从事业角度来说，我希望把从 Facebook 学到的公司文化、技术带回中国。中国在计算机行业上有很大的潜力。现在国内的产品质量上和美国产品已经相当了，比如微信，Facebook 的产品经理也学习了微信里面的功能。再往后面看几年的话，中国有机会赶上美国。

计算机学科已经成熟，创造力在慢慢变好。很多初创企业尝试不同的想法，中国的创业者是美国的好多倍，都在尝试不同的想法，会诞生出成功的公司。技术上，中国正在逼近美国，甚至会超越美国。长远来看，中国的计算机行业、互联网行业，应该是有潜力成为世界上互联网行业最好的国家。

安全业务领域

Fighting spam with Haskell

October 24, 2016 zr9558 Leave a comment

One of our weapons in the fight against spam, malware, and other abuse on Facebook is a system called Sigma. Its job is to proactively identify malicious actions on Facebook, such as spam, phishing attacks, posting links to malware, etc. Bad content detected by Sigma is removed automatically so that it doesn’t show up in your News Feed.

We recently completed a two-year-long major redesign of Sigma, which involved replacing the in-house FXL language previously used to program Sigma with Haskell. The Haskell-powered Sigma now runs in production, serving more than one million requests per second.

Haskell isn’t a common choice for large production systems like Sigma, and in this post, we’ll explain some of the thinking that led to that decision. We also wanted to share the experiences and lessons we learned along the way. We made several improvements to GHC (the Haskell compiler) and fed them back upstream, and we were able to achieve better performance from Haskell compared with the previous implementation.

How does Sigma work?

Sigma is a rule engine, which means it runs a set of rules, called policies. Every interaction on Facebook — from posting a status update to clicking “like” — results in Sigma evaluating a set of policies specific to that type of interaction. These policies make it possible for us to identify and block malicious interactions before they affect people on Facebook.

Policies are continuously deployed. At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses. This also means that safety in the language we write policies in is important. We don’t allow code to be checked into the repository unless it is type-correct.

Louis Brandy of Facebook’s Site Integrity team discusses scalable spam fighting and the anti-abuse structure at Facebook and Instagram in a 2014 @Scale talk.

Why Haskell?

The original language we designed for writing policies, FXL, was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted. We wanted the performance and expressivity of a fully fledged programming language. Thus, we decided to migrate to an existing language rather than try to improve FXL.

The following features were at the top of our list when we were choosing a replacement:

1. Purely functional and strongly typed. This ensures that policies can’t inadvertently interact with each other, they can’t crash Sigma, and they are easy to test in isolation. Strong types help eliminate many bugs before putting policies into production.

2. Automatically batch and overlap data fetches. Policies typically fetch data from other systems at Facebook, so we want to employ concurrency wherever possible for efficiency. We want concurrency to be implicit, so that engineers writing policies can concentrate on fighting spam and not worry about concurrency. Implicit concurrency also prevents the code from being cluttered with efficiency-related details that would obscure the functionality, and make the code harder to understand and modify.

3. Push code changes to production in minutes. This enables us to deploy new or updated policies quickly.

4. Performance. FXL’s slower performance meant that we were writing anything performance-critical in C++ and putting it in Sigma itself. This had a number of drawbacks, particularly the time required to roll out changes.

5. Support for interactive development. Developers working on policies want to be able to experiment and test their code interactively, and to see the results immediately.

Haskell measures up quite well: It is a purely functional and strongly typed language, and it has a mature optimizing compiler and an interactive environment (GHCi). It also has all the abstraction facilities we would need, it has a rich set of libraries available, and it’s backed by an active developer community.

That left us with two features from our list to address: (1) automatic batching and concurrency, and (2) hot-swapping of compiled code.

Automatic batching and concurrency: The Haxl framework

All the existing concurrency abstractions in Haskell are explicit, meaning that the user needs to say which things should happen concurrently. For data-fetching, which can be considered a purely functional operation, we wanted a programming model in which the system just exploits whatever concurrency is available, without the programmer having to use explicit concurrency constructs. We developed the Haxl framework to address this issue: Haxl enables multiple data-fetching operations to be automatically batched and executed concurrently.

We discussed Haxl in an earlier blog post, and we published a paper on Haxl at the ICFP 2014 conference. Haxl is open source and available on GitHub.

In addition to the Haxl framework, we needed help from the Haskell compiler in the form of theApplicative do-notation. This allows programmers to write sequences of statements that the compiler automatically rearranges to exploit concurrency. We also designed and implemented Applicative do-notation in GHC.

Hot-swapping of compiled code

Every time someone checks new code into the repository of policies, we want to have that code running on every machine in the Sigma fleet as quickly as possible. Haskell is a compiled language, so that involves compiling the code and distributing the new compiled code to all the machines running Sigma.

We want to update the compiled rules in a running Sigma process on the fly, while it is serving requests. Changing the code of a running program is a tricky problem in general, and it has been the subject of a great deal of research in the academic community. In our case, fortunately, the problem is simpler: Requests to Sigma are short-lived, so we don’t need to switch a running request to new code. We can serve new requests on the new code and let the existing requests finish before we discard the old code. We’re careful to ensure that we don’t change any code associated with persistent state in Sigma.

Loading and unloading code currently uses GHC’s built-in runtime linker, although in principle, we could use the system dynamic linker. To unload the old version of the code, the garbage collector gets involved. The garbage collector detects when old code is no longer being used by a running request, so we know when it is safe to unload it from the running process.

How Haskell fits in

Haskell is sandwiched between two layers of C++ in Sigma. At the top, we use the C++ thrift server. In principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant. It also supports more features. Furthermore, it can work seamlessly with the Haskell layers below because we can call into Haskell from C++. For these reasons, it made sense to use C++ for the server layer.

At the lowest layer, we have existing C++ client code for talking to other internal services. Rather than rewrite this code in Haskell, which would duplicate the functionality and create an additional maintenance burden, we wrapped each C++ client in a Haxl data source using Haskell’s Foreign Function Interface (FFI) so we could use it from Haskell.

Haskell’s FFI is designed to call C rather than C++, so calling C++ requires an intermediate C layer. In most cases, we were able to avoid the intermediate C layer by using a compile-time tool that demangles C++ function names so they can be called directly from Haskell.

Performance

Perhaps the biggest question here is “Does it run fast enough?” Requests to Sigma result from users performing actions on Facebook, such as sending a message on Messenger, and Sigma must respond before the action can take place. So we wanted to serve requests fast enough to avoid interruptions to the user experience.

The graph below shows the relative throughput performance between FXL and Haskell for the 25 most common types of requests served by Sigma (these requests account for approximately 95 percent of Sigma’s typical workload).

Haskell performs as much as three times faster than FXL for certain requests. On a typical workload mix, we measured a 20 percent to 30 percent improvement in overall throughput, meaning we can serve 20 percent to 30 percent more traffic with the same hardware. We believe additional improvements are possible through performance analysis, tuning, and optimizing the GHC runtime for our workload.

Achieving this level of performance required a lot of hard work, profiling the Haskell code, and identifying and resolving performance bottlenecks.

Here are a few specific things we did:

We implemented automatic memoization of top-level computations using a source-to-source translator. This is particularly beneficial in our use-case where multiple policies can refer to the same shared value, and we want to compute it only once. Note, this is per-request memoization rather than global memoization, which lazy evaluation already provides.
We made a change to the way GHC manages the heap, to reduce the frequency of garbage collections on multicore machines. GHC’s default heap settings are frugal, so we also use a larger allocation area size of at least 64 MB per core.
Fetching remote data usually involves marshaling the data structure across the C++/Haskell boundary. If the whole data structure isn’t required, it is better to marshal only the pieces needed. Or better still, don’t fetch the whole thing — although that’s only possible if the remote service implements an appropriate API.
We uncovered a nasty performance bug in aeson, the Haskell JSON parsing library. Bryan O’Sullivan, the author of aeson, wrote a nice blog post about how he fixed it. It turns out that when you do things at Facebook scale, those one-in-a-million corner cases tend to crop up all the time.

Resource limits

In a latency-sensitive service, you don’t want a single request using a lot of resources and slowing down other requests on the same machine. In this case, the “resources” include everything on the machine that is shared by the running requests — CPU, memory, network bandwidth, and so on.

A request that uses a lot of resources is normally a bug that we want to fix. It does happen from time to time, often as a result of a condition that occurs in production that wasn’t encountered during testing — perhaps an innocuous operation provided with some unexpectedly large input data, or pathological performance of an algorithm on certain rare inputs, for example. When this happens, we want Sigma to terminate the affected request with an error (that will subsequently result in the bug being fixed) and continue without any impact on the performance of other requests being served.

To make this possible, we implemented allocation limits in GHC, which places a bound on the amount of memory a thread can allocate before it is terminated. Terminating a computation safely is a hard problem in general, but Haskell provides a safe way to abort a computation in the form of asynchronous exceptions. Asynchronous exceptions allow us to write most of most of our code ignoring the potential for summary termination and still have all the nice guarantees that we need in the event that the limit is hit, including safe releasing of resources, closing network connections, and so forth.

The following graph illustrates of how well allocation limits work in practice. It tracks the maximum live memory across various groups of machines in the Sigma fleet. When we enabled one request that had some resource-intensive outliers, we saw large spikes in the maximum live memory, which disappeared when we enabled allocation limits.

Enabling interactive development

Facebook engineers develop policies interactively, testing code against real data as they go. To enable this workflow in Haskell, we needed the GHCi environment to work with our full stack, including making requests to other back-end services from the command line.

To make this work, we had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load. We also customized the GHCi front end to implement some of our own commands and streamline the desired workflows. The result is an interactive environment in which developers can load their code from source in a few seconds and work on it with a fast turnaround time. They have the full set of APIs available and can test against real production data sources.

While GHCi isn’t as easy to customize as it could be, we’ve already made several improvements and contributed them upstream. We hope to make more improvements in the future.

Packages and build systems

In addition to GHC itself, we make use of a lot of open-source Haskell library code. Haskell has its own packaging and build system, Cabal, and the open-source packages are all hosted onHackage. The problem with this setup is that the pace of change on Hackage is fast, there are often breakages, and not all combinations of packages work well together. The system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure, and the tool support isn’t what it could be. We found that using packages directly from Hackage together with Facebook’s internal build tools meant adding or updating an existing package sometimes led to a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations.

As a result of this experience, we switched to Stackage as our source of packages. Stackage provides a set of package versions that are known to work together, freeing us from the problem of having to find the set by trial and error.

Did we find bugs in GHC?

Yes, most notably:

We fixed a bug in GHC’s garbage collector that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years.
We fixed a bug in GHC’s handling of finalizers that occasionally caused crashes during process shutdown.

Following these fixes, we haven’t seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet.

What else?

At Facebook, we’re using Haskell at scale to fight spam and other types of abuse. We’ve found it to be reliable and performant in practice. Using the Haxl framework, our engineers working on spam fighting can focus on functionality rather than on performance, while the system can exploit the available concurrency automatically.

For more information on spam fighting at Facebook, check out our Protect the Graph page, or watch videos from our recent Spam Fighting @Scale event.

ZHANG RONG

Category Archives: 安全业务领域

How AI is helping detect fraud and fight criminals

Rules and reputation lists

Supervised machine learning (SML)

Unsupervised machine learning (UML)

Putting it all together

无监督机器学习：超越规则引擎和有监督机器学习的反欺诈分析方法

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

该如何做大中型 UGC 平台（如新浪微博）的反垃圾（anti-spam）工作？

来自知乎

机器学习正在安全领域挂起一阵小旋风，但这里面有BUG

| 把机器学习应用到安全领域，老板们跃跃欲试

| 用机器学习有效解决安全问题，正确的方法是？

| 机器学习和异常检测，用在哪里价值最大？

| 让机器学习进行安全防护，你需要做哪些功课？

未来的网络安全，离不开机器学习

Machine learning and big data know it wasn’t you who just swiped your credit card

Gathering the troops

Learning what’s legit, what’s shady

Computer detectives, in the cloud

当朋友圈更新多到看不完时，来看看Facebook是怎么优化信息流的

Facebook利用Sigma 系统做了什么？

机器学习是Sigma 系统的核心

新闻流是Facebook最重要的产品

新闻流排序的工作原理是什么？

“我已经没法看完所有朋友圈的内容”

Fighting spam with Haskell

How does Sigma work?

Why Haskell?

Automatic batching and concurrency: The Haxl framework

Hot-swapping of compiled code

How Haskell fits in

Performance

Resource limits

Enabling interactive development

Packages and build systems

Did we find bugs in GHC?

What else?

zr9558's Blog