Category Archives: 安全业务领域

How AI is helping detect fraud and fight criminals

http://venturebeat.com/2017/02/18/how-ai-is-helping-detect-fraud-and-fight-criminals/

AI is about to go mainstream. It will show up in the connected home, in your car, and everywhere else. While it’s not as glamorous as the sentient beings that turn on us in futuristic theme parks, the use of AI in fraud detection holds major promise. Keeping fraud at bay is an ever-evolving battle in which both sides, good and bad, are adapting as quickly as possible to determine how to best use AI to their advantage.

There are currently three major ways that AI is used to fight fraud, and they correspond to how AI has developed as a field. These are:

  1. Rules and reputation lists
  2. Supervised machine learning
  3. Unsupervised machine learning

Rules and reputation lists

Rules and reputation lists exist in many modern organizations today to help fight fraud and are akin to “expert systems,” which were first introduced to the AI field in the 1970s. Expert systems are computer programs combined with rules from domain experts.They’re easy to get up and running and are human-understandable, but they’re also limited by their rigidity and high manual effort.

A “rule” is a human-encoded logical statement that is used to detect fraudulent accounts and behavior. For example, an institution may put in place a rule that states, “If the account is purchasing an item costing more than $1000, is located in Nigeria, and signed up less than 24 hours ago, block the transaction.”

Reputation lists, similarly, are based on what you already know is bad. A reputation list is a list of specific IPs, device types, and other single characteristics and their corresponding reputation score. Then, if an account is coming from an IP on the bad reputation list, you block them.

While rules and reputation lists are a good first attempt at fraud detection and prevention, they can be easily gamed by cybercriminals. These days, digital services abound, and these companies make the sign-up process frictionless. Therefore, it takes very little time for fraudsters to make dozens, or even thousands, of accounts. They then use these accounts to learn the boundaries of the rules and reputation lists put in place. Easy access to cloud hosting services, VPNs, anonymous email services, device emulators, and mobile device flashing makes it easy to come up with unsuspicious attributes that would miss reputation lists.

Since the 1990s, expert systems have fallen out of favor in many domains, losing out to more sophisticated techniques. Clearly, there are better tools at our disposal for fighting fraud. However, a significant number of fraud-fighting teams in modern companies still rely on this rudimentary approach for the majority of their fraud detection, leading to massive human review overhead, false positives, and sub-optimal detection results.

Supervised machine learning (SML)

Machine learning is a subfield of AI that attempts to address the issue of previous approaches being too rigid. Researchers wanted the machines to learn from data, rather than encoding what these computer programs should look for (a different approach from expert systems). Machine learning began to make big strides in the 1990s, and by the 2000s it was effectively being used in fighting fraud as well.

Applied to fraud, supervised machine learning (SML) represents a big step forward. It’s vastly different from rules and reputation lists because instead of looking at just a few features with simple rules and gates in place, all features are considered together.

There’s one downside to this approach. An SML model for fraud detection must be fed historical data to determine what the fraudulent accounts and activity look like versus what the good accounts and activity look like. The model would then be able to look through all of the features associated with the account to make a decision. Therefore, the model can only find fraud that is similar to previous attacks. Many sophisticated modern-day fraudsters are still able to get around these SML models.

That said, SML applied to fraud detection is an active area of development because there are many SML models and approaches. For instance, applying neural networks to fraud can be very helpful because it automates feature engineering, an otherwise costly step that requires human intervention. This approach can decrease the incidence of false positives and false negatives compared to other SML models, such as SVM and random forest models, since the hidden neurons can encode many more feature possibilities than can be done by a human.

Unsupervised machine learning (UML)

Compared to SML, unsupervised machine learning (UML) has cracked fewer domain problems. For fraud detection, UML hasn’t historically been able to help much. Common UML approaches (e.g., k-means and hierarchical clustering, unsupervised neural networks, and principal component analysis) have not been able to achieve good results for fraud detection.

Having an unsupervised approach to fraud can be  difficult to build in-house since it requires processing billions of events all together and there are no out-of-the-box effective unsupervised models. However, there are companies that have made strides in this area.

The reason it can be applied to fraud is due to the anatomy of most fraud attacks. Normal user behavior is chaotic, but fraudsters will work in patterns, whether they realize it or not. They are working quickly and at scale. A fraudster isn’t going to try to steal $100,000 in one go from an online service. Rather, they make dozens to thousands of accounts, each of which may yield a profit of a few cents to several dollars. But those activities will inevitably create patterns, and UML can detect them.

The main benefits of using UML are:

  • You can catch new attack patterns earlier
  • All of the accounts are caught, stopping the fraudster from making any money
  • Chance of false positives is much lower, since you collect much more information before making a detection decision

Putting it all together

Each approach has its own advantages and disadvantages, and you can benefit from each method. Rules and reputation lists can be implemented cheaply and quickly without AI expertise. However, they have to be constantly updated and will only block the most naive fraudsters. SML has become an out-of-the box technology that can consider all the attributes for a single account or event, but it’s still limited in that it can’t find new attack patterns. UML is the next evolution, as it can find new attack patterns, identify all of the accounts associated with an attack, and provide a full global view. On the other hand, it’s not as effective at stopping individual fraudsters with low-volume attacks and is difficult to implement in-house. Still, it’s certainly promising for companies looking to block large-scale or constantly evolving attacks.

A healthy fraud detection system often employs all three major ways of using AI to fight fraud. When they’re used together properly, it’s possible to benefit from the advantages of each while mitigating the weaknesses of the others.

AI in fraud detection will continue to evolve, well beyond the technologies explored above, and it’s hard to even grasp what the next frontier will look like. One thing we know for sure, though, is that the bad guys will continue to evolve along with it, and the race is on to use AI to detect criminals faster than they can use it to hide.

Catherine Lu is a technical product manager at DataVisor, a full-stack online fraud analytics platform.

MI 3.0 Thumbnail (1)

Advertisements

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

原文链接:http://www.twoeggz.com/news/3147867.html

规则引擎,机器学习模型,设备指纹,黑白名单(例如邮件、IP地址黑白名单)和无监督检测分析?经常会有人问,我们应该选择哪种反欺诈检测方式?其实每一种方法都有其独特的优势,企业应该结合反欺诈解决方案及反欺诈行业专家经验,搭建出一套最适合自己公司业务、产品以及用户类型的反欺诈管理系统。

规则引擎和学习模型是传统反欺诈系统构建中重要的两个基本组成部分。接下来的文章中会介绍这两套系统是如何工作的?它们各自的优势和局限性是什么?为什么无监督分析算法优越于规则引擎和机器学习模型,以及使用无监督分析算法在捕捉新型欺诈时的必要性。

>>>>规则引擎

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

>>>>工作机制

规则引擎将商业业务逻辑和应用程序代码划分开来,安全和风险分析师等基于SQL或数据库知识就可以独自管理运行规则。有效的规则可以通过几行逻辑代码一目了然的进行表述:If A and B, then do C。例如:

IF(user_email=type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

规则引擎同样可以使用加权打分评分机制。例如,下表中的每一项规则都对应一个分值,正数或负数,这个分值可以由分析师赋值。所有规则的分数会被加起来,之后得到一个总计分数。规则引擎基于分数临界值创建出业务运维流程。在一个典型的运维流中,根据分数范围,一般会分为三种行为类型,例如:

1.高于1000 - 否认(如拒绝交易,暂停帐户)

2.低于300-接受(如确认订单,通过内容)

3.介于300到1000-提示需要增加额外的审核,置入人工审校池

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

规则引擎可以从数据库中导入数据,挑选出黑名单(如IP地址)和其它坏的列表。每当一个新的欺诈情况发生后,分析师会增加一个新规则,以保证公司在可预见范围内免于欺诈风险。这样通过使用规则引擎,公司便可以避免一些周期性出现的欺诈。

➜局限性

一旦欺诈规模增大,规则引擎就会展现出局限性。欺诈者不会在被捕捉后依旧坐以待毙,他们会研究你是如何捕捉他们,之后变换新的方式,避免再次被捉到。所以,规则作用的时间很有限,可能是几周,甚至几天。试想一下,当你在运行和测试成百上千条新的规则同时,还需要每隔几天增加新的规则,删除或更新之前的规则,并对规则进行加权,这无疑要花费大量运营资源,时间,和费用来维护。

如果一个反欺诈分析师要在3种规则下计算出通过、拒绝及比例数字,并通过比例变化情况调整每一项规则的分值,需要做出8种改变:2^3 = 8(values^rules)。而测试3种不同值的10种规则需要做出超过5.9万次变化。逐渐随着规则数量增加,改变频率也会随之快速增长。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

规则引擎不会从分析观察或反馈中自动学习。由于欺诈者经常改变欺诈方式,导致数据会间歇性暴露在各种新的攻击下。此外,规则引擎是基于二进制方式处理信息,有可能无法完全检测到数据细微差别,这会导致出现更高的误判率及用户负面体验。

有监督机器学习模型

➜工作机制

有监督机器学习模式是反欺诈检测中最为广泛使用的机器学习模式。其中包含的几个学习技术分别有决策树算法,随机森林,最近邻算法,支持向量机和朴素贝叶斯分类。机器学习通常从有标签数据中自动创建出模型,来检测欺诈行为。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

在创建模型的过程中,清楚了解哪些是欺诈行为,哪些不是,会起到至关重要的作用。模型中倒入的数据会影响其检测效果。用已知欺诈数据和正常数据做训练集,可以训练出学习模型来填补并增强规则引擎无法覆盖的复杂欺诈行为。

下面是一个关于有监督机器学习机制如何将新的数据划分为欺诈和非欺诈的例子。训练数据通过识别模型特点,可以预知两种类型欺诈者: 1. 信用卡欺诈者 2. 垃圾信息制造者。以下三种特征对识别欺诈攻击类型非常有帮助:1. 邮件地址结构 2. IP地址类型 3. 关联账户密度指示欺诈攻击类型(如变化的回复)。实际上,一个典型的模型有成百上千种特征。

在此例中,拥有以下特征的用户会被训练出的模型识别为信用卡欺诈:

邮箱地址前5个是字母,后3个是数字

使用匿名代理

中等密度关联账号(例如10)

有以下特征的用户会被识别为垃圾信息制造者:

邮箱地址按某种形式随机生成的

使用数据中心的IP地址

高密度关联账号(例如30+)

假设现在你的模型正在从下面一批用户里评估风险,这个模型会计算每个用户的邮件地址结构,IP地址类型以及账号关联密度。正常情况下,模型会将第二种和第三种用户归类为垃圾制造者,把第一、第四、第五种归为信用卡欺诈者。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

➜优势

训练学习模型填补并增强了规则引擎无法覆盖的范围,学习模型可以通过增加训练数据持续提高其检测效率。学习模型可以处理非结构数据(如图像,邮件内容),即使有成千上万的输入信息变化特征,也可以自动识别复杂的欺诈模式。

➜局限性

虽然有监督机器学习创建模型功能比较强大,但同时也有局限性。如果出现之前没有标签案例的、新的欺诈类型该怎么办?由于欺诈方式经常变化,这种情况普遍存在。毕竟欺诈者在不停地变化欺诈手段,日以继夜的实施各种新型攻击,如果之前没有遇到这种欺诈攻击模式,也没有足够的训练数据,那么训练出的模型就不能返回优质、可靠的结果。

从下图中可以看出,收集数据和标记数据是创建有监督机器学习过程中最重要的部分。产出准确的训练标签可能需要花费数周到数月的时间。并且产生标签的过程需要反欺诈分析团队全面审核案例,将数据进行正确标签分类,并在投入使用前进行验证测试。除非学习模型之前有足够的相应训练数据,否则一旦出现新的攻击,学出的模型将会无法识别。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

无监督机器学习-超越规则引擎和有监督机器学习

以上两种欺诈检测框架都有各自明显的局限性,DataVisor创新的无监督机器学习算法弥补了这两种模型的不足。无监督检测算法无需依赖于任何标签数据来训练模型。这种检测机制算法的核心内容是无监督欺诈行为检测,通过利用关联分析和相似性分析,发现欺诈用户行为间的联系,创建群组,并在一个或多个其他群组中发掘新型欺诈行为和案例。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

无监督检测提供了攻击的群组信息,并自动生成训练数据,之后汇入到有监督的机器学习模块中。基于这些数据,有监督机器学习通过模型结构,可以进一步发现大规模攻击群组之外的欺诈用户。DataVisor所采用的这种框架模式不仅可以找出由个人账号发起的攻击,更重要的是可以有效发现由多个账号组成的欺诈或犯罪团伙实施的有组织的大规模攻击,为客户反欺诈检测框架增加至关重要的早期全方位检测。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor采用的关联分析方法将欺诈行为相似的群组归为一类。而另一种检测技术-异常检测,将不符合好用户行为特点的用户均列为欺诈对象。其原理是假设坏用户都是孤立于正常用户之外的单个用户或小群组。下面图表列举了欺诈者F1、F3、群组F2,以及好用户群组G1和G2。异常检测模型只能发现此类孤立的欺诈行为,但在鉴别大规模的群组欺诈时就会面临很大的挑战。在这一点上,相比于异常检测,无监督分析的优势显而易见。

无监督机器学习:超越规则引擎和有监督机器学习的反欺诈分析方法

DataVisor把无监督分析算法结合规则引擎和机器学习模型一起使用。对于客户来说,这种全方位的检测在提供欺诈信息列表的同时,也会提供给客户新的欺诈检测模型,并帮助用户创建新的检测规则。一旦DataVisor的检测方式发现客户遇到新型未知欺诈,无监督检测可以有效提前早期预警。

通过专注于早期检测和发现未知欺诈,DataVisor帮助客户在欺诈解决方案的各个方面提升机制、提高效率:

鉴别虚假用户注册和帐户侵权;

检测虚假金融交易和活动;

发现虚假推广和促销滥用;

阻止社交垃圾信息,虚假内容发布、虚假阅读量和虚假点赞数量;

翻译者:Lily.Wang

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

Unsupervised Analytics: Moving Beyond Rules Engines and Learning Models

无监督机器学习:超越规则引擎和有监督机器学习的新一代反欺诈分析方法

Rules engines, machine learning models, ID verification, or reputation lookups (e.g. email, IP blacklists and whitelists) and unsupervised analytics? I’ve often been asked which one to use and should you only go with one over the others. There is a place for each to provide value and you should anticipate incorporating some combination of these fraud solutions along with solid domain expertise to build a fraud management system that best accounts for your business, products and users. With that said, rules engines and learning models are two of the major foundational components of a company’s fraud detection architecture. I’ll explain how they work, discuss the benefits and limitations of each and highlight the demand for unsupervised analytics that can go beyond rules engines and machine learning in order to catch new fraud that has yet to be seen.

Rules Engines

Unsupervised analytics - RULES BLOG IMAGE 1

Image Source

How they work

Rules engines partition the operational business logic from the application code, enabling non-engineering fraud domain experts (e.g. Trust & Safety or Risk Analysts) with SQL/database knowledge to manage the rules themselves. So what types of rules are effective? Rules can be as straightforward as a few lines of logic: If A and B, then do C. For example,

IF (user_email = type_free_email_service) AND (comment_character_count ≥ 150 per sec) {

flag user_account as spammer

mute comment

}

Rules engines can also employ weighted scoring mechanisms. For example, in the table below each rule has a score value, positive or negative, which can be assigned by an analyst. The points for all of the rules triggered will be added together to compute an aggregate score. Subsequently, rules engines aid in establishing business operation workflows based on the score thresholds. In a typical workflow, there could be three types of actions to take based on the score range:

  1. Above 1000 – Deny (e.g. reject a transaction, suspend the account)
  2. Below 300 – Accept (e.g. order is ok, approve the content post)
  3. Between 300 and 1000 – Flag for additional review and place into a manual review bin

Unsupervised Analytics - RULES BLOG 2

Advantages

Rules engines can take black lists (e.g. IP addresses) and other negative lists derived from consortium databases as input data. An analyst can add a new rule as soon as he or she encounters a new fraud/risk scenario, helping the company benefit from the real-world insights of the analyst on the ground seeing the fraud every day. As a result, rules engines give businesses the control and capability to handle one-off brute force attacks, seasonality and short-term emerging trends.

Limitations

Rules engines have limitations when it comes to scale. Fraudsters don’t sit idle after you catch them. They will change what they do after learning how you caught them to prevent being caught again. Thus, the shelf life of rules can be a couple of weeks or even as short as a few days before their effectiveness begins to diminish. Imagine having to add, remove, and update rules and weights every few days when you’re in a situation with hundreds or thousands of rules to run and test. This could require huge operational resources and costs to maintain.

If a fraud analyst wants to calculate the accept, reject, and review rates for 3 rules and get the changes in those rates for adjusting each rule down or up by 100 points, that would require 8 changes: 23^ = 8 (values^rules). Testing 10 rules with 3 different values would be over 59K changes! As the number of rules increases, the time to make adjustments increases quickly.

Unsupervised Analytics - rules_engine_costs

Rules engines don’t automatically learn from analyst observations or feedback. As fraudsters adapt their tactics, businesses can be temporarily exposed to new types of fraud attacks. And since rules engines treat information in a binary fashion and may not detect subtle nuances, this can lead to higher instances of false positives and negative customer experiences.

Learning Models
Unsupervised analytics - svm

Image Source

How they work

Supervised machine learning is the most widely used learning approach when it comes to fraud detection. A few of the learning techniques include decision trees, random forests, nearest neighbors, Support Vector Machines (SVM) and Naive Bayes. Machine learning models often solve complex computations with hundreds of variables (high-dimensional space) in order to accurately determine cases of fraud.

Having a good understanding of both what is and what is not fraud plays a central role in the process of creating models. The input data to the models influences their effectiveness. The models are trained on known cases of fraud and non-fraud (e.g. labeled training data), which then facilitate its ability to classify new data and cases as either fraudulent or not. Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage.

Below is a simplified example of how a supervised machine learning program would classify new data into the categories of non-fraud or fraud. Training data informs the model of the characteristics of two types of fraudsters: 1) credit card fraudsters and 2) spammers. Three features: 1) the email address structure, 2) the IP address type, and 3) the density of linked accounts are indicative of the type of fraud attack (e.g. response variable). Note in reality, there could be hundreds of features for a model.

The trained model recognizes that a user with:

  • an email address that has 5 letters followed by 3 numbers
  • using an anonymous proxy
  • with a medium density (e.g. 10) of connected accounts

is a credit card fraudster.

It also knows recognizes that a user with:

  • an email address structure with a “dot” pattern
  • using an IP address from a datacenter
  • with a high density (e.g. 30+) of linked accounts

is a spammer.

Now suppose your model is evaluating new users from the batch of users below. It computes the email address structure, IP address type, and density of linked accounts for each user. If working properly, it will classify the users in Cases 2 and 3 as spammers and the users in Cases 1, 4 and 5 as credit card fraudsters.

BLOG EMAILS

Advantages

Because of their ability to predict the label for a new unlabeled data set, trained learning models fill in the gap and bolster the areas where rules engines may not provide great coverage. Learning models have the ability to digest millions of row of data scalably, pick up from past behaviors and continually improve their predictions based on new and different data. They can handle unstructured data (e.g. images, email text) and recognize sophisticated fraud patterns automatically even if there are thousands of features/variables in the input data set. With learning models, you can also measure effectiveness and improve it by only changing algorithms or algorithm parameters.

Limitations

Trained learning models, while powerful, have their limitations. What happens if there are no labeled examples for a given type of fraud? Given how quickly fraud is evolving, this is not that uncommon of an occurrence. After all, fraudsters change schemes and conduct new types of attacks around the clock. If we have not encountered the fraud attack pattern, and therefore do not have sufficient training data, the trained learning models may not have the appropriate support to return good and reliable results.

As seen in the diagram below, collecting and labeling data is a crucial part of building a learning model and the time required to generate accurate training labels can be weeks to months. Labeling can involve teams of fraud analysts reviewing cases thoroughly, categorizing it with the right fraud tags, and undergoing a verification process before being used as training data. In the event a new type of fraud emerges, a learning model may not be able to detect it until weeks later after sufficient data has been acquired to properly train it.
unsupervised analytics - supervised_learning_flow

Unsupervised Analytics – Going Beyond Rules Engines and Learning Models

While both of these approaches are critical pieces of a fraud detection architecture, here at DataVisor we take it one step further. DataVisor employs unsupervised analytics, which do not rely on having prior knowledge of the fraud patterns. In other words no training data is needed. The core component of the algorithm is theunsupervised attack campaign detection which leverages correlation analysis and graph processing to discover the linkages between fraudulent user behaviors, create clusters and assign new examples into one or the other of the clusters.

unsupervised anaytics - DV_Apache-Spark

The unsupervised campaign detection provides the attack campaign group info and also the self-generated training data, both of which can be fed into our machine learning models to bootstrap them. With this data, the supervised machine learning will pick up patterns and find the fraudulent users that don’t fit into these large attack campaign groups. This framework enables DataVisor to uncover fraud attacks perpetrated by individual accounts, as well as organized mass scale attacks coordinated among many users such as fraud and crime rings – adding a valuable piece to your fraud detection architecture with a “full-stack.”

unsupervised analytics DV-fullstack

Our correlation analysis groups fraudsters “acting” similarly into the same cluster. In contrast, anomaly detection, another useful technique, finds the set of fraud objects that are considerably dissimilar from the remainder of the good users. It does this is by assuming anomalies do not belong to any group or they belong to small/sparse clusters. See graph below for anomaly detection illustrating fraudsters F1, F3, and group F2and good users G1 and G2. The benefits of unsupervised analytics is on display when comparing it to anomaly detection. While anomaly detection can find outlying fraudsters from a given data set, it would encounter a challenge identifying large fraud groups.

anomaly_detect

With unsupervised analytics, DataVisor collaborates with rules engines and machine learning models. For customers, the analytics provides them a list of the fraudsters and also gives their fraud analysts insights to create new rules. When DataVisor finds fraud that has not been encountered by a customer previously, the data from the unsupervised campaign detection can serve as early warning signals and/or training data to their learning models, creating new and valuable dimensions to their model’s accuracy.

By focusing on early detection and discovering unknown fraud, DataVisor has helped customers to become better and more efficient in solving fraud in diverse range of areas such as:

  • Identifying fake user registration and account takeovers (ATO)
  • Detecting fraudulent financial transactions and activity
  • Discovering user acquisition and promotion abuse
  • Preventing social spam, fake posts, reviews and likes

Stay tuned for future blog posts where I will address topics such as new online fraud attacks, case review management tools, and a closer look into DataVisor’s fraud detection technology stack. If you want to learn more about how DataVisor can help you fight online fraud, please visit https://datavisor.com/ or schedule atrial.

该如何做大中型 UGC 平台(如新浪微博)的反垃圾(anti-spam)工作?

来自知乎

帅帅 产品经理
宋一松 Facebook,Uber
收录于 编辑推荐 159 人赞同
aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢
iammutex 彩石手机CTO – 做最好的中老年智能手机

机器学习正在安全领域挂起一阵小旋风,但这里面有BUG

如今,安全领域是机器学习(Machine learning)正在大力进军的一个方向。

| 把机器学习应用到安全领域,老板们跃跃欲试

如果你亲自参加了 2016 RSA 大会,就会发现几乎没有哪家公司在说自家安全领域的产品时,不提及机器学习。这是为什么呢?

可能对外行人来说,机器学习就像一种魔法,能解决所有的安全问题:你把一堆未标识的数据统统塞进会机器学习的系统中,它就能分辨出连人类专家都分辨不出的数据规律,并且还可以学习新的行为指令和适应环境威胁。不仅如此,就连为规则加密也劳烦不到你,因为系统已经自动为你搞定这一切。

要真是像这样的话,那机器学习可真就是今年的重头戏了!但讽刺的是,每个人都兴师动众说要在这个领域搞出点名堂来,但真正理解什么是机器学习,或明白机器学习到底能用来做什么的人,却是凤毛麟角。可想而知,在这种大环境下机器学习大多是被滥用的,尤其在安全领域

| 用机器学习有效解决安全问题,正确的方法是?

把机器学习应用到安全领域,大多会涉及到一种技术——异常检测(anomaly detection),它可以识别哪些部分和预期模式或数据集不匹配。但技术销售方要注意,这种技术只在某些条件下有效——不过显然,他们还不知道自己已经犯下错误:他们会告诉你,分析过你公司的网络流量后,就可以用机器学习 揪出暗藏在网络中的黑客。但事实上,机器学习根本就做不到。这时候,你要立刻对这个销售商保持一丝怀疑。

那到底什么情况下才有效?答案是,只有为低维度的问题也配备上高质量的标识数据,这样的机器学习才是有效的。但很不幸,企业在实施过程并没有做到这一点。如果要检测新型的攻击方式,你得有很清晰并且经过标识的攻击案例。这就是说,如果没有透彻理解正常的网络行为,机器学习是不可能发现黑客的。再说,所有的黑客都很狡猾,他们一定会把自己伪装的天衣无缝。

| 机器学习和异常检测,用在哪里价值最大?

机器学习和异常检测真正有用的地方,在于它们能将人类行为分类。

事实证明,人类的预测能力非常强,他们也有能力建立非常精确的个体用户行为模型,让模型探测到异常情况。

其实,人们在这方面已小有成就,比如隐式认证( Implicit Authentication)。隐式认证采用生物特征识别技术,基于击键力度、节奏和打字模式等技术对用户身份进行认证。不管是改善用户体验还是增强安全性,这个技术的优势都相当明显。最起码,它免除了用户记忆密码的负担和输入密码的麻烦。由于隐式认证所需元素大多是低维的, 机器学习就只需处理少量几个参数,这也使得收集用户的高品质标识数据变得很方便。所以,即使有行为差异或信号干扰, 机器学习还是能正确为计算机视觉进行图形搭配。同理,机器学习也能通过识别出个体的独特行为而进行身份验证,这当然也不在话下。

不过,它是怎么做到的呢?

其实,你走路、站立等所有动作,是由众多因素共同决定的,比如生理状况,年龄,性别,肌肉记忆等等。并且对个体来说,这些动作不会有太大改变。因此,不经意间,你口袋中的手机就通过内置传感器精确捕捉到了这些信息,并记录下来。而想要通过运动行为来识别一个人, 4 秒的运动信息就已足够。另外,通过对比用户的历史和当下的定位记录也可以进行身份识别。人们总是生活在各种各样的习惯当中,通过观察他们什么时候从哪出发,就能预测被测者到底是不是用户本人。

我们的手机和电脑上已有大量的传感器,以后随着可穿戴设备的普及和物联网的发展,传感器的数量更会暴增。用户大量的行为数据和环境数据就这样被收集起来,提供给机器学习,让它为用户建立个体模型,并找到各个因素之间的相互关系。

| 让机器学习进行安全防护,你需要做哪些功课?

想进行安全防护,就必须让你的系统提前知道都存在哪些威胁模型。

首先,也是最重要的事——收集数据。这些数据必须非常精确,才能用来训练系统,起到抵抗威胁的作用。不过身份认证系统要真是遭到攻击,你也不用过于担心。因为行为变化还是比较好检测的,系统很快就能识别出异常情况。比如,如果一个设备不小心被偷,那么这个设备被偷之后所记录的运动状态,地理位置和用法就会和之前的记录有明显不同。不过,系统是接受这种可能存在的异常情况的,这时候用户就需要在系统上以另外的方式确认身份,调整系统,以使假阳性最小化。而一旦我们在不同设备上连接起 4 个因素,那么隐式认证的假阳性就会低于 0.001% 。

这个世界上并没有哪一种机器学习真的神奇到能解决所有的安全问题。设计者想用机器学习创建一个有用的安全防卫产品,就需要对底层系统有深刻理解,并且承认很多问题并不适合用机器学习来解决。不过不用担心,那些处在浪潮之巅的科技公司会将这些问题一步步消灭掉。

机器学习正在安全领域酝酿着一股势不可挡的市场狂潮。

未来的网络安全,离不开机器学习

信息安全一直就是猫与老鼠的游戏。好家伙新建一堵墙,坏家伙便想方设法通过或绕过它。但最近,坏家伙们似乎越来越轻易地就可以通过这堵墙。要想阻止他们,我们的能力需要有一个巨大的提升,这可能意味着我们需要更广泛地使用机器学习技术。

这可能会惊到行业外的旁观者,但机器学习目前并没有广泛地影响到IT安全领域。安全专家认为,尽管信用卡欺诈侦查系统和网络设备制造商正在使用先进的分析方法,但实际上每个大型公司常见的自动化安全行动——比如检测个人电脑上的恶意软件或者识别网络中的恶意活动——大部分都要依靠人类适时地对这些行动进行代码编写和配置。

尽管机器学习技术在网络安全领域的应用已经有了广泛的学术研究,但我们现在才刚开始了解这项技术对安全工具的影响。一些创业公司(如Invincea, Cylance, Exabeam和Argyle Data)正在利用机器学习驱动安全工具,使得它们比目前主要的安全软件供应商提供的工具更快捷和精准。

用数据摧毁恶意软件

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为,是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说:「我了解到,一些反病毒公司已经涉足机器学习领域,但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件,这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

他说:「他们在检测过去常见的恶意软件上很成功,但是他们并不擅长检测新的恶意软件,这也是当下网络犯罪大行其道的原因之一。即使你安装了杀毒系统,其他人还是能成功侵入你的电脑,因为特征码检测的方法根本不起作用。」

在Invincea,Saxe正带领团队用机器学习建立更完善的恶意软件检测系统。这个项目是DARPA网络基因组项目的一部分,主要是使用机器学习来摧毁检测到的恶意软件,包括反向还原恶意软件的运行方式、在代码中进行社交网络分析、使用机器学习系统快速摧毁自然网络环境中出现的恶意软件新样本。

「我们已经证明,我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作,甚至能做得更好。把机器学习系统与大量的训练数据结合,就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前,Saxe有大约150万个良性或恶意软件样品用来训练算法,这些都在使用 Python 工具的GPU中进行。他希望,随着样本数据增加到3000万,机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多,用来训练机器学习系统的恶意软件的数量越多,那机器学习系统在检测恶意软件上的性能优势就会越明显,」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说,就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

恶意用户检测

机器学习还有助于IT安全的其他方面:检测恶意的内部用户和识别损坏的账户。

正如主要的反病毒产品依赖特征码来识别恶意软件一样,监测用户活动的工具也是倚赖特征码。基于特征码的检测方法在恶意软件检测上开始失效,同样的,它在检测用户活动领域的效果也不尽如人意。

「过去,企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

他说:「这种方法寻找的是已经发生的事情。基于特征码的方法存在的问题是,只有事件发生过后,他们才能看到留下的特征码。而现在,安全人员非常聚焦于检测没有特征码的恶意事件。」

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

如今,精明的犯罪分子知道稍微改变一下他们的路径就能战胜特征码检测。所以,如果被侵入的检测系统中存有一个IP黑名单,网络犯罪分子可以通过在他处理下的大面积网域中不断来回跳动来打破这个IP黑名单。

Exabeam并没有固守昔日的防御策略,而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏,所以先假设他们是恶意的,你的网络是缺乏抵抗力的,所以你时刻对每个人的行为进行监测和制作模型,从而找到恶意行为者。

这就是用到机器学习算法的地方。Lin和他的团队获取了多种多样的资源(如服务器日志、虚拟私人网络日志和VPN日志等),使用各种监督和非监督式机器学习算法来检测用户行为的异常模式。

Lin说:「以上都是描绘用户行为的画像,问题是这是如何做到的。对于网络上每个用户或实体,我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后,我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常,让他们浮现出来,方便安全分析员查看。」

机器学习在安全领域的未来

「想一想我们经历过的几次主要的网络安全浪潮,网络犯罪分子正寻找有效地方法来打破安全系统,我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗?答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

他说:「现在我们正开始获得能够有效处理大量未结构化数据和检测模式的系统,我希望下一波网络安全浪潮中的产品是基于认知计算的。看看Watson,既然它可以赢得危险边缘(Jeopardy)游戏,那为什么它不可以用来广泛地分析和理解网络安全事件呢?我认为我们正处于用基于认知的计算来帮助处理安全问题的萌芽阶段。」

Invincea的Saxe希望可以成为弄潮儿。他说:「我并不惊讶该领域的公司没有抓住这次浪潮,生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

 

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.