Machine learning (ML) is such a hot area in security right now.
At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.
Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.
Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.
Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.
Where does ML work?
One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.
We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.
Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.
One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.
There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.
Threat models and anomaly detection
In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.
There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.