# 线性回归（Linear Regression）

### （一）线性回归（Linear Regression）

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}(Y-X\Theta)$

$(Y-X\Theta)^{T}(Y-X\Theta)=Y^{T}Y-2Y^{T}X\Theta + \Theta^{T}X^{T}X\Theta$

$\Theta$ 求导之后得到：$-2X^{T}Y + 2X^{T}X\Theta=0$，求解 $\Theta$ 之后得到 $\Theta = (X^{T}X)^{-1}X^{T}Y$。因此，对于矩阵 $X$ 和列向量 $Y$ 而言，最佳的线性回归系数是

$\Theta = (X^{T}X)^{-1}X^{T}Y.$

### （二）局部加权线性回归（Locally Weighted Linear Regression）

$\sum_{i=1}^{m}w_{i}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}W(Y-X\Theta),$

$(Y-X\Theta)^{T}W(Y-X\Theta)=Y^{T}WY-2Y^{T}WX\Theta+\Theta^{T}X^{T}WX\Theta,$

$\Theta$ 求导之后得到：

$-2(Y^{T}WX)^{T} + 2 X^{T}WX\Theta = -2X^{T}WY+2X^{T}WX\Theta.$

$\Theta = (X^{T}WX)^{-1}X^{T}WY.$

$w_{i} = \exp\{-\frac{(x_{i}-x)^{2}}{2k^{2}}\}.$

### （三）岭回归（Ridge Regression）和 LASSO

$\Theta = (X^{T}X+\lambda I)^{-1}X^{T}Y.$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda||\Theta||_{2}^{2}=(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta,$

$(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta = Y^{T}Y - 2\Theta^{T}X^{T}Y+\Theta^{T}X^{T}X\Theta+\lambda\Theta^{T}I\Theta.$

$\Theta$ 求导之后得到：

$-2X^{T}Y+2(X^{T}X+\lambda I)\Theta,$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda ||\Theta||_{1}$

### （四）前向逐步线性回归（Forward Stagewise Linear Regression）

数据标准化，使其分布满足零均值和单位方差

设置当前最小误差为正无穷
对每个特征：
增大或者缩小：
改变一个系数得到一个新的权重W
计算新W下的误差
如果误差Error小于当前误差：设置Wbest等于当前的W
将W设置为新的Wbest

# 该如何做大中型 UGC 平台（如新浪微博）的反垃圾（anti-spam）工作？

## 来自知乎

aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢
iammutex 彩石手机CTO – 做最好的中老年智能手机

# 如何在 Kaggle 首战中进入前 10%

## Introduction

Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，在 2125 个参赛队伍中排名第 98 位（~ 5%）。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

• 不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
• 比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。
• 一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。
• 反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
• LB 指的就是在 Leaderboard 得到的分数，由上，有 Public LBPrivate LB 之分。
• 自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
• 新手可以从比赛的 ForumScripts 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。

P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!

## General Approach

### Data Exploration

#### Visualization

• 查看目标变量的分布。当分布不平衡时，根据评分标准和具体模型的使用不同，可能会严重影响性能。
• Numerical Variable，可以用 Box Plot 来直观地查看它的分布。
• 对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
• 对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。
• 绘制变量之间两两的分布和相关度图表。

### Data Preprocessing

• 有时数据会分散在几个不同的文件中，需要 Join 起来。
• 处理 Missing Data
• 处理 Outlier
• 必要时转换某些 Categorical Variable 的表示方式。
• 有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

### Feature Engineering

#### Feature Selection

• Feature 越少，训练越快。
• 有些 Feature 之间可能存在线性关系，影响 Model 的性能。
• 通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。

Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的Feature Importance 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如这个。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。

### Model Selection

• Random Forest
• Extra Randomized Trees

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

#### Model Training

• eta：每次迭代完成后更新权重时的步长。越小训练越慢。
• num_round：总共迭代的次数。
• subsample：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。
• colsample_bytree：训练每棵树时用来训练的特征的比例，类似 RandomForestClassifiermax_features
• max_depth：每棵树的最大深度限制。与 Random Forest 不同，Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的
• early_stopping_rounds：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。

1. 将训练数据的一部分划出来作为验证集。
2. 先将 eta 设得比较高（比如 0.1），num_round 设为 300 ~ 500。
3. 用 Grid Search 对其他参数进行搜索
4. 逐步将 eta 降低，找到最佳值。
5. 以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 early_stopping_rounds

#### Cross Validation

Cross Validation 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 Public LB 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 CV 结果，提交之后得到的 LB 结果却变差了，一般认为这时应该相信 CV 的结果。当然，最理想的情况是多种不同的 CV 方法得到的结果和 LB 同时提高，但这样的比赛并不是太多。

### Ensemble Generation

Ensemble Learning 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以同时降低最终模型的 Bias 和 Variance（证明可以参考这篇论文，我最近在研究类似的理论，可能之后会写新文章详述)，从而在提高分数的同时又降低 Overfitting 的风险。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。

• Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。
• Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。
• Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。
• Stacking：接下来会详细介绍。

• Base Model 之间的相关性要尽可能的小。这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。
• Base Model 之间的性能表现不能差距太大。这其实是一个 Trade-off，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。

### *Pipeline

• 模块化 Feature Transform，只需写很少的代码就能将新的 Feature 更新到训练集中。
• 自动化 Grid Search，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。
• 自动化 Ensemble Generation，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

Crowdflower Search Results Relevance 的第一名获得者 Chenglong Chen 将他在比赛中使用的 Pipeline 公开了，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。

## Home Depot Search Relevance

### EDA

• 同一个搜索词/产品都出现了多次，数据分布显然不 i.i.d.
• 文本之间的相似度很有用。
• 产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。
• 产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？

### Preprocessing

1. 利用 Forum 上的 Typo Dictionary 修正搜索词中的错误。
2. 统计属性的出现次数，将其中出现次数多又容易利用的记录下来。
3. 将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。
4. 对所有文本能做 StemmingTokenizing，同时手工做了一部分格式统一化（比如涉及到数字和单位的）同义词替换

### Feature

• *Attribute Features
• 是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）
• 这个特定的属性是否匹配
• Meta Features
• 各个文本域的长度
• 是否包含属性域
• 品牌（将所有的品牌做数值离散化）
• 产品 ID
• 简单匹配
• 搜索词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例
• *搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词和产品标题、产品介绍以及产品属性之间的文本相似度
• Latent Semantic Indexing：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 SVD 分解，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。

### Ensemble

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

### Lessons Learned

• 产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用With/Without XXX 的格式放在标题最后。
• 使用外部数据，比如 WordNetReddit 评论数据集等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。
• 基于字母而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现越长的单词约具体，所以越容易被用户认为相关度高。此外他们还使用了逐字符的序列比较（difflib.SequenceMatcher），因为这个相似度能够衡量视觉上的相似度。像这样的 Feature 的确不是每个人都能想到的。
• 标注单词的词性，找出中心词，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。
• 将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。
• 一些新颖的距离尺度，比如 Word Movers Distance
• 除了 SVD 以外还可以用上 NMF
• 最重要的 Feature 之间的 Pairwise Polynomial Interaction
• 针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布

## Summary

### Takeaways

1. 比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。
2. 很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。
3. Feature 为王。我花在 Feature 上的时间还是太少。
4. 可能的话，多花点时间去手动查看原始数据中的 Pattern。

### Issues Raised

1. 在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 CV
2. 如何量化 Ensemble 中 Diversity vs. Accuracy 的 Trade-off。
3. 如何处理 Feature 之间互相影响导致性能反而下降。

### Beginner Tips

1. 选择一个感兴趣的比赛。如果你对相关领域原本就有一些洞见那就更理想了。
2. 根据我描述的方法开始探索、理解数据并进行建模。
3. 通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。
4. 如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。
5. 在得到一个比较不错的 LB 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。
6. 如果觉得自己有希望拿到奖金，开始找人组队吧！
7. 到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。
8. 比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备
9. 好好休息！

## Introduction

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th (~ 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal.

This post is also available in Chinese.

Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.

First let’s go through some facts about Kaggle competitions in case you are not very familiar with them.

• Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
• A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited amount of times a day. Usually it’s 5 times a day.
• There will be a deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore be sure to have at least one valid submission before that.
• You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.
• The score you get is calculated on a subset of testing set, which is commonly referred to as a Public LB score. Whereas the final result will use the remaining data in the testing set, which is referred to as Private LB score.
• The score you get by local cross validation is commonly referred to as a CVscore. Generally speaking, CV scores are more reliable than LB scores.
• Beginners can learn a lot from Forum and Scripts. Do not hesitate to ask, Kagglers are very kind and helpful.

I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!

## General Approach

In this section, I will walk you through the whole process of a Kaggle competition.

### Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

#### Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

• Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
• For numerical variables, use box plot to inspect their distributions.
• For coordinates-like data, use scatter plot to inspect the distribution and check for outliers.
• For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
• Make pairwise distribution plots and examine correlations between pairs of variables.

Be sure to read this very inspiring tutorial of exploratory visualization before you go on.

#### Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about the choice of tests and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.id. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is an genuine one or due to randomness.

### Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

• Sometimes several files are provided and we need to join them.
• Deal with missing data.
• Deal with outliers.
• Encode categorical variables if necessary.
• Deal with noise. For example you may have some floats derived from unknown integers. The loss of precision during floating-point operations can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.

How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using iPython Notebook for data manipulating and mastering usages of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun operations. Also this makes it very convenient to share your approaches with others. After all reproducible results are very important in data science.

Let’s have some examples.

#### Outlier

The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.

#### Dummy Variables

For categorical variables, a common practice is One-hot Encoding. For a categorical variable with n possible values, we create a group of n dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Like this, we transform DayOfWeek into 7 dummy variables.

Note that when the categorical variable can takes many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.

### Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, common NLP features are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to check by plotting it against the target variable like this:

#### Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

• Less features mean faster training
• Some features are linearly related to others. This might put a strain on the model.
• By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.

The simplest way to inspect feature importance is by fitting a random forest model. There exist more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

#### Feature Encoding

Sometimes raw features have to be converted to some other formats for them to be work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

### Model Selection

With the features set, we can start training models. Kaggle competitions usually favor tree-based models:

• Random Forest
• Extra Randomized Trees

These models are slightly worse in terms of performance, but are suitable as base models in ensemble learning (will be discussed later):

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

Of course, neural networks are very important in image-related competitions.

All these models can be accessed using Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

BTW, installing Xgboost on Windows could be a painstaking process. You can refer to this post by me if you run into problems.

#### Model Training

We can obtain a good model by tuning its parameters. A model usually have many parameters, but only a few of them are important to its performance. For example, the most important parameters for random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each of the parameters have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

• eta: Step size used in updating weights. Lower eta means slower training.
• num_round: Total round of iterations.
• subsample: The ratio of training data used in each iteration. This is to combat overfitting.
• colsample_bytree: The ratio of features used in each iteration. This is like max_features of RandomForestClassifier.
• max_depth: The maximum depth of each tree. Unlike random forest,gradient boosting would eventually overfit if we do not limit its depth.
• early_stopping_rounds: Controls how many iterations that do not show a increase of score on validation set are needed for the algorithm to stop early. This is to combat overfitting, too.

Usual tuning steps:

1. Reserve a portion of training set as the validation set.
2. Set eta to a relatively high value (e.g. 0.1), num_round to 300 ~ 500.
3. Use grid search to find best combination of other parameters.
4. Gradually lower eta to find the optimum.
5. Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

#### Cross Validation

Cross validation is an essential step. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions where it’s not easy to do reliable CV.

### Ensemble Generation

Ensemble Learning refers to ways of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thusincreasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

• Bagging: Use different random subsets of training data to train each base model. Then base models vote to generate the final predictions. This is how random forest works.
• Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. It performs better than bagging but is more prone to overfitting.
• Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
• Stacking: To be discussed next.

In theory, for the ensemble to perform well, two elements matter:

• Base models should be as unrelated as possibly. This is why we tend to include non-tree-base models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
• Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase performance even under this circumstance.

#### Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(rows in training data) X #(base models). This matrix is then fed to the stacker in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

### *Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

• Modularized feature transform. We only need to write a few lines of codes and the new feature is added to the training set.
• Automated grid search. We only need to set up models and parameter grid, the search will be run and best parameters are recorded.
• Automated ensemble generation. Use best K models for ensemble as soon as last generation is done.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because manually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s still very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.

## Home Depot Search Relevance

In this section I will share my solution in Home Depot Search Relevance and what I learned from top teams after the competition.

The task in this competitions is to predict how relevant a result is for a search term on Home Depot website. The relevance score is an average of three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is RMSE.

This is much like Crowdflower Search Results Relevance. The difference is thatQuadratic Weighted Kappa is used in that competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in that competition.

### EDA

There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:

• Many search terms / products appeared several times.
• Text similarities are great features.
• Many products don’t have attributes features. Would this be a problem?
• Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?

### Preprocessing

You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:

1. Use typo dictionary posted in forum to correct typos in search terms.
2. Count attributes. Find those frequent and easily exploited ones.
3. Join the training set with the testing set. This is important because otherwise you’ll have to do feature transform twice.
4. Do stemming and tokenizing for all the text fields. Some normalization(with digits and units) and synonym substitutions are performed manually.

### Feature

• *Attribute Features
• Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
• Whether a certain attribute matches with search term
• Meta Features
• Length of each text field
• Whether the product contains attribute fields
• Brand (encoded as integers)
• Product ID
• Matching
• Whether search term appears in product title / description / attributes
• Count and ratio of search term’s appearance in product title / description / attributes
• *Whether the i-th word of search term appears in product title / description / attributes
• Text similarities between search term and product title/description/attributes
• Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get the latent descriptions of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).

Note that features listed above with * are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlation mentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

### Model

At first I was using RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my PC while working on features on my laptop.

Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.

### Ensemble

My final model is an ensemble consisting of 4 base models:

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

The stacker (L2 model) is also a XGBRegressor.

The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.

The good news is that, despite base models being highly correlated, stacking really bumps up my score. What’s more, my CV score and LB score are in complete sync after I started stacking.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of 0.0004, which might be just due to randomness.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.

### Lessons Learned

There’s much to learn from the solutions shared by top teams:

• There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by With/Without XXXat the end of the title.
• Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
• Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
• POS-tag words and find anchor words. Use anchor words for computing various distances.
• Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
• Some novel distance metrics like Word Movers Distance
• Apart from SVD, some used NMF.
• Generate pairwise polynomial interactions between top-ranking features.
• For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.

## Summary

### Takeaways

1. It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
2. It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
3. Features matter the most! I didn’t spend enough time on features in this competition.
4. If possible, spend some time to manually inspect raw data for patterns.

### Issues Raised

Several issues I encountered in this competitions are of high research values.

1. How to do reliable CV with dependent data.
2. How to quantify the trade-off between diversity and accuracy in ensemble learning.
3. How to deal with feature interaction which harms the model’s performance. And how to determine whether new features are effective in such situation.

### Beginner Tips

1. Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
2. Following my approach or somebody else’s, start exploring, understanding and modeling data.
3. Learn from forum and scripts. See how other interpret data and construct features.
4. Find winner interviews / blog post of previous competitions. They’re very helpful.
5. Start doing ensemble after you have reached a pretty good score (e.g. ~ 10%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
6. If you think you may have a chance to win the prize, try teaming up!
7. Don’t give up until the end of the competition. At least try something new every day.
8. Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
9. Get some rest!

# 未来的网络安全，离不开机器学习

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为，是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说：「我了解到，一些反病毒公司已经涉足机器学习领域，但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件，这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

「我们已经证明，我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作，甚至能做得更好。把机器学习系统与大量的训练数据结合，就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前，Saxe有大约150万个良性或恶意软件样品用来训练算法，这些都在使用 Python 工具的GPU中进行。他希望，随着样本数据增加到3000万，机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多，用来训练机器学习系统的恶意软件的数量越多，那机器学习系统在检测恶意软件上的性能优势就会越明显，」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说，就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

「过去，企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

Exabeam并没有固守昔日的防御策略，而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏，所以先假设他们是恶意的，你的网络是缺乏抵抗力的，所以你时刻对每个人的行为进行监测和制作模型，从而找到恶意行为者。

Lin说：「以上都是描绘用户行为的画像，问题是这是如何做到的。对于网络上每个用户或实体，我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后，我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常，让他们浮现出来，方便安全分析员查看。」

「想一想我们经历过的几次主要的网络安全浪潮，网络犯罪分子正寻找有效地方法来打破安全系统，我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗？答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

Invincea的Saxe希望可以成为弄潮儿。他说：「我并不惊讶该领域的公司没有抓住这次浪潮，生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

# Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to$50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

## Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

## Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

## Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

## 机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法，请求通过和拒绝就是一个迅捷数据组（Scrum）。任务通过，则说明这个任务是一个对机器学习来说的正样本，被拒绝则是一个负样本，很像 0 和 1。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈，你的信息流是个什么样子，信息的位置。每个人产生的内容、新闻会有两三千个，用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示，比如你喜欢游戏，你的朋友不喜欢。

## 新闻流排序的工作原理是什么？

A/B 测试是很好的迭代方法。建立起核心指标，进行 A/B 测试，看新的改动能否提高核心指标，提高就发布，没有提高就不用发布。这很像 Growth hacking，当然最终目的还是提高 DAU。如果用户喜欢你的新闻流，就会更频繁访问，最终目的还是在线时长和日活跃用户。