# 线性回归（Linear Regression）

### （一）线性回归（Linear Regression）

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}(Y-X\Theta)$

$(Y-X\Theta)^{T}(Y-X\Theta)=Y^{T}Y-2Y^{T}X\Theta + \Theta^{T}X^{T}X\Theta$

$\Theta$ 求导之后得到：$-2X^{T}Y + 2X^{T}X\Theta=0$，求解 $\Theta$ 之后得到 $\Theta = (X^{T}X)^{-1}X^{T}Y$。因此，对于矩阵 $X$ 和列向量 $Y$ 而言，最佳的线性回归系数是

$\Theta = (X^{T}X)^{-1}X^{T}Y.$

### （二）局部加权线性回归（Locally Weighted Linear Regression）

$\sum_{i=1}^{m}w_{i}(y_{i}-x_{i}\Theta)^{2} = (Y-X\Theta)^{T}W(Y-X\Theta),$

$(Y-X\Theta)^{T}W(Y-X\Theta)=Y^{T}WY-2Y^{T}WX\Theta+\Theta^{T}X^{T}WX\Theta,$

$\Theta$ 求导之后得到：

$-2(Y^{T}WX)^{T} + 2 X^{T}WX\Theta = -2X^{T}WY+2X^{T}WX\Theta.$

$\Theta = (X^{T}WX)^{-1}X^{T}WY.$

$w_{i} = \exp\{-\frac{(x_{i}-x)^{2}}{2k^{2}}\}.$

### （三）岭回归（Ridge Regression）和 LASSO

$\Theta = (X^{T}X+\lambda I)^{-1}X^{T}Y.$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda||\Theta||_{2}^{2}=(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta,$

$(Y-X\Theta)^{T}(Y-X\Theta)+\lambda \Theta^{T}\Theta = Y^{T}Y - 2\Theta^{T}X^{T}Y+\Theta^{T}X^{T}X\Theta+\lambda\Theta^{T}I\Theta.$

$\Theta$ 求导之后得到：

$-2X^{T}Y+2(X^{T}X+\lambda I)\Theta,$

$\sum_{i=1}^{m}(y_{i}-x_{i}\Theta)^{2}+\lambda ||\Theta||_{1}$

### （四）前向逐步线性回归（Forward Stagewise Linear Regression）

数据标准化，使其分布满足零均值和单位方差

设置当前最小误差为正无穷
对每个特征：
增大或者缩小：
改变一个系数得到一个新的权重W
计算新W下的误差
如果误差Error小于当前误差：设置Wbest等于当前的W
将W设置为新的Wbest

# 该如何做大中型 UGC 平台（如新浪微博）的反垃圾（anti-spam）工作？

## 来自知乎

aviat 淫欲、暴食、贪婪、怠惰、暴怒、嫉妒、傲慢
iammutex 彩石手机CTO – 做最好的中老年智能手机

# 如何在 Kaggle 首战中进入前 10%

## Introduction

Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，在 2125 个参赛队伍中排名第 98 位（~ 5%）。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

• 不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
• 比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。
• 一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。
• 反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
• LB 指的就是在 Leaderboard 得到的分数，由上，有 Public LBPrivate LB 之分。
• 自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
• 新手可以从比赛的 ForumScripts 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。

P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!

## General Approach

### Data Exploration

#### Visualization

• 查看目标变量的分布。当分布不平衡时，根据评分标准和具体模型的使用不同，可能会严重影响性能。
• Numerical Variable，可以用 Box Plot 来直观地查看它的分布。
• 对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在。
• 对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。
• 绘制变量之间两两的分布和相关度图表。

### Data Preprocessing

• 有时数据会分散在几个不同的文件中，需要 Join 起来。
• 处理 Missing Data
• 处理 Outlier
• 必要时转换某些 Categorical Variable 的表示方式。
• 有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

### Feature Engineering

#### Feature Selection

• Feature 越少，训练越快。
• 有些 Feature 之间可能存在线性关系，影响 Model 的性能。
• 通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。

Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的Feature Importance 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如这个。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。

### Model Selection

• Random Forest
• Extra Randomized Trees

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

#### Model Training

• eta：每次迭代完成后更新权重时的步长。越小训练越慢。
• num_round：总共迭代的次数。
• subsample：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。
• colsample_bytree：训练每棵树时用来训练的特征的比例，类似 RandomForestClassifiermax_features
• max_depth：每棵树的最大深度限制。与 Random Forest 不同，Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的
• early_stopping_rounds：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。

1. 将训练数据的一部分划出来作为验证集。
2. 先将 eta 设得比较高（比如 0.1），num_round 设为 300 ~ 500。
3. 用 Grid Search 对其他参数进行搜索
4. 逐步将 eta 降低，找到最佳值。
5. 以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 early_stopping_rounds

#### Cross Validation

Cross Validation 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 Public LB 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 CV 结果，提交之后得到的 LB 结果却变差了，一般认为这时应该相信 CV 的结果。当然，最理想的情况是多种不同的 CV 方法得到的结果和 LB 同时提高，但这样的比赛并不是太多。

### Ensemble Generation

Ensemble Learning 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以同时降低最终模型的 Bias 和 Variance（证明可以参考这篇论文，我最近在研究类似的理论，可能之后会写新文章详述)，从而在提高分数的同时又降低 Overfitting 的风险。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。

• Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。
• Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。
• Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。
• Stacking：接下来会详细介绍。

• Base Model 之间的相关性要尽可能的小。这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。
• Base Model 之间的性能表现不能差距太大。这其实是一个 Trade-off，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。

### *Pipeline

• 模块化 Feature Transform，只需写很少的代码就能将新的 Feature 更新到训练集中。
• 自动化 Grid Search，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。
• 自动化 Ensemble Generation，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

Crowdflower Search Results Relevance 的第一名获得者 Chenglong Chen 将他在比赛中使用的 Pipeline 公开了，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。

## Home Depot Search Relevance

### EDA

• 同一个搜索词/产品都出现了多次，数据分布显然不 i.i.d.
• 文本之间的相似度很有用。
• 产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。
• 产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？

### Preprocessing

1. 利用 Forum 上的 Typo Dictionary 修正搜索词中的错误。
2. 统计属性的出现次数，将其中出现次数多又容易利用的记录下来。
3. 将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。
4. 对所有文本能做 StemmingTokenizing，同时手工做了一部分格式统一化（比如涉及到数字和单位的）同义词替换

### Feature

• *Attribute Features
• 是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）
• 这个特定的属性是否匹配
• Meta Features
• 各个文本域的长度
• 是否包含属性域
• 品牌（将所有的品牌做数值离散化）
• 产品 ID
• 简单匹配
• 搜索词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例
• *搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现
• 搜索词和产品标题、产品介绍以及产品属性之间的文本相似度
• Latent Semantic Indexing：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 SVD 分解，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。

### Ensemble

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

### Lessons Learned

• 产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用With/Without XXX 的格式放在标题最后。
• 使用外部数据，比如 WordNetReddit 评论数据集等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。
• 基于字母而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现越长的单词约具体，所以越容易被用户认为相关度高。此外他们还使用了逐字符的序列比较（difflib.SequenceMatcher），因为这个相似度能够衡量视觉上的相似度。像这样的 Feature 的确不是每个人都能想到的。
• 标注单词的词性，找出中心词，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。
• 将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。
• 一些新颖的距离尺度，比如 Word Movers Distance
• 除了 SVD 以外还可以用上 NMF
• 最重要的 Feature 之间的 Pairwise Polynomial Interaction
• 针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布

## Summary

### Takeaways

1. 比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。
2. 很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。
3. Feature 为王。我花在 Feature 上的时间还是太少。
4. 可能的话，多花点时间去手动查看原始数据中的 Pattern。

### Issues Raised

1. 在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 CV
2. 如何量化 Ensemble 中 Diversity vs. Accuracy 的 Trade-off。
3. 如何处理 Feature 之间互相影响导致性能反而下降。

### Beginner Tips

1. 选择一个感兴趣的比赛。如果你对相关领域原本就有一些洞见那就更理想了。
2. 根据我描述的方法开始探索、理解数据并进行建模。
3. 通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。
4. 如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。
5. 在得到一个比较不错的 LB 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。
6. 如果觉得自己有希望拿到奖金，开始找人组队吧！
7. 到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。
8. 比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备
9. 好好休息！

## Introduction

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th (~ 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal.

This post is also available in Chinese.

Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.

First let’s go through some facts about Kaggle competitions in case you are not very familiar with them.

• Different competitions have different tasks: classification, regression, recommendation, ordering… Training set and testing set will be open for download after the competition launches.
• A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited amount of times a day. Usually it’s 5 times a day.
• There will be a deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore be sure to have at least one valid submission before that.
• You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.
• The score you get is calculated on a subset of testing set, which is commonly referred to as a Public LB score. Whereas the final result will use the remaining data in the testing set, which is referred to as Private LB score.
• The score you get by local cross validation is commonly referred to as a CVscore. Generally speaking, CV scores are more reliable than LB scores.
• Beginners can learn a lot from Forum and Scripts. Do not hesitate to ask, Kagglers are very kind and helpful.

I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!

## General Approach

In this section, I will walk you through the whole process of a Kaggle competition.

### Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

#### Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

• Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
• For numerical variables, use box plot to inspect their distributions.
• For coordinates-like data, use scatter plot to inspect the distribution and check for outliers.
• For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
• Make pairwise distribution plots and examine correlations between pairs of variables.

Be sure to read this very inspiring tutorial of exploratory visualization before you go on.

#### Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about the choice of tests and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.id. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is an genuine one or due to randomness.

### Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

• Sometimes several files are provided and we need to join them.
• Deal with missing data.
• Deal with outliers.
• Encode categorical variables if necessary.
• Deal with noise. For example you may have some floats derived from unknown integers. The loss of precision during floating-point operations can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.

How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using iPython Notebook for data manipulating and mastering usages of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun operations. Also this makes it very convenient to share your approaches with others. After all reproducible results are very important in data science.

Let’s have some examples.

#### Outlier

The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.

#### Dummy Variables

For categorical variables, a common practice is One-hot Encoding. For a categorical variable with n possible values, we create a group of n dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to 1 while other dummies in the same group are all set to 0.

Like this, we transform DayOfWeek into 7 dummy variables.

Note that when the categorical variable can takes many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.

### Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, common NLP features are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to check by plotting it against the target variable like this:

#### Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

• Less features mean faster training
• Some features are linearly related to others. This might put a strain on the model.
• By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.

The simplest way to inspect feature importance is by fitting a random forest model. There exist more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

#### Feature Encoding

Sometimes raw features have to be converted to some other formats for them to be work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

### Model Selection

With the features set, we can start training models. Kaggle competitions usually favor tree-based models:

• Random Forest
• Extra Randomized Trees

These models are slightly worse in terms of performance, but are suitable as base models in ensemble learning (will be discussed later):

• SVM
• Linear Regression
• Logistic Regression
• Neural Networks

Of course, neural networks are very important in image-related competitions.

All these models can be accessed using Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

BTW, installing Xgboost on Windows could be a painstaking process. You can refer to this post by me if you run into problems.

#### Model Training

We can obtain a good model by tuning its parameters. A model usually have many parameters, but only a few of them are important to its performance. For example, the most important parameters for random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each of the parameters have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

• eta: Step size used in updating weights. Lower eta means slower training.
• num_round: Total round of iterations.
• subsample: The ratio of training data used in each iteration. This is to combat overfitting.
• colsample_bytree: The ratio of features used in each iteration. This is like max_features of RandomForestClassifier.
• max_depth: The maximum depth of each tree. Unlike random forest,gradient boosting would eventually overfit if we do not limit its depth.
• early_stopping_rounds: Controls how many iterations that do not show a increase of score on validation set are needed for the algorithm to stop early. This is to combat overfitting, too.

Usual tuning steps:

1. Reserve a portion of training set as the validation set.
2. Set eta to a relatively high value (e.g. 0.1), num_round to 300 ~ 500.
3. Use grid search to find best combination of other parameters.
4. Gradually lower eta to find the optimum.
5. Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

#### Cross Validation

Cross validation is an essential step. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions where it’s not easy to do reliable CV.

### Ensemble Generation

Ensemble Learning refers to ways of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thusincreasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

• Bagging: Use different random subsets of training data to train each base model. Then base models vote to generate the final predictions. This is how random forest works.
• Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. It performs better than bagging but is more prone to overfitting.
• Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
• Stacking: To be discussed next.

In theory, for the ensemble to perform well, two elements matter:

• Base models should be as unrelated as possibly. This is why we tend to include non-tree-base models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
• Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase performance even under this circumstance.

#### Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(rows in training data) X #(base models). This matrix is then fed to the stacker in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

### *Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

• Modularized feature transform. We only need to write a few lines of codes and the new feature is added to the training set.
• Automated grid search. We only need to set up models and parameter grid, the search will be run and best parameters are recorded.
• Automated ensemble generation. Use best K models for ensemble as soon as last generation is done.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because manually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s still very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.

## Home Depot Search Relevance

In this section I will share my solution in Home Depot Search Relevance and what I learned from top teams after the competition.

The task in this competitions is to predict how relevant a result is for a search term on Home Depot website. The relevance score is an average of three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is RMSE.

This is much like Crowdflower Search Results Relevance. The difference is thatQuadratic Weighted Kappa is used in that competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in that competition.

### EDA

There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:

• Many search terms / products appeared several times.
• Text similarities are great features.
• Many products don’t have attributes features. Would this be a problem?
• Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?

### Preprocessing

You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:

1. Use typo dictionary posted in forum to correct typos in search terms.
2. Count attributes. Find those frequent and easily exploited ones.
3. Join the training set with the testing set. This is important because otherwise you’ll have to do feature transform twice.
4. Do stemming and tokenizing for all the text fields. Some normalization(with digits and units) and synonym substitutions are performed manually.

### Feature

• *Attribute Features
• Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
• Whether a certain attribute matches with search term
• Meta Features
• Length of each text field
• Whether the product contains attribute fields
• Brand (encoded as integers)
• Product ID
• Matching
• Whether search term appears in product title / description / attributes
• Count and ratio of search term’s appearance in product title / description / attributes
• *Whether the i-th word of search term appears in product title / description / attributes
• Text similarities between search term and product title/description/attributes
• Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get the latent descriptions of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).

Note that features listed above with * are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlation mentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.

### Model

At first I was using RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my PC while working on features on my laptop.

Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.

### Ensemble

My final model is an ensemble consisting of 4 base models:

• RandomForestRegressor
• ExtraTreesRegressor
• GradientBoostingRegressor
• XGBRegressor

The stacker (L2 model) is also a XGBRegressor.

The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and XGBRegressor with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.

The good news is that, despite base models being highly correlated, stacking really bumps up my score. What’s more, my CV score and LB score are in complete sync after I started stacking.

During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of 0.0004, which might be just due to randomness.

After the competition, I found out that my best single model scores 0.46378 on the private leaderboard, whereas my best stacking ensemble scores 0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.

### Lessons Learned

There’s much to learn from the solutions shared by top teams:

• There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by With/Without XXXat the end of the title.
• Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
• Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
• POS-tag words and find anchor words. Use anchor words for computing various distances.
• Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
• Some novel distance metrics like Word Movers Distance
• Apart from SVD, some used NMF.
• Generate pairwise polynomial interactions between top-ranking features.
• For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.

## Summary

### Takeaways

1. It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
2. It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
3. Features matter the most! I didn’t spend enough time on features in this competition.
4. If possible, spend some time to manually inspect raw data for patterns.

### Issues Raised

Several issues I encountered in this competitions are of high research values.

1. How to do reliable CV with dependent data.
2. How to quantify the trade-off between diversity and accuracy in ensemble learning.
3. How to deal with feature interaction which harms the model’s performance. And how to determine whether new features are effective in such situation.

### Beginner Tips

1. Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
2. Following my approach or somebody else’s, start exploring, understanding and modeling data.
3. Learn from forum and scripts. See how other interpret data and construct features.
4. Find winner interviews / blog post of previous competitions. They’re very helpful.
5. Start doing ensemble after you have reached a pretty good score (e.g. ~ 10%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
6. If you think you may have a chance to win the prize, try teaming up!
7. Don’t give up until the end of the competition. At least try something new every day.
8. Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
9. Get some rest!

# 未来的网络安全，离不开机器学习

Invincea是美国弗吉尼亚州一家专门检测恶意软件和维护网络安全的公司。这家公司的首席研究工程师Josh Saxe认为，是时候摒弃上世纪90年代的基于特征码和文件哈希值的分析技术了。

Saxe说：「我了解到，一些反病毒公司已经涉足机器学习领域，但是他们赖以生存的仍然是特征码检测。他们基于文件哈希值或者模式匹配来检测恶意软件，这是人类研究员想出来的检测给定样品的分析技术。」

Invincea先进的恶意软件检测系统有一部分是基于 DARPA 的网络基因组项目。

「我们已经证明，我们开发的基于机器学习的方法比传统反病毒系统更有效。机器学习系统能够自动完成人类分析员所做的工作，甚至能做得更好。把机器学习系统与大量的训练数据结合，就能击败基于特征码的传统检测系统。」

Invincea采用深度学习方法来加快算法的训练。目前，Saxe有大约150万个良性或恶意软件样品用来训练算法，这些都在使用 Python 工具的GPU中进行。他希望，随着样本数据增加到3000万，机器学习系统的性能优势会有一个线性增长。

「我们拥有的训练数据越多，用来训练机器学习系统的恶意软件的数量越多，那机器学习系统在检测恶意软件上的性能优势就会越明显，」他说。

Saxe说Invincea目前的计划是在2016年的终端安全产品上加载更多基于深度学习的功能。具体来说，就是把这种能力添加到已经使用机器学习技术的终端安全产品Cynomix上。

「过去，企业的安全人员严重倚赖特征码方法——比如IP地址黑名单。」用户行为分析工具提供商Exabeam的首席数据科学家Derek Lin说到。

Exabeam通过追踪用户的远程连接信息、设备、IP地址和凭证建立了一张用户活动图。

Exabeam并没有固守昔日的防御策略，而是基于Gartner的UBA( User Behavior Analytics,用户行为分析)概念采取了主动出击的方法。UBA背后的思路是你没法事先知道机器或用户的好坏，所以先假设他们是恶意的，你的网络是缺乏抵抗力的，所以你时刻对每个人的行为进行监测和制作模型，从而找到恶意行为者。

Lin说：「以上都是描绘用户行为的画像，问题是这是如何做到的。对于网络上每个用户或实体，我们尝试建立一个正常的简略图——这里涉及到统计学分析。然后，我们在概念水平上寻找与正常值的偏差……我们使用基于行为的方法来寻找系统中的异常，让他们浮现出来，方便安全分析员查看。」

「想一想我们经历过的几次主要的网络安全浪潮，网络犯罪分子正寻找有效地方法来打破安全系统，我们也要回以反击。机器学习会成为反击武器中的中流砥柱吗？答案是肯定的。」安全软件供应商Townsend Security创始人兼CEO Patrick Townsend说到。

Invincea的Saxe希望可以成为弄潮儿。他说：「我并不惊讶该领域的公司没有抓住这次浪潮，生产出基于新的深度学习的算法。对机器学习的训练才刚实现不久。这在10年前是没法有效完成的。」

# Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to$50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies’ bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

## Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

## Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

## Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs’ worth – has a positive influence on the algorithms’ machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

## 机器学习是Sigma 系统的核心

Sigma 系统中有些是人为规则也有机器算法，请求通过和拒绝就是一个迅捷数据组（Scrum）。任务通过，则说明这个任务是一个对机器学习来说的正样本，被拒绝则是一个负样本，很像 0 和 1。

“排序” 指的是信息流的顺序。它决定了打开你的 Facebook 朋友圈，你的信息流是个什么样子，信息的位置。每个人产生的内容、新闻会有两三千个，用户只能看到 50-100 个。你需要把两三千个最好地展示出来。有些我们不给用户显示，比如你喜欢游戏，你的朋友不喜欢。

## 新闻流排序的工作原理是什么？

A/B 测试是很好的迭代方法。建立起核心指标，进行 A/B 测试，看新的改动能否提高核心指标，提高就发布，没有提高就不用发布。这很像 Growth hacking，当然最终目的还是提高 DAU。如果用户喜欢你的新闻流，就会更频繁访问，最终目的还是在线时长和日活跃用户。

## “我已经没法看完所有朋友圈的内容”

One of our weapons in the fight against spam, malware, and other abuse on Facebook is a system called Sigma. Its job is to proactively identify malicious actions on Facebook, such as spam, phishing attacks, posting links to malware, etc. Bad content detected by Sigma is removed automatically so that it doesn’t show up in your News Feed.

We recently completed a two-year-long major redesign of Sigma, which involved replacing the in-house FXL language previously used to program Sigma with Haskell. The Haskell-powered Sigma now runs in production, serving more than one million requests per second.

Haskell isn’t a common choice for large production systems like Sigma, and in this post, we’ll explain some of the thinking that led to that decision. We also wanted to share the experiences and lessons we learned along the way. We made several improvements to GHC (the Haskell compiler) and fed them back upstream, and we were able to achieve better performance from Haskell compared with the previous implementation.

## How does Sigma work?

Sigma is a rule engine, which means it runs a set of rules, called policies. Every interaction on Facebook — from posting a status update to clicking “like” — results in Sigma evaluating a set of policies specific to that type of interaction. These policies make it possible for us to identify and block malicious interactions before they affect people on Facebook.

Policies are continuously deployed. At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses. This also means that safety in the language we write policies in is important. We don’t allow code to be checked into the repository unless it is type-correct.

Louis Brandy of Facebook’s Site Integrity team discusses scalable spam fighting and the anti-abuse structure at Facebook and Instagram in a 2014 @Scale talk.

The original language we designed for writing policies, FXL, was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted. We wanted the performance and expressivity of a fully fledged programming language. Thus, we decided to migrate to an existing language rather than try to improve FXL.

The following features were at the top of our list when we were choosing a replacement:

1. Purely functional and strongly typed. This ensures that policies can’t inadvertently interact with each other, they can’t crash Sigma, and they are easy to test in isolation. Strong types help eliminate many bugs before putting policies into production.

2. Automatically batch and overlap data fetches. Policies typically fetch data from other systems at Facebook, so we want to employ concurrency wherever possible for efficiency. We want concurrency to be implicit, so that engineers writing policies can concentrate on fighting spam and not worry about concurrency. Implicit concurrency also prevents the code from being cluttered with efficiency-related details that would obscure the functionality, and make the code harder to understand and modify.

3. Push code changes to production in minutes. This enables us to deploy new or updated policies quickly.

4. Performance. FXL’s slower performance meant that we were writing anything performance-critical in C++ and putting it in Sigma itself. This had a number of drawbacks, particularly the time required to roll out changes.

5. Support for interactive development. Developers working on policies want to be able to experiment and test their code interactively, and to see the results immediately.

Haskell measures up quite well: It is a purely functional and strongly typed language, and it has a mature optimizing compiler and an interactive environment (GHCi). It also has all the abstraction facilities we would need, it has a rich set of libraries available, and it’s backed by an active developer community.

That left us with two features from our list to address: (1) automatic batching and concurrency, and (2) hot-swapping of compiled code.

## Automatic batching and concurrency: The Haxl framework

All the existing concurrency abstractions in Haskell are explicit, meaning that the user needs to say which things should happen concurrently. For data-fetching, which can be considered a purely functional operation, we wanted a programming model in which the system just exploits whatever concurrency is available, without the programmer having to use explicit concurrency constructs. We developed the Haxl framework to address this issue: Haxl enables multiple data-fetching operations to be automatically batched and executed concurrently.

We discussed Haxl in an earlier blog post, and we published a paper on Haxl at the ICFP 2014 conference. Haxl is open source and available on GitHub.

In addition to the Haxl framework, we needed help from the Haskell compiler in the form of theApplicative do-notation. This allows programmers to write sequences of statements that the compiler automatically rearranges to exploit concurrency. We also designed and implemented Applicative do-notation in GHC.

## Hot-swapping of compiled code

Every time someone checks new code into the repository of policies, we want to have that code running on every machine in the Sigma fleet as quickly as possible. Haskell is a compiled language, so that involves compiling the code and distributing the new compiled code to all the machines running Sigma.

We want to update the compiled rules in a running Sigma process on the fly, while it is serving requests. Changing the code of a running program is a tricky problem in general, and it has been the subject of a great deal of research in the academic community. In our case, fortunately, the problem is simpler: Requests to Sigma are short-lived, so we don’t need to switch a running request to new code. We can serve new requests on the new code and let the existing requests finish before we discard the old code. We’re careful to ensure that we don’t change any code associated with persistent state in Sigma.

Loading and unloading code currently uses GHC’s built-in runtime linker, although in principle, we could use the system dynamic linker. To unload the old version of the code, the garbage collector gets involved. The garbage collector detects when old code is no longer being used by a running request, so we know when it is safe to unload it from the running process.

Haskell is sandwiched between two layers of C++ in Sigma. At the top, we use the C++ thrift server. In principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant. It also supports more features. Furthermore, it can work seamlessly with the Haskell layers below because we can call into Haskell from C++. For these reasons, it made sense to use C++ for the server layer.

At the lowest layer, we have existing C++ client code for talking to other internal services. Rather than rewrite this code in Haskell, which would duplicate the functionality and create an additional maintenance burden, we wrapped each C++ client in a Haxl data source using Haskell’s Foreign Function Interface (FFI) so we could use it from Haskell.

Haskell’s FFI is designed to call C rather than C++, so calling C++ requires an intermediate C layer. In most cases, we were able to avoid the intermediate C layer by using a compile-time tool that demangles C++ function names so they can be called directly from Haskell.

## Performance

Perhaps the biggest question here is “Does it run fast enough?” Requests to Sigma result from users performing actions on Facebook, such as sending a message on Messenger, and Sigma must respond before the action can take place. So we wanted to serve requests fast enough to avoid interruptions to the user experience.

The graph below shows the relative throughput performance between FXL and Haskell for the 25 most common types of requests served by Sigma (these requests account for approximately 95 percent of Sigma’s typical workload).

Haskell performs as much as three times faster than FXL for certain requests. On a typical workload mix, we measured a 20 percent to 30 percent improvement in overall throughput, meaning we can serve 20 percent to 30 percent more traffic with the same hardware. We believe additional improvements are possible through performance analysis, tuning, and optimizing the GHC runtime for our workload.

Achieving this level of performance required a lot of hard work, profiling the Haskell code, and identifying and resolving performance bottlenecks.

Here are a few specific things we did:

• We implemented automatic memoization of top-level computations using a source-to-source translator. This is particularly beneficial in our use-case where multiple policies can refer to the same shared value, and we want to compute it only once. Note, this is per-request memoization rather than global memoization, which lazy evaluation already provides.
• We made a change to the way GHC manages the heap, to reduce the frequency of garbage collections on multicore machines. GHC’s default heap settings are frugal, so we also use a larger allocation area size of at least 64 MB per core.
• Fetching remote data usually involves marshaling the data structure across the C++/Haskell boundary. If the whole data structure isn’t required, it is better to marshal only the pieces needed. Or better still, don’t fetch the whole thing — although that’s only possible if the remote service implements an appropriate API.
• We uncovered a nasty performance bug in aeson, the Haskell JSON parsing library. Bryan O’Sullivan, the author of aeson, wrote a nice blog post about how he fixed it. It turns out that when you do things at Facebook scale, those one-in-a-million corner cases tend to crop up all the time.

## Resource limits

In a latency-sensitive service, you don’t want a single request using a lot of resources and slowing down other requests on the same machine. In this case, the “resources” include everything on the machine that is shared by the running requests — CPU, memory, network bandwidth, and so on.

A request that uses a lot of resources is normally a bug that we want to fix. It does happen from time to time, often as a result of a condition that occurs in production that wasn’t encountered during testing — perhaps an innocuous operation provided with some unexpectedly large input data, or pathological performance of an algorithm on certain rare inputs, for example. When this happens, we want Sigma to terminate the affected request with an error (that will subsequently result in the bug being fixed) and continue without any impact on the performance of other requests being served.

To make this possible, we implemented allocation limits in GHC, which places a bound on the amount of memory a thread can allocate before it is terminated. Terminating a computation safely is a hard problem in general, but Haskell provides a safe way to abort a computation in the form of asynchronous exceptions. Asynchronous exceptions allow us to write most of most of our code ignoring the potential for summary termination and still have all the nice guarantees that we need in the event that the limit is hit, including safe releasing of resources, closing network connections, and so forth.

The following graph illustrates of how well allocation limits work in practice. It tracks the maximum live memory across various groups of machines in the Sigma fleet. When we enabled one request that had some resource-intensive outliers, we saw large spikes in the maximum live memory, which disappeared when we enabled allocation limits.

## Enabling interactive development

Facebook engineers develop policies interactively, testing code against real data as they go. To enable this workflow in Haskell, we needed the GHCi environment to work with our full stack, including making requests to other back-end services from the command line.

To make this work, we had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load. We also customized the GHCi front end to implement some of our own commands and streamline the desired workflows. The result is an interactive environment in which developers can load their code from source in a few seconds and work on it with a fast turnaround time. They have the full set of APIs available and can test against real production data sources.

While GHCi isn’t as easy to customize as it could be, we’ve already made several improvements and contributed them upstream. We hope to make more improvements in the future.

## Packages and build systems

In addition to GHC itself, we make use of a lot of open-source Haskell library code. Haskell has its own packaging and build system, Cabal, and the open-source packages are all hosted onHackage. The problem with this setup is that the pace of change on Hackage is fast, there are often breakages, and not all combinations of packages work well together. The system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure, and the tool support isn’t what it could be. We found that using packages directly from Hackage together with Facebook’s internal build tools meant adding or updating an existing package sometimes led to a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations.

As a result of this experience, we switched to Stackage as our source of packages. Stackage provides a set of package versions that are known to work together, freeing us from the problem of having to find the set by trial and error.

## Did we find bugs in GHC?

Yes, most notably:

• We fixed a bug in GHC’s garbage collector that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years.
• We fixed a bug in GHC’s handling of finalizers that occasionally caused crashes during process shutdown.

Following these fixes, we haven’t seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet.

## What else?

At Facebook, we’re using Haskell at scale to fight spam and other types of abuse. We’ve found it to be reliable and performant in practice. Using the Haxl framework, our engineers working on spam fighting can focus on functionality rather than on performance, while the system can exploit the available concurrency automatically.

For more information on spam fighting at Facebook, check out our Protect the Graph page, or watch videos from our recent Spam Fighting @Scale event.

# 最大似然估计（Maximal Likelihood Estimation）

## （一）基本思想

$P(x_{1},...,x_{n}) = f_{D}(x_{1},...,x_{n}|\theta)$.

$L(\theta) = f_{D}(x_{1},...,x_{n}|\theta)$,

Remark. 最大似然函数不一定是唯一的，甚至不一定是存在的。

## （二）基本算法

（1）定义似然函数；

（2）对似然函数求导数，或者说对似然函数的对数求导数，目的都是为了更加方便地计算一阶导数；

（3）令一阶导数等于零，得到关于参数 $\theta$ 的似然方程；

（4）求解似然方程，得到的参数就是最大似然估计。在求解的过程中，如果不能够直接求解方程的话，可以采取牛顿法来近似求解。

## （三）例子

### （i）Bernoulli 分布（Bernoulli Distribution）

$f(x_{i};p)=p^{x_{i}}(1-p)^{1-x_{i}}, \text{ for } x_{i}=0 \text{ or } 1 \text{ and } 0.

$L(p)=\prod_{i=1}^{n}f(x_{i};p)=p^{\sum_{i=1}^{n}x_{i}}(1-p)^{n-\sum_{i=1}^{n}x_{i}}$.

$\ln(L(p))=(\sum_{i=1}^{n}x_{i})\ln(p) + (n-\sum_{i=1}^{n}x_{i})\ln(1-p)$

$\frac{\partial\ln(L(p))}{\partial p} = \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}$

$\frac{\partial\ln(L(p))}{\partial p} = 0$，可以得到

$p=\sum_{i=1}^{n}x_{i}/n$.

$\hat{p}=\sum_{i=1}^{n}x_{i}/n$.

### （ii）Gaussian Distribution

$f(x_{i};\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x_{i}-\theta)^{2}}{2\sigma^{2}})$.

$L(\mu,\sigma) = \prod_{i=1}^{n} f(x_{i};\mu,\sigma^{2}) = (2\pi)^{-n/2}\sigma^{-n}\exp(-\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}})$

$\ln(L(\mu,\sigma)) = -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{2\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} = \frac{\sum_{i=1}^{n}(x_{i}-\mu)}{\sigma^{2}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{\sigma^{3}}$

$\frac{\partial \ln(L(\mu,\sigma))}{\partial \mu} =0$ 和  $\frac{\partial \ln(L(\mu,\sigma))}{\partial \sigma}=0$，可以求解方程组得到：

$\hat{\mu}= \sum_{i=1}^{n}x_{i}/n$,

$\hat{\sigma}^{2} = \sum_{i=1}^{n}(x_{i}-\mu)^{2}/n$.

### (iii) Weibull 分布（Weibull Distribution）

$f(x;\lambda,k) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1}\exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0, f(x;\lambda,k)=0 \text{ for } x<0.$

Weibull 分布的累积分布函数

$F(x;k,\lambda) = 1- \exp^{-(x/\lambda)^{k}} \text{ for } x\geq 0$,

$F(x;k,\lambda) = 0 \text{ for } x<0$.

Weibull 分布的分位函数（quantile function, inverse cumulative distribution）是

$Q(p;k,\lambda) = \lambda(-\ln(1-p))^{1/k} \text{ for } 0\leq p <1$.

$p(x_{i};k,\lambda) = \frac{k}{\lambda}(\frac{x_{i}}{\lambda})^{k-1}\exp(-(\frac{x_{i}}{\lambda})^{k})$.

$L(k,\lambda) = \prod_{i=1}^{n}p(x_{i};k,\lambda)$

$\ln(L(k,\lambda)) = n\ln(k) - nk\ln(\lambda) + (k-1)\sum_{i=1}^{n}\ln(x_{i}) - \sum_{i=1}^{n}x_{i}^{k}/\lambda^{k}.$

$\frac{\partial \ln(L(k,\lambda))}{\partial \lambda} = - \frac{nk}{\lambda} + \frac{k\sum_{i=1}^{n}x_{i}^{k}}{\lambda^{k+1}},$

$\frac{\partial \ln(L(k,\lambda))}{\partial k} = \frac{n}{k} - n\ln(\lambda) + \sum_{i=1}^{n}\ln(x_{i}) -\sum_{i=1}^{n}(\frac{x_{i}}{\lambda})^{k}\ln(\frac{x_{i}}{\lambda}).$

$\lambda^{k}=\frac{\sum_{i=1}^{n}x_{i}^{k}}{n},$

$\frac{1}{k} = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} -\frac{\sum_{i=1}^{n}\ln(x_{i})}{n}.$

$f(k) = \frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}} - \frac{\sum_{i=1}^{n}\ln(x_{i})}{n} - \frac{1}{k}.$

$f^{'}(k)= \frac{\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2}}{\sum_{i=1}^{n}x_{i}^{k}}-(\frac{\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i})}{\sum_{i=1}^{n}x_{i}^{k}})^{2} + \frac{1}{k^{2}}.$

$(\sum_{i=1}^{n}x_{i}^{k}(\ln(x_{i}))^{2})\cdot(\sum_{i=1}^{n}x_{i}^{k})\geq (\sum_{i=1}^{n}x_{i}^{k}\ln(x_{i}))^{2}.$

$\lim_{k\rightarrow 0^{+}}f(k) = -\infty,$

$\lim_{k\rightarrow +\infty}f(k) > 0 \text{ if } \forall x_{i}>1.$

$k_{0}= 0.0001,$

$k_{n+1} = k_{n}- \frac{f(k_{n})}{f^{'}(k_{n})} \text{ for all } n\geq 0.$

$n$ 的次数足够大的时候，$k_{n}$ 就可以被当作最大似然估计。

# How machine learning can help the security industry

Machine learning (ML) is such a hot area in security right now.

At the 2016 RSA Conference, you would be hard pressed to find a company that is not claiming to use ML for security. And why not? To the layperson, ML seems like the magic solution to all security problems. Take a bunch of unlabeled data, pump it through a system with some ML magic inside, and it can somehow identify patterns even human experts can’t find — all while learning and adapting to new behaviors and threats. Rather than having to code the rules, these systems can discover the rules all by themselves.

Oh, if only that were the case! ML is this year’s “big data”: Everyone is claiming to do it, but few actually do it right or even understand what it’s good for. Especially in security, I’ve seen more misapplications than appropriate ones.

Most applications of ML in security use a form of anomaly detection, which is used to spot events that do not match an expected pattern. Anomaly detection is a useful technique in certain circumstances, but too often, vendors misapply it. For example, they will claim to analyze network traffic in an enterprise and use ML to find hackers in your network. This does not work, and you should be immediately skeptical of the vendors who make this claim.

Effective machine learning requires a low dimensionality problem with high-quality labeled data. Unfortunately, deployments in real enterprises have neither. Detecting novel attacks requires either clear, labeled examples of attacks, which you do not have by definition, or a complete, exhaustive understanding of “normal” network behavior, which is impossible for any real network. And any sophisticated attacker will make an attack appear as seamless and “typical” as possible, to avoid setting off alarms.

### Where does ML work?

One example where ML and anomaly detection can actually work well for security is in classifying human behavior. Humans, it turns out, are fairly predictable, and it is possible to build fairly accurate models of individual user behavior and detect when it doesn’t match their normal behavior.

We’ve had success in using ML for implicit authentication via analyzing a user’s biometrics, behavior, and environment. Implicit authentication is a technique that allows users to authenticate without performing any explicit actions like entering a password or swiping a fingerprint. This has clear benefits to both the user experience as well as for security. Users don’t need to be bothered with extra steps, we can use many authentication factors (rather than just one, a password), and it can happen continuously in the background.

Implicit authentication is well-suited to ML because most of the factors are low dimensional, meaning they involve a small number of parameters, and you can passively gather high-quality labeled data about user identities. Much like ML is effective in matching images for computer vision even in the presence of variance and noise, it is also effective in matching unique human behavioral aspects.

One example of this technology is how we can authenticate users based on unique aspects to the way they move. Attributes of the way you walk, sit, and stand are influenced by a large number of factors (including physiology, age, gender, and muscle memory), but are largely consistent for an individual. It is actually possible to accurately detect some of these attributes from the motion sensors in your phone in your pocket. In fact, after four seconds of motion data from a phone in your pocket, we can detect enough of these attributes to identify you. Another example is in using a user’s location history to authenticate them. Humans are creatures of habit, and by looking at where they came from and when, we can make an estimate of whether it’s them.

There are enough sensors in phones and computers (and more recently, wearables and IoT devices) that it is possible to passively pick up a large number of unique attributes about a user’s behavior and environment. We can then use ML to build a unique model for an individual user and find correlations between factors.

### Threat models and anomaly detection

In any security system, it is important to understand the threat models you are trying to protect against. When using ML for security, you need to explicitly gather data, model the threats your system is protecting against, and use the model to train your system. Fortunately, for attacks against authentication, it is often possible to detect behavioral changes. For example, when a device is stolen, there are often clear changes in terms of its movement, location, and usage. And because false negatives are acceptable in that they just require the user to re-authenticate with a different method, we can tune the system to minimize false positives. In fact, once we combine four factors across multiple devices, we can get below a 0.001 percent false positive rate on implicit authentication.

There is no magic machine learning genie that can solve all your security problems. Building an effective security product that uses ML requires a deep understanding of the underlying system, and many security problems are just not appropriate for ML. For those that are, it’s a very powerful technique. And don’t worry, the companies on the hype train will soon move on to newer fads, like mobile self-driving AR blockchain drone marketplaces.

# Hausdorff dimension of the graphs of the classical Weierstrass functions

In this paper, we obtain the explicit value of the Hausdorff dimension of the graphs of the classical Weierstrass functions, by proving absolute continuity of the SRB measures of the associated solenoidal attractors.

1. Introduction

In Real Analysis, the classical Weierstrass function is

$\displaystyle W_{\lambda,b}(x) = \sum\limits_{n=0}^{\infty} \lambda^n \cos(2\pi b^n x)$

with ${1/b < \lambda < 1}$.

Note that the Weierstrass functions have the form

$\displaystyle f^{\phi}_{\lambda,b}(x) = \sum\limits_{n=0}^{\infty} \lambda^n \phi(b^n x)$

where ${\phi}$ is a ${\mathbb{Z}}$-periodic ${C^2}$-function.

Weierstrass (1872) and Hardy (1916) were interested in ${W_{\lambda,b}}$ because they are concrete examples of continuous but nowhere differentiable functions.

Remark 1 The graph of ${f^{\phi}_{\lambda,b}}$ tends to be a “fractal object” because ${f^{\phi}_{\lambda,b}}$ is self-similar in the sense that

$\displaystyle f^{\phi}_{\lambda, b}(x) = \phi(x) + \lambda f^{\phi}_{\lambda,b}(bx)$

We will come back to this point later.

Remark 2 ${f^{\phi}_{\lambda,b}}$ is a ${C^{\alpha}}$-function for all ${0\leq \alpha < \frac{-\log\lambda}{\log b}}$. In fact, for all ${x,y\in[0,1]}$, we have

$\displaystyle \frac{f^{\phi}_{\lambda, b}(x) - f^{\phi}_{\lambda,b}(y)}{|x-y|^{\alpha}} = \sum\limits_{n=0}^{\infty} \lambda^n b^{n\alpha} \left(\frac{\phi(b^n x) - \phi(b^n y)}{|b^n x - b^n y|^{\alpha}}\right),$

so that

$\displaystyle \frac{f^{\phi}_{\lambda, b}(x) - f^{\phi}_{\lambda,b}(y)}{|x-y|^{\alpha}} \leq \|\phi\|_{C^{\alpha}} \sum\limits_{n=0}^{\infty}(\lambda b^{\alpha})^n:=C(\phi,\alpha,\lambda,b) < \infty$

whenever ${\lambda b^{\alpha} < 1}$, i.e., ${\alpha < -\log\lambda/\log b}$.

The study of the graphs of ${W_{\lambda,b}}$ as fractal sets started with the work of Besicovitch-Ursell in 1937.

Remark 3 The Hausdorff dimension of the graph of a ${C^{\alpha}}$-function ${f:[0,1]\rightarrow\mathbb{R}}$is

$\displaystyle \textrm{dim}(\textrm{graph}(f))\leq 2 - \alpha$

Indeed, for each ${n\in\mathbb{N}}$, the Hölder continuity condition

$\displaystyle |f(x)-f(y)|\leq C|x-y|^{\alpha}$

leads us to the “natural cover” of ${G=\textrm{graph}(f)}$ by the family ${(R_{j,n})_{j=1}^n}$ of rectangles given by

$\displaystyle R_{j,n}:=\left[\frac{j-1}{n}, \frac{j}{n}\right] \times \left[f(j/n)-\frac{C}{n^{\alpha}}, f(j/n)+\frac{C}{n^{\alpha}}\right]$

Nevertheless, a direct calculation with the family ${(R_{j,n})_{j=1}^n}$ does not give us an appropriate bound on ${\textrm{dim}(G)}$. In fact, since ${\textrm{diam}(R_{j,n})\leq 4C/n^{\alpha}}$ for each ${j=1,\dots, n}$, we have

$\displaystyle \sum\limits_{j=1}^n\textrm{diam}(R_{j,n})^d\leq n\left(\frac{4C}{n^{\alpha}}\right)^d = (4C)^{1/\alpha} < \infty$

for ${d=1/\alpha}$. Because ${n\in\mathbb{N}}$ is arbitrary, we deduce that ${\textrm{dim}(G)\leq 1/\alpha}$. Of course, this bound is certainly suboptimal for ${\alpha<1/2}$ (because we know that ${\textrm{dim}(G)\leq 2 < 1/\alpha}$ anyway).Fortunately, we can refine the covering ${(R_{j,n})}$ by taking into account that each rectangle ${R_{j,n}}$ tends to be more vertical than horizontal (i.e., its height ${2C/n^{\alpha}}$ is usually larger than its width ${1/n}$). More precisely, we can divide each rectangle ${R_{j,n}}$ into ${\lfloor n^{1-\alpha}\rfloor}$ squares, say

$\displaystyle R_{j,n} = \bigcup\limits_{k=1}^{\lfloor n^{1-\alpha}\rfloor}Q_{j,n,k},$

such that every square ${Q_{j,n,k}}$ has diameter ${\leq 2C/n}$. In this way, we obtain a covering ${(Q_{j,n,k})}$ of ${G}$ such that

$\displaystyle \sum\limits_{j=1}^n\sum\limits_{k=1}^{\lfloor n^{1-\alpha}\rfloor} \textrm{diam}(Q_{j,n,k})^d \leq n\cdot n^{1-\alpha}\cdot\left(\frac{2}{n}\right)^d\leq (2C)^{2-\alpha}<\infty$

for ${d=2-\alpha}$. Since ${n\in\mathbb{N}}$ is arbitrary, we conclude the desired bound

$\displaystyle \textrm{dim}(G)\leq 2-\alpha$

A long-standing conjecture about the fractal geometry of ${W_{\lambda,b}}$ is:

Conjecture (Mandelbrot 1977): The Hausdorff dimension of the graph of ${W_{\lambda,b}}$ is

$\displaystyle 1<\textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b} < 2$

Remark 4 In view of remarks 2 and 3, the whole point of Mandelbrot’s conjecture is to establish the lower bound

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) \geq 2 + \frac{\log\lambda}{\log b}$

Remark 5 The analog of Mandelbrot conjecture for the box and packing dimensions is known to be true: see, e.g., these papers here and here).

In a recent paper (see here), Shen proved the following result:

Theorem 1 (Shen) For any ${b\geq 2}$ integer and for all ${1/b < \lambda < 1}$, the Mandelbrot conjecture is true, i.e.,

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

Remark 6 The techniques employed by Shen also allow him to show that given ${\phi:\mathbb{R}\rightarrow\mathbb{R}}$ a ${\mathbb{Z}}$-periodic, non-constant, ${C^2}$ function, and given ${b\geq 2}$ integer, there exists ${K=K(\phi,b)>1}$ such that

$\displaystyle \textrm{dim}(\textrm{graph}(f^{\phi}_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

for all ${1/K < \lambda < 1}$.

Remark 7 A previous important result towards Mandelbrot’s conjecture was obtained by Barańsky-Barány-Romanowska (in 2014): they proved that for all ${b\geq 2}$ integer, there exists ${1/b < \lambda_b < 1}$ such that

$\displaystyle \textrm{dim}(\textrm{graph}(W_{\lambda,b})) = 2 + \frac{\log\lambda}{\log b}$

for all ${\lambda_b < \lambda < 1}$.

The remainder of this post is dedicated to give some ideas of Shen’s proof of Theorem1 by discussing the particular case when ${1/b<\lambda<2/b}$ and ${b\in\mathbb{N}}$ is large.

2. Ledrappier’s dynamical approach

If ${b\geq 2}$ is an integer, then the self-similar function ${f^{\phi}_{\lambda,b}}$ (cf. Remark 1) is also ${\mathbb{Z}}$-periodic, i.e., ${f^{\phi}_{\lambda,b}(x+1) = f^{\phi}_{\lambda,b}(x)}$ for all ${x\in\mathbb{R}}$. In particular, if ${b\geq 2}$ is an integer, then ${\textrm{graph}(f^{\phi}_{\lambda,b})}$ is an invariant repeller for the endomorphism ${\Phi:\mathbb{R}/\mathbb{Z}\times\mathbb{R}\rightarrow \mathbb{R}/\mathbb{Z}\times\mathbb{R}}$ given by

$\displaystyle \Phi(x,y) = \left(bx\textrm{ mod }1, \frac{y-\phi(x)}{\lambda}\right)$

This dynamical characterization of ${G = \textrm{graph}(f^{\phi}_{\lambda,b})}$ led Ledrappier to the following criterion for the validity of Mandelbrot’s conjecture when ${b\geq 2}$ is an integer.

Denote by ${\mathcal{A}}$ the alphabet ${\mathcal{A}=\{0,\dots,b-1\}}$. The unstable manifolds of ${\Phi}$through ${G}$ have slopes of the form

$\displaystyle (1,-\gamma \cdot s(x,u))$

where ${\frac{1}{b} < \gamma = \frac{1}{\lambda b} <1}$, ${x\in\mathbb{R}}$, ${u\in\mathcal{A}^{\mathbb{N}}}$, and

$\displaystyle s(x,u):=\sum\limits_{n=0}^{\infty} \gamma^n \phi'\left(\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

In this context, the push-forwards ${m_x := (u\mapsto s(x,u))_*\mathbb{P}}$ of the Bernoulli measure ${\mathbb{P}}$ on ${\mathcal{A}^{\mathbb{N}}}$ (induced by the discrete measure assigning weight ${1/b}$ to each letter of the alphabet ${\mathcal{A}}$) play the role of conditional measures along vertical fibers of the unique Sinai-Ruelle-Bowen (SRB) measure ${\theta}$ of the expanding endomorphism ${T:\mathbb{R}/\mathbb{Z}\times\mathbb{R} \rightarrow \mathbb{R}/\mathbb{Z}\times\mathbb{R}}$,

$\displaystyle T(x,y) = (bx\textrm{ mod }1, \gamma y + \psi(x)),$

where ${\gamma=1/\lambda b}$ and ${\psi(x)=\phi'(x)}$. In plain terms, this means that

$\displaystyle \theta = \int_{\mathbb{R}/\mathbb{Z}} m_x \, d\textrm{Leb}(x) \ \ \ \ \ (1)$

where ${\theta}$ is the unique ${T}$-invariant probability measure which is absolutely continuous along unstable manifolds (see Tsujii’s paper).

As it was shown by Ledrappier in 1992, the fractal geometry of the conditional measures ${m_x}$ have important consequences for the fractal geometry of the graph ${G}$:

Theorem 2 (Ledrappier) Suppose that for Lebesgue almost every ${x\in\mathbb{R}}$ the conditional measures ${m_x}$ have dimension ${\textrm{dim}(m_x)=1}$, i.e.,

$\displaystyle \lim\limits_{r\rightarrow 0}\frac{\log m_x(B(z,r))}{\log r} = 1 \textrm{ for } m_x\textrm{-a.e. } z$

Then, the graph ${G=\textrm{graph}(f^{\phi}_{\lambda,b})}$ has Hausdorff dimension

$\displaystyle \textrm{dim}(G) = 2 + \frac{\log\lambda}{\log b}$

Remark 8 Very roughly speaking, the proof of Ledrappier theorem goes as follows. By Remark 4, it suffices to prove that ${\textrm{dim}(G)\geq 2 + \frac{\log\lambda}{\log b}}$. By Frostman lemma, we need to construct a Borel measure ${\nu}$ supported on ${G}$ such that

$\displaystyle \underline{\textrm{dim}}(\nu) := \textrm{ ess }\inf \underline{d}(\nu,x) \geq 2 + \frac{\log\lambda}{\log b}$

where ${\underline{d}(\nu,x):=\liminf\limits_{r\rightarrow 0}\log \nu(B(x,r))/\log r}$. Finally, the main point is that the assumptions in Ledrappier theorem allow to prove that the measure ${\mu^{\phi}_{\lambda, b}}$ given by the lift to ${G}$ of the Lebesgue measure on ${[0,1]}$ via the map ${x\mapsto (x,f^{\phi}_{\lambda,b}(x))}$satisfies

$\displaystyle \underline{\textrm{dim}}(\mu^{\phi}_{\lambda,b}) \geq 2 + \frac{\log\lambda}{\log b}$

An interesting consequence of Ledrappier theorem and the equation 1 is the following criterion for Mandelbrot’s conjecture:

Corollary 3 If ${\theta}$ is absolutely continuous with respect to the Lebesgue measure ${\textrm{Leb}_{\mathbb{R}^2}}$, then

$\displaystyle \textrm{dim}(G) = 2 + \frac{\log\lambda}{\log b}$

Proof: By (1), the absolute continuity of ${\theta}$ implies that ${m_x}$ is absolutely continuous with respect to ${\textrm{Leb}_{\mathbb{R}}}$ for Lebesgue almost every ${x\in\mathbb{R}}$.

Since ${m_x\ll \textrm{Leb}_{\mathbb{R}}}$ for almost every ${x}$ implies that ${\textrm{dim}(m_x)=1}$ for almost every ${x}$, the desired corollary now follows from Ledrappier’s theorem. $\Box$

3. Tsujii’s theorem

The relevance of Corollary 3 is explained by the fact that Tsujii found an explicittransversality condition implying the absolute continuity of ${\theta}$.

More precisely, Tsujii firstly introduced the following definition:

Definition 4

• Given ${\varepsilon>0}$, ${\delta>0}$ and ${x_0\in\mathbb{R}/\mathbb{Z}}$, we say that two infinite words ${u, v\in\mathcal{A}^{\mathbb{N}}}$ are ${(\varepsilon,\delta)}$-transverse at ${x_0}$ if either

$\displaystyle |s(x_0,u)-s(x_0,v)|>\varepsilon$

or

$\displaystyle |s'(x_0,u)-s'(x_0,v)|>\delta$

• Given ${q\in\mathbb{N}}$, ${\varepsilon>0}$, ${\delta>0}$ and ${x_0\in\mathbb{R}/\mathbb{Z}}$, we say that two finite words ${k,l\in\mathcal{A}^q}$ are ${(\varepsilon,\delta)}$-transverse at ${x_0}$ if ${ku}$, ${lv}$ are ${(\varepsilon,\delta)}$-transverse at ${x_0}$for all pairs of infinite words ${u,v\in\mathcal{A}^{\mathbb{N}}}$; otherwise, we say that ${k}$ and ${l}$ are${(\varepsilon,\delta)}$-tangent at ${x_0}$;
• ${E(q,x_0;\varepsilon,\delta):= \{(k,l)\in\mathcal{A}^q\times\mathcal{A}^q: (k,l) \textrm{ is } (\varepsilon,\delta)\textrm{-tangent at } x_0\}}$
• ${E(q,x_0):=\bigcap\limits_{\varepsilon>0}\bigcap\limits_{\delta>0} E(q,x_0;\varepsilon,\delta)}$;
• ${e(q,x_0):=\max\limits_{k\in\mathcal{A}^q}\#\{l\in\mathcal{A}^q: (k,l)\in E(q,x_0)\}}$
• ${e(q):=\max\limits_{x_0\in\mathbb{R}/\mathbb{Z}} e(q,x_0)}$.

Next, Tsujii proves the following result:

Theorem 5 (Tsujii) If there exists ${q\geq 1}$ integer such that ${e(q)<(\gamma b)^q}$, then

$\displaystyle \theta\ll\textrm{Leb}_{\mathbb{R}^2}$

Remark 9 Intuitively, Tsujii’s theorem says the following. The transversality condition ${e(q)<(\gamma b)^q}$ implies that the majority of strong unstable manifolds ${\ell^{uu}}$are mutually transverse, so that they almost fill a small neighborhood ${U}$ of some point ${x_0}$ (see the figure below extracted from this paper of Tsujii). Since the SRB measure ${\theta}$ is absolutely continuous along strong unstable manifolds, the fact that the ${\ell^{uu}}$‘s almost fill ${U}$ implies that ${\theta}$ becomes “comparable” to the restriction of the Lebesgue measure ${\textrm{Leb}_{\mathbb{R}^2}}$ to ${U}$.

Remark 10 In this setting, Barańsky-Barány-Romanowska obtained their main result by showing that, for adequate choices of the parameters ${\lambda}$ and ${b}$, one has ${e(1)=1}$. Indeed, once we know that ${e(1)=1}$, since ${1<\gamma b}$, they can apply Tsujii’s theorem and Ledrappier’s theorem (or rather Corollary 3) to derive the validity of Mandelbrot’s conjecture for certain parameters ${\lambda}$ and ${b}$.

For the sake of exposition, we will give just a flavor of the proof of Theorem 1 by sketching the derivation of the following result:

Proposition 6 Let ${\phi(x) = \cos(2\pi x)}$. If ${1/2<\gamma=1/\lambda b <1}$ and ${b\in\mathbb{N}}$ is sufficiently large, then

$\displaystyle e(1)<\gamma b$

In particular, by Corollary 3 and Tsujii’s theorem, if ${1/2<\gamma=1/\lambda b <1}$ and ${b\in\mathbb{N}}$ is sufficiently large, then Mandelbrot’s conjecture is valid, i.e.,

$\displaystyle \textrm{dim}(W_{\lambda,b}) = 2+\frac{\log\lambda}{\log b}$

Remark 11 The proof of Theorem 1 in full generality (i.e., for ${b\geq 2}$ integer and ${1/b<\lambda<1}$) requires the introduction of a modified version of Tsujii’s transversality condition: roughly speaking, Shen defines a function ${\sigma(q)\leq e(q)}$(inspired from Peter-Paul inequality) and he proves

• (a) a variant of Proposition 6: if ${b\geq 2}$ integer and ${1/b<\lambda<1}$, then ${\sigma(q)<(\gamma b)^q}$ for some integer ${q}$;
• (b) a variant of Tsujii’s theorem: if ${\sigma(q)<(\gamma b)^q}$ for some integer ${q}$, then ${\theta\ll\textrm{Leb}_{\mathbb{R}^2}}$.

See Sections 2, 3, 4 and 5 of Shen’s paper for more details.

We start the (sketch of) proof of Proposition 6 by recalling that the slopes of unstable manifolds are given by

$\displaystyle s(x,u):=-2\pi\sum\limits_{n=0}^{\infty} \gamma^n \sin\left(2\pi\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

for ${x\in\mathbb{R}}$, ${u\in\mathcal{A}^{\mathbb{N}}}$, so that

$\displaystyle s'(x,u)=-4\pi^2\sum\limits_{n=0}^{\infty} \left(\frac{\gamma}{b}\right)^n \cos\left(2\pi\frac{x + u_1 + u_2 b + \dots + u_n b^{n-1}}{b^n}\right)$

Remark 12 Since ${\gamma/b < \gamma}$, the series defining ${s'(x,u)}$ converges faster than the series defining ${s(x,u)}$.

By studying the first term of the expansion of ${s(x,u)}$ and ${s'(x,u)}$ (while treating the remaining terms as a “small error term”), it is possible to show that if ${(k,l)\in E(1,x_0)}$, then

$\displaystyle \left|\sin\left(2\pi\frac{x_0+k}{b}\right) - \sin\left(2\pi\frac{x_0+l}{b}\right)\right| \leq\frac{2\gamma}{1-\gamma} \ \ \ \ \ (2)$

and

$\displaystyle \left|\cos\left(2\pi\frac{x_0+k}{b}\right) - \cos\left(2\pi\frac{x_0+l}{b}\right)\right| \leq \frac{2\gamma}{b-\gamma} \ \ \ \ \ (3)$

(cf. Lemma 3.2 in Shen’s paper).

Using these estimates, we can find an upper bound for ${e(1)}$ as follows. Take ${x_0\in\mathbb{R}/\mathbb{Z}}$ with ${e(1)=e(1,x_0)}$, and let ${k\in\mathcal{A}}$ be such that ${(k,l_1),\dots,(k,l_{e(1)})\in E(1,x_0)}$ distinct elements listed in such a way that

$\displaystyle \sin(2\pi x_i)\leq \sin(2\pi x_{i+1})$

for all ${i=1,\dots,e(1)-1}$, where ${x_i:=(x_0+l_i)/b}$.

From (3), we see that

$\displaystyle \left|\cos\left(2\pi x_i\right) - \cos\left(2\pi x_{i+1}\right)\right| \leq \frac{4\gamma}{b-\gamma}$

for all ${i=1,\dots,e(1)-1}$.

Since

$\displaystyle (\cos(2\pi x_i)-\cos(2\pi x_{i+1}))^2 + (\sin(2\pi x_i)-\sin(2\pi x_{i+1}))^2 = 4\sin^2(\pi(x_i-x_{i+1}))\geq 4\sin^2(\pi/b),$

it follows that

$\displaystyle |\sin(2\pi x_i)-\sin(2\pi x_{i+1})|\geq \sqrt{4\sin^2\left(\frac{\pi}{b}\right) - \left(\frac{4\gamma}{b-\gamma}\right)^2} \ \ \ \ \ (4)$

Now, we observe that

$\displaystyle \sqrt{4\sin^2\left(\frac{\pi}{b}\right) - \left(\frac{4\gamma}{b-\gamma}\right)^2} > \frac{4}{b} \ \ \ \ \ (5)$

for ${b}$ large enough. Indeed, this happens because

• ${\sqrt{z^2-w^2}>2(z-w)}$ if ${z+w>4(z-w)}$;
• ${z+w>4(z-w)}$ if ${z/w:=u < 5/3}$;
• ${\frac{2\sin(\frac{\pi}{b})}{\frac{4\gamma}{b-\gamma}}\rightarrow \frac{2\pi}{4\gamma} (< \frac{5}{3})}$ as ${b\rightarrow\infty}$, and ${2\sin(\frac{\pi}{b}) - \frac{4\gamma}{b-\gamma} \rightarrow (2\pi-4\gamma)\frac{1}{b} (>\frac{2}{b})}$ as ${b\rightarrow\infty}$ (here we used ${\gamma<1}$).

By combining (4) and (5), we deduce that

$\displaystyle |\sin(2\pi x_i)-\sin(2\pi x_{i+1})| > 4/b$

for all ${i=1,\dots, e(1)-1}$.

Since ${-1\leq\sin(2\pi x_1)\leq\sin(2\pi x_2)\leq\dots\leq\sin(2\pi x_{e(1)})\leq 1}$, the previous estimate implies that

$\displaystyle \frac{4}{b}(e(1)-1)<\sum\limits_{i=1}^{e(1)-1}(\sin(2\pi x_{i+1}) - \sin(2\pi x_i)) = \sin(2\pi x_{e(1)}) - \sin(2\pi x_1)\leq 2,$

i.e.,

$\displaystyle e(1)<1+\frac{b}{2}$

Thus, it follows from our assumptions (${\gamma>1/2}$, ${b}$ large) that

$\displaystyle e(1)<1+\frac{b}{2}<\gamma b$

This completes the (sketch of) proof of Proposition 6 (and our discussion of Shen’s talk).

# 为什么说绝大多数人都是“低品质勤奋者”？

（一）一个普通“勤奋者”的模糊肖像

（二）表演“勤奋”，还是想把事情搞定？

1、之前计划好的雅思没有准备好，只得弃考或者硬着头皮裸考，导致无法出国；

2、一直想提高的演讲和写作技能也没大长进，所以那次难得的公众表达机会就这么白白溜走；

3、甚至你一直期待的“减肥成功后，自信满满地向女神大胆表白”这样的美好画面也没有出现，原因想必大家都了解。

（三）多数人为了逃避真正的思考，愿意做任何事情

（四）“深度思考”才能带来“认知升级”，从而成为“高品质勤奋者”

“为什么那里缴获的短枪与长枪的比例比其它战斗略高?”

“为什么那里缴获和击毁的小车与大车的比例比其它战斗略高?”

“为什么在那里俘虏和击毙的军官与士兵的比例比其它战斗略高?”