不要做低品质的勤奋者---读后感

不要做低品质的勤奋者---读后感

之前在网络上曾经有人写过一篇文章《为什么绝大多数人都是“低品质的勤奋者”》,现在一年之后翻阅来看,结合自己攻读博士学位和工作的经历,在这里稍微总结一下,希望能够给自己一些启示。

作为一个博士生,时间通常来说会比工作人士相对自由,非常类似一个自由职业者。因此也会有时间去做自己想做的事情,阅读自己喜欢的各种书籍,阅读自己想学的各种知识技能。但是,如果要真正走上一条科研的道路,绝对不是把所有的知识点统统学一遍之后才开始进行科研,而是在科研的道路中不断的学习和充实自己,在科研的过程中不断的成长。除了在科研的前期准备一些必要的理论基础之外,在科研的过程中就要有选择放弃一些知识。因为在做科研的时候并不需要了解任何一个知识点,科研是做出来的而不是学出来的。科研的途中只需要掌握某些知识点就可以了,剩下的都是遇到了问题然后靠自己独立攻克解决的。在科研之前抱有“等我读完了这些书再开始科研”其实是一种不明智的想法。

在工业界,一天到晚总会有各种各样杂七杂八的事情,也需要有人去关注这些问题并且解决这些问题。不过在日常的工作中,技术的积累和调研也是一份重要的工作,它不仅关系着员工个人的成长,也关系着整个团队整体实力的提升。这种时候,总需要有人去做一些调研性质的工作,去探索这个技术对整个团队是否存在实际的价值。但是在调研的过程中,不同的人就有着完全不同的做事方法,有的人喜欢直接开始干活,以无穷次试错的方式来判断自己的选择是否是正确的;而有的人喜欢先进行调研的工作,先把公司内外的优秀技术调研清楚,然后再与自己所处的环境做比较,最后再判断自己所选择的方案是否合理。除此之外,有的人总喜欢抓住一些细枝末节的东西反复纠缠,仿佛在这些细节上就能够挖出全部的框架结构,试图从这些细枝末节入手找出关键之处。而有的人则倾向于从一个整体的框架入手,尝试着从这个框架来解释方案是否合理,从整体来反推出之前的那些细节存在的意义。

如果看过天龙八部的人就知道,鸠摩智当时上少林寺去挑战,在少林高僧面前展示出自己所学的少林七十二绝技,诸多少林高僧无不大惊失色。而当时的虚竹在旁边观战,就对少林高僧们说:“鸠摩智所耍的招数虽然是少林绝技,但是本质上却是使用小无相功催动出来的。虽然招数相同,但是却用的道家的内力。”为什么少林的高僧们没有看出来鸠摩智武功的关键之处呢,那是因为少林高僧们在练功的时候,一直抱着武学秘籍在修炼,一辈子练到头了也就13门绝技。其实从鸠摩智的个人修炼来看,修练武学的关键并不在武学秘籍里,而在佛经里。没有找到关键的佛经,没有找到运功的法门,无论抱着武学秘籍修炼多少年,终究与别人有着本质上的差距。诗人陆游也曾经教育过他的后辈:“汝果欲学诗,功夫在诗外”。意思是说,如果你想真正地写出好的诗词,就要在生活上下功夫,去体验生活的酸甜苦辣,而不是抱着一本诗词歌赋来反复阅读。

在读博士的过程中,一般来说,选择博士课题是一件极为重要的事情。如果课题选得太难,那么学生在这几年内也做不出来;如果课题选得过于简单,那么学生即使做出来了也没有太大的意义。在选择课题的时候,博士们都会十分慎重,都会通过半年或者一年甚至更长的时间来选择合适自己的课题,否则在未来的几年里面将会十分的痛苦。在科研的过程中,在做每一个方案之前,一定要进行仔细的思考,因为科研工作一旦开工,很可能在短期内就不会回头。如果选择了一条极其错误的道路,那将会在这条错误的道路上走得越来越远,越来越不可能得到正确的结果。

为了不成为低品质的勤奋者,在平时工作和生活的过程中,在做某件事情之前,一定要提前进行仔细的思考和周密的安排。在这种时候,就不能够通过自己的应激反应来做这些重要而又不紧急的事情。一般来说,通过自己的第一反应来做的事情都是那种重要而又紧急的事情,而通过自己深度思考才开始做的事情都是重要但是又不紧急的事情。在个人成长道路中,要想改变一个人的命运的,通常都是做那些重要而又不紧急的事情。不过,人一般来说都是有惰性的,都是懒得进行深度思考,最常见的做法都是通过应激反应来决定下一步的计划和方案。如果要想保证一个人的成长,就一定要刻意地逼迫自己进行深度思考,逼迫自己每天花一定的时间来处理那些重要而又不紧急的事情。

无论是学习还是工作,都不可能面面俱到,不太可能做到事必躬亲的处理每一件事情。这种时候就要做到有重点的处理关键事情,放弃一些不那么重要的事情。在做每一件事情之前,都要思考做这件事情是否能够真正的带来价值,是否对整个团队和个人有利,而不是在稍微思考了一下觉得可以这么做就开始动手做这件事情。一味的努力做某件事情,只是看起来十分刻苦,但是要想真正的摆脱低品质勤奋者的困境,还需要有很长的路要走。

Funnel: Assessing Software Changes in Web-based Services

本篇文章是智能运维系统的调研文章,作者是清华大学的裴丹教授,论文标题是《Funnel: Assessing Software Changes in Web-based Services》,大概的内容是介绍软件更新给KPI(key performance indicator)指标的影响。下面将会详细介绍这篇文章的大概思想和技术框架。

 

Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning

本文是运维系统智能化的一次探索工作,论文的作者是清华大学的裴丹教授,论文的题目是《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》。目的是基于机器学习的 KPI(Key Performance Indicator)的自动化异常检测。

标题 Opprentice 来源于(Operator’s Apprentice),意思就是运维人员的学徒。本文通过运维人员的业务经验来进行异常数据的标注工作,使用时间序列的各种算法来提取特征,并且使用有监督学习模型(例如 Random Forest,GBDT,XgBoost 等)模型来实现离线训练和上线预测的功能。本文提到系统 Opprentice 使用了一个多月的历史数据进行分析和预测,已经可以做到准确率>=0.66,覆盖率>=0.66 的效果。

1. Opprentice的介绍

系统遇到的挑战:

Definition Challenges: it is difficult to precisely define anomalies in reality.(在现实环境下很难精确的给出异常的定义)

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI (Key Performance Indicators). To address the definition challenge and the detector challenge, we advocate for using supervised machine learning techniques. (使用有监督学习的方法来解决这个问题)

该系统的优势:

(i) Opprentice is the first detection framework to apply machine learning to acquiring realistic anomaly definitions and automatically combining and tuning diverse detectors to satisfy operators’ accuracy preference.

(ii) Opprentice addresses a few challenges in applying machine learning to such a problem: labeling overhead, infrequent anomalies, class imbalance, and irrelevant and redundant features.

(iii) Opprentice can automatically satisfy or approximate a reasonable accuracy preference (recall>=0.66 & precision>=0.66). (准确率和覆盖率的效果)

2. 背景描述:

KPIs and KPI Anomalies:

KPIs: The KPI data are the time series data with the format of (time stamp, value). In this paper, Opprentice pays attention to three kinds of KPIs: the search page view (PV), which is the number of successfully served queries; The number of slow responses of search data centers (#SR); The 80th percentile of search response time (SRT).

Anomalies: KPI time series data can also present several unexpected patterns (e.g. jitters, slow ramp ups, sudden spikes and dips) in different severity levels, such as a sudden drop by 20% or 50%.

OpprenticeFigure1

问题和目标:

覆盖率(recall):# of true anomalous points detected / # of the anomalous points

准确率(precision):# of true anomalous points detected / # of anomalous points detected

1-FDR(false discovery rate):# of false anomalous points detected / # of anomalous points detected = 1 – precision

The quantitative goal of opprentice is precision>=0.66 and recall>=0.66.

The qualitative goal of opprentice is automatic enough so that the operators would not be involved in selecting and combining suitable detectors, or tuning them.

3. Opprentice Overview: (Opprentice系统的概况)

OpprenticeFigure2

(i) Opprentice approaches the above problem through supervised machine learning.

(ii) Features of the data are the results of the detectors.(Basic Detectors 来计算出特征)

(iii) The labels of the data are from operators’ experience.(人工打标签)

(iv) Addressing Challenges in Machine Learning: (机器学习遇到的挑战)

(1) Label Overhead: Opprentice has a dedicated labeling tool with a simple and convenient interaction interface. (标签的获取)

(2) Incomplete Anomaly Cases:(异常情况的不完全信息)

(3) Class Imbalance Problem: (正负样本比例不均衡)

(4) Irrelevant and Redundant Features:(无关和多余的特征)

4. Opprentice’s Design:

Architecture: Operators label the data and numerous detectors functions are feature extractors for the data.

OpprenticeFigure3

Label Tool:

人工使用鼠标和软件进行标注工作

OpprenticeFigure4

Detectors:

(i) Detectors As Feature Extractors: (Detector用来提取特征)

Here for each parameter detector, we sample their parameters so that we can obtain several fixed detectors, and a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor:

data point + configuration (detector + sample parameters) -> feature,

(ii) Choosing Detectors: (Detector的选择,目前有14种较为常见的)

Opprentice can find suitable ones from broadly selected detectors, and achieve a relatively high accuracy. Here, we implement 14 widely-used detectors in Opprentice.

Opprentice has 14 widely-used detectors:

OpprenticeTable3

Diff“: it simply measures anomaly severity using the differences between the current point and the point of last slot, the point of last day, and the point of last week.

MA of diff“: it measures severity using the moving average of the difference between current point and the point of last slot.

The other 12 detectors come from previous literature. Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severity.

(iii) Sampling Parameters: (Detector的参数选择方法,一种是扫描参数空间,另外一种是选择最佳的参数)

Two methods to sample the parameters of detectors.

(1) The first one is to sweep the parameter space. For example, in EWMA, we can choose \alpha \in \{0.1,0.3,0.5,0.7,0.9\} to obtain 5 typical features from EWMA; Holt-Winters has three [0,1] valued parameters \alpha,\beta,\gamma. To choose \alpha,\beta,\gamma \in \{0.2,0.4,0.6,0.8\}, we have 4^3 features; In ARIMA, we can estimate their “best” parameters from the data, and generate only one set of parameters, or one configuration for each detector.

Supervised Machine Learning Models:

Decision Trees, logistic regression, linear support vector machines (SVMs), and naive Bayes. 下面是决策树(Decision Tree)的一个简单例子。

OpprenticeFigure5

Random Forest is an ensemble classifier using many decision trees. It main principle is that a group of weak learners (e.g. individual decision trees) can together form a strong learner. To grow different trees, a random forest adds some elements or randomness. First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time. The random forest combines those trees by majority vote. The above properties of randomness and ensemble make random forest more robust to noises and perform better when faced with irrelevant and redundant features than decisions trees.

Configuring cThlds: (阈值的计算和预估)

(i) methods to select proper cThlds: offline part

OpprenticeFigure6

We need to figure cThlds rather than using the default one (e.g. 0.5) for two reasons.

(1) First, when faced with imbalanced data (anomalous data points are much less frequent than normal ones in data sets), machine learning algorithems typically fail to identify the anomalies (low recall) if using the default cThlds (e.g. 0.5).

(2) Second, operators have their own preference regarding the precision and recall of anomaly detection.

The metric to evaluate the precision and recall are:

(1) F-Score: F-Score = 2*precision*recall/(precision+recall).

(2) SD(1,1): it selects the point with the shortest Euclidean distance to the upper right corner where the precision and the recall are both perfect.

(3) PC-Score: (本文中采用这种评估指标来选择合适的阈值)

If r>=R and p>=P, then PC-Score(r,p)=2*r*p/(r+p) + 1; else PC-Score(r,p)=2*r*p/(r+p). Here, R and P are from the operators’ preference “recall>=R and precision>=P”. Since the F-Score is no more than 1, then we can choose the cThld corresponding to the point with the largest PC-Score.

(ii) EWMA Based cThld Prediction: (基于EWMA方法的阈值预估算法)

OpprenticeFigure7

In online detection, we need to predict cThlds for detecting future data.

Use EWMA to predict the cThld of the i-th week ( or the i-th test set) based on the historical best cThlds. Specially, EWMA works as follows:

If i=1, then cThld_{i}^{p}=cThld_{1}^{p}= 5-fold prediction

Else i>1, then cThld_{i}^{p}=\alpha\cdot cThld_{i-1}^{b}+(1-\alpha)\cdot cThld_{i-1}^{p}, where cThld_{i-1}^{b} is the best cThld of the (i-1)-th week. cThld_{i}^{p} is the predicted cThld of the i-th week, and also the one used for detecting the i-th week data. \alpha\in [0,1] is the smoothing constant.

For the first week, we use 5-fold cross-validation to initialize cThld_{1}^{p}. As \alpha increases, EWMA gives the recent best cThlds more influences in the prediction. We use \alpha=0.8 in this paper.

5. Evaluation(系统评估)

在 Opprentice 系统中,红色表示 Opprentice 系统的方法,黑色表示其他额外的方法。

OpprenticeFigure8

Opprentice has 14 detectors with about 9500 lines of Python, R and C++ code. The machine learning block is based on the scikit-learn library.

Random Forest is better than decision trees, logistic regression, linear support vector machines (SVMs), and naive Bayes.

 

Focus: Shedding Light on the High Search Response Time in the Wild

本文作为智能运维系统的探索,这篇论文的标题是《Focus: Shedding Light on the High Search Response Time in the Wild》,来自于清华大学裴丹教授。目标是解决在运维过程中,发现高搜索响应时间之后,使用机器学习算法发现异常的原因和规则。该系统(Focus)使用过2.5个月的数据,并且分析过数十亿的日志。下面将会详细介绍这篇文章的主要内容。

问题描述:

To help search operators dubug HSRT (high search response time),Focus is a search log analysis framework to answer the three questions:

(1) What is the HSRT condition?

(2) Which HSRT condition types are prevalent across days?

(3) How does each attribute affect SRT in those prevalent HSRT condition types?

解决方案:

Focus has one component for each of the above questions:

(1) A decision tree based classifier to identify HSRT conditions in search logs of each day;

(2) A clustering based condition type miner to combine similar HSRT conditions into one type, and find the prevalent condition types across days; following Occam’s razor principle.

(3) An attribute effect estimator to analyze the effect of each individual attribute of SRT within a prevalent condition type.

基础知识准备:

(A) Search Logs:

For each measured query, its search log records two types of data: SRT and SRT components, Query Attributes.

(1) SRT and SRT components:(特征层)

FocusFigure1

t_{1} is when a query is submitted; t_{2} is when the result HTML file has been downloaded; t_{3} is when a brower finishes parsing the HTML; t_{4} is when the page is completely rendered. SRT is measured by t_{4}-t_{1}, the user-received search response time.

T_{server} is the server response time of the HTML file, which is recorded by servers; T_{net}=t_{2}-t_{1}-T_{server} is the network transmission time of the HTML file; T_{brower}=t_{3}-t_{2} is the browser parsing time of the HTML; T_{other}=t_{4}-t_{3} is the remaining time spent before the page is rendered, e.g. download time of images from image servers.

(2) Query Attributes:(特征层)

The search logs record the following attributes for each measured query:

(i) Browser Engine: Webkit(e.g. Chrome, Safari and 360 Secure Browser), Gecko, Trident LEGC, Trident 4.0, Trident 5.0, and others.

(ii) ISP: China Telecom, China Unicom, China Mobile, China Netcom, CHina Tietong, others.

(iii) Localtion: Based on the client IP, convert IP to its geographic location. In total, there are 32 provinces.

(iv) #Image: the number of embedded images in the result page.

(v) Ads: A result page contains paid advertise links or not.

(vi) Loading Mode: The loading mode of a result page can be either synchronous or asynchronous.

(vii) Background page views: On the service side, the search engine S also post-analyzes the logs and generates the background page views. The background PVs (page views) for a query q is measured by the number of queries served within 30 seconds before and after q is served.It reflects the average search request load where q is served. Due to confidentiality constraints, we normalize specific background PVs (page Views) by the maximum value.(事后分析,统计出一些必要的特征,输入 Focus 系统的机器学习模型中)

(B) HSRT and HSRT Conditions:(样本层)

FocusFigure2

Usually, we can use cumulative distribution fraction (CDF) of SRT in the search logs to determine the high search response time condition (HSRT condition). In this paper,  we define HSRT as the SRT longer than 1s.

Challenges of Identifying HSRT Conditions: In order to identify HSRT conditions in multi-dimensional search logs.(以下是这个系统的一些难点和挑战点)

(a) Naive Single Dimensional Based Methods: including pair-wise correlation analysis and so on, but is inefficient.

(b) Attributes can be potentially interdependent on each other: that means Naive Bayes Method may not applicable in this situation.

(c) Need to avoid output overlapping conditions: like {#image>30}, {ads=yes}, and {#image>20, ads=yes}.  (随着时间的推移,每天使用模型可能会推出类似或者重复的规则)。

关键思想和系统概况

Condition is a combination of attributes and specific values in search logs.

HSRT Condition is a condition that covers at least 1%$ of total queries, and has the fraction of HSRT large than the global level:

(# of HSRT queries in a HSRT condition / #of queries in a HSRT condition) > (# of HSRT queries / # of queries). This is in order to assign to labels and we can change this definition in practice. (这只是用来打标签的定义,用于判断什么是HSRT,在实际的应用中,我们可以根据具体的场景采用不同的定义,例如返回码等指标)。

‘Focus’ System Overview:

FocusFigure6

Input: search logs(日志)

(i) Use a decision tree based classifier to identify HSRT conditions  in search logs every day; (每天可以使用决策树模型从日志中提取HSRT条件)。

(ii) Use a clustering based condition type miner to identify condition types of similar HSRT conditions, and fine prevalent condition types across days; (用于把类似的条件融合在一起)。

(iii) Use an attribute effect estimator to analyze how an attribute affects SRT and SRT components in each prevalent condition type. (用于判断哪些属性或者特征对这个标签影响更加深远)。

Output: prevalent condition types and their attributes effects on SRT.(第二步输出的条件以及第三步属性的重要性)。

Part (i): Decision Tree Based Classifier including ID3, C4.5, CART. It contains five important parts: (1) expressing attribute splits; (2) evaluate splits; (3) stopping tree growing; (4) assigning Labels: assign HSRT labels to the left nodes whose fraction of HSRT is larger than the global fraction of HSRT; (5) identify HSRT Branching Attribute Conditions. (这里是 Focus 系统所采用的机器学习算法)。

FocusFigure7

Part (ii): Condition Type Miner: group HSRT conditions according to (1) the same combination of attributes, (2) the same value from each category attribute, and (3) similar interval for each numeric attribute, using Jaccard Index to measure the similarity between intervals. (条件的融合)。

FocusTable1

Part (iii): Attribute Effect Estimator: With each condition type

C=\{c_{1}\wedge c_{2}\cdots \wedge c_{i} \wedge \cdots c_{n}\},

we design a method to understand how each attribute condition c_{i} affects SRT.

For example, what is the HSRT fraction caused by c_{i} in C? What SRT components (e.g. T_{net} and T_{server}) are affected by c_{i}?

Main Idea: flip condition c_{i} to the opposite \overline{c}_{i} to get a variant condition type C_{i}'=\{c_{1}\wedge c_{2}\cdots \wedge \overline{c}_{i} \wedge \cdots c_{n}\}. In the past days, we have the number of HSRT events in total, the number of HSRT events in condition C and the number of HSRT events in condition C_{i}'. As a result, we believe the historical data based comparison can provide a reasonable estimate of the attribute effects. The comparison between C and C_{i}' in these days is based on the specific HSRT conditions of these days. (用于判断哪些属性更能够引起 HSRT)。

FocusTable2

FocusTable3and4

In Table IV, the results are sorted by the variation of the fraction of HSRT in condition types (HSRT% column) caused by flipping an attribute condition.

(i) We highlight the variations greater than zero (getting worse after flipping an attribute condition).

(ii) We focus on that flipping the HSRT branching attribute conditions can yield improvements on HSRT%. For example, the condition #image>x are all ranked at the top. It means we need to reduce the impact of images on SRT and we can get the highest potential improvement of HSRT.

(iii) Table III and Table IV are the output of Focus to the operators for these months.

Observations by Further Inverstigation

Table IV raises some interesting questions:(通过 Focus 输出的表格 Table IV 可以提出很多其余的问题,也许是人工经验不容易发现的问题)

(1) Why does reducing #images increase T_{server}, the time that servers prepare the result HTML (row 1, 2, 3, 4 of Table IV)?

(2) How do ads inflate SRT? Why do the pages with ads need more T_{net} and T_{brower} (row 7)?

(3) Why does Webkit engine perform better, especially greatly decreasing T_{browser} (row 5, 10, 11, 12)?

(4) It is nature that switching ISPs can affect network transmission time T_{net}, but why does switching to China Telecom reduce T_{server} by over 20% (row 6, 8, 9)?