时间序列的搜索

在之前的时间序列相似度算法中,时间戳都是一一对应的,但是在实际的场景中,时间戳有可能出现一定的偏移,但是两条时间序列却又是十分相似的。例如正弦函数 \sin(x) 和余弦函数 \cos(x),只是平移了 \pi/2 个长度而已。本文将会介绍一些基于形状的时间序列的距离算法,并且介绍如何在给定时间序列的情况下,在时间序列数据库中寻找相似的时间序列。

 

基于动态规划的相似度计算方法

DTW 的经典算法

基于动态规划的相似度计算的典型代表就是 DTW(Dynamical Time Warping )和 Frechet 距离。在这里将会主要介绍 DTW 算法。详细来说,DTW 算法是为了计算语音信号处理中,由于两个人说话的时长不一样,但是却是类似的一段话,欧几里德算法不完全能够解决这类问题。在这种情况下,DTW 算法就被发展出来。DTW 算法是为了计算两条时间序列的最佳匹配点,假设我们有两条时间序列 Q 和 C,长度都是 n,并且 Q:\{q_{1},q_{2},\cdots,q_{n}\}C:\{c_{1},c_{2},\cdots,c_{n}\}。首先我们可以建立一个 n\times n 的矩阵,(i,j) 位置的元素是 dist(q_{i},c_{j}),这里的 dist 可以使用 L^{1}, L^{2}, L^{p}, L^{\infty} 范数。其次,我们想找到一条路径,使得这个矩阵的累积距离最小,而这条路则是两条时间序列之间的最佳匹配。在这里,我们可以假设这条路径是 W: \{w_{1},\cdots,w_{K}\},其中 W 的每个元素表示时间序列 Q 中的第 i 个元素和时间序列 C 中的第 j 个元素之间的距离. i.e. w_{k}=(q_{i}-c_{j})^{2}

DTW1

现在我们需要找到一条路径使得

W^{*} = argmin_{W}(\sqrt{\sum_{k=1}^{K}w_{k}}).

这条路径就是动态规划的解,它满足一个动态规划方程:对于 1\leq i\leq n, 1\leq j\leq n,有

DTW(i,j)

= dist(q_{i},c_{j}) + \min(DTW(i-1,j-1), DTW(i-1,j), DTW(i,j-1)).

其初始状态是 DTW(0,0)=0, DTW(i,0)=\infty, DTW(0,j)=\infty, \forall 1\leq i\leq n, 1\leq j\leq n. 最终的取值 DTW(n,n) 就是我们需要的解,也就是两条时间序列的 DTW 距离。按照上面的算法,DTW 算法的时间复杂度是 O(n^{2})。特别地,

  1. 如果 dist(q_{i},c_{j}) = (q_{i}-c_{j})^{2} 时,则 \sqrt{DTW(n,n)} 表示最后的距离;
  2. 如果 dist(q_{i},c_{j}) = |q_{i}-c_{j}| 时,则 DTW(n,n) 表示最后的距离。
  3. 如果 dist(q_{i},c_{j}) = |q_{i}-c_{j}|^{p} 时,则 (DTW(n,n))^{1/p} 表示最后的距离。

Remark. 

DTW 算法不满足三角不等式。例如:x={0}, y={1,2}, z={1,2,2},则

DTW(x,z)=5>DTW(x,y)+DTW(y,z) = 3 + 0 =3.

DTW 的加速算法

某些时候,我们可以添加一个窗口长度的限制,换言之,如果要比较 q[i]c[j] 的话,i 与 j 需要满足 |i-j|\leq w,这里的 w 表示窗口长度。因此算法的描述如下:

初始条件和之前一样。

DTW(i,j) = dist(q_{i},c_{j}) + \min(DTW(i-1,j-1), DTW(i-1,j), DTW(i,j-1)),

这里的 j 取值范围是:对每一个 1\leq i\leq n,需要 j 满足\max(1,i-w) \leq j\leq \min(m,i+w)

 

相似时间序列的搜索

相似的时间序列的搜索问题一般是这样的:

Question 1. 给定一条时间序列 Q 和一个时间序列的数据库 D=\{C_{i}\}_{i=1}^{\infty}。通过某种相似度或者距离计算方法,计算出给定的时间序列 Q 与时间序列数据库中 D 中最相似的时间序列。

Question 2. 给定一条时间序列 Q 和一个时间序列的数据库 D=\{C_{i}\}_{i=1}^{\infty},以及正整数 K。从数据库 D 中寻找与给定的时间序列 Q 最相似的 K 条时间序列。

DTW 算法的下界 LB_Kim

对于两条时间序列 Q 和 C 而言,可以分别提取它们的四个特征,那就是最大值,最小值,第一个元素的值,最后一个元素的值。这样可以计算出 LB_Kim

LBKim(Q,C)

= \max\{|\max(Q)-\max(C)|,|\min(Q)-\min(C)|,|First(Q)-First(C)|,|Last(Q)-Last(C)| \}.

可以证明 LBKim(Q,C)\leq DTW(Q,C).

DTW 算法的下界 LB_Yi

有学者在 LB_Kim 的基础上提出了一种下界的计算方法,那就是 LB_Yi,有兴趣的读者可以参见:efficient retrieval of similar time sequences under time warping, 1998.

DTW 算法的下界 LB_Keogh

但是,如果是基于 DTW 的距离计算方法,每次都要计算两条时间序列的 DTW 距离,时间复杂度是 O(n^{2})。于是就有学者研究是否存在 DTW 距离的下界表示,也就是说找到一个合适的下界,Lower Bound of DTW。每次判断 Lower Bound of DTW 是否小于当前的最小距离,如果下界高于最小距离,就不需要进行 DTW 的计算;否则开始计算 DTW 的值。如果下界的计算速度足够快,并且下界足够精准的话,可以大量的压缩搜索的时间。于是,Keogh 提出了下界的计算方法。

(1)首先定义时间序列 Q 的上下界。i.e. Q:\{q_{1},\cdots,q_{n}\},给定一个窗口的取值 r,得到 U_{i}=\max(q_{i-r},q_{i+r})L_{i} = \min(q_{i-r},q_{i+r})

(2)定义公式:

LBKeogh(Q,C)

= \sqrt{\sum_{i=1}^{n}(c_{i}-U_{i})^{2}I_{\{c_{i}>U_{i}\}} + \sum_{i=1}^{n}(c_{i}-L_{i})^{2}I_{\{c_{i}<L_{i}\}}}.

其中,LBKeogh 满足性质:

对于任意两条长度为 n 的时间序列 Q 和 C,对于任意的窗口 r\geq 1,有不等式 LBKeogh(Q,C)\leq DTW(Q,C) 成立。

所以,可以把搜索的算法进行加速:

Lower Bounding Sequential Scan(Q)
    best_so_far = infinity 
    for all sequences in database
        LB_dist = lower_bound_distance(C_{i},Q)
        if LB_dist < best_so_far
            true_dist = DTW(C_{i},Q)
            if true_dist < best_so_far
                best_so_far = true_dist
                index_of_best_match = i
            endif
        endif
    endfor

在论文 Exact Indexing of Dynamic Time Warping 里面,作者还尝试将 Piecewise Constant Approximation 与 LB_Keogh 结合起来,提出了 LB_PAA 的下界。它满足条件 LBPAA(Q,C)\leq LBKeogh(Q,C)\leq DTW(Q,C).

 

总结

本文初步介绍了 DTW 算法以及它的下界算法,包括 LB_Kim, LB_Keogh 等,以及时间序列数据库的搜索算法。

 

如何理解时间序列?— 从Riemann积分和Lebesgue积分谈起

Riemann 积分和 Lebesgue 积分是数学中两个非常重要的概念。本文将会从 Riemann 积分和 Lebesgue 积分的定义出发,介绍它们各自的性质和联系。

积分

Riemann 积分

Riemann 积分虽然被称为 Riemann 积分,但是在 Riemann 之前就有学者对这类积分进行了详细的研究。早在阿基米德时代,阿基米德为了计算曲线 x^{2} 在 [0,1] 区间上与 X 坐标轴所夹的图形面积,就使用了 Riemann 积分的思想。 他把 [0,1] 区间等长地切割成 n 段,每一段使用一个长方形去逼近 x^{2} 这条曲线的分段面积,再把 n 取得很大,所以得到当 n 趋近于无穷的时候,就知道该面积其实是 1/3。

下面来看一下 Riemann 积分的详细定义。

Riemann Integral1

考虑定义在闭区间 [a,b] 上的函数 f(x)

取一个有限的点列 a=x_{0}<x_{1}<\cdots<x_{n}=b\lambda = \max(x_{i+1}-x_{i}) 表示这些区间长度的最大值,在这里 0\leq i\leq n-1。在每一个子区间上[x_{i},x_{i+1}] 上取出一个点 t_{i}\in[x_{i},x_{i+1}]。而函数 f(x) 关于以上取样分割的 Riemann 和就是以下公式:

\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}).

当我们说该函数 f(x) 在闭区间 [a,b] 上的取值是 s 的意思是:

对于任意 \epsilon>0,存在 \delta>0 使得对于任意取样分割,当 \lambda<\delta 时,就有

|\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}) - s|<\epsilon.

通常来说,用符号来表示就是:\int_{a}^{b}f(x)=s.

用几幅图来描述 Riemann 积分的思想就是:

 

Lebesgue 积分

Riemann 积分是为了计算曲线与 X 轴所围成的面积,而 Lebesgue 积分也是做同样的事情,但是计算面积的方法略有不同。要想直观的解释两种积分的原理,可以参见下图:

RiemannANDLebesgue
Riemann 积分(上)与 Lebesgue 积分(下)

Riemann 积分是把一条曲线的底部分成等长的区间,测量每一个区间上的曲线高度,所以总面积就是这些区间与高度所围成的面积和。

Lebesgue 积分是把曲线化成等高线图,每两根相邻等高线的差值是一样的。每根等高线之内含有它所圈着的长度,因此总面积就是这些等高线内的面积之和。

用再形象一点的语言来描述就是:吃一块汉堡有多种方式

  1. Riemann 积分:从一个角落开始一口一口吃,每口都包含所有的配料;
  2. Lebesgue 积分:从最上层开始吃,按照“面包-配菜-肉-蛋-面包”的节奏,一层一层来吃。

再看一幅图的表示就是:Riemann 积分是按照蓝色的数字顺序相加的,Lebesgue 积分是按照红色的数字来顺序相加的。

Riemann and Lebesgue1

 

基于这些基本的思想,就可以给出 Lebesgue 积分的定义:

简单函数指的是对指示函数的有限线性组合,i.e. \sum_{k}a_{k}I_{S_{k}},这里的 a_{k} 是系数,S_{k} 是可测集合,I 表示指示函数。当 a_{k} 非负时,令

\int(\sum_{k}a_{k}1_{S_{k}})d\mu = \sum_{k}a_{k}\int 1_{S_{k}}d\mu = \sum_{k}a_{k}\mu(S_{k}).

如果 f 是一个非负可测函数时,可以定义函数 f 在可测集合 E 上的 Lebesgue 积分是:

\int_{E}f d\mu = \sup\{\int_{E}sd\mu: \bold{0}\leq s\leq f\},

这里的 s 指的是非负简单函数,\bold{0} 表示零函数,这里的大小关系表示对定义域内的每个点都要成立。

而对于可测函数时,可以把可测函数 f 转换成 f= f^{+}-f^{-},而这里的 f^{+}f^{-} 都是非负可测函数。所以可以定义任意可测函数的 Lebesgue 积分如下:

\int fd\mu = \int f^{+}d\mu - \int f^{-}d\mu..

Riemann 积分与Lebesgue 积分的关系

定义了两种积分之后,也许有人会问它们之间是否存在矛盾?其实,它们之间是不矛盾的,因为有学者证明了这样的定理:

如果有界函数 f 在闭区间 [a,b] 是 Riemann 可积的,则它也是 Lebesgue 可积的,并且它们的积分值相等:

(R)\int_{a}^{b}f(x)dx = (L)\int_{[a,b]}f(x)dx.

左侧是表示 Riemann 积分,右侧表示 Lebesgue 积分。

形象化一点的语言描述就是:无论从角落一口一口地吃汉堡,还是从顶至下一层一层吃,所吃的汉堡都是同一个。

但是 Lebesgue 积分比 Riemann 积分有着更大的优势,例如 Dirichlet 函数,

  1. x 是有理数时,D(x) = 1
  2. x 是无理数时,D(x) = 0.

Dirichlet 函数是定义在实数轴的函数,并且值域是 \{0,1\},无法画出函数图像,它不是 Riemann 可积的,但是它 Lebesgue 可积。

 

时间序列

提到时间序列,也就是把以上所讨论的连续函数换成离散函数而已,把定义域从一个闭区间 [a,b] 换成 \{1,2,3,\cdots\} 这样的定义域而已。所以,之前所讨论的很多连续函数的想法都可以应用在时间序列上。

时间序列的表示 — 基于 Riemann 积分

现在我们可以按照 Riemann 积分的计算方法来表示一个时间序列的特征,于是就有学者把时间序列按照横轴切分成很多段,每一段使用某个简单函数(线性函数等)来表示,于是就有了以下的方法:

  1. 分段线性逼近(Piecewise Linear Approximation)
  2. 分段聚合逼近(Piecewise Aggregate Approximation)
  3. 分段常数逼近(Piecewise Constant Approximation)

说到这几种算法,其实最本质的思想就是进行数据降维的工作,用少数的数据来进行原始时间序列的表示(Representation)。用数学化的语言来描述时间序列的数据降维(Data Reduction)就是:把原始的时间序列 \{x_{1},\cdots,x_{N}\}\{x_{1}^{'},\cdots, x_{D}^{'}\} 来表示,其中 D<N。那么后者就是原始序列的一种表示(representation)。

分段聚合逼近(Piecewise Aggregate Approximation)— 类似 Riemann 积分

在这种算法中,分段聚合逼近(Piecewise Aggregate Approximation)是一种非常经典的算法。假设原始的时间序列是 C = \{x_{1},\cdots, x_{N}\},定义 PAA 的序列是:\overline{C} = \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

其中

\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}.

在这里 1\leq i\leq w。用图像来表示那就是:

PAA

至于分段线性逼近(Piecewise Linear Approximation)和分段常数逼近(Piecewise Constant Approximation),只需要在 \overline{x}_{i} 的定义上稍作修改即可。

符号特征(Symbolic Approximation)— 类似用简单函数来计算 Lebesgue 积分

在推荐系统的特征工程里面,特征通常来说可以做归一化,二值化,离散化等操作。例如,用户的年龄特征,一般不会直接使用具体的年月日,而是划分为某个区间段,例如 0~6(婴幼儿时期),7~12(小学),13~17(中学),18~22(大学)等阶段。

其实在得到分段特征之后,分段特征在某种程度上来说依旧是某些连续值,能否把连续值划分为一些离散的值呢?于是就有学者使用一些符号来表示时间序列的关键特征,也就是所谓的符号表示法(Symbolic Representation)。下面来介绍经典的 SAX Representation。

如果我们希望使用 \alpha 个符号来表示时间序列,那么我们其实可以考虑正态分布 N(0,1),用\{z_{1/\alpha},\cdots,z_{(\alpha-1)/\alpha}\} 来表示 Gauss 曲线下方的一些点,而这些点把 Gauss 曲线下方的面积等分成了 \alpha 段。用 \{l_{1},\cdots,l_{\alpha}\} 表示 \alpha 个字母。

SAX 方法的流程如下:

  1. 正规化(normalization):把原始的时间序列映射到一个新的时间序列,新的时间序列满足均值为零,方差为一的条件。
  2. 分段表示(PAA):\{x_{1},\cdots, x_{N}\} \Rightarrow  \{\overline{x}_{1},\cdots,\overline{x}_{w}\}
  3. 符号表示(SAX):如果 \overline{x}_{i}<z_{1/\alpha},那么 \hat{X}_{i}=l_{1};如果 z_{(j-1)/\alpha}\leq \overline{x}_{i}<z_{j/\alpha},那么 \hat{X}_{i} = l_{j},在这里 2\leq j\leq \alpha;如果 \overline{x}_{i}\geq z_{(\alpha-1)/\alpha},那么 \hat{X}_{i} = l_{\alpha}

于是,我们就可以用 \{l_{1},\cdots,l_{\alpha}\}\alpha 个字母来表示原始的时间序列了。

SAX

时间序列的表示 — 基于 Lebesgue 积分

要想考虑一个时间序列的值分布情况,其实就类似于 Lebesgue 积分的计算方法,考虑它们的分布情况,然后使用某些函数去逼近时间序列。要考虑时间序列的值分布情况,可以考虑熵的概念。

熵(Entropy)

通常来说,要想描述一种确定性与不确定性,熵(entropy)是一种不错的指标。对于离散空间而言,一个系统的熵(entropy)可以这样来表示:

\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\}).

如果一个系统的熵(entropy)越大,说明这个系统就越混乱;如果一个系统的熵越小,那么说明这个系统就更加确定。

提到时间序列的熵特征,一般来说有几个经典的熵指标,其中有一个就是 binned entropy。

分桶熵(Binned Entropy)

从熵的定义出发,可以考虑把时间序列的值进行分桶的操作,例如,可以把 [min, max] 这个区间等分为十个小区间,那么时间序列的取值就会分散在这十个桶中。根据这个等距分桶的情况,就可以计算出这个概率分布的熵(entropy)。i.e. Binned Entropy 就可以定义为:

\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},

其中 p_{k} 表示时间序列 X 的取值落在第 k 个桶的比例(概率),maxbin 表示桶的个数,len(X) 表示时间序列 X 的长度。

如果一个时间序列的 Binned Entropy 较大,说明这一段时间序列的取值是较为均匀的分布在 [min, max] 之间的;如果一个时间序列的 Binned Entropy 较小,说明这一段时间序列的取值是集中在某一段上的。

总结

在本篇文章中,笔者从 Riemann 积分和 Lebesgue 积分出发,介绍了它们的基本概念,性质和联系。然后从两种积分出发,探讨了时间序列的分段特征,时间序列的熵特征。在未来的 Blog 中,笔者将会介绍时间序列的更多相关内容。

 

时间序列的相似性

在文本挖掘中,可以通过 Word2Vec 生成的向量以及向量的内积,或者根据语义和词性来判断两个词语是否是近义词。在时间序列的挖掘中,同样可以找到一些方法来描述两条时间序列是否相似。

在介绍时间序列的距离之前,笔者感觉需要回顾一下数学中度量空间内积空间的定义。

度量空间

在数学里面,集合 M 上的距离函数定义为 d: M\times M\rightarrow \mathbb{R},其中 \mathbb{R} 表示实数集合,并且函数 d 满足以下几个条件:

  1. d(x,y)\geq 0,并且 d(x,y)=0 当且仅当 x=y;
  2. d(x,y)=d(y,x),也就是满足对称性;
  3. d(x,z)\leq d(x,y)+d(y,z),也就是三角不等式。

满足这三个条件的距离函数称为度量,具有某种度量的集合则叫做度量空间

Remark.

在本文下面的时间序列距离的各种定义中,这些距离不一定满足三角不等式。例如动态时间算法(DTW)就不满足三角不等式。

内积空间

一个内积空间是域 F(其中 F=\mathbb{R} 或者 F=\mathbb{C})上的向量空间 V 与一个内积(映射)所构成,V 上的一个内积定义为正定,非退化的共轭双线性形式,记成 <\cdot,\cdot>:V\times V\rightarrow F,它满足以下设定:

1. 对于任意的 x,y\in V,有 <x,y> =\overline{<y,x>}.

2. 共轭双线性形式指的是:

\forall a\in F, \forall x,y\in V, <ax,y>=a<x,y>,

\forall x,y,z\in F, <x+y,z> = <x,z> + <y,z>.

\forall b\in F, \forall x,y\in V, <x,by> = \overline{b}<x,y>,

\forall x,y,z\in F,<x,y+z> = <x,y>+<x,z>.

3. 非负性:\forall x\in V, <x,x>\geq 0.

4. 非退化:从 V 到对偶空间 V^{*} 的映射:x\mapsto<x,\cdot> 是同构映射。

 

基于欧几里德距离的相似度计算

假设两条时间序列曲线为 X_{T} = \{x_{1},\cdots,x_{T}\}Y_{T} = \{y_{1},\cdots,y_{T}\},于是可以使用欧几里德空间里面的 L^{1}, L^{2}, L^{p}, L^{\infty} 范数来表示两个时间序列之间的距离。用公式来描述就是:

d_{L^{1}}(X_{T},Y_{T}) = \sum_{t=1}^{T}|x_{t}-y_{t}|,

d_{L^{p}}(X_{T}, Y_{T}) = (\sum_{t=1}^{T}|x_{t}-y_{t}|^{p})^{1/p},

d_{L^{2}}(X_{T}, Y_{T}) = (\sum_{t=1}^{T}|x_{t}-y_{t}|^{2})^{1/2},

d_{L^{\infty}}(X_{T},Y_{T}) = \max_{1\leq t\leq T}|x_{t}-y_{t}|.

 

基于相关性的相似度计算方法

Pearson 系数(Pearson Coefficient)

\text{COR}(X_{T},Y_{T}) = \frac{\sum_{t=1}^{T}(x_{t}-\overline{X}_{T})\cdot(y_{t}-\overline{Y}_{T})}{\sqrt{\sum_{t=1}^{T}(x_{t}-\overline{X}_{T})^{2}}\cdot\sqrt{\sum_{t=1}^{T}(y_{t}-\overline{Y}_{T})^{2}}},

其中,

\overline{X}_{T} = \sum_{t=1}^{T}x_{t}/T, \overline{Y}_{T} = \sum_{t=1}^{T}y_{t}/T

Pearson 系数的性质如下:

  1. 如果两条时间序列 X_{T} = Y_{T},则 \text{COR}(X_{T},Y_{T}) =1 表是它们是完全一致的,如果两条时间序列 X_{T} = -Y_{T},则 \text{COR}(X_{T},Y_{T}) = -1 表示它们之间是负相关的。
  2. -1\leq \text{COR}(X_{T},Y_{T})\leq 1.

可以基于 Pearson 系数来制定两条时间序列之间的距离:

d_{COR,1}(X_{T},Y_{T}) = \sqrt{2\cdot(1-COR(X_{T},Y_{T}))},

d_{COR,2}(X_{T},Y_{T}) = \sqrt{\big(\frac{1-COR(X_{T},Y_{T})}{1+COR(X_{T},Y_{T})}\big)^{\beta}},

其中\beta\geq 0.

The First Order Temporal Correlation Coefficient

这个相关性系数与 Pearson 系数类似,但是略有不同,其定义为:

\text{CORT}(X_{T},Y_{T}) = \frac{\sum_{t=1}^{T-1}(x_{t+1}-x_{t})\cdot(y_{t+1}-y_{t})}{\sqrt{\sum_{t=1}^{T-1}(x_{t+1}-x_{t})^{2}}\cdot\sqrt{\sum_{t=1}^{T-1}(y_{t+1}-y_{t})^{2}}},

\text{CORT}(X_{T},Y_{T}) 的性质:

  1. -1\leq \text{CORT}(X_{T},Y_{T}) \leq 1
  2. \text{CORT}(X_{T},Y_{T}) =1 表示两条时间序列持有类似的趋势, 它们会同时上涨或者下跌,并且涨幅或者跌幅也是类似的。
  3. \text{CORT}(X_{T},Y_{T})=-1 表示两条时间序列的上涨和下跌趋势恰好相反。
  4. \text{CORT}(X_{T},Y_{T})=0 表示两条时间序列在单调性方面没有相关性。

基于 CORT,同样可以定义时间序列的距离,用公式描述如下:

d_{CORT}(X_{T},Y_{T}) = \phi_{k}[CORT(X_{T},Y_{T})]\cdot d(X_{T},Y_{T}),

其中,d(X_{T},Y_{T}) 可以用 d_{L^{1}}, d_{L^{p}}, d_{L^{2}}, d_{L^{\infty}}, d_{DTW}, d_{Frechet} 来计算,而

\phi_{k}(u) = 2/(1+\exp(ku)), k\geq 0

是一个递减函数。

基于自相关系数的距离(Autocorrelation-based distance)

假设时间序列是 X_{T} = \{x_{1},\cdots,x_{T}\},对于任意的 k<T,可以定义自相关系数为:

\hat{\rho}_{k} = \frac{1}{(T-k)\sigma^{2}}\sum_{t=1}^{T-k}(x_{t}-\mu)\cdot(x_{t+k}-\mu),

其中 \mu, \sigma^{2} 分别表示该时间序列的均值和方差。该公式相当于是比较整个时间序列 X_{T}=\{x_{1},\cdots,x_{T}\} 的两个子序列的相似度(Pearson 系数),这两个子序列分别是 \{x_{1},\cdots,x_{T-k}\}\{x_{k+1},\cdots,x_{T}\}

于是,通过给定一个正整数 L<T,可以对每一个时间序列得到一组自相关系数的向量,用公式描述如下:

\hat{\rho}_{X_{T}} = (\hat{\rho}_{1,X_{T}},\cdots,\hat{\rho}_{L,X_{T}})^{T}\in \mathbb{R}^{L},

\hat{\rho}_{Y_{T}} = (\hat{\rho}_{1,Y_{T}},\cdots,\hat{\rho}_{L,Y_{T}})^{T}\in\mathbb{R}^{L}.

对于 i>L 的情况,可以假定 \hat{\rho}_{i,X_{T}} = 0\hat{\rho}_{i,Y_{T}} = 0。于是,可以定义时间序列之间的距离如下:

d_{ACF}(X_{T},Y_{T}) = \sqrt{(\hat{\rho}_{X_{T}}-\hat{\rho}_{Y_{T}})^{T}\Omega(\hat{\rho}_{X_{T}}-\hat{\rho}_{Y_{T}})}.

其中的 \Omega 表示一个 L\times L 的矩阵。它有着很多种选择,例如:

(1)\Omega = I_{L} 表示单位矩阵。用公式表示就是

d_{ACFU}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{L}(\hat{\rho}_{i,X_{T}}-\hat{\rho}_{i,Y_{T}})^{2}}.

(2)\Omega = diag\{p(1-p),p(1-p)^{2},\cdots,p(1-p)^{L}\} 表示一个 L\times L 的对角矩阵,其中 0<p<1。此时相当于一个带权重的求和公式。

d_{ACFU}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{L}p(1-p)^{i}(\hat{\rho}_{i,X_{T}}-\hat{\rho}_{i,Y_{T}})^{2}}.

除了自相关系数(Autocorrelation Coefficients)之外,也可以考虑偏自相关系数(Partial Autocorrelation Coefficients),使用 PACFs 来取代 ACFs。这样,使用同样的定义方式就可以得到 d_{PACFU}d_{PACFG} 两个距离公式。

 

基于周期性的相似度计算方法

这里会介绍基于周期图表(Periodogram-based)的距离计算方法。其大体思想就是通过 Fourier 变换得到一组参数,然后通过这组参数来反映原始的两个时间序列时间的距离。用数学公式来描述就是:

I_{X_{T}}(\lambda_{k}) = T^{-1}|\sum_{t=1}^{T}x_{t}e^{-i\lambda_{k}t}|^{2},

I_{Y_{T}}(\lambda_{k}) = T^{-1}|\sum_{t=1}^{T}y_{t}e^{-i\lambda_{k}t}|^{2}.

其中 \lambda_{k} = 2\pi k/Tk = 1,\cdots,nn=[(T-1)/2]。这里的 [\cdot] 表示 Gauss 取整函数。

(1)用原始的特征来表示距离:

d_{P}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(I_{X_{T}}(\lambda_{k})-I_{Y_{T}}(\lambda_{k}))^{2}}.

(2)用正则化之后的特征来描述就是:

d_{P}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(NI_{X_{T}}(\lambda_{k})-NI_{Y_{T}}(\lambda_{k}))^{2}},

其中 NI_{X_{T}}(\lambda_{k})=I_{X_{T}}(\lambda_{k})/\hat{\gamma}_{0,X_{T}}NI_{Y_{T}}(\lambda_{k})=I_{Y_{T}}(\lambda_{k})/\hat{\gamma}_{0,Y_{T}}\hat{\gamma}_{0,X_{T}}\hat{\gamma}_{0,Y_{T}} 表示 X_{T}, Y_{T} 的标准差(sample variance)。

(3)用取对数之后的特征表示:

d_{LNP}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(\ln NI_{X_{T}}(\lambda_{k})-\ln NI_{Y_{T}}(\lambda_{k}))^{2}}.

 

 

基于模型的相似度计算

Piccolo 距离

基于模型的相似度判断本质上是用一个模型和相应的一组参数去拟合某条时间序列,然后得到最优的一组参数,计算两个时间序列所得到的最优参数的欧几里德距离即可。

ARMA(p,q) 模型有自己的 AR 表示,因此可以得到相应的一组参数 (\pi_{1},\pi_{2},\cdots),所以,对于每一条时间序列,都可以用一组最优的参数去逼近。如果

\hat{\prod}_{X_{T}}=(\hat{\pi}_{1,X_{T}},\cdots,\hat{\pi}_{k_{1},X_{T}})^{T},

\hat{\prod}_{X_{T}}=(\hat{\pi}_{1,X_{T}},\cdots,\hat{\pi}_{k_{1},X_{T}})^{T}

分别表示 AR(k_{1})AR(k_{2}) 对于时间序列 X_{T}Y_{T} 的参数估计,则 Piccolo 距离如下:

d_{PIC}(X_{T},Y_{T}) =\sqrt{\sum_{j=1}^{k}(\hat{\pi}_{j,X_{T}}'-\hat{\pi}_{j,Y_{T}}')^{2}},

其中 k=\max(k_{1},k_{2})\hat{\pi}_{j,X_{T}}'=\hat{\pi}_{j,X_{T}}j\leq k_{1},并且 \hat{\pi}_{j,X_{T}}' = 0k_{1}<j\leq k\hat{\pi}_{j,Y_{T}}'=\hat{\pi}_{j,Y_{T}}j\leq k_{2},并且 \hat{\pi}_{j,Y_{T}}' = 0k_{2}<j\leq k

Maharaj 距离

按照之前的描述,可以增加一个矩阵来修改 Piccolo 距离:

d_{MAH}(X_{T},Y_{T}) =\sqrt{T}(\hat{\prod}'_{X_{T}}-\hat{\prod}'_{Y_{T}})^{T}\hat{V}^{-1}(\hat{\prod}'_{X_{T}}-\hat{\prod}'_{Y_{T}}).

其中 \hat{\prod}'_{X_{T}}\hat{\prod}'_{Y_{T}} 表示 AR(k) 模型对于 X_{T}Y_{T} 的参数估计,和 Piccolo 距离一样。\hat{V} = \sigma_{X_{T}}^{2}R_{X_{T}}^{-1}(k) + \sigma_{Y_{T}}^{2}R_{Y_{T}}^{-1}(k)\sigma_{X_{T}}^{2}\sigma_{Y_{T}}^{2} 表示时间序列的方差,R_{X_{T}}R_{Y_{T}} 表示时间序列的 sample covariance 矩阵。

基于 Cepstral 的距离

考虑时间序列 X_{T} 满足 AR(p) 的结构,i.e. X_{t}=\sum_{r=1}^{p}\phi_{r}X_{t-r}+\epsilon_{t},这里的 \phi_{r} 表示 AR 模型的参数,\epsilon_{t} 表示白噪声(均值为 0,方差为 1 的 Gauss 正态分布)。于是可以从这些参数定义 LPC 系数如下:

\psi_{1}=\phi_{1}

\psi_{h}=\phi_{h}+\sum_{m=1}^{h-1}(\phi_{m}-\psi_{h-m})1<h\leq p

\psi_{h}=\sum_{m=1}^{p}(1-\frac{m}{h})\phi_{m}\psi_{h-m}p<h

所以,LPC 的距离定义是:

d_{LPC, Cep}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{T}(\psi_{i,X_{T}}-\psi_{i,Y_{T}})^{2}}.

 

总结

在本文中,介绍了时间序列之间距离的计算方法,包括基于 L^{p} 范数的距离,基于相关性的距离,基于周期图表的计算方法,基于模型的计算方法。

 

 

时间序列的表示与信息提取

提到时间序列,大家能够想到的就是一串按时间排序的数据,但是在这串数字背后有着它特殊的含义,那么如何进行时间序列的表示(Representation),如何进行时间序列的信息提取(Information Extraction)就成为了时间序列研究的关键问题。

就笔者的个人经验而言,其实时间序列的一些想法和文本挖掘是非常类似的。通常来说句子都是由各种各样的词语组成的,并且一般情况下都是“主谓宾”的句子结构。于是就有人希望把词语用一个数学上的向量描述出来,那么最经典的做法就是使用 one – hot 的编码格式。i.e. 也就是对字典里面的每一个词语进行编码,一个词语对应着一个唯一的数字,例如 0,1,2 这种形式。one hot 的编码格式是这行向量的长度是词典中词语的个数,只有一个值是1,其余的取值是0,也就是 (0,…,0,1,0,…,0) 这种样子。但是在一般情况下,词语的个数都是非常多的,如何使用一个维度较小的向量来表示一个词语就成为了一个关键的问题。几年前,GOOGLE 公司开源了 Word2vec 开源框架,把每一个词语用一串向量来进行描述,向量的长度可以自行调整,大约是100~1000 不等,就把原始的 one-hot 编码转换为了一个低维空间的向量。在这种情况下,机器学习的很多经典算法,包括分类,回归,聚类等都可以在文本上得到巨大的使用。Word2vec 是采用神经网络的思想来提取每个词语与周边词语的关系,从而把每个词语用一个低维向量来表示。在这里,时间序列的特征提取方法与 word2vec 略有不同,后面会一一展示这些技巧。

 

时间序列的统计特征

提到时间序列的统计特征,一般都能够想到最大值(max),最小值(min),均值(mean),中位数(median),方差(variance),标准差(standard variance)等指标,不过一般的统计书上还会介绍两个指标,那就是偏度(skewness)和峰度(kuriosis)。如果使用时间序列 X_{T} = \{x_{1},\cdots,x_{T}\} 来表示长度为 T 的时间序列,那么这些统计特征用数学公式来表示就是:

\mu=\frac{1}{T}\sum_{i=1}^{T}x_{i},

\sigma^{2} = \sum_{i=1}^{T}\frac{1}{T}(x_{i}-\mu)^{2},

\text{skewness}(X) = E[(\frac{X-\mu}{\sigma})^{3}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{3}}{\sigma^{3}},

\text{kurtosis}(X) = E[(\frac{X-\mu}{\sigma})^{4}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{4}}{\sigma^{4}} .

其中 \mu\sigma 分别表示时间序列 X_{T} 的均值和方差。

 

时间序列的熵特征

为什么要研究时间序列的熵呢?请看下面两个时间序列:

时间序列(1):(1,2,1,2,1,2,1,2,1,2,…)

时间序列(2):(1,1,2,1,2,2,2,2,1,1,…)

在时间序列(1)中,1 和 2 是交替出现的,而在时间序列(2)中,1 和 2 是随机出现的。在这种情况下,时间序列(1)则更加确定,时间序列(2)则更加随机。并且在这种情况下,两个时间序列的统计特征,例如均值,方差,中位数等等则是几乎一致的,说明用之前的统计特征并不足以精准的区分这两种时间序列。

通常来说,要想描述一种确定性与不确定性,熵(entropy)是一种不错的指标。对于离散空间而言,一个系统的熵(entropy)可以这样来表示:

\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\}).

如果一个系统的熵(entropy)越大,说明这个系统就越混乱;如果一个系统的熵越小,那么说明这个系统就更加确定。

提到时间序列的熵特征,一般来说有几个经典的例子,那就是 binned entropyapproximate entropysample entropy。下面来一一介绍时间序列中这几个经典的熵。

Binned Entropy

从熵的定义出发,可以考虑把时间序列 X_{T} 的取值进行分桶的操作。例如,可以把 [\min(X_{T}), \max(X_{T})] 这个区间等分为十个小区间,那么时间序列的取值就会分散在这十个桶中。根据这个等距分桶的情况,就可以计算出这个概率分布的熵(entropy)。i.e. Binned Entropy 就可以定义为:

\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},

其中 p_{k} 表示时间序列 X_{T} 的取值落在第 k 个桶的比例(概率),maxbin 表示桶的个数,len(X_{T}) = T 表示时间序列 X_{T} 的长度。

如果一个时间序列的 Binned Entropy 较大,说明这一段时间序列的取值是较为均匀的分布在 [\min(X_{T}), \max(X_{T})] 之间的;如果一个时间序列的 Binned Entropy 较小,说明这一段时间序列的取值是集中在某一段上的。

Approximate Entropy

回到本节的问题,如何判断一个时间序列是否具备某种趋势还是随机出现呢?这就需要介绍 Approximate Entropy 的概念了,Approximate Entropy 的思想就是把一维空间的时间序列提升到高维空间中,通过高维空间的向量之间的距离或者相似度的判断,来推导出一维空间的时间序列是否存在某种趋势或者确定性。那么,我们现在可以假设时间序列 X_{N}: \{x_{1},\cdots, x_{N}\} 的长度是 N,同时 Approximate Entropy 函数拥有两个参数,mr,下面来详细介绍 Approximate Entropy 的算法细节。

Step 1. 固定两个参数,正整数 m 和正数 r,正整数 m 是为了把时间序列进行一个片段的提取,正数 r 是表示时间序列距离的某个参数。i.e. 需要构造新的 m 维向量如下:

X_{1}(m) = (x_{1},\cdots, x_{m})\in\mathbb{R}^{m},

X_{i}(m) = (x_{i},\cdots, x_{m+i-1})\in\mathbb{R}^{m},

X_{N-m+1}(m) = (x_{N-m+1},\cdots, x_{N})\in\mathbb{R}^{m}.

Step 2. 通过新的向量 X_{1}(m),\cdots, X_{N-m+1}(m),可以计算出哪些向量与 X_{i} 较为相似。i.e.

C_{i}^{m}(r) = (\text{number of }X_{j}(m)\text{ such that } d(X_{i}(m), X_{j}(m))\leq r)/(N-m+1),

在这里,距离 d 可以选择 L^{1}, L^{2}, L^{p}, L^{\infty} 范数。在这个场景下,距离 d 通常选择为 L^{\infty} 范数。

Step 3. 考虑函数

\Phi^{m}(r) = (N-m+1)^{-1}\cdot \sum_{i=1}^{N-m+1} \ln(C_{i}^{m}(r)),

Step 4. Approximate Entropy 可以定义为:

\text{ApEn}(m,r) = \Phi^{m}(r)-\Phi^{m+1}(r).

Remark.

  1. 正整数 m 一般可以取值为 2 或者 3,r>0 会基于具体的时间序列具体调整;
  2. 如果某条时间序列具有很多重复的片段(repetitive pattern)或者自相似性(self-similarity pattern),那么它的 Approximate Entropy 就会相对小;反之,如果某条时间序列几乎是随机出现的,那么它的 Approximate Entropy 就会相对较大。

Sample Entropy

除了 Approximate Entropy,还有另外一个熵的指标可以衡量时间序列,那就是 Sample Entropy,通过自然对数的计算来表示时间序列是否具备某种自相似性。

按照以上 Approximate Entropy 的定义,可以基于 mr 定义两个指标 AB,分别是

A = \#\{\text{vector pairs having } d(X_{i}(m+1),X_{j}(m+1))<r \text{ of length } m+1  \},

B = \#\{ \text{vector pairs having } d(X_{i}(m), X_{j}(m))<r \text{ of length } m\}.

其中,\# 表示集合的元素个数。根据度量 d(无论是 L^{1}, L^{2}, L^{\infty})的定义可以知道A\leq B,因此 Sample Entropy 总是非负数,i.e.

\text{SampEn} = -\ln(A/B) \geq 0.

Remark.

  1. Sample Entropy 总是非负数;
  2. Sample Entropy 越小表示该时间序列具有越强的自相似性(self similarity)。
  3. 通常来说,在 Sample Entropy 的参数选择中,可以选择 m = 2, r = 0.2 \cdot std.

 

时间序列的分段特征

即使时间序列有一定的自相似性(self-similarity),能否说明这两条时间序列就完全相似呢?其实答案是否定的,例如:两个长度都是 1000 的时间序列,

时间序列(1): [1,2] * 500

时间序列(2): [1,2,3,4,5,6,7,8,9,10] * 100

其中,时间序列(1)是 1 和 2 循环的,时间序列(2)是 1~10 这样循环的,它们从图像上看完全是不一样的曲线,并且它们的 Approximate Entropy  和 Sample Entropy 都是非常小的。那么问题来了,有没有办法提炼出信息,从而表示它们的不同点呢?答案是肯定的。

首先,我们可以回顾一下 Riemann 积分和 Lebesgue 积分的定义和不同之处。按照下面两幅图所示,Riemann 积分是为了算曲线下面所围成的面积,因此把横轴划分成一个又一个的小区间,按照长方形累加的算法来计算面积。而 Lebesgue 积分的算法恰好相反,它是把纵轴切分成一个又一个的小区间,然后也是按照长方形累加的算法来计算面积。

RiemannANDLebesgue

之前的 Binned Entropy 方案是根据值域来进行切分的,好比 Lebesgue 积分的计算方法。现在我们可以按照 Riemann 积分的计算方法来表示一个时间序列的特征,于是就有学者把时间序列按照横轴切分成很多段,每一段使用某个简单函数(线性函数等)来表示,于是就有了以下的方法:

  1. 分段线性逼近(Piecewise Linear Approximation)
  2. 分段聚合逼近(Piecewise Aggregate Approximation)
  3. 分段常数逼近(Piecewise Constant Approximation)

说到这几种算法,其实最本质的思想就是进行数据降维的工作,用少数的数据来进行原始时间序列的表示(Representation)。用数学化的语言来描述时间序列的数据降维(Data Reduction)就是:把原始的时间序列 \{x_{1},\cdots,x_{N}\}\{x_{1}^{'},\cdots, x_{D}^{'}\} 来表示,其中 D<N。那么后者就是原始序列的一种表示(representation)。

分段聚合逼近(Piecewise Aggregate Approximation)— 类似 Riemann 积分

在这种算法中,分段聚合逼近(Piecewise Aggregate Approximation)是一种非常经典的算法。假设原始的时间序列是 C = \{x_{1},\cdots, x_{N}\},定义 PAA 的序列是:\overline{C} = \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

其中

\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}.

在这里 1\leq i\leq w。用图像来表示那就是:

PAA

至于分段线性逼近(Piecewise Linear Approximation)和分段常数逼近(Piecewise Constant Approximation),只需要在 \overline{x}_{i} 的定义上稍作修改即可。

符号逼近(Symbolic Approximation)— 类似 Riemann 积分

在推荐系统的特征工程里面,特征通常来说可以做归一化,二值化,离散化等操作。例如,用户的年龄特征,一般不会直接使用具体的年月日,而是划分为某个区间段,例如 0~6(婴幼儿时期),7~12(小学),13~17(中学),18~22(大学)等阶段。

其实在得到分段特征之后,分段特征在某种程度上来说依旧是某些连续值,能否把连续值划分为一些离散的值呢?于是就有学者使用一些符号来表示时间序列的关键特征,也就是所谓的符号表示法(Symbolic Representation)。下面来介绍经典的 SAX Representation。

如果我们希望使用 \alpha 个符号,例如 \{l_{1},\cdots,l_{\alpha}\} 来表示时间序列。同时考虑正态分布 N(0,1),用\{z_{1/\alpha},\cdots,z_{(\alpha-1)/\alpha}\} 来表示 Gauss 曲线下方的一些点,而这些点把 Gauss 曲线下方的面积等分成了 \alpha 段。

SAX 方法的流程如下:

Step 1. 正规化(normalization):也就是该时间序列被映射到均值为零,方差为一的区间内。

Step 2. 分段表示(PAA):\{x_{1},\cdots, x_{N}\} \Rightarrow  \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

Step 3. 符号表示(SAX):如果 \overline{x}_{i}<z_{1/\alpha},那么 \hat{X}_{i}=l_{1};如果 z_{(j-1)/\alpha}\leq \overline{x}_{i}<z_{j/\alpha},那么 \hat{X}_{i} = l_{j};如果 \overline{x}_{i}\geq z_{(\alpha-1)/\alpha},那么 \hat{X}_{i} = l_{\alpha}

于是,我们就可以用 \{l_{1},\cdots,l_{\alpha}\}\alpha 个字母来表示原始的时间序列了。

SAX

 

总结

在本篇文章中,我们介绍了时间序列的一些表示方法(Representation),其中包括时间序列统计特征,时间序列的熵特征,时间序列的分段特征。在下一篇文章中,我们将会介绍时间序列的相似度计算方法。

 

A Guide to Time Series Forecasting with ARIMA in Python 3

https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3#step-2-—-importing-packages-and-loading-data

 

Introduction

Time series provide the opportunity to forecast future values. Based on previous values, time series can be used to forecast trends in economics, weather, and capacity planning, to name a few. The specific properties of time-series data mean that specialized statistical methods are usually required.

In this tutorial, we will aim to produce reliable forecasts of time series. We will begin by introducing and discussing the concepts of autocorrelation, stationarity, and seasonality, and proceed to apply one of the most commonly used method for time-series forecasting, known as ARIMA.

One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. Here, we will primarily focus on the ARIMA component, which is used to fit time-series data to better understand and forecast future points in the time series.

Prerequisites

This guide will cover how to do time-series analysis on either a local desktop or a remote server. Working with large datasets can be memory intensive, so in either case, the computer will need at least 2GB of memory to perform some of the calculations in this guide.

To make the most of this tutorial, some familiarity with time series and statistics can be helpful.

For this tutorial, we’ll be using Jupyter Notebook to work with the data. If you do not have it already, you should follow our tutorial to install and set up Jupyter Notebook for Python 3.

Step 1 — Installing Packages

To set up our environment for time-series forecasting, let’s first move into our local programming environment or server-based programming environment:

  • cd environments
  • . my_env/bin/activate

From here, let’s create a new directory for our project. We will call it ARIMA and then move into the directory. If you call the project a different name, be sure to substitute your name for ARIMA throughout the guide

  • mkdir ARIMA
  • cd ARIMA

This tutorial will require the warningsitertoolspandasnumpymatplotlib and statsmodels libraries. The warnings and itertools libraries come included with the standard Python library set so you shouldn’t need to install them.

Like with other Python packages, we can install these requirements with pip.
We can now install pandasstatsmodels, and the data plotting package matplotlib. Their dependencies will also be installed:

  • pip install pandas numpy statsmodels matplotlib

At this point, we’re now set up to start working with the installed packages.

Step 2 — Importing Packages and Loading Data

To begin working with our data, we will start up Jupyter Notebook:

  • jupyter notebook

To create a new notebook file, select New > Python 3 from the top right pull-down menu:

Create a new Python 3 notebook

This will open a notebook.

As is best practice, start by importing the libraries you will need at the top of your notebook:

import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

We have also defined a matplotlib style of fivethirtyeight for our plots.

We’ll be working with a dataset called “Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.,” which collected CO2 samples from March 1958 to December 2001. We can bring in this data as follows:

data = sm.datasets.co2.load_pandas()
y = data.data

Let’s preprocess our data a little bit before moving forward. Weekly data can be tricky to work with since it’s a briefer amount of time, so let’s use monthly averages instead. We’ll make the conversion with the resample function. For simplicity, we can also use the fillna() function to ensure that we have no missing values in our time series.

# The 'MS' string groups the data in buckets by start of the month
y = y['co2'].resample('MS').mean()

# The term bfill means that we use the value before filling in missing values
y = y.fillna(y.bfill())

print(y)
Output
co2
1958-03-01  316.100000
1958-04-01  317.200000
1958-05-01  317.433333
...
2001-11-01  369.375000
2001-12-01  371.020000

Let’s explore this time series e as a data visualization:

y.plot(figsize=(15, 6))
plt.show()

Figure 1: CO2 Levels Time Series

Some distinguishable patterns appear when we plot the data. The time series has an obvious seasonality pattern, as well as an overall increasing trend.

To learn more about time series pre-processing, please refer to “A Guide to Time Series Visualization with Python 3,” where the steps above are described in much more detail.

Now that we’ve converted and explored our data, let’s move on to time series forecasting with ARIMA.

Step 3 — The ARIMA Time Series Model

One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoregRessive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

There are three distinct integers (pdq) that are used to parametrize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

  • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
  • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
  • q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

When dealing with seasonal effects, we make use of the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. Here, (p, d, q) are the non-seasonal parameters described above, while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series. The term s is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

The seasonal ARIMA method can appear daunting because of the multiple tuning parameters involved. In the next section, we will describe how to automate the process of identifying the optimal set of parameters for the seasonal ARIMA time series model.

Step 4 — Parameter Selection for the ARIMA Time Series Model

When looking to fit time series data with a seasonal ARIMA model, our first goal is to find the values of ARIMA(p,d,q)(P,D,Q)s that optimize a metric of interest. There are many guidelines and best practices to achieve this goal, yet the correct parametrization of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but those have yet to be ported over to Python. In this section, we will resolve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q)(P,D,Q)s time series model.

We will use a “grid search” to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. Once we have explored the entire landscape of parameters, our optimal set of parameters will be the one that yields the best performance for our criteria of interest. Let’s begin by generating the various combination of parameters that we wish to assess:

# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Output
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyperparameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

The code chunk below iterates through combinations of parameters and uses the SARIMAX function from statsmodels to fit the corresponding Seasonal ARIMA model. Here, the order argument specifies the (p, d, q) parameters, while the seasonal_order argument specifies the (P, D, Q, S) seasonal component of the Seasonal ARIMA model. After fitting each SARIMAX()model, the code prints out its respective AICscore.

warnings.filterwarnings("ignore") # specify to ignore warning messages

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

Because some parameter combinations may lead to numerical misspecifications, we explicitly disabled warning messages in order to avoid an overload of warning messages. These misspecifications can also lead to errors and throw an exception, so we make sure to catch these exceptions and ignore the parameter combinations that cause these issues.

The code above should yield the following results, this may take some time:

Output
SARIMAX(0, 0, 0)x(0, 0, 1, 12) - AIC:6787.3436240402125
SARIMAX(0, 0, 0)x(0, 1, 1, 12) - AIC:1596.711172764114
SARIMAX(0, 0, 0)x(1, 0, 0, 12) - AIC:1058.9388921320026
SARIMAX(0, 0, 0)x(1, 0, 1, 12) - AIC:1056.2878315690562
SARIMAX(0, 0, 0)x(1, 1, 0, 12) - AIC:1361.6578978064144
SARIMAX(0, 0, 0)x(1, 1, 1, 12) - AIC:1044.7647912940095
...
...
...
SARIMAX(1, 1, 1)x(1, 0, 0, 12) - AIC:576.8647112294245
SARIMAX(1, 1, 1)x(1, 0, 1, 12) - AIC:327.9049123596742
SARIMAX(1, 1, 1)x(1, 1, 0, 12) - AIC:444.12436865161305
SARIMAX(1, 1, 1)x(1, 1, 1, 12) - AIC:277.7801413828764

The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 1, 12) yields the lowest AIC value of 277.78. We should therefore consider this to be optimal option out of all the models we have considered.

Step 5 — Fitting an ARIMA Time Series Model

Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

We’ll start by plugging the optimal parameter values into a new SARIMAX model:

mod = sm.tsa.statespace.SARIMAX(y,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])
Output
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.3182      0.092      3.443      0.001       0.137       0.499
ma.L1         -0.6255      0.077     -8.165      0.000      -0.776      -0.475
ar.S.L12       0.0010      0.001      1.732      0.083      -0.000       0.002
ma.S.L12      -0.8769      0.026    -33.811      0.000      -0.928      -0.826
sigma2         0.0972      0.004     22.634      0.000       0.089       0.106
==============================================================================

The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

results.plot_diagnostics(figsize=(15, 12))
plt.show()

Figure 2: Model Diagnostics

Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
  • The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we may find better models if we widened the grid search.

Step 6 — Validating Forecasts

We have obtained a model for our time series that can now be used to produce forecasts. We start by comparing predicted values to real values of the time series, which will help us understand the accuracy of our forecasts. The get_prediction() and conf_int() attributes allow us to obtain the values and associated confidence intervals for forecasts of the time series.

pred = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=False)
pred_ci = pred.conf_int()

The code above requires the forecasts to start at January 1998.

The dynamic=False argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.

We can plot the real and forecasted values of the CO2 time series to assess how well we did. Notice how we zoomed in on the end of the time series by slicing the date index.

ax = y['1990':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)

ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')
plt.legend()

plt.show()

Figure 3: CO2 Levels Static Forecast

Overall, our forecasts align with the true values very well, showing an overall increase trend.

It is also useful to quantify the accuracy of our forecasts. We will use the MSE (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

y_forecasted = pred.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 0.07

The MSE of our one-step ahead forecasts yields a value of 0.07, which is very low as it is close to 0. An MSE of 0 would that the estimator is predicting observations of the parameter with perfect accuracy, which would be an ideal scenario but it not typically possible.

However, a better representation of our true predictive power can be obtained using dynamic forecasts. In this case, we only use information from the time series up to a certain point, and after that, forecasts are generated using values from previous forecasted time points.

In the code chunk below, we specify to start computing the dynamic forecasts and confidence intervals from January 1998 onwards.

pred_dynamic = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
pred_dynamic_ci = pred_dynamic.conf_int()

Plotting the observed and forecasted values of the time series, we see that the overall forecasts are accurate even when using dynamic forecasts. All forecasted values (red line) match pretty closely to the ground truth (blue line), and are well within the confidence intervals of our forecast.

ax = y['1990':].plot(label='observed', figsize=(20, 15))
pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)

ax.fill_between(pred_dynamic_ci.index,
                pred_dynamic_ci.iloc[:, 0],
                pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)

ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('1998-01-01'), y.index[-1],
                 alpha=.1, zorder=-1)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 4: CO2 Levels Dynamic Forecast

Once again, we quantify the predictive performance of our forecasts by computing the MSE:

# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 1.01

The predicted values obtained from the dynamic forecasts yield an MSE of 1.01. This is slightly higher than the one-step ahead, which is to be expected given that we are relying on less historical data from the time series.

Both the one-step ahead and dynamic forecasts confirm that this time series model is valid. However, much of the interest around time series forecasting is the ability to forecast future values way ahead in time.

Step 7 — Producing and Visualizing Forecasts

In the final step of this tutorial, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

# Get forecast 500 steps ahead in future
pred_uc = results.get_forecast(steps=500)

# Get confidence intervals of forecasts
pred_ci = pred_uc.conf_int()

We can use the output of this code to plot the time series and forecasts of its future values.

ax = y.plot(label='observed', figsize=(20, 15))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 5: Time Series and Forecast of Future Values

Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

Conclusion

In this tutorial, we described how to implement a seasonal ARIMA model in Python. We made extensive use of the pandas and statsmodels libraries and showed how to run model diagnostics, as well as how to produce forecasts of the CO2 time series.

Here are a few other things you could try:

  • Change the start date of your dynamic forecasts to see how this affects the overall quality of your forecasts.
  • Try more combinations of parameters to see if you can improve the goodness-of-fit of your model.
  • Select a different metric to select the best model. For example, we used the AIC measure to find the best model, but you could seek to optimize the out-of-sample mean square error instead.

For more practice, you could also try to load another time series dataset to produce your own forecasts.

9 Comments

  • B
  • I
  • UL
  • OL
  • Code
  • Highlight
  • Table
  • 0

    Hi! Thanks for sharing this.
    I was trying to forecast hourly values. The seasonality to capture should be similar as the 168th previous value. This means, Friday 9PM of this week should be similar than Friday 9PM of the past week.
    That is why I decided to use 168 seasionality (24*7) but it takes very long and consumes lots of memory. I’ve tried several times using 7 and 24 seasionality but it wasn’t doing it well when forecasting (previous fitting with dynamic set to False was working perfectly). Do you have any advice for this situation? Thanks in advance.

  • 0

    Thanks for the Guide.

    I tried this with my own data. And at the model result summary part, I got ma.L1 having p-value over 0.88. So, I definitely want to get rid of this feature from the model. But how do I do that? How to remove a feature from the model??

    • 0

      Hi!
      Thanks for taking the time to read through this tutorial! Yes, a p-value of 0.88 would suggest that your ma.L1 feature is not very informative. The simplest way to start would be to try and remove the MA features from your model. You can achieve this by refitting your time-series models while explicitly setting the Q parameter to zero, this will ensure that no MA components are used when you fit your model.

  • 0

    very nice tutorial. thanks! I am a new one to ARIMA model, I want to ask you some questions.
    1) I found you use all the historical data to fit an ARIMA Time Series Model, and use part of all the historical data to validate mode, with code: pred = results.getprediction(start=pd.todatetime(‘1998-01-01’), dynamic=False) predci = pred.confint()
    But why the data to validate model is one part of data for fitting model before. You know in machine learning, the train data and test data is split, I don’t why here is different.
    2) how about data stationary, could you tell me why you set the enforce_stationary is false.

    3) how about days data not month data(average) for fit mode to predict, how about week data for prediction, could you tell me how to do it

    thanks!

  • 0

    I got an error on line:
    pred = results.getprediction(start=pd.todatetime(‘1998-02-01’), dynamic=False)

    File “pandas_libs\tslib.pyx”, line 1080, in pandas.libs.tslib.Timestamp.richcmp (pandas_libs\tslib.c:20281)
    TypeError: Cannot compare type ‘Timestamp’ with type ‘int’

    How can I solve it?

  • 1

    Hi.
    Thank you so much for your wonderful sharing. Is there are any way to catch the minimum value of AIC automatically?
    It would be wonderful, if the best set for ARIMAX was stored on a external variable and pass them to next step.
    Is it possible? how?
    Thanks you

    • 0

      Use this code

      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['pram','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['pram','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['pram'],
                                      seasonal_order=Min_AIC_list['pram_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      
      results = mod.fit()
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      
    • 0

      Revised code (sorry in the previous code, I was missing one thing.)

      
      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['param','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['param','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['param'],
                                      seasonal_order=Min_AIC_list['param_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      results = mod.fit()
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      

时序数据与事件的关联分析

文章是:”Correlating Events with Time Series for Incident Diagnosis” 是微软在2014年的工作,并且发表在KDD上。

本文提出了一种无监督和统计判别的算法,可以检测出事件(E)与时间序列(S)的关联关系,并且可以检测出时间序列(S)的单调性(上升或者下降)。在这篇文章中,选择的事件有CPU(Memory, Disk)Intensive Program,Query Alert;选择的时间序列有 CPU(Memory)Usage,Disk Transfer Rate。时间序列的特点是它们的值域范围都是[0,1]。

Table1

案例是:时间序列是CPU的Usage,事件是Disk Intensive task和CPU intensive task。

Figure1

关联关系的挖掘分成三个部分:

(1)是否存在关联性(Existence of Dependency):在事件(E)与时间序列(S)之间是否存在关联关系。

(2)关联关系的因果关系(Temporal Order of Dependency):是事件(E)导致了时间序列(S)的变化还是时间序列(S)导致了事件(E)的发生。

(3)关联关系的单调性影响(Monotonic Effect of Dependency):用于判断时间序列(S)是发生了突增或者是突降。

基本概念:

给定一个事件序列(E),事件发生的时间戳是T_{E}=(t_{1},\cdots,t_{n}),这里n表示有n个事件发生。时间序列(S)表示为S=(s_{1},\cdots,s_{m}),这里的m表示时间序列的长度。时间序列的时间戳可以选择一个等差序列,等差用\tau来表示,并且T_{S}=(t(s_{1}),\cdots,t(s_{n})),and t(s_{i}) =t(s_{i-1})+\tau

e_{i}来表示某个事件,\ell_{k}^{rear}(S,e_{i})表示序列S在事件e_{i}之后的长度为k的子序列,\ell_{k}^{front}(S,e_{i})表示序列S在事件e_{i}之前的长度为k的子序列。如果事件E与时间序列S之间存在关联关系,那么

\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}

\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}),i=1,\cdots,n\}应该是不一样的。

定义一:如果事件序列E和时间序列S是相关的,并且S->E,当且仅当\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}和随机选择的子序列分布不一致。

定义二:如果事件序列E和时间序列S是相关的,并且E->S,当且仅当\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}),i=1,\cdots,n\}和随机选择的子序列分布不一致,并且\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}和随机选择的子序列分布一致。

定义三:如果事件序列E和时间序列S是相关的,那么S->E或者E->S

定义四:如果E->S (or S->E),并且时间序列相比E之前是增加了,那么记为E\stackrel{+}{\longrightarrow} S (or S\stackrel{+}{\longrightarrow} E)。如果E->S (or S->E),并且时间序列相比E之前是减少了,那么记为E\stackrel{-}{\longrightarrow} S (or S\stackrel{-}{\longrightarrow} E)。

方法论:

第一步:最邻近算法(类似kNN)(Nearest Neighbor Method)

在计算时间序列之间距离的时候,使用DTW算法或者DTW-D算法会优于L1或者L2算法。

\Gamma^{front}来做例子,\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}\Theta =\{\theta_{1},\cdots,\theta_{\tilde{n}}\} 是随机选择的,Z=\Gamma \cup \Theta,可以标记为Z_{1},\cdots,Z_{p},其中p=n+\tilde{n}Z_{i}=\ell_{k}^{front}(S,e_{i}) when 1\leq i\leq nZ_{i}=\theta_{i-n} when n+1\leq i\leq p。可以使用记号A=A_{1}\cup A_{2},其中A_{1}=\Gamma^{front}A_{2}=\Theta=\{\theta_{1},\cdots,\theta_{\tilde{n}}\}是随机选择的。

对于集合 Ax\in A 而言,NN_{r}(x,A) 表示A-\{x\}中距离x最近的第r个元素,对于两个不相交的集合A_{1}A_{2},可以定义方程:

I_{r}(x,A_{1},A_{2})=1 when x\in A_{i} \&\& NN_{r}(x,A)\in A_{i},

I_{r}(x,A_{1},A_{2})=0 when otherwise.

该方程I_{r}(x,A_{1},A_{2})表示x与x的第r个最近的邻居是否在同一个子集内。

定义

T_{r,p}=\frac{1}{pr}\sum_{i=1}^{p}\sum_{j=1}^{r}I_{j}(x_{i},A_{1},A_{2}),

在这里p=n+\tilde{n}表示样本的总个数,x_{i}表示集合A的第i个元素。从直觉上讲,如果T_{r,p}小,则说明两类samples A_{1},A_{2}混合得非常好,表示无异常情况;如果T_{r,p}大,则说明两类samples A_{1},A_{2}有区分度,很多元素与它的邻居集中在某个子集里面,说明 A_{1} 这个集合与 A_{2} 有区分度。

根据文献里面的观点,当p足够大的时候,(pr)^{\frac{1}{2}}(T_{r,p}-\mu_{r})/\sigma_{r}遵循标准Gauss分布,其参数是\mu_{r}=(\lambda_{1})^{2}+(\lambda_{2})^{2}, \sigma_{r}^{2}=\lambda_{1}\lambda_{2}+4\lambda_{1}^{2}\lambda_{2}^{2},

\lambda_{1}=n/p=n/(n+\tilde{n}), \lambda_{2}=\tilde{n}/(n+\tilde{n})

根据传统的Gauss分布Test方法,\Gamma^{front}\Theta有显著的不同,当(pr)^{\frac{1}{2}}(T_{r,p}-\mu_{r})/\sigma_{r}^{2}>\alpha,在这里,参数可以按照以下标准设置:

\alpha = 1.96 for P=0.025

\alpha = 2.58 for P=0.001

如果\Gamma^{front}\Theta存在显著性偏差,那么说明\Gamma^{front}应该返回异常的标识。类似的,如果使用\Gamma^{rear}并且它与\Theta存在显著性偏差,那么说明\Gamma^{rear}应该返回异常的标识。

 

第二步:关联顺序的挖掘(Mining Existence and Temporal Order)

Figure3

如果前面的子序列\Gamma^{front}与随机选择的子序列\Theta有显著偏差,那么说明时序的变化导致了事件的发生,S\rightarrow E

如果后面的子序列\Gamma^{rear}与随机选择的子序列\Theta有显著偏差,那么说明事件导致了时序的变化,E\rightarrow S

在Figure 3中,CPU Intensive Program 导致了 CPU Usage,并且 CPU Usage 导致了 SQL Query Alert。

 

第三步:单调性的影响类型(Mining Effect Type)

现在需要判断时间序列是突增还是突降了,需要引入t_{score}的概念。

对于\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}), i=1,\cdots,n\}而言,其中n是E中的事件个数。t_{score}就可以定义为:

t_{score}=\frac{\mu_{\Gamma^{front}} - \mu_{\Gamma^{rear}}}{\sqrt{\frac{\sigma_{\Gamma^{front}}^{2}+\sigma_{\Gamma^{rear}}^{2}}{n}}}.

那么,如果t_{score}>\alpha,可以得到 E\stackrel{-}{\longrightarrow}S 或者 S\stackrel{-}{\longrightarrow} E;如果t_{score}<-\alpha,可以得到 E\stackrel{+}{\longrightarrow}S 或者 S\stackrel{+}{\longrightarrow} E

其中参数可以设置为:

\alpha = 1.96 for P=0.025

\alpha = 2.58 for P=0.001

 

算法综述:

algorithm

其中,5,6行是为了计算相关性,D_{r} 是 True 表示 \Gamma^{rear} 有异常,否则表示正常;D_{f} 是 True 表示 \Gamma^{front} 有异常,否则表示正常。

7-13行是 E\rightarrow S 的情形,因为\Gamma^{rear} 异常,同时 \Gamma^{front} 正常,说明事件导致了时间序列的变化。7-13行是为了计算 t_{score} 的范围,判断是显著的提升还是下降。

14-20行是 S\rightarrow E 的情形,因为\Gamma^{front} 异常,就导致了事件的发生。14-20行是为了计算 t_{score} 的范围,判断是显著的提升还是下降。

参数:

时间序列的长度 k 可以设置为第一次达到顶峰的长度,

最邻近的元素个数 r=\ln(p),其中p是样本的总个数。

Figure5

Figure6

其他算法:

(1)Pearson Correlation

(2)J-Measure Correlation

 

 

How to Convert a Time Series to a Supervised Learning Problem in Python

How to Convert a Time Series to a Supervised Learning Problem in Python

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a function to transform a time series dataset into a supervised learning dataset.
  • How to transform univariate time series data for machine learning.
  • How to transform multivariate time series data for machine learning.

Let’s get started.

How to Convert a Time Series to a Supervised Learning Problem in Python

Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

  • data: Sequence of observations as a list or 2D NumPy array. Required.
  • n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
  • n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
  • dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

  • return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?
Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.

Summary

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

  • About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
  • How to reframe a univariate time series into one-step and multi-step supervised learning problems.
  • How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Mueen Keogh算法

论文:Exact Discovery of Time Series Motifs

 

Speeded up Brute Force Motif Discovery:

Github:https://github.com/saifuddin778/mkalgo

但是感觉有一行比较奇怪,应该是 Dist_{I(j+offset)}-Dist_{I(j)} < best-so-far,而不是Dist_{I(j)}-Dist_{I(j+offset)} < best-so-far,因为 D_{I(j)} 是递增排列的,并且 best-so-far > 0.

Speeded up brute force motif discovery

Generalization to multiple reference points:

https://github.com/nicholasg3/motif-mining/tree/95bbb05ac5d0f9e90134a67a789ea7e607f22cea

注意:

for j = 1 to m-offset 而不是 for j = 1 to R

nicholasg3 MK_Motif_discovery

MK Motif Discovery

Time Series Clustering with Dynamic Time Warping (DTW)

https://github.com/goodmattg/wikipedia_kaggle

 

 

 

 

 

不要做低品质的勤奋者---读后感

不要做低品质的勤奋者---读后感

之前在网络上曾经有人写过一篇文章《为什么绝大多数人都是“低品质的勤奋者”》,现在一年之后翻阅来看,结合自己攻读博士学位和工作的经历,在这里稍微总结一下,希望能够给自己一些启示。

作为一个博士生,时间通常来说会比工作人士相对自由,非常类似一个自由职业者。因此也会有时间去做自己想做的事情,阅读自己喜欢的各种书籍,阅读自己想学的各种知识技能。但是,如果要真正走上一条科研的道路,绝对不是把所有的知识点统统学一遍之后才开始进行科研,而是在科研的道路中不断的学习和充实自己,在科研的过程中不断的成长。除了在科研的前期准备一些必要的理论基础之外,在科研的过程中就要有选择放弃一些知识。因为在做科研的时候并不需要了解任何一个知识点,科研是做出来的而不是学出来的。科研的途中只需要掌握某些知识点就可以了,剩下的都是遇到了问题然后靠自己独立攻克解决的。在科研之前抱有“等我读完了这些书再开始科研”其实是一种不明智的想法。

在工业界,一天到晚总会有各种各样杂七杂八的事情,也需要有人去关注这些问题并且解决这些问题。不过在日常的工作中,技术的积累和调研也是一份重要的工作,它不仅关系着员工个人的成长,也关系着整个团队整体实力的提升。这种时候,总需要有人去做一些调研性质的工作,去探索这个技术对整个团队是否存在实际的价值。但是在调研的过程中,不同的人就有着完全不同的做事方法,有的人喜欢直接开始干活,以无穷次试错的方式来判断自己的选择是否是正确的;而有的人喜欢先进行调研的工作,先把公司内外的优秀技术调研清楚,然后再与自己所处的环境做比较,最后再判断自己所选择的方案是否合理。除此之外,有的人总喜欢抓住一些细枝末节的东西反复纠缠,仿佛在这些细节上就能够挖出全部的框架结构,试图从这些细枝末节入手找出关键之处。而有的人则倾向于从一个整体的框架入手,尝试着从这个框架来解释方案是否合理,从整体来反推出之前的那些细节存在的意义。

如果看过天龙八部的人就知道,鸠摩智当时上少林寺去挑战,在少林高僧面前展示出自己所学的少林七十二绝技,诸多少林高僧无不大惊失色。而当时的虚竹在旁边观战,就对少林高僧们说:“鸠摩智所耍的招数虽然是少林绝技,但是本质上却是使用小无相功催动出来的。虽然招数相同,但是却用的道家的内力。”为什么少林的高僧们没有看出来鸠摩智武功的关键之处呢,那是因为少林高僧们在练功的时候,一直抱着武学秘籍在修炼,一辈子练到头了也就13门绝技。其实从鸠摩智的个人修炼来看,修练武学的关键并不在武学秘籍里,而在佛经里。没有找到关键的佛经,没有找到运功的法门,无论抱着武学秘籍修炼多少年,终究与别人有着本质上的差距。诗人陆游也曾经教育过他的后辈:“汝果欲学诗,功夫在诗外”。意思是说,如果你想真正地写出好的诗词,就要在生活上下功夫,去体验生活的酸甜苦辣,而不是抱着一本诗词歌赋来反复阅读。

在读博士的过程中,一般来说,选择博士课题是一件极为重要的事情。如果课题选得太难,那么学生在这几年内也做不出来;如果课题选得过于简单,那么学生即使做出来了也没有太大的意义。在选择课题的时候,博士们都会十分慎重,都会通过半年或者一年甚至更长的时间来选择合适自己的课题,否则在未来的几年里面将会十分的痛苦。在科研的过程中,在做每一个方案之前,一定要进行仔细的思考,因为科研工作一旦开工,很可能在短期内就不会回头。如果选择了一条极其错误的道路,那将会在这条错误的道路上走得越来越远,越来越不可能得到正确的结果。

为了不成为低品质的勤奋者,在平时工作和生活的过程中,在做某件事情之前,一定要提前进行仔细的思考和周密的安排。在这种时候,就不能够通过自己的应激反应来做这些重要而又不紧急的事情。一般来说,通过自己的第一反应来做的事情都是那种重要而又紧急的事情,而通过自己深度思考才开始做的事情都是重要但是又不紧急的事情。在个人成长道路中,要想改变一个人的命运的,通常都是做那些重要而又不紧急的事情。不过,人一般来说都是有惰性的,都是懒得进行深度思考,最常见的做法都是通过应激反应来决定下一步的计划和方案。如果要想保证一个人的成长,就一定要刻意地逼迫自己进行深度思考,逼迫自己每天花一定的时间来处理那些重要而又不紧急的事情。

无论是学习还是工作,都不可能面面俱到,不太可能做到事必躬亲的处理每一件事情。这种时候就要做到有重点的处理关键事情,放弃一些不那么重要的事情。在做每一件事情之前,都要思考做这件事情是否能够真正的带来价值,是否对整个团队和个人有利,而不是在稍微思考了一下觉得可以这么做就开始动手做这件事情。一味的努力做某件事情,只是看起来十分刻苦,但是要想真正的摆脱低品质勤奋者的困境,还需要有很长的路要走。

Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning

本文是运维系统智能化的一次探索工作,论文的作者是清华大学的裴丹教授,论文的题目是《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》。目的是基于机器学习的 KPI(Key Performance Indicator)的自动化异常检测。

标题 Opprentice 来源于(Operator’s Apprentice),意思就是运维人员的学徒。本文通过运维人员的业务经验来进行异常数据的标注工作,使用时间序列的各种算法来提取特征,并且使用有监督学习模型(例如 Random Forest,GBDT,XgBoost 等)模型来实现离线训练和上线预测的功能。本文提到系统 Opprentice 使用了一个多月的历史数据进行分析和预测,已经可以做到准确率>=0.66,覆盖率>=0.66 的效果。

1. Opprentice的介绍

系统遇到的挑战:

Definition Challenges: it is difficult to precisely define anomalies in reality.(在现实环境下很难精确的给出异常的定义)

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI (Key Performance Indicators). To address the definition challenge and the detector challenge, we advocate for using supervised machine learning techniques. (使用有监督学习的方法来解决这个问题)

该系统的优势:

(i) Opprentice is the first detection framework to apply machine learning to acquiring realistic anomaly definitions and automatically combining and tuning diverse detectors to satisfy operators’ accuracy preference.

(ii) Opprentice addresses a few challenges in applying machine learning to such a problem: labeling overhead, infrequent anomalies, class imbalance, and irrelevant and redundant features.

(iii) Opprentice can automatically satisfy or approximate a reasonable accuracy preference (recall>=0.66 & precision>=0.66). (准确率和覆盖率的效果)

2. 背景描述:

KPIs and KPI Anomalies:

KPIs: The KPI data are the time series data with the format of (time stamp, value). In this paper, Opprentice pays attention to three kinds of KPIs: the search page view (PV), which is the number of successfully served queries; The number of slow responses of search data centers (#SR); The 80th percentile of search response time (SRT).

Anomalies: KPI time series data can also present several unexpected patterns (e.g. jitters, slow ramp ups, sudden spikes and dips) in different severity levels, such as a sudden drop by 20% or 50%.

OpprenticeFigure1

问题和目标:

覆盖率(recall):# of true anomalous points detected / # of the anomalous points

准确率(precision):# of true anomalous points detected / # of anomalous points detected

1-FDR(false discovery rate):# of false anomalous points detected / # of anomalous points detected = 1 – precision

The quantitative goal of opprentice is precision>=0.66 and recall>=0.66.

The qualitative goal of opprentice is automatic enough so that the operators would not be involved in selecting and combining suitable detectors, or tuning them.

3. Opprentice Overview: (Opprentice系统的概况)

OpprenticeFigure2

(i) Opprentice approaches the above problem through supervised machine learning.

(ii) Features of the data are the results of the detectors.(Basic Detectors 来计算出特征)

(iii) The labels of the data are from operators’ experience.(人工打标签)

(iv) Addressing Challenges in Machine Learning: (机器学习遇到的挑战)

(1) Label Overhead: Opprentice has a dedicated labeling tool with a simple and convenient interaction interface. (标签的获取)

(2) Incomplete Anomaly Cases:(异常情况的不完全信息)

(3) Class Imbalance Problem: (正负样本比例不均衡)

(4) Irrelevant and Redundant Features:(无关和多余的特征)

4. Opprentice’s Design:

Architecture: Operators label the data and numerous detectors functions are feature extractors for the data.

OpprenticeFigure3

Label Tool:

人工使用鼠标和软件进行标注工作

OpprenticeFigure4

Detectors:

(i) Detectors As Feature Extractors: (Detector用来提取特征)

Here for each parameter detector, we sample their parameters so that we can obtain several fixed detectors, and a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor:

data point + configuration (detector + sample parameters) -> feature,

(ii) Choosing Detectors: (Detector的选择,目前有14种较为常见的)

Opprentice can find suitable ones from broadly selected detectors, and achieve a relatively high accuracy. Here, we implement 14 widely-used detectors in Opprentice.

Opprentice has 14 widely-used detectors:

OpprenticeTable3

Diff“: it simply measures anomaly severity using the differences between the current point and the point of last slot, the point of last day, and the point of last week.

MA of diff“: it measures severity using the moving average of the difference between current point and the point of last slot.

The other 12 detectors come from previous literature. Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severity.

(iii) Sampling Parameters: (Detector的参数选择方法,一种是扫描参数空间,另外一种是选择最佳的参数)

Two methods to sample the parameters of detectors.

(1) The first one is to sweep the parameter space. For example, in EWMA, we can choose \alpha \in \{0.1,0.3,0.5,0.7,0.9\} to obtain 5 typical features from EWMA; Holt-Winters has three [0,1] valued parameters \alpha,\beta,\gamma. To choose \alpha,\beta,\gamma \in \{0.2,0.4,0.6,0.8\}, we have 4^3 features; In ARIMA, we can estimate their “best” parameters from the data, and generate only one set of parameters, or one configuration for each detector.

Supervised Machine Learning Models:

Decision Trees, logistic regression, linear support vector machines (SVMs), and naive Bayes. 下面是决策树(Decision Tree)的一个简单例子。

OpprenticeFigure5

Random Forest is an ensemble classifier using many decision trees. It main principle is that a group of weak learners (e.g. individual decision trees) can together form a strong learner. To grow different trees, a random forest adds some elements or randomness. First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time. The random forest combines those trees by majority vote. The above properties of randomness and ensemble make random forest more robust to noises and perform better when faced with irrelevant and redundant features than decisions trees.

Configuring cThlds: (阈值的计算和预估)

(i) methods to select proper cThlds: offline part

OpprenticeFigure6

We need to figure cThlds rather than using the default one (e.g. 0.5) for two reasons.

(1) First, when faced with imbalanced data (anomalous data points are much less frequent than normal ones in data sets), machine learning algorithems typically fail to identify the anomalies (low recall) if using the default cThlds (e.g. 0.5).

(2) Second, operators have their own preference regarding the precision and recall of anomaly detection.

The metric to evaluate the precision and recall are:

(1) F-Score: F-Score = 2*precision*recall/(precision+recall).

(2) SD(1,1): it selects the point with the shortest Euclidean distance to the upper right corner where the precision and the recall are both perfect.

(3) PC-Score: (本文中采用这种评估指标来选择合适的阈值)

If r>=R and p>=P, then PC-Score(r,p)=2*r*p/(r+p) + 1; else PC-Score(r,p)=2*r*p/(r+p). Here, R and P are from the operators’ preference “recall>=R and precision>=P”. Since the F-Score is no more than 1, then we can choose the cThld corresponding to the point with the largest PC-Score.

(ii) EWMA Based cThld Prediction: (基于EWMA方法的阈值预估算法)

OpprenticeFigure7

In online detection, we need to predict cThlds for detecting future data.

Use EWMA to predict the cThld of the i-th week ( or the i-th test set) based on the historical best cThlds. Specially, EWMA works as follows:

If i=1, then cThld_{i}^{p}=cThld_{1}^{p}= 5-fold prediction

Else i>1, then cThld_{i}^{p}=\alpha\cdot cThld_{i-1}^{b}+(1-\alpha)\cdot cThld_{i-1}^{p}, where cThld_{i-1}^{b} is the best cThld of the (i-1)-th week. cThld_{i}^{p} is the predicted cThld of the i-th week, and also the one used for detecting the i-th week data. \alpha\in [0,1] is the smoothing constant.

For the first week, we use 5-fold cross-validation to initialize cThld_{1}^{p}. As \alpha increases, EWMA gives the recent best cThlds more influences in the prediction. We use \alpha=0.8 in this paper.

5. Evaluation(系统评估)

在 Opprentice 系统中,红色表示 Opprentice 系统的方法,黑色表示其他额外的方法。

OpprenticeFigure8

Opprentice has 14 detectors with about 9500 lines of Python, R and C++ code. The machine learning block is based on the scikit-learn library.

Random Forest is better than decision trees, logistic regression, linear support vector machines (SVMs), and naive Bayes.

 

Focus: Shedding Light on the High Search Response Time in the Wild

本文作为智能运维系统的探索,这篇论文的标题是《Focus: Shedding Light on the High Search Response Time in the Wild》,来自于清华大学裴丹教授。目标是解决在运维过程中,发现高搜索响应时间之后,使用机器学习算法发现异常的原因和规则。该系统(Focus)使用过2.5个月的数据,并且分析过数十亿的日志。下面将会详细介绍这篇文章的主要内容。

问题描述:

To help search operators dubug HSRT (high search response time),Focus is a search log analysis framework to answer the three questions:

(1) What is the HSRT condition?

(2) Which HSRT condition types are prevalent across days?

(3) How does each attribute affect SRT in those prevalent HSRT condition types?

解决方案:

Focus has one component for each of the above questions:

(1) A decision tree based classifier to identify HSRT conditions in search logs of each day;

(2) A clustering based condition type miner to combine similar HSRT conditions into one type, and find the prevalent condition types across days; following Occam’s razor principle.

(3) An attribute effect estimator to analyze the effect of each individual attribute of SRT within a prevalent condition type.

基础知识准备:

(A) Search Logs:

For each measured query, its search log records two types of data: SRT and SRT components, Query Attributes.

(1) SRT and SRT components:(特征层)

FocusFigure1

t_{1} is when a query is submitted; t_{2} is when the result HTML file has been downloaded; t_{3} is when a brower finishes parsing the HTML; t_{4} is when the page is completely rendered. SRT is measured by t_{4}-t_{1}, the user-received search response time.

T_{server} is the server response time of the HTML file, which is recorded by servers; T_{net}=t_{2}-t_{1}-T_{server} is the network transmission time of the HTML file; T_{brower}=t_{3}-t_{2} is the browser parsing time of the HTML; T_{other}=t_{4}-t_{3} is the remaining time spent before the page is rendered, e.g. download time of images from image servers.

(2) Query Attributes:(特征层)

The search logs record the following attributes for each measured query:

(i) Browser Engine: Webkit(e.g. Chrome, Safari and 360 Secure Browser), Gecko, Trident LEGC, Trident 4.0, Trident 5.0, and others.

(ii) ISP: China Telecom, China Unicom, China Mobile, China Netcom, CHina Tietong, others.

(iii) Localtion: Based on the client IP, convert IP to its geographic location. In total, there are 32 provinces.

(iv) #Image: the number of embedded images in the result page.

(v) Ads: A result page contains paid advertise links or not.

(vi) Loading Mode: The loading mode of a result page can be either synchronous or asynchronous.

(vii) Background page views: On the service side, the search engine S also post-analyzes the logs and generates the background page views. The background PVs (page views) for a query q is measured by the number of queries served within 30 seconds before and after q is served.It reflects the average search request load where q is served. Due to confidentiality constraints, we normalize specific background PVs (page Views) by the maximum value.(事后分析,统计出一些必要的特征,输入 Focus 系统的机器学习模型中)

(B) HSRT and HSRT Conditions:(样本层)

FocusFigure2

Usually, we can use cumulative distribution fraction (CDF) of SRT in the search logs to determine the high search response time condition (HSRT condition). In this paper,  we define HSRT as the SRT longer than 1s.

Challenges of Identifying HSRT Conditions: In order to identify HSRT conditions in multi-dimensional search logs.(以下是这个系统的一些难点和挑战点)

(a) Naive Single Dimensional Based Methods: including pair-wise correlation analysis and so on, but is inefficient.

(b) Attributes can be potentially interdependent on each other: that means Naive Bayes Method may not applicable in this situation.

(c) Need to avoid output overlapping conditions: like {#image>30}, {ads=yes}, and {#image>20, ads=yes}.  (随着时间的推移,每天使用模型可能会推出类似或者重复的规则)。

关键思想和系统概况

Condition is a combination of attributes and specific values in search logs.

HSRT Condition is a condition that covers at least 1%$ of total queries, and has the fraction of HSRT large than the global level:

(# of HSRT queries in a HSRT condition / #of queries in a HSRT condition) > (# of HSRT queries / # of queries). This is in order to assign to labels and we can change this definition in practice. (这只是用来打标签的定义,用于判断什么是HSRT,在实际的应用中,我们可以根据具体的场景采用不同的定义,例如返回码等指标)。

‘Focus’ System Overview:

FocusFigure6

Input: search logs(日志)

(i) Use a decision tree based classifier to identify HSRT conditions  in search logs every day; (每天可以使用决策树模型从日志中提取HSRT条件)。

(ii) Use a clustering based condition type miner to identify condition types of similar HSRT conditions, and fine prevalent condition types across days; (用于把类似的条件融合在一起)。

(iii) Use an attribute effect estimator to analyze how an attribute affects SRT and SRT components in each prevalent condition type. (用于判断哪些属性或者特征对这个标签影响更加深远)。

Output: prevalent condition types and their attributes effects on SRT.(第二步输出的条件以及第三步属性的重要性)。

Part (i): Decision Tree Based Classifier including ID3, C4.5, CART. It contains five important parts: (1) expressing attribute splits; (2) evaluate splits; (3) stopping tree growing; (4) assigning Labels: assign HSRT labels to the left nodes whose fraction of HSRT is larger than the global fraction of HSRT; (5) identify HSRT Branching Attribute Conditions. (这里是 Focus 系统所采用的机器学习算法)。

FocusFigure7

Part (ii): Condition Type Miner: group HSRT conditions according to (1) the same combination of attributes, (2) the same value from each category attribute, and (3) similar interval for each numeric attribute, using Jaccard Index to measure the similarity between intervals. (条件的融合)。

FocusTable1

Part (iii): Attribute Effect Estimator: With each condition type

C=\{c_{1}\wedge c_{2}\cdots \wedge c_{i} \wedge \cdots c_{n}\},

we design a method to understand how each attribute condition c_{i} affects SRT.

For example, what is the HSRT fraction caused by c_{i} in C? What SRT components (e.g. T_{net} and T_{server}) are affected by c_{i}?

Main Idea: flip condition c_{i} to the opposite \overline{c}_{i} to get a variant condition type C_{i}'=\{c_{1}\wedge c_{2}\cdots \wedge \overline{c}_{i} \wedge \cdots c_{n}\}. In the past days, we have the number of HSRT events in total, the number of HSRT events in condition C and the number of HSRT events in condition C_{i}'. As a result, we believe the historical data based comparison can provide a reasonable estimate of the attribute effects. The comparison between C and C_{i}' in these days is based on the specific HSRT conditions of these days. (用于判断哪些属性更能够引起 HSRT)。

FocusTable2

FocusTable3and4

In Table IV, the results are sorted by the variation of the fraction of HSRT in condition types (HSRT% column) caused by flipping an attribute condition.

(i) We highlight the variations greater than zero (getting worse after flipping an attribute condition).

(ii) We focus on that flipping the HSRT branching attribute conditions can yield improvements on HSRT%. For example, the condition #image>x are all ranked at the top. It means we need to reduce the impact of images on SRT and we can get the highest potential improvement of HSRT.

(iii) Table III and Table IV are the output of Focus to the operators for these months.

Observations by Further Inverstigation

Table IV raises some interesting questions:(通过 Focus 输出的表格 Table IV 可以提出很多其余的问题,也许是人工经验不容易发现的问题)

(1) Why does reducing #images increase T_{server}, the time that servers prepare the result HTML (row 1, 2, 3, 4 of Table IV)?

(2) How do ads inflate SRT? Why do the pages with ads need more T_{net} and T_{brower} (row 7)?

(3) Why does Webkit engine perform better, especially greatly decreasing T_{browser} (row 5, 10, 11, 12)?

(4) It is nature that switching ISPs can affect network transmission time T_{net}, but why does switching to China Telecom reduce T_{server} by over 20% (row 6, 8, 9)?

 

TF-IDF简介

tf-idf,英语的全称叫做 term frequency-inverse document frequency,它是文本挖掘领域的基本技术之一。tf-idf 是一种统计的方法,用来评估一个词语在一份语料库中对于其中一份文件的重要程度。词语的重要性会随着它在该文件中出现的次数而增加,但是也会同时随着它在语料库中其他文件出现的次数而减少。

假设一份语料集合是 D,那么语料集中文件的个数就是 N=|D|,第 j 份文件用 d_{j} 来表示,其中 1\leq j\leq |D|。同时假设在语料库中出现的所有词语的集合是 \{t_{i},1\leq i\leq T\}

(一)词频的定义以及基本性质

从直觉上来说,如果某一个词语在一份文件中重复出现了多次,那么这个词语在这份文件中的重要性就会显著增加。在给定的一份文件 d_{j} 里面,词频(term frequency)指的是语料库中的一个词语 t_{i} 在该文件中出现的频率。词频通常定义为:

tf(t_{i},d_{j}) = \frac{n_{i,j}}{\sum_{k}n_{k,j}},

其中,n_{i,j} 指的是词语 t_{i} 在文件 d_{j} 中的出现次数,而分母 \sum_{k}n_{k,j} 则是文件 d_{j} 中所有的词语的出现次数之和。

备注:如果某个词语在该文件中没有出现过,那么该词语在这个文件中的词频就是零。

词频除了以上的基本定义之外,还有其他的各种形式:

二值表示:tf(t_{i},d_{j}) = \mathcal{I}(t_{i}\in d_{j}),其中 \mathcal{I} 是指示函数,

计数表示:tf(t_{i},d_{j}) = n_{i,j},

概率表示:tf(t_{i},d_{j}) = n_{i,j}/\sum_{k}n_{k,j},

对数表示:tf(t_{i},d_{j}) = 1 + \log(n_{i,j}),

double normalization K表示:tf(t_{i},d_{j}) = K + (1-K)n_{i,j}/\max_{k}n_{k,j},其中 K \in (0,1),或者 K 直接取值为 0.5 即可。

(二)逆向文件频率的定义以及基本性质

除了词频之外,还有逆向文件频率(inverse document frequency)这个概念,它是用来描述一个词语普遍性的指标。通常来说,如果某个词语在绝大多数甚至所有的文件中都出现过,例如一些常见的停用词,那么该词语的重要性就要降低,因为它在语料集中十分普遍。因此,逆向文件频率的定义通常就是:

idf(t_{i},D) = \log(\frac{|D|}{|\{d\in D: t_{i}\in d\}|}),

其中,N=|D| 是语料库中文件的总数,|\{d\in D: t_{i}\in d\}| 表示的是在语料库中包含词语 t_{i} 的文件个数,也就是说 n_{i,j}\neq 0 的文件个数。

备注:如果该词语不在语料库中,会导致分母是零,因此在一般情况下会使用 1 + |\{d\in D: t_{i}\in d\}|

逆向文件频率除了以上的基本定义之外,还有以下几种常见的计算方法:假设 n_{i} = |\{d\in D: t_{i}\in d\}|

唯一性表示:1

逆向文件频率:\log(\frac{|D|}{n_{i}})=-\log(\frac{n_{i}}{|D|})=-log(p(t_{i}|d)),

光滑的逆向文件频率:\log(1 + \frac{|D|}{n_{i}}),

概率化逆向文件频率:\log(\frac{|D|-n_{i}}{n_{i}}).

(三)TF-IDF 的定义和基本性质

那么,为了描述词语 t_{i} 在文件 d_{j} 中的重要性,tf-idf 的定义就可以写成:

tfidf(t_{i},d_{j})=tf(t_{i},d_{j})\times idf(t_{i},D).

通常来说,tf-idf 倾向于过滤掉常见的词语,而保留重要的词语。

下面:我们来通过一个案例来看 tf-idf 是如何进行计算的。

假设语料集中有两份文档,分别是 Document 1 和 Document 2,出现的词语个数如下表示:

Example

通过这幅图可以直接计算出 “this” 这个词语在各个文件中的重要性程度:

tf("this",d_{1}) = 1/5tf("this, d_{2}) = 1/7idf("this") = 0

因此可以得到 tdidf("this",d_{1}) = tfidf("this,d_{2}) = 0。原因是 “this” 这个词语在两个文件中都出现了,是一个常见的词语。

tf("example",d_{1}) = 0tf("example", d_{2}) = 3/7, idf("example") = \log(2)

因此可以得到 tfidf("example",d_{1}) = 0tfidf("example",d_{2}) = 3\log(2)/7。原因是 “example” 这个词语在第一份文件中没有出现,第二份文件中出现了。

(四)向量空间模型

空间向量模型是把一个文件表示成向量的代数模型,文件与文件之间的相似度使用向量之间的角度来进行比较。

假设语料库中所有词语的个数是 T,第 j 个文件是 d_{j},查询是 q,它们用向量表示就是:

d_{j} = (w_{1,j},w_{2,j},...,w_{T,j})\in \mathbb{R}^{T}

q = (w_{1,q},w_{2,q},...,w_{T,q})\in \mathbb{R}^{T}

每个维度对应了一个相应的词语。如果该词语没有出现在该文件中,那么向量中所对应的位置就是零。在这里,比较经典的一种做法就是选择 tf-idf 权重,也就是说第 j 个文件的向量是按照如下规则选择的,w_{i,j}=tfidf(t_{i},d_{j}), w_{i,q}=tfidf(t_{i},q), 1\leq i\leq T

那么文件 d_{j} 和查询 q 之间的相似度就可以定义为:

sim(d_{j},q)=\frac{d_{j}\cdot q}{||d_{j}||\cdot||q||}.

其中分子指的是两个向量的内积,分母指的是两个向量的欧几里得范数的乘积。

备注:在词组计数模型(Term Count Model)中,也可以简单的考虑词语出现的次数即可:w_{i,j}=tf(t_{i},d_{j})

 

聚类算法(二)

聚类是一种无监督学习(无监督学习是指事先并不知道要寻找的内容,没有固定的目标变量)的算法,它将相似的一批对象划归到一个簇,簇里面的对象越相似,聚类的效果越好。聚类的目标是在保持簇数目不变的情况下提高簇的质量。给定一个 m 个对象的集合。把这 m 个对象划分到 K 个区域,每个对象只属于一个区域。所遵守的一般准则是:同一个簇的对象尽可能的相互接近或者相关,而不同簇的对象尽可能的区分开。之前已经介绍过 KMeans 算法与 bisecting KMeans 算法,在这里介绍一个流式的聚类算法。

流式聚类算法(One Pass Cluster Algorithm)

既然是流式的聚类算法,那么每个数据只能够进行一次聚类的操作。有一个简单的思路就是对于每一个未知的新样本,如果它与某个类足够相似,那么就把这个未知的新样本加入这个类;否则就自成为一个新的类。如下图所示:

SinglePass 聚类原理

基于以上思路,我们可以设计一个流式的聚类算法。假设 dataMat 是一个 m 行 n 列的一个矩阵,每一行代表一个向量,n 代表向量的维度。i.e. 在 n 维欧几里德空间里面有 m 个点需要进行聚类。

流式聚类的算法细节:

参数的设定:

K 表示在聚类的过程中允许形成的簇的最大个数;

D 表示距离的阀值,在这里两个点之间的距离可以使用 L^{1}, L^{2}, L^{\infty} 范数;

聚簇的质心定义为该类中所有点的平均值。i.e. 如果该类中的点是 A[0],A[1],\cdot\cdot\cdot,A[n-1],那么质心就是 \sum_{i=0}^{n-1}A[i]/n

第 j 个聚簇中元素个数用 num[j] 表示。

方法:

Step 1:对于 dataMat[0] 而言,自成一类。i.e. 质心就是它本身 C[0]=dataMat[0],该聚簇的元素个数就是 num[0]=1,当前所有簇的个数是 K^{'}=1

Step 2:  对于每一个 1\leq i\leq m-1,进行如下的循环操作:

假设当前有 K^{'} 个簇,第 j 个簇的质心是 C[j],第 j 个簇的元素个数是 num[j],其中 0\leq j\leq K^{'}-1

计算 d= \min_{0\leq j\leq K^{'}-1}Distance(dataMat[i],C[j]),其中的 Distance 可以是欧几里德空间的 L^{1},L^{2},L^{\infty} 范数,对应的下标是 j^{'}。i.e. j'=argmin_{0\leq j\leq K'-1}Distance(dataMat[i],C[j])

(i)如果当前 K'=K 或者 d\leq D,则把 dataMat[i] 加入到第 j^{'} 个聚簇。也就是说:

质心更新为 C[j'] \leftarrow (C[j']*num[j] + dataMat[i])/(num[j] + 1)

元素的个数更新为 num[j'] \leftarrow num[j'] + 1

(ii)否则,dataMat[i] 需要自成一类。i.e. K^{'} \leftarrow K^{'} + 1num[K'-1]=1C[K'-1]=dataMat[i]

注:

(1)数据的先后顺序对聚类的效果有一定的影响。

(2)该算法的时间复杂度与簇的个数是相关的,通常来说,如果形成的簇越多,对于一个新的样本与质心的比较次数就会越多。

(3)如果设置簇的最大个数是一个比较小的数字的时候,那么对于一个新的样本,它需要比较的次数就会大量减少。如果最小距离小于 D 或者目前已经达到最大簇的个数的话,那么把新的样本放入某个类。此时,可能会出现某个类由于这个新的样本导致质心偏移的情况。如果设置 K 很大,那么计算的时间复杂度就会增加。因此,在保证计算实时性的时候,需要考虑到 K 的设置问题。

效果展示:

使用一批二维数据集合,进行流式聚类算法可以得到下图:’+’ 表示的是质心的位置,其余就是数据点的位置。通过选择不同的距离阈值 D,就可以得到不同的聚类情况。其中 ‘+’ 表示该簇质心的位置,蓝色的点表示数据点。如下图所示:

流式聚类算法1流式聚类算法2

流式聚类算法3

UCL course – 2016

GOOD

Hado van Hasselt's avatarHado van Hasselt

Together with Joseph Modayil, this year I am teaching the part on reinforcement learning of the Advanced Topics in Machine Learning course at UCL.

Lectures

Note that there will be two lectures about AlphaGo on March 24.  We will talk about AlphaGo in the context of the whole course at the normal place and time (9:15am in Roberts 412), and in addition David Silver will give a seminar that afternoon.  Neither of these will be required for the exam.

  1. Introduction to reinforcement learning updated January 14 (Lecture: January 14)
  2. Exploration and Exploitation updated January 21 (Lecture: January 21)
  3. Markov decision processes updated January 27 (Lecture: January 28)
  4. Dynamic programming updated February 3 (Lecture: February 4)
  5. Learning to predict updated February 10 (Lecture: February 11)
  6. Learning to control updated March 16 (Lecture: February 25)
  7. Value function approximation updated March 2 (Lecture: March 3)
  8. Policy-gradient algorithms updated March 9 (Lecture:…

View original post 51 more words

用强化学习玩文本游戏

随着 DeepMind 成功地使用卷积神经网络(CNN)和强化学习来玩 Atari 游戏,AlphaGo 击败围棋职业选手李世石,强化学习已经成为了机器学习的一个重要研究方向。除此之外,随着人工智能的兴起,自然语言处理在聊天机器人和智能问答客服上也有着广泛的应用。之前在一篇博客里面曾经介绍了强化学习的基本概念,今天要介绍的是强化学习在文本领域的应用,也就是如何使用强化学习来玩文本游戏。要分享的 Paper 是 Deep Reinforcement Learning with a Natural Language Action Space,作者是 Microsoft 的 Ji He 与他的合作者们。

对于强化学习而言,那就不得不提到 Markov 决策过程(Markov Decision Process)。它是由状态(State),动作(Action),状态转移概率(State Transition Probability),折扣因子(Discount Factor),奖励函数(Reward Function)五个部分构成。强化学习做的事情就是该 agent 在某一个时刻处于某个状态 s,然后执行了某个动作 a,从整个环境中获得了奖励 r,根据状态 s 和奖励 r 来继续选择下一个动作 a,目标是让获得的奖励值最大。整个过程其实是一个不断地从环境中执行动作和获得奖励的过程,通过引入动作值函数 Q(s,a) 的概念,介绍了 Q-learning 这个基本算法。通过 Q-learning 来让 agent 获得最大的奖励。

在实际的生产环境中,状态空间 S 很可能是十分巨大的,如果对于 Atari 游戏的话,动作空间 A 是有限的(例如上下左右移动,攻击,躲避等)。因此 DeepMind 在处理这个问题的时候,创新性地使用了卷积神经网络(CNN)和强化学习(RL)两者结合的解决方案。通过 CNN 来读取游戏图像,然后神经网络输出的是动作值函数 Q(s,a),其中 a 就是游戏手柄上的几个动作按钮。然后使用周围环境的反馈和强化学习方法来获得相应的样本,从而训练整个 CNN 神经网络。

下面来介绍本文的正式内容。首先文本游戏和视觉游戏有一定的差别,视觉游戏的状态就是当前的屏幕图像,文本游戏的状态是一段文本描述,然后玩家来给出一个合适的动作进入下一个状态。例如:白色的文字描述就是当前的状态,蓝色的文字就是玩家要选择的动作。

Description of Text Games

当玩家选择了其中一个状态(例如选择了第一个 A Lister sandwich)之后,就会进入下一个状态,如图所示。

Description of Text Games2

注:关于文本游戏 Machine of Death 的代码和基本信息,可以参见 https://github.com/jvking/text-games.

为了做一个 agent 使其能够自主地玩文本游戏,基于当前的文本游戏背景,就要考虑到两种文本情况,分别是状态文本(state-text)和动作文本(action-text)。agent 需要同时理解这两部分的文本,然后争取在玩游戏的过程中获得最大的收益,最终才能够学会玩游戏。对于每一个时刻 t,状态是 s_{t},动作是 a_{t}。该 agent 的策略定义为在状态 s_{t} 执行动作 a_{t} 的概率 \pi(a_{t}|s_{t})。如果折扣因子是 \gamma,那么基于策略 \pi 的动作值函数就定义为:

Q^{\pi}(s,a)=E[\sum_{i=0}^{\infty}\gamma^{i}r_{t+i}|s_{t}=s, a_{t}=a].

在该paper中,作者使用 softmax 策略来进行必要的探索活动,也就是说在状态 s_{t} 选择 a_{t} 的概率是

\pi(a_{t}=a_{t}^{i}|s_{t}) = \exp(\alpha\cdot Q(s_{t},a_{t}^{i}))/\sum_{j=1}^{|A_{t}|}\ \exp(\alpha\cdot Q(s_{t},a_{t}^{j})),

在这里 A_{t} 是在状态 s_{t} 可以执行的所有动作的集合,a_{t}^{i}A_{t} 的第 i 个动作,\alpha>0 是某个常数。

因为状态文本(state-text)通常来说是长文本,动作文本(action-text)是短文本,两者有着本质的区别,因此作者在构建神经网络的时候分别对状态文本和动作文本搭建了相应的神经网络。在本文中,作者对比了三种神经网络结构,分别是 Max-action DQN,Pre-action DQN 和 DRRN。如下图所示:

DRRN

对于 Max-action DQN,该模型适用于对每一个状态 s_{t} 中动作空间的数量都是已知的情况,因为它是把状态和动作拼成一个向量作为神经网络的输入的,然后计算出每一个动作的 Q 值。

对于 Pre-action DQN,该模型把每一对状态文本和动作文本当作输入,也就是(状态文本,动作文本)这种形式,通过神经网络计算出这一对状态文本和动作文本的 Q 值。

对于 DRRN,该模型把每一对状态文本和动作文本当作输入,依旧是(状态文本,动作文本)这种形式。然后对状态文本和动作文本分别构建一个神经网络,然后对输出的两个向量进行内积的操作,从而获得 Q 值。用数学语言来描述就是,给定一个状态文本和动作文本的对,也就是 (s_{t},a_{t}^{i}),使用相应的神经网络同时把 s_{t}a_{t}^{i} 映射成向量 h_{\ell,s}h_{\ell,a_{t}^{i}},然后计算这两个向量的内积 h_{\ell,s}\cdot h_{\ell,a_{t}^{i}},最后选择 a_{t} = argmax_{a_{t}^{i}}Q(s_{t},a_{t}^{i}) 作为当前状态的动作。其中,h_{\ell,s}h_{\ell, a} 分别指的是状态网络和动作网络的第 \ell 个隐藏层神经网络的输出。W_{\ell,s}, W_{\ell,a}b_{\ell,s}, b_{\ell,a} 是在第 (\ell-1) 层和第 \ell 层的权重矩阵和偏移量。也就是说,DRRN 拥有 L 个隐藏层,并且

h_{1,s} = f(W_{1,s}s_{t}+b_{1,s}),

h_{1,a}^{i} = f(W_{1,a}a_{t}^{i}+b_{1,a}),

h_{\ell,s} = f(W_{\ell,s}h_{\ell-1,s}+b_{\ell,s}),

h_{\ell,a}^{i} = f(W_{\ell,a}h_{\ell-1,a}^{i}+b_{\ell,a}),

其中 f 是激活函数,1\leq \ell \leq L。Q 值函数定义为 Q(s,a^{i};\Theta) = g(h_{L,s},h_{L,a}^{i}),其中 g 可以定义为两个向量的内积。

综上所述,DRRN 的伪代码如下:

DRRN Code

DRRN 相比另外两个模型其创新点在于分别使用了两个网络来映射状态文本和动作文本,因为如果将长文本和短文本直接拼接输入单个神经网络结构的时候,可能会降低 Q 值的质量,所以把 state-text 和 action-text 分别放入不同的网络结构进行学习,最后使用内积合并的方式获得 Q 值的方法会更加优秀。

参考文献:

  1. Deep Reinforcement Learning with a Natural Language Action Space

《大数据智能》第2章:知识图谱

http://blog.sina.com.cn/s/blog_574a437f0102w2bk.html

第2章:知识图谱——机器大脑中的知识库

2.1 什么是知识图谱

在互联网时代,搜索引擎是人们在线获取信息和知识的重要工具。当用户输入一个查询词,搜索引擎会返回它认为与这个关键词最相关的网页。从诞生之日起,搜索引擎就是这样的模式。

直到2012年5月,搜索引擎巨头谷歌在它的搜索页面中首次引入“知识图谱”:用户除了得到搜索网页链接外,还将看到与查询词有关的更加智能化的答案。如图2.1所示,当用户输入“Marie Curie”(玛丽·居里)这个查询词,谷歌会在右侧提供了居里夫人的详细信息,如个人简介、出生地点、生卒年月等,甚至还包括一些与居里夫人有关的历史人物,例如爱因斯坦、皮埃尔·居里(居里夫人的丈夫)等。

从杂乱的网页到结构化的实体知识,搜索引擎利用知识图谱能够为用户提供更具条理的信息,甚至顺着知识图谱可以探索更深入、广泛和完整的知识体系,让用户发现他们意想不到的知识。谷歌高级副总裁艾米特·辛格博士一语道破知识图谱的重要意义所在:“构成这个世界的是实体,而非字符串(things, not strings)”。

图2.1  谷歌搜索引擎的知识图谱

谷歌知识图谱一出激起千层浪,美国的微软必应,中国的百度、搜狗等搜索引擎公司在短短的一年内纷纷宣布了各自的“知识图谱”产品,如百度“知心”、搜狗“知立方”等。为什么这些搜索引擎巨头纷纷跟进知识图谱,在这上面一掷千金,甚至把它视为搜索引擎的未来呢?这就需要从传统搜索引擎的原理讲起。以百度为例,在过去当我们想知道“泰山”的相关信息的时候,我们会在百度上搜索“泰山”,它会尝试将这个字符串与百度抓取的大规模网页做比对,根据网页与这个查询词的相关程度,以及网页本身的重要性,对网页进行排序,作为搜索结果返回给用户。而用户所需的与“泰山”相关的信息,就还要他们自己动手,去访问这些网页来找了。

当然,与搜索引擎出现之前相比,随着网络信息的爆炸式增长,搜索引擎由于大大缩小了用户查找信息的范围,日益成为人们遨游信息海洋的不可或缺的工具。但是,传统搜索引擎的工作方式表明,它只是机械地比对查询词和网页之间的匹配关系,并没有真正理解用户要查询的到底是什么,远远不够“聪明”,当然经常会被用户嫌弃了。

而知识图谱则会将“泰山”理解为一个“实体”(entity),也就是一个现实世界中的事物。这样,搜索引擎会在搜索结果的右侧显示它的基本资料,例如地理位置、海拔高度、别名,以及百科链接等,此外甚至还会告诉你一些相关的“实体”,如嵩山、华山、衡山和恒山等其他三山五岳等。当然,用户输入的查询词并不见得只对应一个实体,例如当在谷歌中查询“apple”(苹果)时,谷歌不止展示IT巨头“Apple-Corporation”(苹果公司)的相关信息,还会在其下方列出“apple-plant”(苹果-植物)的另外一种实体的信息。

很明显,以谷歌为代表的搜索引擎公司希望利用知识图谱为查询词赋予丰富的语义信息,建立与现实世界实体的关系,从而帮助用户更快找到所需的信息。谷歌知识图谱不仅从Freebase和维基百科等知识库中获取专业信息,同时还通过分析大规模网页内容抽取知识。现在谷歌的这幅知识图谱已经将5亿个实体编织其中,建立了35亿个属性和相互关系,并还在不断高速扩充。

谷歌知识图谱正在不断融入其各大产品中服务广大用户。最近,谷歌在Google Play Store的Google Play Movies & TV应用中添加了一个新的功能,当用户使用安卓系统观看视频时,暂停播放,视频旁边就会自动弹出该屏幕上人物或者配乐的信息,如图2.2所示。这些信息就是来自谷歌知识图谱。谷歌会圈出播放器窗口所有人物的脸部,用户可以点击每一个人物的脸来查看相关信息。此前,Google Books 已经应用此功能。

图2.2  Google利用知识图谱标示视频中的人物或配乐信息

2.2  知识图谱的构建

最初,知识图谱是由谷歌推出的产品名称,寓意与Facebook提出的社交图谱(Social Graph)异曲同工。由于其表意形象,现在知识图谱已经被用来泛指各种大规模知识库了。

我们应当如何构建知识图谱呢?我们先了解一下,知识图谱的数据来源都有哪些。知识图谱的最重要的数据来源之一是以维基百科、百度百科为代表的大规模知识库,在这些由网民协同编辑构建的知识库中,包含了大量结构化的知识,可以高效地转化到知识图谱中。此外,互联网的海量网页中也蕴藏了海量知识,虽然相对知识库而言这些知识更显杂乱,但通过自动化技术,也可以将其抽取出来构建知识图谱。接下来,我们分别详细介绍这些识图谱的数据来源。

2.2.1  大规模知识库

大规模知识库以词条作为基本组织单位,每个词条对应现实世界的某个概念,由世界各地的编辑者义务协同编纂内容。随着互联网的普及和Web 2.0理念深入人心,这类协同构建的知识库,无论是数量、质量还是更新速度,都早已超越传统由专家编辑的百科全书,成为人们获取知识的主要来源之一。目前,维基百科已经收录了超过2200万词条,而仅英文版就收录了超过400万条,远超过英文百科全书中最权威的大英百科全书的50万条,是全球浏览人数排名第6的网站。值得一提的是,2012年大英百科全书宣布停止印刷版发行,全面转向电子化。这也从一个侧面说明在线大规模知识库的影响力。人们在知识库中贡献了大量结构化的知识。如图2.3所示,是维基百科关于“清华大学”的词条内容。可以看到,在右侧有一个列表,标注了与清华有关的各类重要信息,如校训、创建时间、校庆日、学校类型、校长,等等。在维基百科中,这个列表被称为信息框(infobox),是由编辑者们共同编辑而成的。信息框中的结构化信息是知识图谱的直接数据来源。

除了维基百科等大规模在线百科外,各大搜索引擎公司和机构还维护和发布了其他各类大规模知识库,例如谷歌收购的Freebase,包含3900万个实体和18亿条实体关系;DBpedia是德国莱比锡大学等机构发起的项目,从维基百科中抽取实体关系,包括1千万个实体和14亿条实体关系;YAGO则是德国马克斯·普朗克研究所发起的项目,也是从维基百科和WordNet等知识库中抽取实体,到2010年该项目已包含1千万个实体和1.2亿条实体关系。此外,在众多专门领域还有领域专家整理的领域知识库。

图2.3  维基百科词条“清华大学”部分内容

2.2.2  互联网链接数据

国际万维网组织W3C在2007年发起了开放互联数据项目(Linked Open Data,LOD),其发布数据集示意图如图2.4所示。该项目旨在将由互联文档组成的万维网(Web of documents)扩展成由互联数据组成的知识空间(Web of data)。LOD以RDF(Resource Description Framework)形式在Web上发布各种开放数据集,RDF是一种描述结构化知识的框架,它将实体间的关系表示为(实体1,关系,实体2)的三元组。LOD还允许在不同来源的数据项之间设置RDF链接,实现语义Web知识库。目前世界各机构已经基于LOD标准发布了数千个数据集,包含数千亿RDF三元组。随着LOD项目的推广和发展,互联网会有越来越多的信息以链接数据形式发布,然而各机构发布的链接数据之间存在严重的异构和冗余等问题,如何实现多数据源的知识融合,是LOD项目面临的重要问题。

图2.4  开放互联数据项目发布数据集示意图

2.2.3  互联网网页文本数据

与整个互联网相比,维基百科等知识库仍只能算沧海一粟。因此,人们还需要从海量互联网网页中直接抽取知识。与上述知识库的构建方式不同,很多研究者致力于直接从无结构的互联网网页中抽取结构化信息,如华盛顿大学Oren Etzioni教授主导的“开放信息抽取”(open information extraction,OpenIE)项目,以及卡耐基梅隆大学Tom Mitchell教授主导的“永不停止的语言学习”(never-ending language learning,NELL)项目。OpenIE项目所开发的演示系统TextRunner已经从1亿个网页中抽取出了5亿条事实,而NELL项目也从Web中学习抽取了超过5千万条事实样例,如图2.5所示。

图2.5  NELL从Web中学习抽取事实样例

显而易见,与从维基百科中抽取的知识库相比,开放信息抽取从无结构网页中抽取的信息准确率还很低,其主要原因在于网页形式多样,噪声信息较多,信息可信度较低。因此,也有一些研究者尝试限制抽取的范围,例如只从网页表格等内容中抽取结构信息,并利用互联网的多个来源互相印证,从而大大提高抽取信息的可信度和准确率。当然这种做法也会大大降低抽取信息的覆盖面。天下没有免费的午餐,在大数据时代,我们需要在规模和质量之间寻找一个最佳的平衡点。

2.2.4  多数据源的知识融合

从以上数据来源进行知识图谱构建并非孤立地进行。在商用知识图谱构建过程中,需要实现多数据源的知识融合。以谷歌最新发布的Knowledge Vault(Dong, et al. 2014)技术为例,其知识图谱的数据来源包括了文本、DOM Trees、HTML表格、RDF语义数据等多个来源。多来源数据的融合,能够更有效地判定抽取知识的可信性。

知识融合主要包括实体融合、关系融合和实例融合三类。对于实体,人名、地名、机构名往往有多个名称。例如“中国移动通信集团公司”有“中国移动”、“中移动”、“移动通信”等名称。我们需要将这些不同名称规约到同一个实体下。同一个实体在不同语言、不同国家和地区往往会有不同命名,例如著名足球明星Beckham在大陆汉语中称作“贝克汉姆”,在香港译作“碧咸”,而在台湾则被称为“贝克汉”。与此对应的,同一个名字在不同语境下可能会对应不同实体,这是典型的一词多义问题,例如“苹果”有时是指一种水果,有时则指的是一家著名IT公司。在这样复杂的多对多对应关系中,如何实现实体融合是非常复杂而重要的课题。如前面开放信息抽取所述,同一种关系可能会有不同的命名,这种现象在不同数据源中抽取出的关系中尤其显著。与实体融合类似,关系融合对于知识融合至关重要。在实现了实体和关系融合之后,我们就可以实现三元组实例的融合。不同数据源会抽取出相同的三元组,并给出不同的评分。根据这些评分,以及不同数据源的可信度,我们就可以实现三元组实例的融合与抽取。

知识融合既有重要的研究挑战,又需要丰富的工程经验。知识融合是实现大规模知识图谱的必由之路。知识融合的好坏,往往决定了知识图谱项目的成功与否,值得任何有志于大规模知识图谱构建与应用的人士高度重视。

2.3  知识图谱的典型应用

知识图谱将搜索引擎从字符串匹配推进到实体层面,可以极大地改进搜索效率和效果,为下一代搜索引擎的形态提供了巨大的想象空间。知识图谱的应用前景远不止于此,目前知识图谱已经被广泛应用于以下几个任务中。

2.3.1  查询理解(Query Understanding)

谷歌等搜索引擎巨头之所以致力于构建大规模知识图谱,其重要目标之一就是能够更好地理解用户输入的查询词。用户查询词是典型的短文本(short text),一个查询词往往仅由几个关键词构成。传统的关键词匹配技术没有理解查询词背后的语义信息,查询效果可能会很差。

例如,对于查询词“李娜大满贯”,如果仅用关键词匹配的方式,搜索引擎根本不懂用户到底希望寻找哪个“李娜”,而只会机械地返回所有含有“李娜”这个关键词的网页。但通过利用知识图谱识别查询词中的实体及其属性,搜索引擎将能够更好地理解用户搜索意图。现在,我们到谷歌中查询“李娜大满贯”,会发现,首先谷歌会利用知识图谱在页面右侧呈现中国网球运动员李娜的基本信息,我们可以知道这个李娜是指中国网球女运动员。同时,谷歌不仅像传统搜索引擎那样返回匹配的网页,更会直接在页面最顶端返回李娜赢得大满贯的次数“2”,如图2.6所示。

图2.6  谷歌中对“李娜大满贯”的查询结果

主流商用搜索引擎基本都支持这种直接返回查询结果而非网页的功能,这背后都离不开大规模知识图谱的支持。以百度为例,图2.7是百度中对“珠穆朗玛峰高度”的查询结果,百度直接告诉用户珠穆朗玛峰的高度是8844.43米。

图2.7  百度中对“珠穆朗玛峰高度”的查询结果

基于知识图谱,搜索引擎还能获得简单的推理能力。例如,图2.8是百度中对“梁启超的儿子的妻子”的查询结果,百度能够利用知识图谱知道梁启超的儿子是梁思成,梁思成的妻子是林徽因等人。

采用知识图谱理解查询意图,不仅可以返回更符合用户需求的查询结果,还能更好地匹配商业广告信息,提高广告点击率,增加搜索引擎受益。因此,知识图谱对搜索引擎公司而言,是一举多得的重要资源和技术。

2.3.2  自动问答(Question Answering)

人们一直在探索比关键词查询更高效的互联网搜索方式。很多学者预测,下一代搜索引擎将能够直接回答人们提出的问题,这种形式被称为自动问答。例如著名计算机学者、美国华盛顿大学计算机科学与工程系教授、图灵中心主任Oren Etzioni于2011年就在Nature杂志上发表文章“搜索需要一场变革“(Search Needs a Shake-Up)。该文指出,一个可以理解用户问题,从网络信息中抽取事实,并最终选出一个合适答案的搜索引擎,才能将我们带到信息获取的制高点。如上节所述,目前搜索引擎已经支持对很多查询直接返回精确答案而非海量网页而已。

关于自动问答,我们将有专门的章节介绍。这里,我们需要着重指出的是,知识图谱的重要应用之一就是作为自动问答的知识库。在搜狗推出中文知识图谱服务“知立方”的时候,曾经以回答“梁启超的儿子的太太的情人的父亲是谁?”这种近似脑筋急转弯似的问题作为案例,来展示其知识图谱的强大推理能力(搜狗知立方服务的实例如图2.9所示)。虽然大部分用户不会这样拐弯抹角地提问,但人们会经常需要寻找诸如“刘德华的妻子是谁?”、“侏罗纪公园的主演是谁?”、“姚明的身高?”以及“北京有几个区?”等问题的答案。而这些问题都需要利用知识图谱中实体的复杂关系推理得到。无论是理解用户查询意图,还是探索新的搜索形式,都毫无例外地需要进行语义理解和知识推理,而这都需要大规模、结构化的知识图谱的有力支持,因此知识图谱成为各大互联网公司的必争之地。

图2.9  搜狗知立方服务

最近,微软联合创始人Paul Allen投资创建了艾伦人工智能研究院(Allen Institute for Artificial Intelligence),致力于建立具有学习、推理和阅读能力的智能系统。2013年底,Paul Allen任命Oren Etzioni教授担任艾伦人工智能研究院的执行主任,该任命所释放的信号颇值得我们思考。

2.3.3  文档表示(Document Representation)

经典的文档表示方案是空间向量模型(Vector Space Model),该模型将文档表示为词汇的向量,而且采用了词袋(Bag-of-Words,BOW)假设,不考虑文档中词汇的顺序信息。这种文档表示方案与上述的基于关键词匹配的搜索方案相匹配,由于其表示简单,效率较高,是目前主流搜索引擎所采用的技术。文档表示是自然语言处理很多任务的基础,如文档分类、文档摘要、关键词抽取,等等。

经典文档表示方案已经在实际应用中暴露出很多固有的严重缺陷,例如无法考虑词汇之间的复杂语义关系,无法处理对短文本(如查询词)的稀疏问题。人们一直在尝试解决这些问题,而知识图谱的出现和发展,为文档表示带来新的希望,那就是基于知识的文档表示方案。一篇文章不再只是由一组代表词汇的字符串来表示,而是由文章中的实体及其复杂语义关系来表示(Schuhmacher, et al. 2014)。该文档表示方案实现了对文档的深度语义表示,为文档深度理解打下基础。一种最简单的基于知识图谱的文档表示方案,可以将文档表示为知识图谱的一个子图(sub-graph),即用该文档中出现或涉及的实体及其关系所构成的图表示该文档。这种知识图谱的子图比词汇向量拥有更丰富的表示空间,也为文档分类、文档摘要和关键词抽取等应用提供了更丰富的可供计算和比较的信息。

知识图谱为计算机智能信息处理提供了巨大的知识储备和支持,将让现在的技术从基于字符串匹配的层次提升至知识理解层次。以上介绍的几个应用可以说只能窥豹一斑。知识图谱的构建与应用是一个庞大的系统工程,其所蕴藏的潜力和可能的应用,将伴随着相关技术的日渐成熟而不断涌现。

2.4  知识图谱的主要技术

大规模知识图谱的构建与应用需要多种智能信息处理技术的支持,以下简单介绍其中若干主要技术。

2.4.1  实体链指(Entity Linking)

互联网网页,如新闻、博客等内容里涉及大量实体。大部分网页本身并没有关于这些实体的相关说明和背景介绍。为了帮助人们更好地了解网页内容,很多网站或作者会把网页中出现的实体链接到相应的知识库词条上,为读者提供更详尽的背景材料。这种做法实际上将互联网网页与实体之间建立了链接关系,因此被称为实体链指。

手工建立实体链接关系非常费力,因此如何让计算机自动实现实体链指,成为知识图谱得到大规模应用的重要技术前提。例如,谷歌等在搜索引擎结果页面呈现知识图谱时,需要该技术自动识别用户输入查询词中的实体并链接到知识图谱的相应节点上。

实体链指的主要任务有两个,实体识别(Entity Recognition)与实体消歧(Entity Disambiguation),都是自然语言处理领域的经典问题。

实体识别旨在从文本中发现命名实体,最典型的包括人名、地名、机构名等三类实体。近年来,人们开始尝试识别更丰富的实体类型,如电影名、产品名,等等。此外,由于知识图谱不仅涉及实体,还有大量概念(concept),因此也有研究者提出对这些概念进行识别。

不同环境下的同一个实体名称可能会对应不同实体,例如“苹果”可能指某种水果,某个著名IT公司,也可能是一部电影。这种一词多义或者歧义问题普遍存在于自然语言中。将文档中出现的名字链接到特定实体上,就是一个消歧的过程。消歧的基本思想是充分利用名字出现的上下文,分析不同实体可能出现在该处的概率。例如某个文档如果出现了iphone,那么“苹果”就有更高的概率指向知识图谱中的叫“苹果”的IT公司。

实体链指并不局限于文本与实体之间,如图2.10所示,还可以包括图像、社交媒体等数据与实体之间的关联。可以看到,实体链指是知识图谱构建与应用的基础核心技术。

图2.10  实体链指实现实体与文本、图像、社交媒体等数据的关联

2.4.2  关系抽取(Relation Extraction)

构建知识图谱的重要来源之一是从互联网网页文本中抽取实体关系。关系抽取是一种典型的信息抽取任务。

典型的开放信息抽取方法采用自举(bootstrapping)的思想,按照“模板生成=>实例抽取”的流程不断迭代直至收敛。例如,最初可以通过“X是Y的首都”模板抽取出(中国,首都,北京)、(美国,首都,华盛顿)等三元组实例;然后根据这些三元组中的实体对“中国-北京”和“美国-华盛顿”可以发现更多的匹配模板,如“Y的首都是X”、“X是Y的政治中心”等等;进而用新发现的模板抽取更多新的三元组实例,通过反复迭代不断抽取新的实例与模板。这种方法直观有效,但也面临很多挑战性问题,如在扩展过程中很容易引入噪声实例与模板,出现语义漂移现象,降低抽取准确率。研究者针对这一问题提出了很多解决方案:提出同时扩展多个互斥类别的知识,例如同时扩展人物、地点和机构,要求一个实体只能属于一个类别;也有研究提出引入负实例来限制语义漂移。

我们还可以通过识别表达语义关系的短语来抽取实体间关系。例如,我们通过句法分析,可以从文本中发现“华为”与“深圳”的如下关系:(华为,总部位于,深圳)、(华为,总部设置于,深圳)、以及(华为,将其总部建于,深圳)。通过这种方法抽取出的实体间关系非常丰富而自由,一般是一个以动词为核心的短语。该方法的优点是,我们无需预先人工定义关系的种类,但这种自由度带来的代价是,关系语义没有归一化,同一种关系可能会有多种不同的表示。例如,上述发现的“总部位于”、“总部设置于”以及“将其总部建于”等三个关系实际上是同一种关系。如何对这些自动发现的关系进行聚类归约是一个挑战性问题。

我们还可以将所有关系看做分类标签,把关系抽取转换为对实体对的关系分类问题。这种关系抽取方案的主要挑战在于缺乏标注语料。2009年斯坦福大学的研究者提出远程监督(Distant Supervision)思想,使用知识图谱中已有的三元组实例启发式地标注训练语料。远程监督思想的假设是,每个同时包含两个实体的句子,都表述了这两个实体在知识库中的对应关系。例如,根据知识图谱中的三元组实例(苹果,创始人,乔布斯)和(苹果,CEO,库克),我们可以将以下四个包含对应实体对的句子分别标注为包含“创始人”和“CEO”关系:

我们将知识图谱三元组中每个实体对看做待分类样例,将知识图谱中实体对关系看做分类标签。通过从出现该实体对的所有句子中抽取特征,我们可以利用机器学习分类模型(如最大熵分类器、SVM等)构建信息抽取系统。对于任何新的实体对,根据所出现该实体对的句子中抽取的特征,我们就可以利用该信息抽取系统自动判断其关系。远程监督能够根据知识图谱自动构建大规模标注语料库,因此取得了瞩目的信息抽取效果。

与自举思想面临的挑战类似,远程监督方法会引入大量噪声训练样例,严重损害模型准确率。例如,对于(苹果,创始人,乔布斯)我们可以从文本中匹配以下四个句子:

在这四个句子中,前两个句子的确表明苹果与乔布斯之间的创始人关系;但是,后两个句子则并没有表达这样的关系。很明显,由于远程监督只能机械地匹配出现实体对的句子,因此会大量引入错误训练样例。为了解决这个问题,人们提出了很多去除噪声实例的办法,来提升远程监督性能。例如,研究发现,一个正确训练实例往往位于语义一致的区域,也就是其周边的实例应当拥有相同的关系;也有研究提出利用因子图、矩阵分解等方法,建立数据内部的关联关系,有效实现降低噪声的目标。

关系抽取是知识图谱构建的核心技术,它决定了知识图谱中知识的规模和质量。关系抽取是知识图谱研究的热点问题,还有很多挑战性问题需要解决,包括提升从高噪声的互联网数据中抽取关系的鲁棒性,扩大抽取关系的类型与抽取知识的覆盖面,等等。

2.4.3  知识推理(Knowledge Reasoning)

推理能力是人类智能的重要特征,能够从已有知识中发现隐含知识。推理往往需要相关规则的支持,例如从“配偶”+“男性”推理出“丈夫”,从“妻子的父亲”推理出“岳父”,从出生日期和当前时间推理出年龄,等等。

这些规则可以通过人们手动总结构建,但往往费时费力,人们也很难穷举复杂关系图谱中的所有推理规则。因此,很多人研究如何自动挖掘相关推理规则或模式。目前主要依赖关系之间的同现情况,利用关联挖掘技术来自动发现推理规则。

实体关系之间存在丰富的同现信息。如图2.11所示,在康熙、雍正和乾隆三个人物之间,我们有(康熙,父亲,雍正)、(雍正,父亲,乾隆)以及(康熙,祖父,乾隆)三个实例。根据大量类似的实体X、Y、Z间出现的(X,父亲,Y)、(Y,父亲,Z)以及(X,祖父,Z)实例,我们可以统计出“父亲+父亲=>祖父”的推理规则。类似地,我们还可以根据大量(X,首都,Y)和(X,位于,Y)实例统计出“首都=>位于”的推理规则,根据大量(X,总统,美国)和(X,是,美国人)统计出“美国总统=>是美国人”的推理规则。

图2.11  知识推理举例

知识推理可以用于发现实体间新的关系。例如,根据“父亲+父亲=>祖父”的推理规则,如果两实体间存在“父亲+父亲”的关系路径,我们就可以推理它们之间存在“祖父”的关系。利用推理规则实现关系抽取的经典方法是Path Ranking Algorithm(Lao &Cohen2010),该方法将每种不同的关系路径作为一维特征,通过在知识图谱中统计大量的关系路径构建关系分类的特征向量,建立关系分类器进行关系抽取,取得不错的抽取效果,成为近年来的关系抽取的代表方法之一。但这种基于关系的同现统计的方法,面临严重的数据稀疏问题。

在知识推理方面还有很多的探索工作,例如采用谓词逻辑(Predicate Logic)等形式化方法和马尔科夫逻辑网络(Markov Logic Network)等建模工具进行知识推理研究。目前来看,这方面研究仍处于百家争鸣阶段,大家在推理表示等诸多方面仍未达成共识,未来路径有待进一步探索。

2.4.4  知识表示(Knowledge Representation)

在计算机中如何对知识图谱进行表示与存储,是知识图谱构建与应用的重要课题。

如“知识图谱”字面所表示的含义,人们往往将知识图谱作为复杂网络进行存储,这个网络的每个节点带有实体标签,而每条边带有关系标签。基于这种网络的表示方案,知识图谱的相关应用任务往往需要借助于图算法来完成。例如,当我们尝试计算两实体之间的语义相关度时,我们可以通过它们在网络中的最短路径长度来衡量,两个实体距离越近,则越相关。而面向“梁启超的儿子的妻子”这样的推理查询问题时,则可以从“梁启超”节点出发,通过寻找特定的关系路径“梁启超->儿子->妻子->?”,来找到答案。

然而,这种基于网络的表示方法面临很多困难。首先,该表示方法面临严重的数据稀疏问题,对于那些对外连接较少的实体,一些图方法可能束手无策或效果不佳。此外,图算法往往计算复杂度较高,无法适应大规模知识图谱的应用需求。

最近,伴随着深度学习和表示学习的革命性发展,研究者也开始探索面向知识图谱的表示学习方案。其基本思想是,将知识图谱中的实体和关系的语义信息用低维向量表示,这种分布式表示(Distributed Representation)方案能够极大地帮助基于网络的表示方案。其中,最简单有效的模型是最近提出的TransE(Bordes, et al. 2013)。TransE基于实体和关系的分布式向量表示,将每个三元组实例(head,relation,tail)中的关系relation看做从实体head到实体tail的翻译,通过不断地调整h、r和t(head、relation和tail的向量),使(h + r)尽可能与t相等,即h + r = t。该优化目标如图2.12所示。

图2.12  基于分布式表示的知识表示方案

通过TransE等模型学习得到的实体和关系向量,能够在很大程度上缓解基于网络表示方案的稀疏性问题,应用于很多重要任务中。

首先,利用分布式向量,我们可以通过欧氏距离或余弦距离等方式,很容易地计算实体间、关系间的语义相关度。这将极大地改进开放信息抽取中实体融合和关系融合的性能。通过寻找给定实体的相似实体,还可用于查询扩展和查询理解等应用。

其次,知识表示向量可以用于关系抽取。以TransE为例,由于我们的优化目标是让h+r=t,因此,当给定两个实体h和t的时候,我们可以通过寻找与t-h最相似的r,来寻找两实体间的关系。(Bordes, et al. 2013)中的实验证明,该方法的抽取性能较高。而且我们可以发现,该方法仅需要知识图谱作为训练数据,不需要外部的文本数据,因此这又称为知识图谱补全(Knowledge Graph Completion),与复杂网络中的链接预测(Link Prediction)类似,但是要复杂得多,因为在知识图谱中每个节点和连边上都有标签(标记实体名和关系名)。

最后,知识表示向量还可以用于发现关系间的推理规则。例如,对于大量X、Y、Z间出现的(X,父亲,Y)、(Y,父亲,Z)以及(X,祖父,Z)实例,我们在TransE中会学习X+父亲=Y,Y+父亲=Z,以及X+祖父=Z等目标。根据前两个等式,我们很容易得到X+父亲+父亲=Z,与第三个公式相比,就能够得到“父亲+父亲=>祖父”的推理规则。前面我们介绍过,基于关系的同现统计学习推理规则的思想,存在严重的数据稀疏问题。如果利用关系向量表示提供辅助,可以显著缓解稀疏问题。

2.5  前景与挑战

如果未来的智能机器拥有一个大脑,知识图谱就是这个大脑中的知识库,对于大数据智能具有重要意义,将对自然语言处理、信息检索和人工智能等领域产生深远影响。

现在以商业搜索引擎公司为首的互联网巨头已经意识到知识图谱的战略意义,纷纷投入重兵布局知识图谱,并对搜索引擎形态日益产生重要的影响。同时,我们也强烈地感受到,知识图谱还处于发展初期,大多数商业知识图谱的应用场景非常有限,例如搜狗知立方更多聚焦在娱乐和健康等领域。根据各搜索引擎公司提供的报告来看,为了保证知识图谱的准确率,仍然需要在知识图谱构建过程中采用较多的人工干预。

可以看到,在未来的一段时间内,知识图谱将是大数据智能的前沿研究问题,有很多重要的开放性问题亟待学术界和产业界协力解决。我们认为,未来知识图谱研究有以下几个重要挑战。

1. 知识类型与表示。知识图谱主要采用(实体1,关系,实体2)三元组的形式来表示知识,这种方法可以较好地表示很多事实性知识。然而,人类知识类型丰富多样,面对很多复杂知识,三元组就束手无策了。例如,人们的购物记录信息,新闻事件等,包含大量实体及其之间的复杂关系,更不用说人类大量的涉及主观感受、主观情感和模糊的知识了。有很多学者针对不同场景设计了不同的知识表示方法。知识表示是知识图谱构建与应用的基础,如何合理设计表示方案,更好地涵盖人类不同类型的知识,是知识图谱的重要研究问题。最近认知领域关于人类知识类型的探索(Tenenbaum, et al. 2011)也许会对知识表示研究有一定启发作用。

2. 知识获取。如何从互联网大数据萃取知识,是构建知识图谱的重要问题。目前已经提出各种知识获取方案,并已经成功抽取出大量有用的知识。但在抽取知识的准确率、覆盖率和效率等方面,都仍不尽如人意,有极大的提升空间。

3. 知识融合。从不同来源数据中抽取的知识可能存在大量噪声和冗余,或者使用了不同的语言。如何将这些知识有机融合起来,建立更大规模的知识图谱,是实现大数据智能的必由之路。

4. 知识应用。目前大规模知识图谱的应用场景和方式还比较有限,如何有效实现知识图谱的应用,利用知识图谱实现深度知识推理,提高大规模知识图谱计算效率,需要人们不断锐意发掘用户需求,探索更重要的应用场景,提出新的应用算法。这既需要丰富的知识图谱技术积累,也需要对人类需求的敏锐感知,找到合适的应用之道。

2.6  内容回顾与推荐阅读

本章系统地介绍了知识图谱的产生背景、数据来源、应用场景和主要技术。通过本章我们主要有以下结论:

— 知识图谱是下一代搜索引擎、自动问答等智能应用的基础设施。

— 互联网大数据是知识图谱的重要数据来源。

— 知识表示是知识图谱构建与应用的基础技术。

— 实体链指、关系抽取和知识推理是知识图谱构建与应用的核心技术。

知识图谱与本体(Ontology)和语义网(Semantic Web)等密切相关,有兴趣的读者可以搜索与之相关的文献阅读。知识表示(KnowledgeRepresentation)是人工智能的重要课题,读者可以通过人工智能专著(Russell & Norvig 2009)了解其发展历程。在关系抽取方面,读者可以阅读(Nauseates, etal. 2013)、(Nickel, et al. 2015)详细了解相关技术。

2.7  参考文献

[1] (Bordes, et al. 2013) Bordes, A.,Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translatingembeddings for modeling multi-relational data. In Proceedings of NIPS.

[2] (Dong, et al. 2014) Dong, X., Gabrilovich,E., Heitz, G., Horn, W., et al. Knowledge Vault A web-scale approach toprobabilistic knowledge fusion. In Proceedings of KDD.

[3]   (Lao & Cohen 2010) Lao, N., & Cohen, W. W. (2010). Relationalretrieval using a combination of path-constrained random walks. Machinelearning, 81(1), 53-67.

[4]   (Nauseates,et al. 2013) Nastase, V., Nakov, P., Seaghdha, D. O., & Szpakowicz, S. (2013). Semanticrelations between nominals. Synthesis Lectures on Human Language Technologies,6(1), 1-119.

[5]   (Nickel,et al. 2015) Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. A Review of RelationalMachine Learning for Knowledge Graphs.

[6] (Russell & Norvig 2009) Russell, S., & Norvig, P. (2009). ArtificialIntelligence: A Modern Approach, 3rd Edition. Pearson Press.(中文译名:人工智能——一种现代方法).

[7]   (Schuhmacher,et al. 2014) Schuhmacher, M., & Ponzetto, S. P. Knowledge-based graphdocument modeling. In Proceedings of the 7th ACM international conference onWeb search and data mining. In Proceedings of WSDM.

 

 

How AI is helping detect fraud and fight criminals

How AI is helping detect fraud and fight criminals

AI is about to go mainstream. It will show up in the connected home, in your car, and everywhere else. While it’s not as glamorous as the sentient beings that turn on us in futuristic theme parks, the use of AI in fraud detection holds major promise. Keeping fraud at bay is an ever-evolving battle in which both sides, good and bad, are adapting as quickly as possible to determine how to best use AI to their advantage.

There are currently three major ways that AI is used to fight fraud, and they correspond to how AI has developed as a field. These are:

  1. Rules and reputation lists
  2. Supervised machine learning
  3. Unsupervised machine learning

Rules and reputation lists

Rules and reputation lists exist in many modern organizations today to help fight fraud and are akin to “expert systems,” which were first introduced to the AI field in the 1970s. Expert systems are computer programs combined with rules from domain experts.They’re easy to get up and running and are human-understandable, but they’re also limited by their rigidity and high manual effort.

A “rule” is a human-encoded logical statement that is used to detect fraudulent accounts and behavior. For example, an institution may put in place a rule that states, “If the account is purchasing an item costing more than $1000, is located in Nigeria, and signed up less than 24 hours ago, block the transaction.”

Reputation lists, similarly, are based on what you already know is bad. A reputation list is a list of specific IPs, device types, and other single characteristics and their corresponding reputation score. Then, if an account is coming from an IP on the bad reputation list, you block them.

While rules and reputation lists are a good first attempt at fraud detection and prevention, they can be easily gamed by cybercriminals. These days, digital services abound, and these companies make the sign-up process frictionless. Therefore, it takes very little time for fraudsters to make dozens, or even thousands, of accounts. They then use these accounts to learn the boundaries of the rules and reputation lists put in place. Easy access to cloud hosting services, VPNs, anonymous email services, device emulators, and mobile device flashing makes it easy to come up with unsuspicious attributes that would miss reputation lists.

Since the 1990s, expert systems have fallen out of favor in many domains, losing out to more sophisticated techniques. Clearly, there are better tools at our disposal for fighting fraud. However, a significant number of fraud-fighting teams in modern companies still rely on this rudimentary approach for the majority of their fraud detection, leading to massive human review overhead, false positives, and sub-optimal detection results.

Supervised machine learning (SML)

Machine learning is a subfield of AI that attempts to address the issue of previous approaches being too rigid. Researchers wanted the machines to learn from data, rather than encoding what these computer programs should look for (a different approach from expert systems). Machine learning began to make big strides in the 1990s, and by the 2000s it was effectively being used in fighting fraud as well.

Applied to fraud, supervised machine learning (SML) represents a big step forward. It’s vastly different from rules and reputation lists because instead of looking at just a few features with simple rules and gates in place, all features are considered together.

There’s one downside to this approach. An SML model for fraud detection must be fed historical data to determine what the fraudulent accounts and activity look like versus what the good accounts and activity look like. The model would then be able to look through all of the features associated with the account to make a decision. Therefore, the model can only find fraud that is similar to previous attacks. Many sophisticated modern-day fraudsters are still able to get around these SML models.

That said, SML applied to fraud detection is an active area of development because there are many SML models and approaches. For instance, applying neural networks to fraud can be very helpful because it automates feature engineering, an otherwise costly step that requires human intervention. This approach can decrease the incidence of false positives and false negatives compared to other SML models, such as SVM and random forest models, since the hidden neurons can encode many more feature possibilities than can be done by a human.

Unsupervised machine learning (UML)

Compared to SML, unsupervised machine learning (UML) has cracked fewer domain problems. For fraud detection, UML hasn’t historically been able to help much. Common UML approaches (e.g., k-means and hierarchical clustering, unsupervised neural networks, and principal component analysis) have not been able to achieve good results for fraud detection.

Having an unsupervised approach to fraud can be  difficult to build in-house since it requires processing billions of events all together and there are no out-of-the-box effective unsupervised models. However, there are companies that have made strides in this area.

The reason it can be applied to fraud is due to the anatomy of most fraud attacks. Normal user behavior is chaotic, but fraudsters will work in patterns, whether they realize it or not. They are working quickly and at scale. A fraudster isn’t going to try to steal $100,000 in one go from an online service. Rather, they make dozens to thousands of accounts, each of which may yield a profit of a few cents to several dollars. But those activities will inevitably create patterns, and UML can detect them.

The main benefits of using UML are:

  • You can catch new attack patterns earlier
  • All of the accounts are caught, stopping the fraudster from making any money
  • Chance of false positives is much lower, since you collect much more information before making a detection decision

Putting it all together

Each approach has its own advantages and disadvantages, and you can benefit from each method. Rules and reputation lists can be implemented cheaply and quickly without AI expertise. However, they have to be constantly updated and will only block the most naive fraudsters. SML has become an out-of-the box technology that can consider all the attributes for a single account or event, but it’s still limited in that it can’t find new attack patterns. UML is the next evolution, as it can find new attack patterns, identify all of the accounts associated with an attack, and provide a full global view. On the other hand, it’s not as effective at stopping individual fraudsters with low-volume attacks and is difficult to implement in-house. Still, it’s certainly promising for companies looking to block large-scale or constantly evolving attacks.

A healthy fraud detection system often employs all three major ways of using AI to fight fraud. When they’re used together properly, it’s possible to benefit from the advantages of each while mitigating the weaknesses of the others.

AI in fraud detection will continue to evolve, well beyond the technologies explored above, and it’s hard to even grasp what the next frontier will look like. One thing we know for sure, though, is that the bad guys will continue to evolve along with it, and the race is on to use AI to detect criminals faster than they can use it to hide.

Catherine Lu is a technical product manager at DataVisor, a full-stack online fraud analytics platform.

MI 3.0 Thumbnail (1)

zr9558's Blog