Category Archives: 时间序列

时间序列的自回归模型—从线性代数的角度来看

Fibonacci 序列

在开始讲解时间序列的自回归模型(AutoRegression Model)之前,我们需要先介绍一下线性代数的基础知识。为了介绍线性代数的基础知识,我们可以先看一个简单的例子。考虑整数序列 Fibonacci 序列,它的通项公式是一个递归函数,每项的取值是由前两项所生成的,其具体的公式就是

F_{n}=F_{n-1}+F_{n-2}

其初始值是 F_{0}=0, F_{1} = 1。按照其递归公式来计算,我们可以详细写出前面的几项,那就是:

0,1,1,2,3,5,8,13,21,34,55,89,144,…

但是,计算 Fibonacci 的通项公式则比计算等差数列和等比数列的通项公式复杂的多,因为这里需要使用线性代数的技巧才能解决这个问题。

求解 Fibonacci 序列的通项公式 --- 矩阵对角化

根据 Fibonacci 数列的递归公式,基于矩阵乘法的定义,Fibonacci 序列可以写成如下形式:

\left( \begin{array}{c} F_{n+2}  \\ F_{n+1}  \\ \end{array}\right)= \left( \begin{array}{cc} 1 & 1 \\ 1 & 0 \\ \end{array}\right) \left( \begin{array}{c} F_{n+1}  \\ F_{n}  \\ \end{array}\right) = A \left( \begin{array}{c} F_{n+1}  \\ F_{n}  \\ \end{array}\right),

\left( \begin{array}{c} F_{n}  \\ F_{n-1}  \\ \end{array}\right)= A \left( \begin{array}{c} F_{n-1}  \\ F_{n-2}  \\ \end{array}\right) = \cdots = A^{n-1} \left( \begin{array}{c} F_{1}  \\ F_{0}  \\ \end{array}\right).

这就意味着我们需要计算出矩阵 A 的幂。在线性代数里面,为了计算矩阵的 n 次方,除了通过矩阵乘法的公式直接计算之外,还有一个经典的技巧,那就是将矩阵对角化。详细来说,如果 m\times m 的矩阵 A 能够对角化,那就是存在可逆矩阵 P 使得

P^{-1}AP = diag(\lambda_{1},\cdots,\lambda_{m})

\implies AP = P diag(\lambda_{1},\cdots,\lambda_{m})

\implies A = P diag(\lambda_{1},\cdots,\lambda_{m})P^{-1}.

其中 D = diag(\lambda_{1},\cdots,\lambda_{m}) 表示一个 m\times m 的对角矩阵,其对角的元素从左上角到右下角依次是 \lambda_{1},\cdots,\lambda_{m}。如果把矩阵 P 写成列向量的形式,i.e. P =(\vec{\alpha}_{1},\cdots,\vec{\alpha}_{m}),那么以上的矩阵方程就可以转换为 A\vec{\alpha}_{i} = \lambda_{i}\vec{\alpha}_{i}, 1\leq i\leq m。进一步来说,如果要计算矩阵 A 的幂,就可以得到:

A^{k} = (PDP^{-1})\cdots(PDP^{-1}) = P D^{k} P^{-1}= P diag(\lambda_{1}^{k},\cdots,\lambda_{m}^{k})P^{-1}.

另外,如果想知道一个矩阵 A 的特征值和特征向量,则需要计算以下多项式的根,i.e. 计算关于 \lambda 的多项式的解,

det(\lambda I - A) = 0,

其中 I 是一个单位矩阵(identity matrix)。

按照以上的思路,如果令

A = \left( \begin{array}{cc} 1 & 1 \\ 1 & 0 \\ \end{array}\right),

可以计算出 A 的两个特征值分别是 \phi=(1+\sqrt{5})/2- \phi^{-1} = (1-\sqrt{5})/2,它们所对应的特征向量分别是:

\vec{\alpha}_{1} = (\phi,1)^{T}, \vec{\alpha}_{2} = (-\phi^{-1},1).

因此直接计算可以得到

F_{k} = \frac{1}{\sqrt{5}}\bigg(\frac{1+\sqrt{5}}{2}\bigg)^{k} - \frac{1}{\sqrt{5}}\bigg(\frac{1-\sqrt{5}}{2}\bigg)^{k}=\frac{\phi^{k}-(-\phi)^{-k}}{\sqrt{5}}.

通过上面的计算方法,为了计算 Fibonacci 数列的通项公式,我们可以先把它转换成一个矩阵求幂的问题,于是我们就能矩阵对角化的方法把 Fibonacci 数列的通项公式求出来。

时间序列的弱平稳性

要讲解自回归模型,就必须提到时间序列的弱平稳性。一个时间序列 \{x_{t}\}_{t\geq 0} 具有弱平稳性(Weak Stationary)指的是:

  1. E(x_{t}) 对于所有的 t\geq 0 都是恒定的;
  2. Var(x_{t}) 对于所有的 t\geq 0 都是恒定的;
  3. x_{t}x_{t-h} 的协方差对于所有的 t\geq 0 都是恒定的。

另外,时间序列的自相关方程(AutoCorrelation Function)指的是对于 h = 1,2,3,\cdots,可以定义 ACF 为

ACF(x_{t},x_{t-h}) = \frac{Covariance(x_{t},x_{t-h})}{\sqrt{Var(x_{t})\cdot Var(x_{t-h})}}.

如果时间序列 \{x_{t}\}_{t\geq 0} 在弱平稳性的假定下,ACF 将会简化为

ACF(x_{t},x_{t-h}) = \frac{Covariance(x_{t},x_{t-h})}{Var(x_{t})}.

时间序列的自回归模型(AutoRegression Model)

AR(1) 模型

AR(1) 模型指的是时间序列 \{x_{t}\}_{t\geq 0} 在时间戳 t 时刻的取值 x_{t} 与时间戳 t - 1 时刻的取值 x_{t-1} 相关,其公式就是:

x_{t}=\delta+\phi_{1}x_{t-1}+w_{t}

这个时间序列 \{x_{t}\}_{t\geq 0} 满足如下条件:

  1. w_{t}\sim N(0,\sigma_{w}^{2}),并且 w_{t} 满足 iid 条件。其中 N(0,\sigma_{w}^{2}) 表示 Gauss 正态分布,它的均值是0,方差是 \sigma_{w}^{2}
  2. w_{t}x_{t} 是相互独立的(independent)。
  3. x_{0},x_{1},\cdots弱平稳的,i.e. 必须满足 |\phi_{1}|<1

如果选择初始条件 x_{0}=1,则可以得到一些 AR(1) 模型的例子如下图所示:

AR Models

从 AR(1) 以上的定义出发,我们可以得到:

  1. E(x_{t}) = \delta/(1-\phi_{1}).
  2. Var(x_{t}) = \sigma_{w}^{2}/(1-\phi_{1}^{2}).
  3. Covariance(x_{t},x_{t-h}) = \phi_{1}^{h}.

Proof of 1. 从 AR(1) 的模型出发,可以得到

E(x_{t}) = E(\delta + \phi_{1}x_{t-1}+w_{t})  = \delta + \phi_{1}E(x_{t-1}) = \delta + \phi_{1}E(x_{t}),

从而,E(x_{t}) = \delta/(1-\phi_{1}).

Proof of 2. 从 AR(1) 的模型出发,可以得到

Var(x_{t}) = Var(\delta + \phi_{1}x_{t-1}+w_{t})

= \phi_{1}^{2}Var(x_{t-1}) +Var(w_{t}) = \phi_{1}^{2}Var(x_{t}) + \sigma_{w}^{2},

从而,Var(x_{t}) =\sigma_{w}^{2}/(1-\phi_{1}^{2}).

Proof of 3.\mu = E(x_{t}), \forall t\geq 0. 从 x_{t} 的定义出发,可以得到:

x_{t}-\mu = \phi_{1}(x_{t-1}-\mu)+w_{t}

= \phi_{1}^{h}(x_{t-h}-\mu) + \phi_{1}^{h-1}w_{t-h+1}+\cdots+\phi_{1}w_{t-1}+w_{t},

从而,

\rho_{h} = Covariance(x_{t},x_{t-h}) = \frac{E((x_{t}-\mu)\cdot(x_{t-h}-\mu))}{Var(x_{t})}=\phi_{1}^{h}.

AR(1) 模型与一维动力系统

特别的,如果假设 w_{t} 恒等于零,就可以得到 x_{t} =\delta + \phi_{1}x_{t-1} 对于所有的 t\geq 1 都成立。也就是可以写成一个一维函数的迭代公式:

f(x) = \phi_{1}x + \delta,

下面我们要计算 f^{n}(x) 的收敛性,这里的 f^{n}(x) = f\circ\cdots\circ f(x) 表示函数 fn 次迭代。

Method 1. 

通过 f(x) 的定义直接计算可以得到:

f^{n}(x) = \phi_{1}^{n}x+ \frac{1-\phi_{1}^{n}}{1-\phi_{1}}\delta,

n\rightarrow \infty,可以得到 f^{n}(x)\rightarrow \delta/(1-\phi_{1})。这与 E(x_{t}) = \delta/(1-\phi_{1}) 其实是保持一致的。

另外,如果 |\phi_{1}|>1,可以从公式上得到 f^{n}(x) \rightarrow \inftyn\rightarrow \infty

Method 2.

将原函数转换成 Lipschitz 函数的形式,i.e. 如果 |\phi_{1}|<1,那么

f(x)-\frac{\delta}{1-\phi_{1}} = \phi_{1}(x-\frac{\delta}{1-\phi_{1}})

\implies |f(x)-\frac{\delta}{1-\phi_{1}}| <\frac{1+|\phi_{1}|}{2}\cdot|x-\frac{\delta}{1-\phi_{1}}|

\implies |f^{n}(x)-\frac{\delta}{1-\phi_{1}}|<\bigg(\frac{1+|\phi_{1}|}{2}\bigg)^{n}\cdot|x-\frac{\delta}{1-\phi_{1}}|.

因为 (1+|\phi_{1}|)/2<1,我们可以得到 \lim_{n\rightarrow\infty}f^{n}(x)=\delta/(1-\phi_{1}). i.e. f^{n}(x) 趋近于 \delta/(1-\phi_{1}).

反之,如果 |\phi_{1}|>1,很容易得到

f(x)-\frac{\delta}{1-\phi_{1}} = \phi_{1}(x-\frac{\delta}{1-\phi_{1}})

\implies |f(x)-\frac{\delta}{1-\phi_{1}}| >\frac{1+|\phi_{1}|}{2}\cdot|x-\frac{\delta}{1-\phi_{1}}|

\implies |f^{n}(x)-\frac{\delta}{1-\phi_{1}}|>\bigg(\frac{1+|\phi_{1}|}{2}\bigg)^{n}\cdot|x-\frac{\delta}{1-\phi_{1}}|.

因此,在 |\phi_{1}|>1 这种条件下,f^{n}(x)\rightarrow \infty as n\rightarrow \infty. 特别地,对于一阶差分方程 x_{t} =\delta + \phi_{1}x_{t-1} 而言,如果 |\phi_{1}|>1,那么 x_{t} 的取值会越来越大,这与现实的状况不相符,所以在时间序列的研究中,一般都会假设 |\phi_{1}|<1

AR(p) 模型

按照之前类似的定义,可以把 AR(1) 模型扩充到 AR(p) 模型,也就是说:

1. AR(1) 模型形如:

x_{t}=\delta+\phi_{1}x_{t-1}+w_{t}.

2. AR(2) 模型形如:

x_{t} = \delta + \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+w_{t}.

3. AR(p) 模型形如:

x_{t} = \delta + \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}+w_{t}.

AR(p) 模型的稳定性 --- 基于线性代数

对于 AR(2) 模型,可以假定 \delta = 0 并且忽略误差项,因此可以得到简化版的模型形如:

x_{t}= \phi_{1}x_{t-1} + \phi_{2}x_{t-2}.

写成矩阵的形式形如:

\left( \begin{array}{c} x_{t+2}  \\ x_{t+1}  \\ \end{array}\right)= \left( \begin{array}{cc} \phi_{1} & \phi_{2} \\ 1 & 0 \\ \end{array}\right) \left( \begin{array}{c} x_{t+1}  \\ x_{t}  \\ \end{array}\right) = A \left( \begin{array}{c} x_{t+1}  \\ x_{t}  \\ \end{array}\right).

求解其特征多项式则是基于 det(\lambda I - A) = 0,求解可以得到 \lambda^{2}-\phi_{1}\lambda - \phi_{2} =0,i.e. A^{k} = P diag(\lambda_{1}^{k}, \lambda_{2}^{k})P^{-1}。当 \lambda_{1}, \lambda_{2} 都在单位圆内部的时候,也就是该模型 x_{t+2} = \phi_{1}x_{t+1}+\phi_{2}x_{t} 满足稳定性的条件。

对于更加一般的 AR(p) 模型,也就是考虑 p 阶差分方程

x_{t} = \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}.

可以用同样的方法将其转换成矩阵的形式,那就是:

\left(\begin{array}{c} x_{t+p} \\ x_{t+p-1} \\ \vdots \\ x_{t+1}\\ \end{array}\right) = \left(\begin{array}{ccccc} \phi_{1} & \phi_{2} &\cdots & \phi_{p-1} & \phi_{p} \\ 1 & 0 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \\ \end{array}\right) \left(\begin{array}{c} x_{t+p-1} \\ x_{t+p-2} \\ \vdots \\ x_{t} \\ \end{array}\right) = A \left(\begin{array}{c} x_{t+p-1} \\ x_{t+p-2} \\ \vdots \\ x_{t} \\ \end{array}\right)

计算 det(\lambda I - A) = 0,可以得到其特征多项式为:

\lambda^{p}-\phi_{1}\lambda^{p-1}-\phi_{2}\lambda^{p-2}-\cdots-\phi_{p}=0.

当每个特征值都在单位圆盘内部的时候,i.e. |\lambda_{i}|<1, \forall 1\leq i\leq p,该 p 阶差分方程

x_{t} = \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}

存在稳定性的解。

 

Advertisements

如何理解时间序列?— 从Riemann积分和Lebesgue积分谈起

Riemann 积分和 Lebesgue 积分是数学中两个非常重要的概念。本文将会从 Riemann 积分和 Lebesgue 积分的定义出发,介绍它们各自的性质和联系。

积分

Riemann 积分

Riemann 积分虽然被称为 Riemann 积分,但是在 Riemann 之前就有学者对这类积分进行了详细的研究。早在阿基米德时代,阿基米德为了计算曲线 x^{2} 在 [0,1] 区间上与 X 坐标轴所夹的图形面积,就使用了 Riemann 积分的思想。 他把 [0,1] 区间等长地切割成 n 段,每一段使用一个长方形去逼近 x^{2} 这条曲线的分段面积,再把 n 取得很大,所以得到当 n 趋近于无穷的时候,就知道该面积其实是 1/3。

下面来看一下 Riemann 积分的详细定义。

Riemann Integral1

考虑定义在闭区间 [a,b] 上的函数 f(x)

取一个有限的点列 a=x_{0}<x_{1}<\cdots<x_{n}=b\lambda = \max(x_{i+1}-x_{i}) 表示这些区间长度的最大值,在这里 0\leq i\leq n-1。在每一个子区间上[x_{i},x_{i+1}] 上取出一个点 t_{i}\in[x_{i},x_{i+1}]。而函数 f(x) 关于以上取样分割的 Riemann 和就是以下公式:

\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}).

当我们说该函数 f(x) 在闭区间 [a,b] 上的取值是 s 的意思是:

对于任意 \epsilon>0,存在 \delta>0 使得对于任意取样分割,当 \lambda<\delta 时,就有

|\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}) - s|<\epsilon.

通常来说,用符号来表示就是:\int_{a}^{b}f(x)=s.

用几幅图来描述 Riemann 积分的思想就是:

 

Lebesgue 积分

Riemann 积分是为了计算曲线与 X 轴所围成的面积,而 Lebesgue 积分也是做同样的事情,但是计算面积的方法略有不同。要想直观的解释两种积分的原理,可以参见下图:

RiemannANDLebesgue
Riemann 积分(上)与 Lebesgue 积分(下)

Riemann 积分是把一条曲线的底部分成等长的区间,测量每一个区间上的曲线高度,所以总面积就是这些区间与高度所围成的面积和。

Lebesgue 积分是把曲线化成等高线图,每两根相邻等高线的差值是一样的。每根等高线之内含有它所圈着的长度,因此总面积就是这些等高线内的面积之和。

用再形象一点的语言来描述就是:吃一块汉堡有多种方式

  1. Riemann 积分:从一个角落开始一口一口吃,每口都包含所有的配料;
  2. Lebesgue 积分:从最上层开始吃,按照“面包-配菜-肉-蛋-面包”的节奏,一层一层来吃。

再看一幅图的表示就是:Riemann 积分是按照蓝色的数字顺序相加的,Lebesgue 积分是按照红色的数字来顺序相加的。

Riemann and Lebesgue1

 

基于这些基本的思想,就可以给出 Lebesgue 积分的定义:

简单函数指的是对指示函数的有限线性组合,i.e. \sum_{k}a_{k}I_{S_{k}},这里的 a_{k} 是系数,S_{k} 是可测集合,I 表示指示函数。当 a_{k} 非负时,令

\int(\sum_{k}a_{k}1_{S_{k}})d\mu = \sum_{k}a_{k}\int 1_{S_{k}}d\mu = \sum_{k}a_{k}\mu(S_{k}).

如果 f 是一个非负可测函数时,可以定义函数 f 在可测集合 E 上的 Lebesgue 积分是:

\int_{E}f d\mu = \sup\{\int_{E}sd\mu: \bold{0}\leq s\leq f\},

这里的 s 指的是非负简单函数,\bold{0} 表示零函数,这里的大小关系表示对定义域内的每个点都要成立。

而对于可测函数时,可以把可测函数 f 转换成 f= f^{+}-f^{-},而这里的 f^{+}f^{-} 都是非负可测函数。所以可以定义任意可测函数的 Lebesgue 积分如下:

\int fd\mu = \int f^{+}d\mu - \int f^{-}d\mu..

Riemann 积分与Lebesgue 积分的关系

定义了两种积分之后,也许有人会问它们之间是否存在矛盾?其实,它们之间是不矛盾的,因为有学者证明了这样的定理:

如果有界函数 f 在闭区间 [a,b] 是 Riemann 可积的,则它也是 Lebesgue 可积的,并且它们的积分值相等:

(R)\int_{a}^{b}f(x)dx = (L)\int_{[a,b]}f(x)dx.

左侧是表示 Riemann 积分,右侧表示 Lebesgue 积分。

形象化一点的语言描述就是:无论从角落一口一口地吃汉堡,还是从顶至下一层一层吃,所吃的汉堡都是同一个。

但是 Lebesgue 积分比 Riemann 积分有着更大的优势,例如 Dirichlet 函数,

  1. x 是有理数时,D(x) = 1
  2. x 是无理数时,D(x) = 0.

Dirichlet 函数是定义在实数轴的函数,并且值域是 \{0,1\},无法画出函数图像,它不是 Riemann 可积的,但是它 Lebesgue 可积。

 

时间序列

提到时间序列,也就是把以上所讨论的连续函数换成离散函数而已,把定义域从一个闭区间 [a,b] 换成 \{1,2,3,\cdots\} 这样的定义域而已。所以,之前所讨论的很多连续函数的想法都可以应用在时间序列上。

时间序列的表示 — 基于 Riemann 积分

现在我们可以按照 Riemann 积分的计算方法来表示一个时间序列的特征,于是就有学者把时间序列按照横轴切分成很多段,每一段使用某个简单函数(线性函数等)来表示,于是就有了以下的方法:

  1. 分段线性逼近(Piecewise Linear Approximation)
  2. 分段聚合逼近(Piecewise Aggregate Approximation)
  3. 分段常数逼近(Piecewise Constant Approximation)

说到这几种算法,其实最本质的思想就是进行数据降维的工作,用少数的数据来进行原始时间序列的表示(Representation)。用数学化的语言来描述时间序列的数据降维(Data Reduction)就是:把原始的时间序列 \{x_{1},\cdots,x_{N}\}\{x_{1}^{'},\cdots, x_{D}^{'}\} 来表示,其中 D<N。那么后者就是原始序列的一种表示(representation)。

分段聚合逼近(Piecewise Aggregate Approximation)— 类似 Riemann 积分

在这种算法中,分段聚合逼近(Piecewise Aggregate Approximation)是一种非常经典的算法。假设原始的时间序列是 C = \{x_{1},\cdots, x_{N}\},定义 PAA 的序列是:\overline{C} = \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

其中

\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}.

在这里 1\leq i\leq w。用图像来表示那就是:

PAA

至于分段线性逼近(Piecewise Linear Approximation)和分段常数逼近(Piecewise Constant Approximation),只需要在 \overline{x}_{i} 的定义上稍作修改即可。

符号特征(Symbolic Approximation)— 类似用简单函数来计算 Lebesgue 积分

在推荐系统的特征工程里面,特征通常来说可以做归一化,二值化,离散化等操作。例如,用户的年龄特征,一般不会直接使用具体的年月日,而是划分为某个区间段,例如 0~6(婴幼儿时期),7~12(小学),13~17(中学),18~22(大学)等阶段。

其实在得到分段特征之后,分段特征在某种程度上来说依旧是某些连续值,能否把连续值划分为一些离散的值呢?于是就有学者使用一些符号来表示时间序列的关键特征,也就是所谓的符号表示法(Symbolic Representation)。下面来介绍经典的 SAX Representation。

如果我们希望使用 \alpha 个符号来表示时间序列,那么我们其实可以考虑正态分布 N(0,1),用\{z_{1/\alpha},\cdots,z_{(\alpha-1)/\alpha}\} 来表示 Gauss 曲线下方的一些点,而这些点把 Gauss 曲线下方的面积等分成了 \alpha 段。用 \{l_{1},\cdots,l_{\alpha}\} 表示 \alpha 个字母。

SAX 方法的流程如下:

  1. 正规化(normalization):把原始的时间序列映射到一个新的时间序列,新的时间序列满足均值为零,方差为一的条件。
  2. 分段表示(PAA):\{x_{1},\cdots, x_{N}\} \Rightarrow  \{\overline{x}_{1},\cdots,\overline{x}_{w}\}
  3. 符号表示(SAX):如果 \overline{x}_{i}<z_{1/\alpha},那么 \hat{X}_{i}=l_{1};如果 z_{(j-1)/\alpha}\leq \overline{x}_{i}<z_{j/\alpha},那么 \hat{X}_{i} = l_{j},在这里 2\leq j\leq \alpha;如果 \overline{x}_{i}\geq z_{(\alpha-1)/\alpha},那么 \hat{X}_{i} = l_{\alpha}

于是,我们就可以用 \{l_{1},\cdots,l_{\alpha}\}\alpha 个字母来表示原始的时间序列了。

SAX

时间序列的表示 — 基于 Lebesgue 积分

要想考虑一个时间序列的值分布情况,其实就类似于 Lebesgue 积分的计算方法,考虑它们的分布情况,然后使用某些函数去逼近时间序列。要考虑时间序列的值分布情况,可以考虑熵的概念。

熵(Entropy)

通常来说,要想描述一种确定性与不确定性,熵(entropy)是一种不错的指标。对于离散空间而言,一个系统的熵(entropy)可以这样来表示:

\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\}).

如果一个系统的熵(entropy)越大,说明这个系统就越混乱;如果一个系统的熵越小,那么说明这个系统就更加确定。

提到时间序列的熵特征,一般来说有几个经典的熵指标,其中有一个就是 binned entropy。

分桶熵(Binned Entropy)

从熵的定义出发,可以考虑把时间序列的值进行分桶的操作,例如,可以把 [min, max] 这个区间等分为十个小区间,那么时间序列的取值就会分散在这十个桶中。根据这个等距分桶的情况,就可以计算出这个概率分布的熵(entropy)。i.e. Binned Entropy 就可以定义为:

\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},

其中 p_{k} 表示时间序列 X 的取值落在第 k 个桶的比例(概率),maxbin 表示桶的个数,len(X) 表示时间序列 X 的长度。

如果一个时间序列的 Binned Entropy 较大,说明这一段时间序列的取值是较为均匀的分布在 [min, max] 之间的;如果一个时间序列的 Binned Entropy 较小,说明这一段时间序列的取值是集中在某一段上的。

总结

在本篇文章中,笔者从 Riemann 积分和 Lebesgue 积分出发,介绍了它们的基本概念,性质和联系。然后从两种积分出发,探讨了时间序列的分段特征,时间序列的熵特征。在未来的 Blog 中,笔者将会介绍时间序列的更多相关内容。

 

How to Convert a Time Series to a Supervised Learning Problem in Python

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a function to transform a time series dataset into a supervised learning dataset.
  • How to transform univariate time series data for machine learning.
  • How to transform multivariate time series data for machine learning.

Let’s get started.

How to Convert a Time Series to a Supervised Learning Problem in Python

Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

  • data: Sequence of observations as a list or 2D NumPy array. Required.
  • n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
  • n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
  • dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

  • return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?
Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.

Summary

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

  • About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
  • How to reframe a univariate time series into one-step and multi-step supervised learning problems.
  • How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Mueen Keogh算法

论文:Exact Discovery of Time Series Motifs

 

Speeded up Brute Force Motif Discovery:

Github:https://github.com/saifuddin778/mkalgo

但是感觉有一行比较奇怪,应该是 Dist_{I(j+offset)}-Dist_{I(j)} < best-so-far,而不是Dist_{I(j)}-Dist_{I(j+offset)} < best-so-far,因为 D_{I(j)} 是递增排列的,并且 best-so-far > 0.

Speeded up brute force motif discovery

Generalization to multiple reference points:

https://github.com/nicholasg3/motif-mining/tree/95bbb05ac5d0f9e90134a67a789ea7e607f22cea

注意:

for j = 1 to m-offset 而不是 for j = 1 to R

nicholasg3 MK_Motif_discovery

MK Motif Discovery

Time Series Clustering with Dynamic Time Warping (DTW)

https://github.com/goodmattg/wikipedia_kaggle