# 基于自编码器的时间序列异常检测算法

### 自编码器

$\phi: X\rightarrow F,$

$\psi: F\rightarrow X,$

$\phi,\psi = argmin_{\phi,\psi}||X-(\psi\circ\phi)X||^{2},$

$z = f(Ax+c),$

$x'=g(Bx+d),$

Claim. 对于自编码器而言，其中隐藏层的维度 $m$ 一定是要小于输入层的维度 $n$ 的。

Proof. 如果 $n=m$，那么令 $A=B=I_{n},$ $c=d=0,$ $f=g=id$ 就可以得到一个自编码器，而这个自编码器对于提取特征没有任何的意义。同理，当 $m>n$ 时，$A$ 是一个 $m\times n$ 矩阵，$B$ 是一个 $n\times m$ 矩阵。从线性代数的角度来看，有无数个矩阵 $A, B$ 满足 $BA=I_{n}$。这种情况下对于提取特征也是没有意义的。而当 $m 时，其实无法找到矩阵 $A,B$ 使得 $BA=I_{n}.$ 如果存在 $BA=I_{n},$ 那么

$n = rank(I_{n})=rank(BA) \leq \min\{rank(A),rank(B)\} \leq \min\{m,m\}=m.$

### 时间序列异常检测：

-> Auto Encoder（Encoder 和 Decoder）

-> 重构后的时间序列

-> 通过重构后的时间序列与原始时间序列的整体误差和局部误差来判断异常点

### 总结

1. 从理论上说，它只能对一个时间序列单独训练一个模型，不同类型的时间序列需要使用不同的模型。这样的话，其实维护模型的成本比较高，不太适用于大规模的时间序列异常检测场景；
2. 对周期型的曲线效果比较好，如果是毛刺型的数据，有可能就不太适用；因为长期的毛刺型数据就可以看成正常的数据了。
3. 每次调参需要人为设置一定的阈值，不同的时间序列所需要的阈值是不一样的。

### 参考文献

1. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications, Haowen XU, etc., 2018
2. Deep Learning, Ian Goodfellow, etc., 2016
3. https://zr9558.com/2016/06/12/replicator-neural-networks/

# 时间序列的自回归模型—从线性代数的角度来看

### Fibonacci 序列

$F_{n}=F_{n-1}+F_{n-2}$

0，1，1，2，3，5，8，13，21，34，55，89，144，…

### 求解 Fibonacci 序列的通项公式 －－－ 矩阵对角化

$\left( \begin{array}{c} F_{n+2} \\ F_{n+1} \\ \end{array}\right)= \left( \begin{array}{cc} 1 & 1 \\ 1 & 0 \\ \end{array}\right) \left( \begin{array}{c} F_{n+1} \\ F_{n} \\ \end{array}\right) = A \left( \begin{array}{c} F_{n+1} \\ F_{n} \\ \end{array}\right),$

$\left( \begin{array}{c} F_{n} \\ F_{n-1} \\ \end{array}\right)= A \left( \begin{array}{c} F_{n-1} \\ F_{n-2} \\ \end{array}\right) = \cdots = A^{n-1} \left( \begin{array}{c} F_{1} \\ F_{0} \\ \end{array}\right).$

$P^{-1}AP = diag(\lambda_{1},\cdots,\lambda_{m})$

$\implies AP = P diag(\lambda_{1},\cdots,\lambda_{m})$

$\implies A = P diag(\lambda_{1},\cdots,\lambda_{m})P^{-1}.$

$A^{k} = (PDP^{-1})\cdots(PDP^{-1}) = P D^{k} P^{-1}= P diag(\lambda_{1}^{k},\cdots,\lambda_{m}^{k})P^{-1}.$

$det(\lambda I - A) = 0,$

$A = \left( \begin{array}{cc} 1 & 1 \\ 1 & 0 \\ \end{array}\right)$,

$\vec{\alpha}_{1} = (\phi,1)^{T}, \vec{\alpha}_{2} = (-\phi^{-1},1)$.

$F_{k} = \frac{1}{\sqrt{5}}\bigg(\frac{1+\sqrt{5}}{2}\bigg)^{k} - \frac{1}{\sqrt{5}}\bigg(\frac{1-\sqrt{5}}{2}\bigg)^{k}=\frac{\phi^{k}-(-\phi)^{-k}}{\sqrt{5}}$.

### 时间序列的弱平稳性

1. $E(x_{t})$ 对于所有的 $t\geq 0$ 都是恒定的；
2. $Var(x_{t})$ 对于所有的 $t\geq 0$ 都是恒定的；
3. $x_{t}$$x_{t-h}$ 的协方差对于所有的 $t\geq 0$ 都是恒定的。

$ACF(x_{t},x_{t-h}) = \frac{Covariance(x_{t},x_{t-h})}{\sqrt{Var(x_{t})\cdot Var(x_{t-h})}}$.

$ACF(x_{t},x_{t-h}) = \frac{Covariance(x_{t},x_{t-h})}{Var(x_{t})}$.

### 时间序列的自回归模型（AutoRegression Model）

#### AR(1) 模型

AR(1) 模型指的是时间序列 $\{x_{t}\}_{t\geq 0}$ 在时间戳 $t$ 时刻的取值 $x_{t}$ 与时间戳 $t - 1$ 时刻的取值 $x_{t-1}$ 相关，其公式就是：

$x_{t}=\delta+\phi_{1}x_{t-1}+w_{t}$

1. $w_{t}\sim N(0,\sigma_{w}^{2})$，并且 $w_{t}$ 满足 iid 条件。其中 $N(0,\sigma_{w}^{2})$ 表示 Gauss 正态分布，它的均值是0，方差是 $\sigma_{w}^{2}$
2. $w_{t}$$x_{t}$ 是相互独立的（independent）。
3. $x_{0},x_{1},\cdots$弱平稳的，i.e. 必须满足 $|\phi_{1}|<1$

1. $E(x_{t}) = \delta/(1-\phi_{1})$.
2. $Var(x_{t}) = \sigma_{w}^{2}/(1-\phi_{1}^{2})$.
3. $Covariance(x_{t},x_{t-h}) = \phi_{1}^{h}$.

Proof of 1. 从 AR(1) 的模型出发，可以得到

$E(x_{t}) = E(\delta + \phi_{1}x_{t-1}+w_{t}) = \delta + \phi_{1}E(x_{t-1}) = \delta + \phi_{1}E(x_{t})$,

Proof of 2. 从 AR(1) 的模型出发，可以得到

$Var(x_{t}) = Var(\delta + \phi_{1}x_{t-1}+w_{t})$

$= \phi_{1}^{2}Var(x_{t-1}) +Var(w_{t}) = \phi_{1}^{2}Var(x_{t}) + \sigma_{w}^{2}$,

Proof of 3.$\mu = E(x_{t}), \forall t\geq 0$. 从 $x_{t}$ 的定义出发，可以得到：

$x_{t}-\mu = \phi_{1}(x_{t-1}-\mu)+w_{t}$

$= \phi_{1}^{h}(x_{t-h}-\mu) + \phi_{1}^{h-1}w_{t-h+1}+\cdots+\phi_{1}w_{t-1}+w_{t},$

$\rho_{h} = Covariance(x_{t},x_{t-h}) = \frac{E((x_{t}-\mu)\cdot(x_{t-h}-\mu))}{Var(x_{t})}=\phi_{1}^{h}$.

#### AR(1) 模型与一维动力系统

$f(x) = \phi_{1}x + \delta,$

Method 1.

$f^{n}(x) = \phi_{1}^{n}x+ \frac{1-\phi_{1}^{n}}{1-\phi_{1}}\delta$,

$n\rightarrow \infty$，可以得到 $f^{n}(x)\rightarrow \delta/(1-\phi_{1})$。这与 $E(x_{t}) = \delta/(1-\phi_{1})$ 其实是保持一致的。

Method 2.

$f(x)-\frac{\delta}{1-\phi_{1}} = \phi_{1}(x-\frac{\delta}{1-\phi_{1}})$

$\implies |f(x)-\frac{\delta}{1-\phi_{1}}| <\frac{1+|\phi_{1}|}{2}\cdot|x-\frac{\delta}{1-\phi_{1}}|$

$\implies |f^{n}(x)-\frac{\delta}{1-\phi_{1}}|<\bigg(\frac{1+|\phi_{1}|}{2}\bigg)^{n}\cdot|x-\frac{\delta}{1-\phi_{1}}|$.

$f(x)-\frac{\delta}{1-\phi_{1}} = \phi_{1}(x-\frac{\delta}{1-\phi_{1}})$

$\implies |f(x)-\frac{\delta}{1-\phi_{1}}| >\frac{1+|\phi_{1}|}{2}\cdot|x-\frac{\delta}{1-\phi_{1}}|$

$\implies |f^{n}(x)-\frac{\delta}{1-\phi_{1}}|>\bigg(\frac{1+|\phi_{1}|}{2}\bigg)^{n}\cdot|x-\frac{\delta}{1-\phi_{1}}|$.

#### AR(p) 模型

1. AR(1) 模型形如：

$x_{t}=\delta+\phi_{1}x_{t-1}+w_{t}.$

2. AR(2) 模型形如：

$x_{t} = \delta + \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+w_{t}.$

3. AR(p) 模型形如：

$x_{t} = \delta + \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}+w_{t}.$

#### AR(p) 模型的稳定性 －－－ 基于线性代数

$x_{t}= \phi_{1}x_{t-1} + \phi_{2}x_{t-2}$.

$\left( \begin{array}{c} x_{t+2} \\ x_{t+1} \\ \end{array}\right)= \left( \begin{array}{cc} \phi_{1} & \phi_{2} \\ 1 & 0 \\ \end{array}\right) \left( \begin{array}{c} x_{t+1} \\ x_{t} \\ \end{array}\right) = A \left( \begin{array}{c} x_{t+1} \\ x_{t} \\ \end{array}\right).$

$x_{t} = \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}.$

$\left(\begin{array}{c} x_{t+p} \\ x_{t+p-1} \\ \vdots \\ x_{t+1}\\ \end{array}\right) = \left(\begin{array}{ccccc} \phi_{1} & \phi_{2} &\cdots & \phi_{p-1} & \phi_{p} \\ 1 & 0 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \\ \end{array}\right) \left(\begin{array}{c} x_{t+p-1} \\ x_{t+p-2} \\ \vdots \\ x_{t} \\ \end{array}\right) = A \left(\begin{array}{c} x_{t+p-1} \\ x_{t+p-2} \\ \vdots \\ x_{t} \\ \end{array}\right)$

$\lambda^{p}-\phi_{1}\lambda^{p-1}-\phi_{2}\lambda^{p-2}-\cdots-\phi_{p}=0.$

$x_{t} = \phi_{1}x_{t-1}+\phi_{2}x_{t-2}+\cdots+\phi_{p}x_{t-p}$

# 时间序列的搜索

### 基于动态规划的相似度计算方法

#### DTW 的经典算法

$W^{*} = argmin_{W}(\sqrt{\sum_{k=1}^{K}w_{k}})$.

$DTW(i,j)$

$= dist(q_{i},c_{j}) + \min(DTW(i-1,j-1), DTW(i-1,j), DTW(i,j-1)).$

1. 如果 $dist(q_{i},c_{j}) = (q_{i}-c_{j})^{2}$ 时，则 $\sqrt{DTW(n,n)}$ 表示最后的距离；
2. 如果 $dist(q_{i},c_{j}) = |q_{i}-c_{j}|$ 时，则 $DTW(n,n)$ 表示最后的距离。
3. 如果 $dist(q_{i},c_{j}) = |q_{i}-c_{j}|^{p}$ 时，则 $(DTW(n,n))^{1/p}$ 表示最后的距离。

Remark.

DTW 算法不满足三角不等式。例如：$x={0}, y={1,2}, z={1,2,2}$，则

$DTW(x,z)=5>DTW(x,y)+DTW(y,z) = 3 + 0 =3.$

#### DTW 的加速算法

$DTW(i,j) = dist(q_{i},c_{j}) + \min(DTW(i-1,j-1), DTW(i-1,j), DTW(i,j-1)),$

### 相似时间序列的搜索

Question 1. 给定一条时间序列 $Q$ 和一个时间序列的数据库 $D=\{C_{i}\}_{i=1}^{\infty}$。通过某种相似度或者距离计算方法，计算出给定的时间序列 $Q$ 与时间序列数据库中 $D$ 中最相似的时间序列。

Question 2. 给定一条时间序列 $Q$ 和一个时间序列的数据库 $D=\{C_{i}\}_{i=1}^{\infty}$，以及正整数 $K$。从数据库 $D$ 中寻找与给定的时间序列 $Q$ 最相似的 $K$ 条时间序列。

#### DTW 算法的下界 LB_Kim

$LBKim(Q,C)$

$= \max\{|\max(Q)-\max(C)|,|\min(Q)-\min(C)|,|First(Q)-First(C)|,|Last(Q)-Last(C)| \}.$

#### DTW 算法的下界 LB_Keogh

（1）首先定义时间序列 Q 的上下界。i.e. $Q:\{q_{1},\cdots,q_{n}\}$，给定一个窗口的取值 r，得到 $U_{i}=\max(q_{i-r},q_{i+r})$$L_{i} = \min(q_{i-r},q_{i+r})$

（2）定义公式：

$LBKeogh(Q,C)$

$= \sqrt{\sum_{i=1}^{n}(c_{i}-U_{i})^{2}I_{\{c_{i}>U_{i}\}} + \sum_{i=1}^{n}(c_{i}-L_{i})^{2}I_{\{c_{i}

Lower Bounding Sequential Scan(Q)
best_so_far = infinity
for all sequences in database
LB_dist = lower_bound_distance(C_{i},Q)
if LB_dist < best_so_far
true_dist = DTW(C_{i},Q)
if true_dist < best_so_far
best_so_far = true_dist
index_of_best_match = i
endif
endif
endfor

# 如何理解时间序列？— 从Riemann积分和Lebesgue积分谈起

Riemann 积分和 Lebesgue 积分是数学中两个非常重要的概念。本文将会从 Riemann 积分和 Lebesgue 积分的定义出发，介绍它们各自的性质和联系。

## 积分

### Riemann 积分

Riemann 积分虽然被称为 Riemann 积分，但是在 Riemann 之前就有学者对这类积分进行了详细的研究。早在阿基米德时代，阿基米德为了计算曲线 $x^{2}$ 在 [0,1] 区间上与 X 坐标轴所夹的图形面积，就使用了 Riemann 积分的思想。 他把 [0,1] 区间等长地切割成 n 段，每一段使用一个长方形去逼近 $x^{2}$ 这条曲线的分段面积，再把 n 取得很大，所以得到当 n 趋近于无穷的时候，就知道该面积其实是 1/3。

$\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}).$

$|\sum_{i=0}^{n-1}f(t_{i})(x_{i+1}-x_{i}) - s|<\epsilon.$

### Lebesgue 积分

Riemann 积分是为了计算曲线与 X 轴所围成的面积，而 Lebesgue 积分也是做同样的事情，但是计算面积的方法略有不同。要想直观的解释两种积分的原理，可以参见下图：

Riemann 积分是把一条曲线的底部分成等长的区间，测量每一个区间上的曲线高度，所以总面积就是这些区间与高度所围成的面积和。

Lebesgue 积分是把曲线化成等高线图，每两根相邻等高线的差值是一样的。每根等高线之内含有它所圈着的长度，因此总面积就是这些等高线内的面积之和。

1. Riemann 积分：从一个角落开始一口一口吃，每口都包含所有的配料；
2. Lebesgue 积分：从最上层开始吃，按照“面包-配菜-肉-蛋-面包”的节奏，一层一层来吃。

$\int(\sum_{k}a_{k}1_{S_{k}})d\mu = \sum_{k}a_{k}\int 1_{S_{k}}d\mu = \sum_{k}a_{k}\mu(S_{k}).$

$\int_{E}f d\mu = \sup\{\int_{E}sd\mu: \bold{0}\leq s\leq f\}$,

$\int fd\mu = \int f^{+}d\mu - \int f^{-}d\mu.$.

### Riemann 积分与Lebesgue 积分的关系

$(R)\int_{a}^{b}f(x)dx = (L)\int_{[a,b]}f(x)dx$.

1. $x$ 是有理数时，$D(x) = 1$
2. $x$ 是无理数时，$D(x) = 0$.

Dirichlet 函数是定义在实数轴的函数，并且值域是 $\{0,1\}$，无法画出函数图像，它不是 Riemann 可积的，但是它 Lebesgue 可积。

## 时间序列

### 时间序列的表示 — 基于 Riemann 积分

1. 分段线性逼近（Piecewise Linear Approximation）
2. 分段聚合逼近（Piecewise Aggregate Approximation）
3. 分段常数逼近（Piecewise Constant Approximation）

#### 分段聚合逼近（Piecewise Aggregate Approximation）— 类似 Riemann 积分

$\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}$.

#### 符号特征（Symbolic Approximation）— 类似用简单函数来计算 Lebesgue 积分

SAX 方法的流程如下：

1. 正规化（normalization）：把原始的时间序列映射到一个新的时间序列，新的时间序列满足均值为零，方差为一的条件。
2. 分段表示（PAA）：$\{x_{1},\cdots, x_{N}\} \Rightarrow \{\overline{x}_{1},\cdots,\overline{x}_{w}\}$
3. 符号表示（SAX）：如果 $\overline{x}_{i}，那么 $\hat{X}_{i}=l_{1}$；如果 $z_{(j-1)/\alpha}\leq \overline{x}_{i}，那么 $\hat{X}_{i} = l_{j}$，在这里 $2\leq j\leq \alpha$；如果 $\overline{x}_{i}\geq z_{(\alpha-1)/\alpha}$，那么 $\hat{X}_{i} = l_{\alpha}$

### 时间序列的表示 — 基于 Lebesgue 积分

#### 熵（Entropy）

$\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\})$.

#### 分桶熵（Binned Entropy）

$\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},$

# 时间序列的相似性

#### 度量空间

1. $d(x,y)\geq 0$，并且 $d(x,y)=0$ 当且仅当 $x=y$;
2. $d(x,y)=d(y,x)$，也就是满足对称性；
3. $d(x,z)\leq d(x,y)+d(y,z)$，也就是三角不等式。

Remark.

#### 内积空间

1. 对于任意的 $x,y\in V$，有 $ =\overline{}.$

2. 共轭双线性形式指的是：

$\forall a\in F, \forall x,y\in V, =a,$

$\forall x,y,z\in F, = + .$

$\forall b\in F, \forall x,y\in V, = \overline{b},$

$\forall x,y,z\in F, = +.$

3. 非负性：$\forall x\in V, \geq 0.$

4. 非退化：从 $V$ 到对偶空间 $V^{*}$ 的映射：$x\mapsto$ 是同构映射。

### 基于欧几里德距离的相似度计算

$d_{L^{1}}(X_{T},Y_{T}) = \sum_{t=1}^{T}|x_{t}-y_{t}|,$

$d_{L^{p}}(X_{T}, Y_{T}) = (\sum_{t=1}^{T}|x_{t}-y_{t}|^{p})^{1/p},$

$d_{L^{2}}(X_{T}, Y_{T}) = (\sum_{t=1}^{T}|x_{t}-y_{t}|^{2})^{1/2},$

$d_{L^{\infty}}(X_{T},Y_{T}) = \max_{1\leq t\leq T}|x_{t}-y_{t}|.$

### 基于相关性的相似度计算方法

#### Pearson 系数（Pearson Coefficient）

$\text{COR}(X_{T},Y_{T}) = \frac{\sum_{t=1}^{T}(x_{t}-\overline{X}_{T})\cdot(y_{t}-\overline{Y}_{T})}{\sqrt{\sum_{t=1}^{T}(x_{t}-\overline{X}_{T})^{2}}\cdot\sqrt{\sum_{t=1}^{T}(y_{t}-\overline{Y}_{T})^{2}}},$

$\overline{X}_{T} = \sum_{t=1}^{T}x_{t}/T$, $\overline{Y}_{T} = \sum_{t=1}^{T}y_{t}/T$

Pearson 系数的性质如下：

1. 如果两条时间序列 $X_{T} = Y_{T}$，则 $\text{COR}(X_{T},Y_{T}) =1$ 表是它们是完全一致的，如果两条时间序列 $X_{T} = -Y_{T}$，则 $\text{COR}(X_{T},Y_{T}) = -1$ 表示它们之间是负相关的。
2. $-1\leq \text{COR}(X_{T},Y_{T})\leq 1$.

$d_{COR,1}(X_{T},Y_{T}) = \sqrt{2\cdot(1-COR(X_{T},Y_{T}))},$

$d_{COR,2}(X_{T},Y_{T}) = \sqrt{\big(\frac{1-COR(X_{T},Y_{T})}{1+COR(X_{T},Y_{T})}\big)^{\beta}},$

#### The First Order Temporal Correlation Coefficient

$\text{CORT}(X_{T},Y_{T}) = \frac{\sum_{t=1}^{T-1}(x_{t+1}-x_{t})\cdot(y_{t+1}-y_{t})}{\sqrt{\sum_{t=1}^{T-1}(x_{t+1}-x_{t})^{2}}\cdot\sqrt{\sum_{t=1}^{T-1}(y_{t+1}-y_{t})^{2}}},$

$\text{CORT}(X_{T},Y_{T})$ 的性质：

1. $-1\leq \text{CORT}(X_{T},Y_{T}) \leq 1$
2. $\text{CORT}(X_{T},Y_{T}) =1$ 表示两条时间序列持有类似的趋势， 它们会同时上涨或者下跌，并且涨幅或者跌幅也是类似的。
3. $\text{CORT}(X_{T},Y_{T})=-1$ 表示两条时间序列的上涨和下跌趋势恰好相反。
4. $\text{CORT}(X_{T},Y_{T})=0$ 表示两条时间序列在单调性方面没有相关性。

$d_{CORT}(X_{T},Y_{T}) = \phi_{k}[CORT(X_{T},Y_{T})]\cdot d(X_{T},Y_{T}),$

$\phi_{k}(u) = 2/(1+\exp(ku)), k\geq 0$

#### 基于自相关系数的距离（Autocorrelation-based distance)

$\hat{\rho}_{k} = \frac{1}{(T-k)\sigma^{2}}\sum_{t=1}^{T-k}(x_{t}-\mu)\cdot(x_{t+k}-\mu)$,

$\hat{\rho}_{X_{T}} = (\hat{\rho}_{1,X_{T}},\cdots,\hat{\rho}_{L,X_{T}})^{T}\in \mathbb{R}^{L},$

$\hat{\rho}_{Y_{T}} = (\hat{\rho}_{1,Y_{T}},\cdots,\hat{\rho}_{L,Y_{T}})^{T}\in\mathbb{R}^{L}.$

$d_{ACF}(X_{T},Y_{T}) = \sqrt{(\hat{\rho}_{X_{T}}-\hat{\rho}_{Y_{T}})^{T}\Omega(\hat{\rho}_{X_{T}}-\hat{\rho}_{Y_{T}})}$.

（1）$\Omega = I_{L}$ 表示单位矩阵。用公式表示就是

$d_{ACFU}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{L}(\hat{\rho}_{i,X_{T}}-\hat{\rho}_{i,Y_{T}})^{2}}$.

（2）$\Omega = diag\{p(1-p),p(1-p)^{2},\cdots,p(1-p)^{L}\}$ 表示一个 $L\times L$ 的对角矩阵，其中 $0。此时相当于一个带权重的求和公式。

$d_{ACFU}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{L}p(1-p)^{i}(\hat{\rho}_{i,X_{T}}-\hat{\rho}_{i,Y_{T}})^{2}}$.

### 基于周期性的相似度计算方法

$I_{X_{T}}(\lambda_{k}) = T^{-1}|\sum_{t=1}^{T}x_{t}e^{-i\lambda_{k}t}|^{2}$,

$I_{Y_{T}}(\lambda_{k}) = T^{-1}|\sum_{t=1}^{T}y_{t}e^{-i\lambda_{k}t}|^{2}$.

（1）用原始的特征来表示距离：

$d_{P}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(I_{X_{T}}(\lambda_{k})-I_{Y_{T}}(\lambda_{k}))^{2}}$.

（2）用正则化之后的特征来描述就是：

$d_{P}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(NI_{X_{T}}(\lambda_{k})-NI_{Y_{T}}(\lambda_{k}))^{2}}$,

（3）用取对数之后的特征表示：

$d_{LNP}(X_{T},Y_{T}) = \frac{1}{n}\sqrt{\sum_{k=1}^{n}(\ln NI_{X_{T}}(\lambda_{k})-\ln NI_{Y_{T}}(\lambda_{k}))^{2}}$.

### 基于模型的相似度计算

#### Piccolo 距离

$ARMA(p,q)$ 模型有自己的 AR 表示，因此可以得到相应的一组参数 $(\pi_{1},\pi_{2},\cdots)$，所以，对于每一条时间序列，都可以用一组最优的参数去逼近。如果

$\hat{\prod}_{X_{T}}=(\hat{\pi}_{1,X_{T}},\cdots,\hat{\pi}_{k_{1},X_{T}})^{T},$

$\hat{\prod}_{X_{T}}=(\hat{\pi}_{1,X_{T}},\cdots,\hat{\pi}_{k_{1},X_{T}})^{T}$

$d_{PIC}(X_{T},Y_{T}) =\sqrt{\sum_{j=1}^{k}(\hat{\pi}_{j,X_{T}}'-\hat{\pi}_{j,Y_{T}}')^{2}}$,

#### Maharaj 距离

$d_{MAH}(X_{T},Y_{T}) =\sqrt{T}(\hat{\prod}'_{X_{T}}-\hat{\prod}'_{Y_{T}})^{T}\hat{V}^{-1}(\hat{\prod}'_{X_{T}}-\hat{\prod}'_{Y_{T}}).$

#### 基于 Cepstral 的距离

$\psi_{1}=\phi_{1}$

$\psi_{h}=\phi_{h}+\sum_{m=1}^{h-1}(\phi_{m}-\psi_{h-m})$$1

$\psi_{h}=\sum_{m=1}^{p}(1-\frac{m}{h})\phi_{m}\psi_{h-m}$$p

$d_{LPC, Cep}(X_{T},Y_{T}) =\sqrt{\sum_{i=1}^{T}(\psi_{i,X_{T}}-\psi_{i,Y_{T}})^{2}}$.

# 时间序列的表示与信息提取

### 时间序列的统计特征

$\mu=\frac{1}{T}\sum_{i=1}^{T}x_{i},$

$\sigma^{2} = \sum_{i=1}^{T}\frac{1}{T}(x_{i}-\mu)^{2},$

$\text{skewness}(X) = E[(\frac{X-\mu}{\sigma})^{3}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{3}}{\sigma^{3}},$

$\text{kurtosis}(X) = E[(\frac{X-\mu}{\sigma})^{4}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{4}}{\sigma^{4}} .$

### 时间序列的熵特征

$\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\}).$

#### Binned Entropy

$\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},$

#### Approximate Entropy

Step 1. 固定两个参数，正整数 $m$ 和正数 $r$，正整数 $m$ 是为了把时间序列进行一个片段的提取，正数 $r$ 是表示时间序列距离的某个参数。i.e. 需要构造新的 $m$ 维向量如下：

$X_{1}(m) = (x_{1},\cdots, x_{m})\in\mathbb{R}^{m},$

$X_{i}(m) = (x_{i},\cdots, x_{m+i-1})\in\mathbb{R}^{m},$

$X_{N-m+1}(m) = (x_{N-m+1},\cdots, x_{N})\in\mathbb{R}^{m}.$

Step 2. 通过新的向量 $X_{1}(m),\cdots, X_{N-m+1}(m)$，可以计算出哪些向量与 $X_{i}$ 较为相似。i.e.

$C_{i}^{m}(r) = (\text{number of }X_{j}(m)\text{ such that } d(X_{i}(m), X_{j}(m))\leq r)/(N-m+1),$

Step 3. 考虑函数

$\Phi^{m}(r) = (N-m+1)^{-1}\cdot \sum_{i=1}^{N-m+1} \ln(C_{i}^{m}(r)),$

Step 4. Approximate Entropy 可以定义为：

$\text{ApEn}(m,r) = \Phi^{m}(r)-\Phi^{m+1}(r).$

Remark.

1. 正整数 $m$ 一般可以取值为 2 或者 3，$r>0$ 会基于具体的时间序列具体调整；
2. 如果某条时间序列具有很多重复的片段（repetitive pattern）或者自相似性（self-similarity pattern），那么它的 Approximate Entropy 就会相对小；反之，如果某条时间序列几乎是随机出现的，那么它的 Approximate Entropy 就会相对较大。

#### Sample Entropy

$A = \#\{\text{vector pairs having } d(X_{i}(m+1),X_{j}(m+1))

$B = \#\{ \text{vector pairs having } d(X_{i}(m), X_{j}(m))

$\text{SampEn} = -\ln(A/B) \geq 0.$

Remark.

1. Sample Entropy 总是非负数；
2. Sample Entropy 越小表示该时间序列具有越强的自相似性（self similarity）。
3. 通常来说，在 Sample Entropy 的参数选择中，可以选择 $m = 2, r = 0.2 \cdot std$.

### 时间序列的分段特征

1. 分段线性逼近（Piecewise Linear Approximation）
2. 分段聚合逼近（Piecewise Aggregate Approximation）
3. 分段常数逼近（Piecewise Constant Approximation）

#### 分段聚合逼近（Piecewise Aggregate Approximation）— 类似 Riemann 积分

$\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}$.

#### 符号逼近（Symbolic Approximation）— 类似 Riemann 积分

SAX 方法的流程如下：

Step 1. 正规化（normalization）：也就是该时间序列被映射到均值为零，方差为一的区间内。

Step 2. 分段表示（PAA）：$\{x_{1},\cdots, x_{N}\} \Rightarrow \{\overline{x}_{1},\cdots,\overline{x}_{w}\}$

Step 3. 符号表示（SAX）：如果 $\overline{x}_{i}，那么 $\hat{X}_{i}=l_{1}$；如果 $z_{(j-1)/\alpha}\leq \overline{x}_{i}，那么 $\hat{X}_{i} = l_{j}$；如果 $\overline{x}_{i}\geq z_{(\alpha-1)/\alpha}$，那么 $\hat{X}_{i} = l_{\alpha}$

# How to Convert a Time Series to a Supervised Learning Problem in Python

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

• How to develop a function to transform a time series dataset into a supervised learning dataset.
• How to transform univariate time series data for machine learning.
• How to transform multivariate time series data for machine learning.

Let’s get started.

## Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

## Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

## The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

• data: Sequence of observations as a list or 2D NumPy array. Required.
• n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
• n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
• dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

• return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?

Now that we have the whole function, we can explore how it may be used.

## One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

## Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

## Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.

## Summary

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

• About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
• How to reframe a univariate time series into one-step and multi-step supervised learning problems.
• How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?