Category Archives: 时间序列

时间序列的表示与信息提取

提到时间序列,大家能够想到的就是一串按时间排序的数据,但是在这串数字背后有着它特殊的含义,那么如何进行时间序列的表示(Representation),如何进行时间序列的信息提取(Information Extraction)就成为了时间序列研究的关键问题。

就笔者的个人经验而言,其实时间序列的一些想法和文本挖掘是非常类似的。通常来说句子都是由各种各样的词语组成的,并且一般情况下都是“主谓宾”的句子结构。于是就有人希望把词语用一个数学上的向量描述出来,那么最经典的做法就是使用 one – hot 的编码格式。i.e. 也就是对字典里面的每一个词语进行编码,一个词语对应着一个唯一的数字,例如 0,1,2 这种形式。one hot 的编码格式是这行向量的长度是词典中词语的个数,只有一个值是1,其余的取值是0,也就是 (0,…,0,1,0,…,0) 这种样子。但是在一般情况下,词语的个数都是非常多的,如何使用一个维度较小的向量来表示一个词语就成为了一个关键的问题。几年前,GOOGLE 公司开源了 Word2vec 开源框架,把每一个词语用一串向量来进行描述,向量的长度可以自行调整,大约是100~1000 不等,就把原始的 one-hot 编码转换为了一个低维空间的向量。在这种情况下,机器学习的很多经典算法,包括分类,回归,聚类等都可以在文本上得到巨大的使用。Word2vec 是采用神经网络的思想来提取每个词语与周边词语的关系,从而把每个词语用一个低维向量来表示。在这里,时间序列的特征提取方法与 word2vec 略有不同,后面会一一展示这些技巧。

 

时间序列的统计特征

提到时间序列的统计特征,一般都能够想到最大值(max),最小值(min),均值(mean),中位数(median),方差(variance),标准差(standard variance)等指标,不过一般的统计书上还会介绍两个指标,那就是偏度(skewness)和峰度(kuriosis)。如果使用时间序列 X_{T} = \{x_{1},\cdots,x_{T}\} 来表示长度为 T 的时间序列,那么这些统计特征用数学公式来表示就是:

\mu=\frac{1}{T}\sum_{i=1}^{T}x_{i},

\sigma^{2} = \sum_{i=1}^{T}\frac{1}{T}(x_{i}-\mu)^{2},

\text{skewness}(X) = E[(\frac{X-\mu}{\sigma})^{3}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{3}}{\sigma^{3}},

\text{kurtosis}(X) = E[(\frac{X-\mu}{\sigma})^{4}]=\frac{1}{T}\sum_{i=1}^{T}\frac{(x_{i}-\mu)^{4}}{\sigma^{4}} .

其中 \mu\sigma 分别表示时间序列 X_{T} 的均值和方差。

 

时间序列的熵特征

为什么要研究时间序列的熵呢?请看下面两个时间序列:

时间序列(1):(1,2,1,2,1,2,1,2,1,2,…)

时间序列(2):(1,1,2,1,2,2,2,2,1,1,…)

在时间序列(1)中,1 和 2 是交替出现的,而在时间序列(2)中,1 和 2 是随机出现的。在这种情况下,时间序列(1)则更加确定,时间序列(2)则更加随机。并且在这种情况下,两个时间序列的统计特征,例如均值,方差,中位数等等则是几乎一致的,说明用之前的统计特征并不足以精准的区分这两种时间序列。

通常来说,要想描述一种确定性与不确定性,熵(entropy)是一种不错的指标。对于离散空间而言,一个系统的熵(entropy)可以这样来表示:

\text{entropy}(X) = -\sum_{i=1}^{\infty}P\{x=x_{i}\}\ln(P\{x=x_{i}\}).

如果一个系统的熵(entropy)越大,说明这个系统就越混乱;如果一个系统的熵越小,那么说明这个系统就更加确定。

提到时间序列的熵特征,一般来说有几个经典的例子,那就是 binned entropyapproximate entropysample entropy。下面来一一介绍时间序列中这几个经典的熵。

Binned Entropy

从熵的定义出发,可以考虑把时间序列 X_{T} 的取值进行分桶的操作。例如,可以把 [\min(X_{T}), \max(X_{T})] 这个区间等分为十个小区间,那么时间序列的取值就会分散在这十个桶中。根据这个等距分桶的情况,就可以计算出这个概率分布的熵(entropy)。i.e. Binned Entropy 就可以定义为:

\text{binned entropy}(X) = -\sum_{k=0}^{\min(maxbin, len(X))} p_{k}\ln(p_{k})\cdot 1_{(p_{k}>0)},

其中 p_{k} 表示时间序列 X_{T} 的取值落在第 k 个桶的比例(概率),maxbin 表示桶的个数,len(X_{T}) = T 表示时间序列 X_{T} 的长度。

如果一个时间序列的 Binned Entropy 较大,说明这一段时间序列的取值是较为均匀的分布在 [\min(X_{T}), \max(X_{T})] 之间的;如果一个时间序列的 Binned Entropy 较小,说明这一段时间序列的取值是集中在某一段上的。

Approximate Entropy

回到本节的问题,如何判断一个时间序列是否具备某种趋势还是随机出现呢?这就需要介绍 Approximate Entropy 的概念了,Approximate Entropy 的思想就是把一维空间的时间序列提升到高维空间中,通过高维空间的向量之间的距离或者相似度的判断,来推导出一维空间的时间序列是否存在某种趋势或者确定性。那么,我们现在可以假设时间序列 X_{N}: \{x_{1},\cdots, x_{N}\} 的长度是 N,同时 Approximate Entropy 函数拥有两个参数,mr,下面来详细介绍 Approximate Entropy 的算法细节。

Step 1. 固定两个参数,正整数 m 和正数 r,正整数 m 是为了把时间序列进行一个片段的提取,正数 r 是表示时间序列距离的某个参数。i.e. 需要构造新的 m 维向量如下:

X_{1}(m) = (x_{1},\cdots, x_{m})\in\mathbb{R}^{m},

X_{i}(m) = (x_{i},\cdots, x_{m+i-1})\in\mathbb{R}^{m},

X_{N-m+1}(m) = (x_{N-m+1},\cdots, x_{N})\in\mathbb{R}^{m}.

Step 2. 通过新的向量 X_{1}(m),\cdots, X_{N-m+1}(m),可以计算出哪些向量与 X_{i} 较为相似。i.e.

C_{i}^{m}(r) = (\text{number of }X_{j}(m)\text{ such that } d(X_{i}(m), X_{j}(m))\leq r)/(N-m+1),

在这里,距离 d 可以选择 L^{1}, L^{2}, L^{p}, L^{\infty} 范数。在这个场景下,距离 d 通常选择为 L^{\infty} 范数。

Step 3. 考虑函数

\Phi^{m}(r) = (N-m+1)^{-1}\cdot \sum_{i=1}^{N-m+1} \ln(C_{i}^{m}(r)),

Step 4. Approximate Entropy 可以定义为:

\text{ApEn}(m,r) = \Phi^{m}(r)-\Phi^{m+1}(r).

Remark.

  1. 正整数 m 一般可以取值为 2 或者 3,r>0 会基于具体的时间序列具体调整;
  2. 如果某条时间序列具有很多重复的片段(repetitive pattern)或者自相似性(self-similarity pattern),那么它的 Approximate Entropy 就会相对小;反之,如果某条时间序列几乎是随机出现的,那么它的 Approximate Entropy 就会相对较大。

Sample Entropy

除了 Approximate Entropy,还有另外一个熵的指标可以衡量时间序列,那就是 Sample Entropy,通过自然对数的计算来表示时间序列是否具备某种自相似性。

按照以上 Approximate Entropy 的定义,可以基于 mr 定义两个指标 AB,分别是

A = \#\{\text{vector pairs having } d(X_{i}(m+1),X_{j}(m+1))<r \text{ of length } m+1  \},

B = \#\{ \text{vector pairs having } d(X_{i}(m), X_{j}(m))<r \text{ of length } m\}.

其中,\# 表示集合的元素个数。根据度量 d(无论是 L^{1}, L^{2}, L^{\infty})的定义可以知道A\leq B,因此 Sample Entropy 总是非负数,i.e.

\text{SampEn} = -\ln(A/B) \geq 0.

Remark.

  1. Sample Entropy 总是非负数;
  2. Sample Entropy 越小表示该时间序列具有越强的自相似性(self similarity)。
  3. 通常来说,在 Sample Entropy 的参数选择中,可以选择 m = 2, r = 0.2 \cdot std.

 

时间序列的分段特征

即使时间序列有一定的自相似性(self-similarity),能否说明这两条时间序列就完全相似呢?其实答案是否定的,例如:两个长度都是 1000 的时间序列,

时间序列(1): [1,2] * 500

时间序列(2): [1,2,3,4,5,6,7,8,9,10] * 100

其中,时间序列(1)是 1 和 2 循环的,时间序列(2)是 1~10 这样循环的,它们从图像上看完全是不一样的曲线,并且它们的 Approximate Entropy  和 Sample Entropy 都是非常小的。那么问题来了,有没有办法提炼出信息,从而表示它们的不同点呢?答案是肯定的。

首先,我们可以回顾一下 Riemann 积分和 Lebesgue 积分的定义和不同之处。按照下面两幅图所示,Riemann 积分是为了算曲线下面所围成的面积,因此把横轴划分成一个又一个的小区间,按照长方形累加的算法来计算面积。而 Lebesgue 积分的算法恰好相反,它是把纵轴切分成一个又一个的小区间,然后也是按照长方形累加的算法来计算面积。

RiemannANDLebesgue

之前的 Binned Entropy 方案是根据值域来进行切分的,好比 Lebesgue 积分的计算方法。现在我们可以按照 Riemann 积分的计算方法来表示一个时间序列的特征,于是就有学者把时间序列按照横轴切分成很多段,每一段使用某个简单函数(线性函数等)来表示,于是就有了以下的方法:

  1. 分段线性逼近(Piecewise Linear Approximation)
  2. 分段聚合逼近(Piecewise Aggregate Approximation)
  3. 分段常数逼近(Piecewise Constant Approximation)

说到这几种算法,其实最本质的思想就是进行数据降维的工作,用少数的数据来进行原始时间序列的表示(Representation)。用数学化的语言来描述时间序列的数据降维(Data Reduction)就是:把原始的时间序列 \{x_{1},\cdots,x_{N}\}\{x_{1}^{'},\cdots, x_{D}^{'}\} 来表示,其中 D<N。那么后者就是原始序列的一种表示(representation)。

分段聚合逼近(Piecewise Aggregate Approximation)— 类似 Riemann 积分

在这种算法中,分段聚合逼近(Piecewise Aggregate Approximation)是一种非常经典的算法。假设原始的时间序列是 C = \{x_{1},\cdots, x_{N}\},定义 PAA 的序列是:\overline{C} = \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

其中

\overline{x}_{i} = \frac{w}{N} \cdot \sum_{j=\frac{N}{w}(i-1)+1}^{\frac{N}{w}i} x_{j}.

在这里 1\leq i\leq w。用图像来表示那就是:

PAA

至于分段线性逼近(Piecewise Linear Approximation)和分段常数逼近(Piecewise Constant Approximation),只需要在 \overline{x}_{i} 的定义上稍作修改即可。

符号逼近(Symbolic Approximation)— 类似 Riemann 积分

在推荐系统的特征工程里面,特征通常来说可以做归一化,二值化,离散化等操作。例如,用户的年龄特征,一般不会直接使用具体的年月日,而是划分为某个区间段,例如 0~6(婴幼儿时期),7~12(小学),13~17(中学),18~22(大学)等阶段。

其实在得到分段特征之后,分段特征在某种程度上来说依旧是某些连续值,能否把连续值划分为一些离散的值呢?于是就有学者使用一些符号来表示时间序列的关键特征,也就是所谓的符号表示法(Symbolic Representation)。下面来介绍经典的 SAX Representation。

如果我们希望使用 \alpha 个符号,例如 \{l_{1},\cdots,l_{\alpha}\} 来表示时间序列。同时考虑正态分布 N(0,1),用\{z_{1/\alpha},\cdots,z_{(\alpha-1)/\alpha}\} 来表示 Gauss 曲线下方的一些点,而这些点把 Gauss 曲线下方的面积等分成了 \alpha 段。

SAX 方法的流程如下:

Step 1. 正规化(normalization):也就是该时间序列被映射到均值为零,方差为一的区间内。

Step 2. 分段表示(PAA):\{x_{1},\cdots, x_{N}\} \Rightarrow  \{\overline{x}_{1},\cdots,\overline{x}_{w}\}

Step 3. 符号表示(SAX):如果 \overline{x}_{i}<z_{1/\alpha},那么 \hat{X}_{i}=l_{1};如果 z_{(j-1)/\alpha}\leq \overline{x}_{i}<z_{j/\alpha},那么 \hat{X}_{i} = l_{j};如果 \overline{x}_{i}\geq z_{(\alpha-1)/\alpha},那么 \hat{X}_{i} = l_{\alpha}

于是,我们就可以用 \{l_{1},\cdots,l_{\alpha}\}\alpha 个字母来表示原始的时间序列了。

SAX

 

总结

在本篇文章中,我们介绍了时间序列的一些表示方法(Representation),其中包括时间序列统计特征,时间序列的熵特征,时间序列的分段特征。在下一篇文章中,我们将会介绍时间序列的相似度计算方法。

 

How to Convert a Time Series to a Supervised Learning Problem in Python

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a function to transform a time series dataset into a supervised learning dataset.
  • How to transform univariate time series data for machine learning.
  • How to transform multivariate time series data for machine learning.

Let’s get started.

How to Convert a Time Series to a Supervised Learning Problem in Python

Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

  • data: Sequence of observations as a list or 2D NumPy array. Required.
  • n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
  • n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
  • dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

  • return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?
Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.

Summary

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

  • About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
  • How to reframe a univariate time series into one-step and multi-step supervised learning problems.
  • How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Mueen Keogh算法

论文:Exact Discovery of Time Series Motifs

 

Speeded up Brute Force Motif Discovery:

Github:https://github.com/saifuddin778/mkalgo

但是感觉有一行比较奇怪,应该是 Dist_{I(j+offset)}-Dist_{I(j)} < best-so-far,而不是Dist_{I(j)}-Dist_{I(j+offset)} < best-so-far,因为 D_{I(j)} 是递增排列的,并且 best-so-far > 0.

Speeded up brute force motif discovery

Generalization to multiple reference points:

https://github.com/nicholasg3/motif-mining/tree/95bbb05ac5d0f9e90134a67a789ea7e607f22cea

注意:

for j = 1 to m-offset 而不是 for j = 1 to R

nicholasg3 MK_Motif_discovery

MK Motif Discovery

Time Series Clustering with Dynamic Time Warping (DTW)

https://github.com/goodmattg/wikipedia_kaggle