Category Archives: Statistics

A Guide to Time Series Forecasting with ARIMA in Python 3

https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3#step-2-—-importing-packages-and-loading-data

 

Introduction

Time series provide the opportunity to forecast future values. Based on previous values, time series can be used to forecast trends in economics, weather, and capacity planning, to name a few. The specific properties of time-series data mean that specialized statistical methods are usually required.

In this tutorial, we will aim to produce reliable forecasts of time series. We will begin by introducing and discussing the concepts of autocorrelation, stationarity, and seasonality, and proceed to apply one of the most commonly used method for time-series forecasting, known as ARIMA.

One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. Here, we will primarily focus on the ARIMA component, which is used to fit time-series data to better understand and forecast future points in the time series.

Prerequisites

This guide will cover how to do time-series analysis on either a local desktop or a remote server. Working with large datasets can be memory intensive, so in either case, the computer will need at least 2GB of memory to perform some of the calculations in this guide.

To make the most of this tutorial, some familiarity with time series and statistics can be helpful.

For this tutorial, we’ll be using Jupyter Notebook to work with the data. If you do not have it already, you should follow our tutorial to install and set up Jupyter Notebook for Python 3.

Step 1 — Installing Packages

To set up our environment for time-series forecasting, let’s first move into our local programming environment or server-based programming environment:

  • cd environments
  • . my_env/bin/activate

From here, let’s create a new directory for our project. We will call it ARIMA and then move into the directory. If you call the project a different name, be sure to substitute your name for ARIMA throughout the guide

  • mkdir ARIMA
  • cd ARIMA

This tutorial will require the warningsitertoolspandasnumpymatplotlib and statsmodels libraries. The warnings and itertools libraries come included with the standard Python library set so you shouldn’t need to install them.

Like with other Python packages, we can install these requirements with pip.
We can now install pandasstatsmodels, and the data plotting package matplotlib. Their dependencies will also be installed:

  • pip install pandas numpy statsmodels matplotlib

At this point, we’re now set up to start working with the installed packages.

Step 2 — Importing Packages and Loading Data

To begin working with our data, we will start up Jupyter Notebook:

  • jupyter notebook

To create a new notebook file, select New > Python 3 from the top right pull-down menu:

Create a new Python 3 notebook

This will open a notebook.

As is best practice, start by importing the libraries you will need at the top of your notebook:

import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

We have also defined a matplotlib style of fivethirtyeight for our plots.

We’ll be working with a dataset called “Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.,” which collected CO2 samples from March 1958 to December 2001. We can bring in this data as follows:

data = sm.datasets.co2.load_pandas()
y = data.data

Let’s preprocess our data a little bit before moving forward. Weekly data can be tricky to work with since it’s a briefer amount of time, so let’s use monthly averages instead. We’ll make the conversion with the resample function. For simplicity, we can also use the fillna() function to ensure that we have no missing values in our time series.

# The 'MS' string groups the data in buckets by start of the month
y = y['co2'].resample('MS').mean()

# The term bfill means that we use the value before filling in missing values
y = y.fillna(y.bfill())

print(y)
Output
co2
1958-03-01  316.100000
1958-04-01  317.200000
1958-05-01  317.433333
...
2001-11-01  369.375000
2001-12-01  371.020000

Let’s explore this time series e as a data visualization:

y.plot(figsize=(15, 6))
plt.show()

Figure 1: CO2 Levels Time Series

Some distinguishable patterns appear when we plot the data. The time series has an obvious seasonality pattern, as well as an overall increasing trend.

To learn more about time series pre-processing, please refer to “A Guide to Time Series Visualization with Python 3,” where the steps above are described in much more detail.

Now that we’ve converted and explored our data, let’s move on to time series forecasting with ARIMA.

Step 3 — The ARIMA Time Series Model

One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoregRessive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

There are three distinct integers (pdq) that are used to parametrize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

  • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
  • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
  • q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

When dealing with seasonal effects, we make use of the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. Here, (p, d, q) are the non-seasonal parameters described above, while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series. The term s is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

The seasonal ARIMA method can appear daunting because of the multiple tuning parameters involved. In the next section, we will describe how to automate the process of identifying the optimal set of parameters for the seasonal ARIMA time series model.

Step 4 — Parameter Selection for the ARIMA Time Series Model

When looking to fit time series data with a seasonal ARIMA model, our first goal is to find the values of ARIMA(p,d,q)(P,D,Q)s that optimize a metric of interest. There are many guidelines and best practices to achieve this goal, yet the correct parametrization of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but those have yet to be ported over to Python. In this section, we will resolve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q)(P,D,Q)s time series model.

We will use a “grid search” to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. Once we have explored the entire landscape of parameters, our optimal set of parameters will be the one that yields the best performance for our criteria of interest. Let’s begin by generating the various combination of parameters that we wish to assess:

# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Output
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyperparameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

The code chunk below iterates through combinations of parameters and uses the SARIMAX function from statsmodels to fit the corresponding Seasonal ARIMA model. Here, the order argument specifies the (p, d, q) parameters, while the seasonal_order argument specifies the (P, D, Q, S) seasonal component of the Seasonal ARIMA model. After fitting each SARIMAX()model, the code prints out its respective AICscore.

warnings.filterwarnings("ignore") # specify to ignore warning messages

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

Because some parameter combinations may lead to numerical misspecifications, we explicitly disabled warning messages in order to avoid an overload of warning messages. These misspecifications can also lead to errors and throw an exception, so we make sure to catch these exceptions and ignore the parameter combinations that cause these issues.

The code above should yield the following results, this may take some time:

Output
SARIMAX(0, 0, 0)x(0, 0, 1, 12) - AIC:6787.3436240402125
SARIMAX(0, 0, 0)x(0, 1, 1, 12) - AIC:1596.711172764114
SARIMAX(0, 0, 0)x(1, 0, 0, 12) - AIC:1058.9388921320026
SARIMAX(0, 0, 0)x(1, 0, 1, 12) - AIC:1056.2878315690562
SARIMAX(0, 0, 0)x(1, 1, 0, 12) - AIC:1361.6578978064144
SARIMAX(0, 0, 0)x(1, 1, 1, 12) - AIC:1044.7647912940095
...
...
...
SARIMAX(1, 1, 1)x(1, 0, 0, 12) - AIC:576.8647112294245
SARIMAX(1, 1, 1)x(1, 0, 1, 12) - AIC:327.9049123596742
SARIMAX(1, 1, 1)x(1, 1, 0, 12) - AIC:444.12436865161305
SARIMAX(1, 1, 1)x(1, 1, 1, 12) - AIC:277.7801413828764

The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 1, 12) yields the lowest AIC value of 277.78. We should therefore consider this to be optimal option out of all the models we have considered.

Step 5 — Fitting an ARIMA Time Series Model

Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

We’ll start by plugging the optimal parameter values into a new SARIMAX model:

mod = sm.tsa.statespace.SARIMAX(y,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])
Output
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.3182      0.092      3.443      0.001       0.137       0.499
ma.L1         -0.6255      0.077     -8.165      0.000      -0.776      -0.475
ar.S.L12       0.0010      0.001      1.732      0.083      -0.000       0.002
ma.S.L12      -0.8769      0.026    -33.811      0.000      -0.928      -0.826
sigma2         0.0972      0.004     22.634      0.000       0.089       0.106
==============================================================================

The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

results.plot_diagnostics(figsize=(15, 12))
plt.show()

Figure 2: Model Diagnostics

Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
  • The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we may find better models if we widened the grid search.

Step 6 — Validating Forecasts

We have obtained a model for our time series that can now be used to produce forecasts. We start by comparing predicted values to real values of the time series, which will help us understand the accuracy of our forecasts. The get_prediction() and conf_int() attributes allow us to obtain the values and associated confidence intervals for forecasts of the time series.

pred = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=False)
pred_ci = pred.conf_int()

The code above requires the forecasts to start at January 1998.

The dynamic=False argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.

We can plot the real and forecasted values of the CO2 time series to assess how well we did. Notice how we zoomed in on the end of the time series by slicing the date index.

ax = y['1990':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)

ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')
plt.legend()

plt.show()

Figure 3: CO2 Levels Static Forecast

Overall, our forecasts align with the true values very well, showing an overall increase trend.

It is also useful to quantify the accuracy of our forecasts. We will use the MSE (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

y_forecasted = pred.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 0.07

The MSE of our one-step ahead forecasts yields a value of 0.07, which is very low as it is close to 0. An MSE of 0 would that the estimator is predicting observations of the parameter with perfect accuracy, which would be an ideal scenario but it not typically possible.

However, a better representation of our true predictive power can be obtained using dynamic forecasts. In this case, we only use information from the time series up to a certain point, and after that, forecasts are generated using values from previous forecasted time points.

In the code chunk below, we specify to start computing the dynamic forecasts and confidence intervals from January 1998 onwards.

pred_dynamic = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
pred_dynamic_ci = pred_dynamic.conf_int()

Plotting the observed and forecasted values of the time series, we see that the overall forecasts are accurate even when using dynamic forecasts. All forecasted values (red line) match pretty closely to the ground truth (blue line), and are well within the confidence intervals of our forecast.

ax = y['1990':].plot(label='observed', figsize=(20, 15))
pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)

ax.fill_between(pred_dynamic_ci.index,
                pred_dynamic_ci.iloc[:, 0],
                pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)

ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('1998-01-01'), y.index[-1],
                 alpha=.1, zorder=-1)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 4: CO2 Levels Dynamic Forecast

Once again, we quantify the predictive performance of our forecasts by computing the MSE:

# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 1.01

The predicted values obtained from the dynamic forecasts yield an MSE of 1.01. This is slightly higher than the one-step ahead, which is to be expected given that we are relying on less historical data from the time series.

Both the one-step ahead and dynamic forecasts confirm that this time series model is valid. However, much of the interest around time series forecasting is the ability to forecast future values way ahead in time.

Step 7 — Producing and Visualizing Forecasts

In the final step of this tutorial, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

# Get forecast 500 steps ahead in future
pred_uc = results.get_forecast(steps=500)

# Get confidence intervals of forecasts
pred_ci = pred_uc.conf_int()

We can use the output of this code to plot the time series and forecasts of its future values.

ax = y.plot(label='observed', figsize=(20, 15))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 5: Time Series and Forecast of Future Values

Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

Conclusion

In this tutorial, we described how to implement a seasonal ARIMA model in Python. We made extensive use of the pandas and statsmodels libraries and showed how to run model diagnostics, as well as how to produce forecasts of the CO2 time series.

Here are a few other things you could try:

  • Change the start date of your dynamic forecasts to see how this affects the overall quality of your forecasts.
  • Try more combinations of parameters to see if you can improve the goodness-of-fit of your model.
  • Select a different metric to select the best model. For example, we used the AIC measure to find the best model, but you could seek to optimize the out-of-sample mean square error instead.

For more practice, you could also try to load another time series dataset to produce your own forecasts.

9 Comments

  • B
  • I
  • UL
  • OL
  • Code
  • Highlight
  • Table
  • 0

    Hi! Thanks for sharing this.
    I was trying to forecast hourly values. The seasonality to capture should be similar as the 168th previous value. This means, Friday 9PM of this week should be similar than Friday 9PM of the past week.
    That is why I decided to use 168 seasionality (24*7) but it takes very long and consumes lots of memory. I’ve tried several times using 7 and 24 seasionality but it wasn’t doing it well when forecasting (previous fitting with dynamic set to False was working perfectly). Do you have any advice for this situation? Thanks in advance.

  • 0

    Thanks for the Guide.

    I tried this with my own data. And at the model result summary part, I got ma.L1 having p-value over 0.88. So, I definitely want to get rid of this feature from the model. But how do I do that? How to remove a feature from the model??

    • 0

      Hi!
      Thanks for taking the time to read through this tutorial! Yes, a p-value of 0.88 would suggest that your ma.L1 feature is not very informative. The simplest way to start would be to try and remove the MA features from your model. You can achieve this by refitting your time-series models while explicitly setting the Q parameter to zero, this will ensure that no MA components are used when you fit your model.

  • 0

    very nice tutorial. thanks! I am a new one to ARIMA model, I want to ask you some questions.
    1) I found you use all the historical data to fit an ARIMA Time Series Model, and use part of all the historical data to validate mode, with code: pred = results.getprediction(start=pd.todatetime(‘1998-01-01’), dynamic=False) predci = pred.confint()
    But why the data to validate model is one part of data for fitting model before. You know in machine learning, the train data and test data is split, I don’t why here is different.
    2) how about data stationary, could you tell me why you set the enforce_stationary is false.

    3) how about days data not month data(average) for fit mode to predict, how about week data for prediction, could you tell me how to do it

    thanks!

  • 0

    I got an error on line:
    pred = results.getprediction(start=pd.todatetime(‘1998-02-01’), dynamic=False)

    File “pandas_libs\tslib.pyx”, line 1080, in pandas.libs.tslib.Timestamp.richcmp (pandas_libs\tslib.c:20281)
    TypeError: Cannot compare type ‘Timestamp’ with type ‘int’

    How can I solve it?

  • 1

    Hi.
    Thank you so much for your wonderful sharing. Is there are any way to catch the minimum value of AIC automatically?
    It would be wonderful, if the best set for ARIMAX was stored on a external variable and pass them to next step.
    Is it possible? how?
    Thanks you

    • 0

      Use this code

      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['pram','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['pram','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['pram'],
                                      seasonal_order=Min_AIC_list['pram_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      
      results = mod.fit()
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      
    • 0

      Revised code (sorry in the previous code, I was missing one thing.)

      
      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['param','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['param','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['param'],
                                      seasonal_order=Min_AIC_list['param_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      results = mod.fit()
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      
Advertisements

交叉验证(Cross Validation)

交叉验证的定义

交叉验证(Cross Validation),有的时候也称作循环估计(Rotation Estimation),是一种统计学上将数据样本切割成较小子集的实用方法,该理论是由Seymour Geisser提出的。

在模式识别(Pattern Recognition)和机器学习(Machine Learning)的相关研究中,经常会将整个数据集合分成两个部分,分别是训练集合和测试集合。假设X是集合全体,A\subseteq X是全集X的非空真子集,那么非空集合X\setminus A\neq \emptyset则是集合A在全集X中的补集。于是可以先在A上面做训练和分析,而集合X\setminus A则用来做测试和验证。一开始的集合A被称作训练集,而它的补集X\setminus A被称作验证集或者测试集。这里有一个重要的观点就是:只有训练集才可以使用在模型的训练之中,而测试集必须在模型训练完成之后才被用来评估模型的误差。

HoldOut检验(Hold-Out Method)

这个方法是将原始的数据集合X随机分成两个集合AX\setminus A,其中A作为训练集,X\setminus A作为测试集。先使用训练集训练模型,然后利用测试集验证模型的效果,记录最后的分类准确率作为Hold-Out下该模型的性能指标。比方说,处理时间序列模型是否准确的时候,把整个数据集合分成前后两部分,前部分占比70%,后部分占比30%。前部分来进行时间序列模型的训练,后部分用来测试改时间序列的准确性。其准确性可以用MAE,MAPE之类的统计指标来衡量。综上所述,该方法的好处就是处理起来简单,只需要把原始数据分成两个部分即可。但是从严格意义上来说,Hold-Out检验并不算是交叉检验(Cross Validation),因为该方法没有达到交叉检验的思想,而且最后验证准确性的高低和原始数组的分类有很大的关系,所以该方法得到的结果在某些场景中并不具备特别大的说服力。 在Hold-Out检验不够有说服力的情形下,有人提出了交叉验证这一个重要思想。

交叉检验的常见形式

假设有一个未知模型有一个或者多个未知的参数,并且有一个训练集。操作的过程就是对该模型的参数进行调整,使得该模型能够最大的反映训练集的特征。如果模型因为训练集过小或者参数不合适而产生过度拟合的情况,测试集的测试效果就可以得到验证。交叉验证是一种能够预测模型拟合性能的有效方法。

彻底的交叉验证(Exhaustive Cross Validation)

彻底的交叉验证方法指的是遍历全集X的所有非空真子集A。换句话说也就是把A当作训练集,X\setminus A是测试集。如果X中有n个元素,那么非空真子集A的选择方法则是2^{n}-2,这个方法的时间复杂度是指数级别的。

留P验证(Leave-p-out Cross Validation)

p验证(LpO CV)指的是使用全集X中的p个元素作为测试集,然后剩下的n-p个元素作为训练集。根据数学上的定理可以得到,p个元素的选择方法有C_{n}^{p}=n!/(p!\cdot(n-p)!)个,其中n!表示n的阶乘。在这个意义下,留p验证的时间复杂度也是非常高的。当p=1的时候,留1验证(Leave-one-out Cross Validation)的复杂度恰好是C_{n}^{1}=n

不彻底的交叉验证(Non-exhaustive Cross Validation)

不彻底的交叉验证不需要考虑全集X的所有划分情况,这种方法是留p验证的一个近似验证算法。

k-fold交叉验证(K-fold Cross Validation)

在k-fold交叉验证中,全集X被随机的划分成k个同等大小的集合A_{1},\cdot\cdot\cdot,A_{k},换句话说也就是X=A_{1}\cup\cdot\cdot\cdot\cup A_{k},并且|A_{1}|=\cdot\cdot\cdot=|A_{k}|。这里的|A_{i}|指的是集合A_{i}的元素个数,也就是集合的势。这个时候需要遍历i1k,把X\setminus A当作训练集合,A_{i}当作测试集合。根据模型的测试统计,可以得到A_{i}集合中测试错误的结果数量n_{i}。如果全集X的势是n的话,可以得到该模型的错误率是E=\sum_{i=1}^{k}n_{i}/n 为了提高模型的精确度,可以将k-fold交叉验证的上述步骤重复t次,每一次都是随机划分全集X。在t次测试中,会得到t个模型的错误率E_{1},\cdot\cdot\cdot, E_{t}。定义e=\sum_{j=1}^{t}E_{j}/t, V=\sum_{j=1}^{t}(E_{j}-e)^{2}/(t-1)\sigma=\sqrt{V}。这样该模型的错误率就是e,其方差是\sigma

注释:

1. 一般来说,10-fold交叉验证的情况使用得最多。

2. k=2的时候,也就是最简单的k-fold交叉验证,2-fold交叉验证。这个时候X=A_{1}\cup A_{2},首先A_{1}当训练集并且A_{2}当测试集,然后A_{2}当训练集并且A_{1}当测试集。2-fold交叉验证的好处就是训练集和测试集的势都非常大,每个数据要么在训练集中,要么在测试集中。

3. k=n的时候,也就是n-fold交叉验证。这个时候就是上面所说的留一验证(Leave-one-out Cross Validation)。 综上所述,交叉验证(Cross Validation)的好处是可以从有限的数据中获得尽可能多的有效信息,从而可以从多个角度去学习样本,避免陷入局部的极值。在这个过程中,无论是训练样本还是测试样本都得到了尽可能多的学习。

一般模型的选择过程

在了解了交叉验证的方法之后,可以来介绍一般模型的选择过程。通过采用不同的输入训练样本,来决定机器学习算法中包含的各个参数值,称作模型选择。下面伪代码表示了模型选择的一般流程。在这个算法中,最重要的就是第三个步骤中的误差评价。

(1)准备候选的\ell个模型:M_{1},\cdot\cdot\cdot,M_{\ell}

(2)对每个模型M_{1},\cdot\cdot\cdot, M_{\ell}求解它的学习结果。

(3)对每个学习结果的误差e_{1},\cdot\cdot\cdot,e_{\ell}进行计算。这里可以使用上面所说的k-fold交叉验证方法。

(4)选择误差e_{1},\cdot\cdot\cdot,e_{\ell}最小的模型作为最终的模型。

 

时间序列模型之相空间重构模型

一般的时间序列主要是在时间域中进行模型的研究,而对于混沌时间序列,无论是混沌不变量的计算,混沌模型的建立和预测都是在所谓的相空间中进行,因此相空间重构就是混沌时间序列处理中非常重要的一个步骤。所谓混沌序列,可以看作是考察混沌系统所得到的一组随着时间而变化的观察值。假设时间序列是\{ x(i): i=1,\cdot \cdot \cdot, n\}, 那么吸引子的结构特性就包含在这个时间序列之中。为了从时间序列中提取出更多有用的信息,1980年Packard等人提出了时间序列重构相空间的两种方法:导数重构法和坐标延迟重构法。而后者的本质则是通过一维的时间序列\{x(i)\}的不同延迟时间\tau来构建d维的相空间矢量

y(i)=(x(i),...,x(i+(d-1)\tau)), 1\leq i\leq n-(d-1)\tau.

1981年Takens提出嵌入定理:对于无限长,无噪声的d^{'}维混沌吸引子的一维标量时间序列\{x(i): 1\leq i \leq n\}都可以在拓扑不变的意义下找到一个d维的嵌入相空间,只要维数d\geq 2d^{'}+1. 根据Takens嵌入定理,我们可以从一维混沌时间序列中重构一个与原动力系统在拓扑意义下一样的相空间,混沌时间序列的判定,分析和预测都是在这个重构的相空间中进行的,因此相空间的重构就是混沌时间序列研究的关键。

1. 相空间重构

相空间重构技术有两个关键的参数:嵌入的维数d和延迟时间\tau. 在Takens嵌入定理中,嵌入维数和延迟时间都只是理论上证明了其存在性,并没有给出具体的表达式,而且实际应用中时间序列都是有噪声的有限序列,嵌入维数和时间延迟必须要根据实际的情况来选取合适的值。

关于嵌入维数d和延迟时间\tau的取值,通常有两种观点:第一种观点认为d\tau是互不相关的,先求出延迟时间之后再根据它求出合适的嵌入维数。求出延迟时间\tau比较常用的方法主要有自相关法,平均位移法,复自相关法和互信息法等,关键的地方就是使得原时间序列经过时间延迟之后可以作为独立的坐标来使用。同时寻找嵌入维数的方法主要是几何不变量方法,虚假最临近法(False Nearest Neighbors)和它改进之后的Cao方法等。第二种观点则是认为延迟时间和嵌入维数是相关的。1996年Kugiumtzis提出的时间窗长度则是综合考虑两者的重要参数。1999年,Kim等人提出了C-C方法,该方法使用关联积分同时估计出延迟时间和时间窗。

(1) 延迟时间\tau的确定:

如果延迟时间\tau太小,则相空间向量

y(i)=(x(i),\cdot\cdot\cdot, x(i+(d-1)\tau), 1\leq i\leq n-(d-1)\tau

中的两个坐标分量x(i+j\tau)x(i+(j+1)\tau)在数值上非常接近,以至于无法相互区分,从而无法提供两个独立的坐标分量;但是如果延迟时间\tau太大的话,则两个坐标分量又会出现一种完全独立的情况,混沌吸引子的轨迹在两个方向上的投影就毫无相关性可言。因此需要合适的方法来确定一个合适的延迟时间\tau, 从而在独立和相关两者之间达到一种平衡。

(1.1) 自相关系数法:

自相关函数是求延迟时间\tau比较简单的一种方法,它的主要理念就是提取序列之间的线性相关性。对于混沌序列x(1), x(2),\cdot \cdot \cdot, x(n), 可以写出其自相关函数如下:

R(\tau)=\frac{1}{n}\sum_{i=1}^{n-\tau}x(i)x(i+\tau).

此时就可以使用已知的数据x(1),\cdot \cdot\cdot x(n)来做出自相关函数R(\tau)随着延迟时间\tau变化的图像,当自相关函数下降到初始值R(0)1-e^{-1}时,i.e. R(\tau)=(1-e^{-1})R(0), 所得到的时间\tau也就是重构相空间的延迟时间。虽然自相关函数法是一种简便有效的计算延迟时间的方法,但是它仅仅能够提取时间序列的线性相关性。根据自相关函数法可以让x(i), x(i+\tau)以及x(i+\tau), x(i+2\tau)之间不相关,但是x(i), x(i+2\tau)之间的相关性可能会很强。这一点意味着这种方法并不能够有效的推广到高维的研究。而且选择下降系数1-e^{-1}时,这一点可能有点主观性,需要根据具体的情况做适当的调整。因此再总结了自相关法的不足之后,下面介绍一种判断系统非线性关系性的一种方法:交互信息法。

(1.2) 交互信息法:

在考虑了以上方法的局限性之后,Fraser和Swinney提出了交互信息法(Mutual Information Method)。假设两个离散信息系统\{ s_{1},\cdot \cdot \cdot, s_{m}\}, \{q_{1},\cdot\cdot\cdot, q_{n}\}所构成的系统SQ。通过信息论和遍历论的知识,从两个系统中获得的信息熵分别是:

H(S)=-\sum_{i=1}^{m}P_{S}(s_{i})log_{2}P_{S}(s_{i}),

H(Q)=-\sum_{j=1}^{n}P_{Q}(q_{j})log_{2}P_{Q}(q_{j}).

其中P_{S}(s_{i}), P_{Q}(q_{i})分别是SQ中事件s_{i}q_{i}的概率。交互信息的计算公式是:

I(S,Q)=H(S)+H(Q)-H(S,Q),

其中H(S,Q)=-\sum_{i=1}^{m}\sum_{j=1}^{n}P_{S,Q}(s_{i},q_{j})log_{2}P_{S,Q}(s_{i},q_{j}).

这里的P_{S,Q}(s_{i},q_{j})称为事件s_{i}和事件q_{j}的联合分布概率。交互信息标准化就是

I(S,Q)=I(S,Q)/\sqrt{H(S)\times H(Q)}.

现在,我们通过信息论的方法来计算延迟时间\tau. 定义(S,Q)=(x(i), x(i+\tau)), 1\leq i\leq n-\tau, 也就是S代表时间x(i), Q代表时间x(i+\tau). 那么I(S,Q)则是关于延迟时间\tau的函数,可以写成I(\tau). I(\tau)的大小表示在已知系统S,也就是x(i)的情况下,系统Q也就是x(i+\tau)的确定性的大小。I(\tau)=0表示x(i)x(i+\tau)是完全不可预测的,也就是说两者完全不相关。而I(\tau)的第一个极小值,则表示了x(i)x(i+\tau)是最大可能的不相关,重构时使用I(\tau)的第一个极小值最为最优的延迟时间\tau.

交互信息法的关键在于怎么计算联合概率分布P_{S,Q}(s_{i},q_{j})以及SQ系统的概率分布P_{S}(s_{i})P_{Q}(q_{j}). 这里采取的方法是等间距格子法,其方法简要概述如下。

(S,Q)=(x(i), x(i+\tau)), 1\leq i \leq n-\tau,(S,Q)平面用一个矩形包含上面所有的点。将矩形S方向上等分成M_{1}份,Q方向等分成M_{2}份(注:M_{1}, M_{2}取值100~200之间即可)。那么在S方向上格子的长度就是\epsilon_{1}, Q方向上格子的长度就是\epsilon_{2}. 假设(a,b)(S,Q)平面矩形左下角的顶点坐标。

如果(i-1)\epsilon_{1}\leq s-a <i\epsilon_{1}, 那么s在第i个格子中,对Row[i]做一次记录;

如果(j-1)\epsilon_{2}\leq s-b <j\epsilon_{2}, 那么q在第j个格子中,对Col[j]做一次记录。x(i)x(i+\tau)序列的总数都是n-\tau, 那么

P_{S}(i)=Row[i]/(n-\tau), 1\leq i\leq M_{1},

P_{Q}(j)=Col[j]/(n-\tau), 1\leq j\leq M_{2}.

如果(i-1)\epsilon_{1}\leq s-a<i\epsilon_{1}(j-1)\epsilon_{2}\leq q-b < j\epsilon_{2},(s,q)在标号为(i,j)的格子中,对Together[i][j]做一次记录。那么

P_{S,Q}(i,j)=Together[i][j]/(n-\tau)^{2}, 1\leq i\leq M_{1}, 1\leq j\leq M_{2}.

根据以上信息论的公式,可以得到:

H(S)=-\sum_{i=1}^{M_{1}}P_{S}(i)log_{2}P_{S}(i),

H(Q)=-\sum_{j=1}^{M_{2}}P_{Q}(j)log_{2}P_{Q}(j),

H(S,Q)=-\sum_{i=1}^{M_{1}}\sum_{j=1}^{M_{2}}P_{S,Q}(i,j)log_{2}P_{S,Q}(i,j).

I(S,Q)=H(S)+H(Q)-H(S,Q),

I(S,Q)=I(S,Q)/\sqrt{H(S)\times H(Q)}.

交互信息曲线I(\tau)=I(S,Q)第一次下降到极小值所对应的延迟时间\tau则是最佳延时时间。

(2) 嵌入维数d的确定:

(2.1) 几何不变量法:

为了确定嵌入维数d, 实际应用中通常的方法是计算吸引子的某些几何不变量(如关联维数,Lyapunov指数等)。选择好延迟时间\tau之后逐渐增加维数d, 直到他们停止变化为止。从Takens嵌入定理分析可知,这些几何不变量具有吸引子的几何性质,当维数d大于最小嵌入维数的时候,几何结构已经被完全打开,此时这些几何不变量与嵌入的维数无关。基于此理论,可以选择吸引子的几何不变量停止变化时的嵌入维数d作为重构的相空间维数。

(2.2) 虚假最临近点法:

从几何的观点来看,混沌时间序列是高维相空间混沌运动的轨迹在一维空间的投影,在这个投影的过程中,混沌运动的轨迹就会被扭曲。高维相空间并不相邻的两个点投影到一维空间上有的时候就会成为相邻的两点,也就是虚假邻点。重构相空间,实际上就是从混沌时间序列中恢复混沌运动的轨迹,随着嵌入维数的增大,混沌运动的轨道就会被打开,虚假邻点就会被逐渐剔除,从而整个混沌运动的轨迹得到恢复,这个思想就是虚假最临近点法的关键。

d维相空间中,每一个矢量

y_{i}(d)=(x(i),\cdot \cdot \cdot, x(i+(d-1)\tau), 1\leq i\leq n-(d-1)\tau

都有一个欧几里德距离的最邻近点y_{n(i,d)}(d), (n(i,d)\neq i, 1\leq n(i,d)\leq n-(d-1)\tau) 其距离是

R_{i}(d)=||y_{i}(d)-y_{n(i,d)}(d)||_{2}.

当相空间的维数从d变成d+1时,这两个点的距离就会发生变化,新的距离是R_{i}(d+1), 并且

(R_{i}(d+1))^{2}=(R_{i}(d))^{2}+||x(i+d\tau)-x(n(i,d)+d\tau)||_{2}^{2}.

如果R_{i}(d+1)R_{i}(d)大很多,那么就可以认为这是由于高维混沌吸引子中两个不相邻得点投影到低维坐标上变成相邻的两点造成的,这样的临近点是虚假的,令

a_{1}(i,d)=||x(i+d\tau)-x(n(i,d)+d\tau)||_{2}/R_{d}(i).

如果a_{1}(i,d)>R_{\tau} \in [10,50], 那么y_{n(i,d)}(d)就是y_{i}(d)的虚假最临近点。这里的R_{\tau}是阀值。

对于实际的混沌时间序列,从嵌入维数的最小值2开始,计算虚假最临近点的比例,然后逐渐增加维数d, 直到虚假最临近点的比例小于5%或者虚假最临近点的不再随着d的增加而减少时,可以认为混沌吸引子已经完全打开,此时的d就是嵌入维数。在相空间重构方面,虚假最临近点法(FNN)被认为是计算嵌入维数很有效的方法之一。

(2.3) 虚假最临近点法的改进-Cao方法:

虚假最临近点法对数据的噪声比较敏感,而且实际操作中需要选择阀值R_{\tau}, 具有很强的主观性。此时Cao Liangyue教授提出了改进的FNN方法,此方法计算时只需要延迟时间\tau一个参数,用较小的数据量就可以求的嵌入维数d.

假设我们有时间序列x(1),\cdot\cdot\cdot, x(n). 那么基于延迟时间\tau所构造的向量空间就是:y_{i}(d)=(x(i),\cdot\cdot\cdot x(i+(d-1)\tau)), i=1,2, \cdot\cdot\cdot, n-(d-1)\tau, 这里的d是作为嵌入维数,\tau是作为嵌入时间。y_{i}(d)则表示第id维重构向量。与虚假最临近点法类似,定义变量

a(i,d)=||y_{i}(d+1)-y_{n(i,d)}(d+1)||_{\infty}/||y_{i}(d)-y_{n(i,d)}(d)||_{\infty}, i=1,2,\cdot\cdot\cdot n-d\tau,

这里的||\cdot ||_{\infty}是最大模范数。其中y_{i}(d+1)是第id+1维的重构向量,i.e. y_{i}(d+1)=(x(i),\cdot\cdot\cdot, x(i+d\tau)). n(i,d) \in \{1,\cdot\cdot\cdot, n-d\tau\}是使得y_{n(i,d)}(d)d维相空间里面,在最大模范数下与y_{i}(d)最近的向量,并且n(i,d)\neq i,显然n(i,d)id所决定。

定义

E(d)=(n-d\tau)^{-1}\cdot\sum_{i=1}^{n-d\tau} a(i,d).

E(d)与延迟时间\tau和嵌入维数d有关。为了发现从dd+1,该函数变化了多少,可以考虑E1(d)=E(d+1)/E(d).d大于某个值d_{0}时,E1(d)不再变化,那么d_{0}+1则是我们所寻找的最小嵌入维数。除了E1(d), 还可以定义一个变量E2(d)如下。令

E^{*}(d)=(n-d\tau)^{-1}\cdot \sum_{i=1}^{n-d\tau}|x(i+d\tau)-x(n(i,d)+d\tau)|,

这里的n(i,d)和上面的定义一样,也就是使得y_{n(i,d)}(d)y_{i}(d)最近的整数,并且n(i,d)\neq i. 定义E2(d)=E^{*}(d+1)/E^{*}(d). 对于随机序列,数据之间没有相关性,E2(d)\equiv 1. 对于确定性的序列,数据点之间的关系依赖于嵌入维数d的变化,总存在一些值使得E2(d)\neq 1.

2. 相空间的预测

通过前面的相空间重构过程,一个混沌时间序列x(1),\cdot \cdot \cdot, x(n)可以确定其延迟时间\tau和嵌入维数d, 形成n-(d-1)\taud维向量:

\vec{y}_{1}=y_{1}(d)=(x(1),\cdot\cdot\cdot, x(1+(d-1)\tau)),

\vec{y}_{2}=y_{2}(d)=(x(2),\cdot\cdot\cdot, x(2+(d-1)\tau)),

\cdot \cdot \cdot

\vec{y}_{i}=y_{i}(d)=(x(i),\cdot\cdot\cdot, x(i+(d-1)\tau)),

\cdot \cdot \cdot

\vec{y}_{n-(d-1)\tau}=y_{n-(d-1)\tau}(d)=(x(n-(d-1)\tau),\cdot\cdot\cdot, x(n)).

这样我们就可以在d维欧式空间\mathbb{R}^{d}建立一个动力系统的模型如下:

\vec{y}_{i+1}=F(\vec{y}_{i}), 其中F是一个连续函数。

(1) 局部预测法:

假设N=n-(d-1)\tau, 根据连续函数的性质可以知道:如果\vec{y}_{N}\vec{y}_{j}非常接近,那么\vec{y}_{N+1}\vec{y}_{j+1}也是非常接近的,可以用x(j+1+(d-1)\tau)作为x(N+1+(d-1)\tau)=x(n+1)的近似值。

(1.1) 局部平均预测法:

考虑向量\vec{y}_{N}k个最临近向量\vec{y}_{t_{1}},\cdot\cdot\cdot, \vec{y}_{t_{k}}. i.e. 也就是从其他的N-1个向量中选取前k个与\vec{y}_{N}最临近的向量,此处用欧几里德范数||\cdot ||_{2}或者最大模范数||\cdot ||_{\infty}. 根据局部预测法的观点,可以得到x(n+1)的近似值:

x(n+1)\approx k^{-1}\cdot \sum_{j=1}^{k}x(t_{j}+1+(d-1)\tau) .

也可以引入权重的概念来计算近似值:

x(n+1)\approx \sum_{j=1}^{k}x(t_{j}+1+(d-1)\tau)\cdot \omega(\vec{y}_{t_{j}},\vec{y}_{N}).

其中\omega(\vec{y}_{t_{j}}, \vec{y}_{N})=K_{h}(||\vec{y}_{t_{j}}-\vec{y}_{N}||) / \sum_{j=1}^{k} K_{h}(||\vec{y}_{t_{j}}-\vec{y}_{N}||), 1\leq j \leq k,

K(x)=exp(-x^{2}/2),

K_{h}(x)=h^{-1}K(h^{-1}x)=h^{-1}exp(-x^{2}/(2*h^{2})).

(1.2) 局部线性预测法:

假设N=n-(d-1)\tau, 则有y_{N}(d)=(x(N),\cdot\cdot\cdot, x(n)).

局部线性预测模型为

\hat{x}(n+1)=c_{0}x(N)+c_{1}x(N+\tau)+\cdot\cdot\cdot+c_{d-1}x(N+(d-1)\tau)+c_{d},

其中c_{i}, 0\leq i\leq d是待定系数。假设向量\vec{y}_{N}k个最临近向量是\vec{y}_{t_{1}},\cdot\cdot\cdot,\vec{y}_{t_{k}}. 此处可以用欧几里德范数||\cdot ||_{2}或者最大模范数||\cdot ||_{\infty}. 这里使用最小二乘法来求出c_{i}, 0\leq i\leq d的值。也就是求c_{i}使得

||Ab-Y||_{2}^{2}=\sum_{j=1}^{k}(\hat{x}(t_{j}+1+(d-1)\tau)-x(t_{j}+1+(d-1)\tau))^{2}

=\sum_{j=1}^{k}(c_{0}x(t_{j})+c_{1}x(t_{j}+\tau)+\cdot\cdot\cdot+c_{d-1}x(t_{j}+(d-1)\tau)+c_{d}-x(t_{j}+1+(d-1)\tau))^{2}

取得最小值,其中Y=(x(t_{1}+1+(d-1)\tau), x(t_{2}+1+(d-1)\tau),\cdot\cdot\cdot x(t_{k}+1+(d-1)\tau))^{T}, b=(c_{0},c_{1},\cdot\cdot\cdot, c_{d})^{T}, 矩阵A的第j行是d+1维向量(x(t_{j}),\cdot\cdot\cdot, x(t_{j}+(d-1)\tau), 1), 1\leq j\leq k. 根据最小二乘法可以得到待定系数向量b=(A^{T}A)^{-1}A^{T}Y是最小二乘解。

(1.3) 局部多项式预测法:

(2) 全局预测法:

(2.1) 神经网络
(2.2) 小波网络
(2.3) 遗传算法

3. 实验数据

这里使用Lorenz模型来作为测试数据,

dx/dt=\sigma(y-x),

dy/dt=x(\rho-z)-y,

dz/dt=xy-\beta z,

其中x, y, z是系统的三个坐标,\sigma, \rho, \beta是三个系数。在这里,我们取\rho=28, \sigma=10, \beta=8/3.在解这个常微分方程的时候,使用了经典的Runge-Kutta数值方法。

测试数据1:

使用1300个点,预测未来100个点,绿色曲线表示Lorenz模型在z坐标上的实际数据,红线右侧表示开始预测,蓝色曲线表示使用相空间重构模型所预测的数据。根据统计学分析可以得到:在红色线条的右侧,蓝色曲线的点和绿色曲线的点的Normalized Root Mean Square Error<8%, Mean Absolute Percentage Error<5.3%, 相关系数>97%.

n=1300 1

把红线附近的图像放大可以看的更加清楚:

n=1300 2

测试数据2:

使用1800个点,预测未来100个点,绿色曲线表示Lorenz模型在x坐标上的实际数据,红线右侧表示开始预测,蓝色曲线表示使用相空间重构模型所预测的数据。根据统计学分析可以得到:在红色线条的右侧,蓝色曲线的点和绿色曲线的点的Normalized Root Mean Square Error<1%, Mean Absolute Percentage Error<2%, 相关系数>99%.

n=1800

把红线附近的图像放大:

n=1800 2

使用同样的数据量,也就是n=1800,预测未来100个点,绿色曲线表示Lorenz模型在z坐标上的实际数据,红线右侧表示开始预测,蓝色曲线表示使用相空间重构模型所预测的数据。根据统计学分析可以得到:在红色线条的右侧,蓝色曲线的点和绿色曲线的点的Normalized Root Mean Square Error<1%, Mean Absolute Percentage Error<1%, 相关系数>99%.

n=1800 y

把红线附近的图像放大:

n=1800 y2

参考文献:

  1. https://en.wikipedia.org/wiki/Lorenz_system
  2. https://en.wikipedia.org/wiki/Root-mean-square_deviation
  3. https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
  4. Practical method for determining the minimum embedding dimension of a scalar time series
  5. 基于混沌理论的往复式压缩机故障诊断
  6. determining embedding dimension for phase-space reconstruction using a geometric construction
  7. Time Series Prediction by Chaotic Modeling of Nonlinear Dynamical Systems
  8. nonlinear dynamics delay times and embedding windows

时间序列模型之灰度模型

灰度预测法是一种对含有不确定因素的系统的预测方法,灰色系统是位于白色系统和黑色系统之间的一种系统。

白色系统指的是一个系统内部的特征是完全已知的,使用者不仅知道系统的输入-输出关系,还知道实现输入-输出的具体方式,譬如函数表达式,微分方程的变化公式,或者物理学的基本定律。比方说牛顿第二定律F=ma, 使用者只需要知道物体的质量和加速度,就可以通过牛顿第二定律求出所使用的力F的具体值。或者说当物体的质量固定之后,已知不同的加速度,就可以求出不同的力的值,力和加速度之间有明确的关系表达式。这种系统就称为白色系统。

黑色系统和白色系统就有鲜明的对比,使用者完全不清楚黑色系统的内部特征,只是知道一些输入和相应的输出。使用者需要通过这一系列的输入和输出的对应关系来判断这个黑色系统的内部特征,对于没有办法打开的黑箱,使用者需要通过输入和输出来建立模型来了解这个黑箱。比方说:高校在进行招生的时候,其实对考生是几乎不了解的,这个时候就需要通过高考或者自主招生考试来判断学生的知识掌握程度和运用知识解决问题的能力。通过测试的结果来判断学生的综合能力的大小,进而判断是不是符合高校的专业需求。

灰度系统恰好位于白色系统和黑色系统之间,灰度系统内部一部分是已知的,但是另一部分却是未知的。比方说:经济学中的模型,通过数学或者统计学的方法,可以判断某些因素之间确实有关系,但是却不能够完全的判断整个系统的情况,对于很多未知的情况,这些模型就有一定的局限性。于是,灰度系统就需要通过判断系统各个因素之间的发展规律,进行相应的关联分析。通过原始的数列来生成规律性更强的数列,通过生成的数列来建立相应的微分方程模型,从而预测数列的未来发展趋势。灰度模型使用等时间距观测到的数据值来构造灰色预测模型,从而达到能够预测未来某一时刻的数值的目的。

尽管灰度系统的现象不够清楚,内部的结构并不为人所知,数据是杂乱的,但是却是互相关联的,有整体功效的,因而可以对它的变化过程进行预测。其中灰度模型是灰度系统理论的重要组成部分,将杂乱的原始数据整理成规律性较强的数据,然后再利用离散的灰度方程建立连续的微分模型,从而对系统的发展做出全面的观察分析,并做出长期预测。

下面来介绍下灰度模型的几种分类:

已知序列 x^{(0)}(1),...,x^{(0)}(n),其中n是数组的长度。通过分析这组序列的内在结构,从而建立相应的微分方程模型,达到预测未来序列x^{(0)}(n+1), x^{(0)}(n+2),...的目的。

(1) 灰度模型GM(1,1)

第一步需要通过累加形成新序列 x^{(1)}(1),..., x^{(1)}(n),其中

x^{(1)}(1)=x^{(0)}(1),

x^{(1)}(2)=x^{(0)}(1)+x^{(0)}(2), \cdot \cdot \cdot

x^{(1)}(n)=x^{(0)}(1)+\cdot\cdot\cdot+x^{(0)}(n).

对上面的累加序列取均值,就可以得到序列

z^{(1)}(1)=x^{(1)}(1),

z^{(1)}(2)=(x^{(1)}(1)+x^{(1)}(2))/2, \cdot \cdot \cdot

z^{(1)}(n)=(x^{(1)}(n-1)+x^{(1)}(n))/2.

那么GM(1,1)灰度模型相应的微分方程就是\frac{dx^{(1)}}{dt}+a x^{(1)}(t)=b, 它的离散方程是x^{(0)}(k)+az^{(1)}(k)=b, 这里2\leq k\leq n. 其中a成为发展系数,b称为控制系数。通过最小二乘法的推导,可以得到 (a,b)^{T}=(B^{T}B)^{-1}B^{T}Y. 其中B和Y都是矩阵,定义如下:

B=\begin{pmatrix} -z^{(1)}(2) & 1 \\ -z^{(1)}(3) & 1 \\ . & . \\ -z^{(1)}(n) & 1 \end{pmatrix},

Y= (x^{(0)}(2),\cdot \cdot \cdot, x^{(0)}(n))^{T}.

通过常微分方程的求解,可以得到预测模型的解析表达式。这里分两种情况:

第一种情况:a\neq 0

x^{(1)}_{p}(k)=(x^{(0)}(1)-\frac{b}{a})e^{-a(k-1)}+\frac{b}{a}, k \geq 1, 从而得到序列的预测值:

x^{(0)}_{p}(k)=x^{(1)}(k)-x^{(1)}(k-1)=(x^{(0)}(1)-\frac{b}{a})\cdot e^{-a(k-1)}\cdot (1-e^{a}), k\geq 2, x^{(0)}_{p}(1)=x^{(0)}(1).

第二种情况:a=0

x^{(1)}_{p}(k)=bk+x^{(0)}(1)-b, k\geq 1, 从而得到的序列的预测值:

x^{(0)}_{p}(k)=x^{(0)}(1).

(2) 灰度Verhulst模型

灰度Verhulst模型和前面的灰度模型GM(1,1)相似,也是计算累加生成数列,然后计算中值数列。不过Verhulst模型是建立非线性的常微分方程,所以相应的矩阵B和Y就需要做更改。微分方程是\frac{dx^{1}(t)}{dt}+ax^{(1)}(t)=b(x^{(1)}(t))^{2}. 离散方程则是x^{(0)}(k)+az^{(1)}(k)=b(z^{(1)}(k))^{2} \text{ for } 2\leq k\leq n. 通过最小二乘法的推导,可以得到 (a,b)^{T}=(B^{T}B)^{-1}B^{T}Y. 其中B和Y都是矩阵,定义如下:

B=\begin{pmatrix} -z^{(1)}(2) & (z^{(1)}(2))^{2} \\ -z^{(1)}(3) & (z^{(1)}(3))^{2} \\ . & . \\ -z^{(1)}(n) & (z^{(1)}(n))^{2} \end{pmatrix},

Y= (x^{(0)}(2),\cdot \cdot \cdot, x^{(0)}(n))^{T}.

通过微分方程的求解,可以得到

x_{p}^{(1)}(k)=\frac{ax^{(0)}(1)}{bx^{(0)}(1)+(a-bx^{(0)}(1))e^{a(k-1)}} \text{ for } k\geq 1.从而可以得到预测的序列x_{p}^{(0)}(k)=x_{p}^{(1)}(k)-x_{p}^{(1)}(k-1) \text{ for }k\geq 2, x_{p}^{(0)}(1)=x^{(0)}(1).

注:如果a<0, 那么\lim_{k\rightarrow \infty}x_{p}^{(1)}(k)=a/b, \lim_{k\rightarrow \infty}x_{p}^{(0)}(k)=0.

(3) 灰度模型DGM(2,1)

这个DGM(2,1)灰度模型并不需要累加生成序列,直接使用原序列就可以进行操作。微分方程是\frac{d^{2}x^{(1)}}{dt^{2}}+a\frac{dx^{(1)}}{dt}=b, 通过最小二乘法的推导,可以得到 (a,b)^{T}=(B^{T}B)^{-1}B^{T}Y. 其中B和Y都是矩阵,定义如下:

B=\begin{pmatrix} -x^{(0)}(2) & 1 \\ -x^{(0)}(3) & 1 \\ . & . \\ -x^{(0)}(n) & 1 \end{pmatrix},

Y= (x^{(0)}(2)-x^{(0)}(1),\cdot \cdot \cdot, x^{(0)}(n)-x^{(0)}(n-1))^{T}.

通过微分方程的求解,可以得到:x^{(1)}_{p}(k)=(\frac{b}{a^{2}}-\frac{x^{(0)}(1)}{a})e^{-a(k-1)}+\frac{b}{a}k+(x^{(0)}(1)-\frac{b}{a})\frac{1+a}{a} \text{ for } k\geq 1.

求解原序列的预测值就是:x^{(0)}_{p}(k)=(\frac{b}{a^{2}}-\frac{x^{(0)}(1)}{a})(1-e^{a})e^{-a(k-1)}+\frac{b}{a} \text{ for } k\geq 2, x^{(0)}_{p}(1)=x^{(0)}(1).

案例1:灰度模型在电力行业的应用。

某市1984~1990年用电量数据如下:

年份                  1984             1985           1986           1987            1988             1989            1990

用电量/GWh    2783.20  3028.26  3290.55  3477.77 3685.02      3935.09    4210.29

这里用n=4,也就是前四个点建立模型,然后后面三个点用来测试模型的正确性。注:这里使用了一些根据上面模型进行改进的灰度模型,进行了模型的融合。x^{(0)}(1)=2783.20, x^{(0)}(2)=3028.26, x^{(0)}(3)=3290.55, x^{(0)}(4)=3477.77. 通过计算累积生成序列,可以进一步求的系数a和b的值,从而根据上面的模型进行模拟。通过模型可以算出

x^{(0)}_{p}(1)=2783.20, x^{(0)}_{p}(2)=3042.19, x^{(0)}_{p}(3)=3258.00, x^{(0)}_{p}(4)=3489.12, x^{(0)}_{p}(5)=3736.64, x^{(0)}_{p}(6)=4001.72, x^{(0)}_{p}(7)=4285.60.

通过平均绝对百分比误差(Mean Absolute Percentage Error)的定义得到:前四项的MAPE<0.005, 所有七项的MAPE<0.01. 通过标准均方根误差(Normalized root mean square error)的定义可以求出前四项的NRMSE<0.027, 所有七项的NRMSE<0.031.

作图如下:

eletric energy

其中绿色线条表示实际的用电量数据,蓝色线条表示用灰度模型模拟的用电量数据。红线左侧表示用来测试的前四项,红线右侧表示预测的未来三年的走势。

案例2:某油藏原始数据

年份              1972             1973           1974           1975             1976             1977            1978              1979

综合含水     31.8              39.1            43.2             48.6              49.8                   53.3               58.6               61.7

这里用n=5,也就是前五个点建立模型,然后后面三个点用来测试模型的正确性。

x^{(0)}(1)=31.8, x^{(0)}(2)=39.1, x^{(0)}(3)=43.2, x^{(0)}(4)=48.6, x^{(0)}(5)=49.8. 通过计算累积生成序列,可以进一步求的系数a和b的值,从而根据上面的模型进行模拟。通过模型可以算出

x^{(0)}_{p}(1)=31.8, x^{(0)}_{p}(2)=39.72, x^{(0)}_{p}(3)=43.10, x^{(0)}_{p}(4)=46.78, x^{(0)}_{p}(5)=50.77, x^{(0)}_{p}(6)=55.10, x^{(0)}_{p}(7)=59.80, x^{(0)}_{p}(8)=64.89.

通过平均绝对百分比误差(Mean Absolute Percentage Error)的定义得到:前五项的MAPE<0.015, 所有八项的MAPE<0.023. 通过标准均方根误差(Normalized root mean square error)的定义可以求出前五项的NRMSE<0.054, 所有八项的NRMSE<0.052.

作图如下:

raw data of reservior

其中绿色线条表示实际的综合含水数据,蓝色线条表示用灰度模型模拟的综合含水数据。红线左侧表示用来测试的前五项,红线右侧表示预测的未来三年的走势。

参考文献:

  1. https://en.wikipedia.org/wiki/Root-mean-square_deviation
  2. https://en.wikipedia.org/wiki/Mean_absolute_percentage_error