A Guide to Time Series Forecasting with ARIMA in Python 3

https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3#step-2-—-importing-packages-and-loading-data

 

Introduction

Time series provide the opportunity to forecast future values. Based on previous values, time series can be used to forecast trends in economics, weather, and capacity planning, to name a few. The specific properties of time-series data mean that specialized statistical methods are usually required.

In this tutorial, we will aim to produce reliable forecasts of time series. We will begin by introducing and discussing the concepts of autocorrelation, stationarity, and seasonality, and proceed to apply one of the most commonly used method for time-series forecasting, known as ARIMA.

One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. Here, we will primarily focus on the ARIMA component, which is used to fit time-series data to better understand and forecast future points in the time series.

Prerequisites

This guide will cover how to do time-series analysis on either a local desktop or a remote server. Working with large datasets can be memory intensive, so in either case, the computer will need at least 2GB of memory to perform some of the calculations in this guide.

To make the most of this tutorial, some familiarity with time series and statistics can be helpful.

For this tutorial, we’ll be using Jupyter Notebook to work with the data. If you do not have it already, you should follow our tutorial to install and set up Jupyter Notebook for Python 3.

Step 1 — Installing Packages

To set up our environment for time-series forecasting, let’s first move into our local programming environment or server-based programming environment:

  • cd environments
  • . my_env/bin/activate

From here, let’s create a new directory for our project. We will call it ARIMA and then move into the directory. If you call the project a different name, be sure to substitute your name for ARIMA throughout the guide

  • mkdir ARIMA
  • cd ARIMA

This tutorial will require the warningsitertoolspandasnumpymatplotlib and statsmodels libraries. The warnings and itertools libraries come included with the standard Python library set so you shouldn’t need to install them.

Like with other Python packages, we can install these requirements with pip.
We can now install pandasstatsmodels, and the data plotting package matplotlib. Their dependencies will also be installed:

  • pip install pandas numpy statsmodels matplotlib

At this point, we’re now set up to start working with the installed packages.

Step 2 — Importing Packages and Loading Data

To begin working with our data, we will start up Jupyter Notebook:

  • jupyter notebook

To create a new notebook file, select New > Python 3 from the top right pull-down menu:

Create a new Python 3 notebook

This will open a notebook.

As is best practice, start by importing the libraries you will need at the top of your notebook:

import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

We have also defined a matplotlib style of fivethirtyeight for our plots.

We’ll be working with a dataset called “Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.,” which collected CO2 samples from March 1958 to December 2001. We can bring in this data as follows:

data = sm.datasets.co2.load_pandas()
y = data.data

Let’s preprocess our data a little bit before moving forward. Weekly data can be tricky to work with since it’s a briefer amount of time, so let’s use monthly averages instead. We’ll make the conversion with the resample function. For simplicity, we can also use the fillna() function to ensure that we have no missing values in our time series.

# The 'MS' string groups the data in buckets by start of the month
y = y['co2'].resample('MS').mean()

# The term bfill means that we use the value before filling in missing values
y = y.fillna(y.bfill())

print(y)
Output
co2
1958-03-01  316.100000
1958-04-01  317.200000
1958-05-01  317.433333
...
2001-11-01  369.375000
2001-12-01  371.020000

Let’s explore this time series e as a data visualization:

y.plot(figsize=(15, 6))
plt.show()

Figure 1: CO2 Levels Time Series

Some distinguishable patterns appear when we plot the data. The time series has an obvious seasonality pattern, as well as an overall increasing trend.

To learn more about time series pre-processing, please refer to “A Guide to Time Series Visualization with Python 3,” where the steps above are described in much more detail.

Now that we’ve converted and explored our data, let’s move on to time series forecasting with ARIMA.

Step 3 — The ARIMA Time Series Model

One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoregRessive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.

There are three distinct integers (pdq) that are used to parametrize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:

  • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
  • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
  • q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

When dealing with seasonal effects, we make use of the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. Here, (p, d, q) are the non-seasonal parameters described above, while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series. The term s is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

The seasonal ARIMA method can appear daunting because of the multiple tuning parameters involved. In the next section, we will describe how to automate the process of identifying the optimal set of parameters for the seasonal ARIMA time series model.

Step 4 — Parameter Selection for the ARIMA Time Series Model

When looking to fit time series data with a seasonal ARIMA model, our first goal is to find the values of ARIMA(p,d,q)(P,D,Q)s that optimize a metric of interest. There are many guidelines and best practices to achieve this goal, yet the correct parametrization of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but those have yet to be ported over to Python. In this section, we will resolve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q)(P,D,Q)s time series model.

We will use a “grid search” to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. Once we have explored the entire landscape of parameters, our optimal set of parameters will be the one that yields the best performance for our criteria of interest. Let’s begin by generating the various combination of parameters that we wish to assess:

# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Output
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyperparameter optimization) for model selection.

When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

The code chunk below iterates through combinations of parameters and uses the SARIMAX function from statsmodels to fit the corresponding Seasonal ARIMA model. Here, the order argument specifies the (p, d, q) parameters, while the seasonal_order argument specifies the (P, D, Q, S) seasonal component of the Seasonal ARIMA model. After fitting each SARIMAX()model, the code prints out its respective AICscore.

warnings.filterwarnings("ignore") # specify to ignore warning messages

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

Because some parameter combinations may lead to numerical misspecifications, we explicitly disabled warning messages in order to avoid an overload of warning messages. These misspecifications can also lead to errors and throw an exception, so we make sure to catch these exceptions and ignore the parameter combinations that cause these issues.

The code above should yield the following results, this may take some time:

Output
SARIMAX(0, 0, 0)x(0, 0, 1, 12) - AIC:6787.3436240402125
SARIMAX(0, 0, 0)x(0, 1, 1, 12) - AIC:1596.711172764114
SARIMAX(0, 0, 0)x(1, 0, 0, 12) - AIC:1058.9388921320026
SARIMAX(0, 0, 0)x(1, 0, 1, 12) - AIC:1056.2878315690562
SARIMAX(0, 0, 0)x(1, 1, 0, 12) - AIC:1361.6578978064144
SARIMAX(0, 0, 0)x(1, 1, 1, 12) - AIC:1044.7647912940095
...
...
...
SARIMAX(1, 1, 1)x(1, 0, 0, 12) - AIC:576.8647112294245
SARIMAX(1, 1, 1)x(1, 0, 1, 12) - AIC:327.9049123596742
SARIMAX(1, 1, 1)x(1, 1, 0, 12) - AIC:444.12436865161305
SARIMAX(1, 1, 1)x(1, 1, 1, 12) - AIC:277.7801413828764

The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 1, 12) yields the lowest AIC value of 277.78. We should therefore consider this to be optimal option out of all the models we have considered.

Step 5 — Fitting an ARIMA Time Series Model

Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.

We’ll start by plugging the optimal parameter values into a new SARIMAX model:

mod = sm.tsa.statespace.SARIMAX(y,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()

print(results.summary().tables[1])
Output
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.3182      0.092      3.443      0.001       0.137       0.499
ma.L1         -0.6255      0.077     -8.165      0.000      -0.776      -0.475
ar.S.L12       0.0010      0.001      1.732      0.083      -0.000       0.002
ma.S.L12      -0.8769      0.026    -33.811      0.000      -0.928      -0.826
sigma2         0.0972      0.004     22.634      0.000       0.089       0.106
==============================================================================

The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

results.plot_diagnostics(figsize=(15, 12))
plt.show()

Figure 2: Model Diagnostics

Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
  • The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.

Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we may find better models if we widened the grid search.

Step 6 — Validating Forecasts

We have obtained a model for our time series that can now be used to produce forecasts. We start by comparing predicted values to real values of the time series, which will help us understand the accuracy of our forecasts. The get_prediction() and conf_int() attributes allow us to obtain the values and associated confidence intervals for forecasts of the time series.

pred = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=False)
pred_ci = pred.conf_int()

The code above requires the forecasts to start at January 1998.

The dynamic=False argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.

We can plot the real and forecasted values of the CO2 time series to assess how well we did. Notice how we zoomed in on the end of the time series by slicing the date index.

ax = y['1990':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)

ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')
plt.legend()

plt.show()

Figure 3: CO2 Levels Static Forecast

Overall, our forecasts align with the true values very well, showing an overall increase trend.

It is also useful to quantify the accuracy of our forecasts. We will use the MSE (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.

y_forecasted = pred.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 0.07

The MSE of our one-step ahead forecasts yields a value of 0.07, which is very low as it is close to 0. An MSE of 0 would that the estimator is predicting observations of the parameter with perfect accuracy, which would be an ideal scenario but it not typically possible.

However, a better representation of our true predictive power can be obtained using dynamic forecasts. In this case, we only use information from the time series up to a certain point, and after that, forecasts are generated using values from previous forecasted time points.

In the code chunk below, we specify to start computing the dynamic forecasts and confidence intervals from January 1998 onwards.

pred_dynamic = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
pred_dynamic_ci = pred_dynamic.conf_int()

Plotting the observed and forecasted values of the time series, we see that the overall forecasts are accurate even when using dynamic forecasts. All forecasted values (red line) match pretty closely to the ground truth (blue line), and are well within the confidence intervals of our forecast.

ax = y['1990':].plot(label='observed', figsize=(20, 15))
pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)

ax.fill_between(pred_dynamic_ci.index,
                pred_dynamic_ci.iloc[:, 0],
                pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)

ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('1998-01-01'), y.index[-1],
                 alpha=.1, zorder=-1)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 4: CO2 Levels Dynamic Forecast

Once again, we quantify the predictive performance of our forecasts by computing the MSE:

# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 1.01

The predicted values obtained from the dynamic forecasts yield an MSE of 1.01. This is slightly higher than the one-step ahead, which is to be expected given that we are relying on less historical data from the time series.

Both the one-step ahead and dynamic forecasts confirm that this time series model is valid. However, much of the interest around time series forecasting is the ability to forecast future values way ahead in time.

Step 7 — Producing and Visualizing Forecasts

In the final step of this tutorial, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.

# Get forecast 500 steps ahead in future
pred_uc = results.get_forecast(steps=500)

# Get confidence intervals of forecasts
pred_ci = pred_uc.conf_int()

We can use the output of this code to plot the time series and forecasts of its future values.

ax = y.plot(label='observed', figsize=(20, 15))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 5: Time Series and Forecast of Future Values

Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.

As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.

Conclusion

In this tutorial, we described how to implement a seasonal ARIMA model in Python. We made extensive use of the pandas and statsmodels libraries and showed how to run model diagnostics, as well as how to produce forecasts of the CO2 time series.

Here are a few other things you could try:

  • Change the start date of your dynamic forecasts to see how this affects the overall quality of your forecasts.
  • Try more combinations of parameters to see if you can improve the goodness-of-fit of your model.
  • Select a different metric to select the best model. For example, we used the AIC measure to find the best model, but you could seek to optimize the out-of-sample mean square error instead.

For more practice, you could also try to load another time series dataset to produce your own forecasts.

9 Comments

  • B
  • I
  • UL
  • OL
  • Code
  • Highlight
  • Table
  • 0

    Hi! Thanks for sharing this.
    I was trying to forecast hourly values. The seasonality to capture should be similar as the 168th previous value. This means, Friday 9PM of this week should be similar than Friday 9PM of the past week.
    That is why I decided to use 168 seasionality (24*7) but it takes very long and consumes lots of memory. I’ve tried several times using 7 and 24 seasionality but it wasn’t doing it well when forecasting (previous fitting with dynamic set to False was working perfectly). Do you have any advice for this situation? Thanks in advance.

  • 0

    Thanks for the Guide.

    I tried this with my own data. And at the model result summary part, I got ma.L1 having p-value over 0.88. So, I definitely want to get rid of this feature from the model. But how do I do that? How to remove a feature from the model??

    • 0

      Hi!
      Thanks for taking the time to read through this tutorial! Yes, a p-value of 0.88 would suggest that your ma.L1 feature is not very informative. The simplest way to start would be to try and remove the MA features from your model. You can achieve this by refitting your time-series models while explicitly setting the Q parameter to zero, this will ensure that no MA components are used when you fit your model.

  • 0

    very nice tutorial. thanks! I am a new one to ARIMA model, I want to ask you some questions.
    1) I found you use all the historical data to fit an ARIMA Time Series Model, and use part of all the historical data to validate mode, with code: pred = results.getprediction(start=pd.todatetime(‘1998-01-01’), dynamic=False) predci = pred.confint()
    But why the data to validate model is one part of data for fitting model before. You know in machine learning, the train data and test data is split, I don’t why here is different.
    2) how about data stationary, could you tell me why you set the enforce_stationary is false.

    3) how about days data not month data(average) for fit mode to predict, how about week data for prediction, could you tell me how to do it

    thanks!

  • 0

    I got an error on line:
    pred = results.getprediction(start=pd.todatetime(‘1998-02-01’), dynamic=False)

    File “pandas_libs\tslib.pyx”, line 1080, in pandas.libs.tslib.Timestamp.richcmp (pandas_libs\tslib.c:20281)
    TypeError: Cannot compare type ‘Timestamp’ with type ‘int’

    How can I solve it?

  • 1

    Hi.
    Thank you so much for your wonderful sharing. Is there are any way to catch the minimum value of AIC automatically?
    It would be wonderful, if the best set for ARIMAX was stored on a external variable and pass them to next step.
    Is it possible? how?
    Thanks you

    • 0

      Use this code

      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['pram','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['pram','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['pram'],
                                      seasonal_order=Min_AIC_list['pram_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      
      results = mod.fit()
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      
    • 0

      Revised code (sorry in the previous code, I was missing one thing.)

      
      warnings.filterwarnings("ignore") # specify to ignore warning messages
      AIC_list = pd.DataFrame({}, columns=['param','param_seasonal','AIC'])
      for param in pdq:
          for param_seasonal in seasonal_pdq:
              try:
                  mod = sm.tsa.statespace.SARIMAX(y,
                                                  order=param,
                                                  seasonal_order=param_seasonal,
                                                  enforce_stationarity=False,
                                                  enforce_invertibility=False)
      
                  results = mod.fit()
      
                  print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
                  temp = pd.DataFrame([[ param ,  param_seasonal , results.aic ]], columns=['param','param_seasonal','AIC'])
                  AIC_list = AIC_list.append( temp, ignore_index=True)  # DataFrame append 는 일반 list append 와 다르게 이렇게 지정해주어야한다.
                  del temp
      
              except:
                  continue
      
      
      m = np.amin(AIC_list['AIC'].values) # Find minimum value in AIC
      l = AIC_list['AIC'].tolist().index(m) # Find index number for lowest AIC
      Min_AIC_list = AIC_list.iloc[l,:]
      
      
      
      mod = sm.tsa.statespace.SARIMAX(y,
                                      order=Min_AIC_list['param'],
                                      seasonal_order=Min_AIC_list['param_seasonal'],
                                      enforce_stationarity=False,
                                      enforce_invertibility=False)
      results = mod.fit()
      
      print("### Min_AIC_list ### \n{}".format(Min_AIC_list))
      
      print(results.summary().tables[1])
      
      results.plot_diagnostics(figsize=(15, 12))
      plt.show()
      
      
Advertisements

时序数据与事件的关联分析

文章是:”Correlating Events with Time Series for Incident Diagnosis” 是微软在2014年的工作,并且发表在KDD上。

本文提出了一种无监督和统计判别的算法,可以检测出事件(E)与时间序列(S)的关联关系,并且可以检测出时间序列(S)的单调性(上升或者下降)。在这篇文章中,选择的事件有CPU(Memory, Disk)Intensive Program,Query Alert;选择的时间序列有 CPU(Memory)Usage,Disk Transfer Rate。时间序列的特点是它们的值域范围都是[0,1]。

Table1

案例是:时间序列是CPU的Usage,事件是Disk Intensive task和CPU intensive task。

Figure1

关联关系的挖掘分成三个部分:

(1)是否存在关联性(Existence of Dependency):在事件(E)与时间序列(S)之间是否存在关联关系。

(2)关联关系的因果关系(Temporal Order of Dependency):是事件(E)导致了时间序列(S)的变化还是时间序列(S)导致了事件(E)的发生。

(3)关联关系的单调性影响(Monotonic Effect of Dependency):用于判断时间序列(S)是发生了突增或者是突降。

基本概念:

给定一个事件序列(E),事件发生的时间戳是T_{E}=(t_{1},\cdots,t_{n}),这里n表示有n个事件发生。时间序列(S)表示为S=(s_{1},\cdots,s_{m}),这里的m表示时间序列的长度。时间序列的时间戳可以选择一个等差序列,等差用\tau来表示,并且T_{S}=(t(s_{1}),\cdots,t(s_{n})),and t(s_{i}) =t(s_{i-1})+\tau

e_{i}来表示某个事件,\ell_{k}^{rear}(S,e_{i})表示序列S在事件e_{i}之后的长度为k的子序列,\ell_{k}^{front}(S,e_{i})表示序列S在事件e_{i}之前的长度为k的子序列。如果事件E与时间序列S之间存在关联关系,那么

\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}

\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}),i=1,\cdots,n\}应该是不一样的。

定义一:如果事件序列E和时间序列S是相关的,并且S->E,当且仅当\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}和随机选择的子序列分布不一致。

定义二:如果事件序列E和时间序列S是相关的,并且E->S,当且仅当\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}),i=1,\cdots,n\}和随机选择的子序列分布不一致,并且\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}和随机选择的子序列分布一致。

定义三:如果事件序列E和时间序列S是相关的,那么S->E或者E->S

定义四:如果E->S (or S->E),并且时间序列相比E之前是增加了,那么记为E\stackrel{+}{\longrightarrow} S (or S\stackrel{+}{\longrightarrow} E)。如果E->S (or S->E),并且时间序列相比E之前是减少了,那么记为E\stackrel{-}{\longrightarrow} S (or S\stackrel{-}{\longrightarrow} E)。

方法论:

第一步:最邻近算法(类似kNN)(Nearest Neighbor Method)

在计算时间序列之间距离的时候,使用DTW算法或者DTW-D算法会优于L1或者L2算法。

\Gamma^{front}来做例子,\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}\Theta =\{\theta_{1},\cdots,\theta_{\tilde{n}}\} 是随机选择的,Z=\Gamma \cup \Theta,可以标记为Z_{1},\cdots,Z_{p},其中p=n+\tilde{n}Z_{i}=\ell_{k}^{front}(S,e_{i}) when 1\leq i\leq nZ_{i}=\theta_{i-n} when n+1\leq i\leq p。可以使用记号A=A_{1}\cup A_{2},其中A_{1}=\Gamma^{front}A_{2}=\Theta=\{\theta_{1},\cdots,\theta_{\tilde{n}}\}是随机选择的。

对于集合 Ax\in A 而言,NN_{r}(x,A) 表示A-\{x\}中距离x最近的第r个元素,对于两个不相交的集合A_{1}A_{2},可以定义方程:

I_{r}(x,A_{1},A_{2})=1 when x\in A_{i} \&\& NN_{r}(x,A)\in A_{i},

I_{r}(x,A_{1},A_{2})=0 when otherwise.

该方程I_{r}(x,A_{1},A_{2})表示x与x的第r个最近的邻居是否在同一个子集内。

定义

T_{r,p}=\frac{1}{pr}\sum_{i=1}^{p}\sum_{j=1}^{r}I_{j}(x_{i},A_{1},A_{2}),

在这里p=n+\tilde{n}表示样本的总个数,x_{i}表示集合A的第i个元素。从直觉上讲,如果T_{r,p}小,则说明两类samples A_{1},A_{2}混合得非常好,表示无异常情况;如果T_{r,p}大,则说明两类samples A_{1},A_{2}有区分度,很多元素与它的邻居集中在某个子集里面,说明 A_{1} 这个集合与 A_{2} 有区分度。

根据文献里面的观点,当p足够大的时候,(pr)^{\frac{1}{2}}(T_{r,p}-\mu_{r})/\sigma_{r}遵循标准Gauss分布,其参数是\mu_{r}=(\lambda_{1})^{2}+(\lambda_{2})^{2}, \sigma_{r}^{2}=\lambda_{1}\lambda_{2}+4\lambda_{1}^{2}\lambda_{2}^{2},

\lambda_{1}=n/p=n/(n+\tilde{n}), \lambda_{2}=\tilde{n}/(n+\tilde{n})

根据传统的Gauss分布Test方法,\Gamma^{front}\Theta有显著的不同,当(pr)^{\frac{1}{2}}(T_{r,p}-\mu_{r})/\sigma_{r}^{2}>\alpha,在这里,参数可以按照以下标准设置:

\alpha = 1.96 for P=0.025

\alpha = 2.58 for P=0.001

如果\Gamma^{front}\Theta存在显著性偏差,那么说明\Gamma^{front}应该返回异常的标识。类似的,如果使用\Gamma^{rear}并且它与\Theta存在显著性偏差,那么说明\Gamma^{rear}应该返回异常的标识。

 

第二步:关联顺序的挖掘(Mining Existence and Temporal Order)

Figure3

如果前面的子序列\Gamma^{front}与随机选择的子序列\Theta有显著偏差,那么说明时序的变化导致了事件的发生,S\rightarrow E

如果后面的子序列\Gamma^{rear}与随机选择的子序列\Theta有显著偏差,那么说明事件导致了时序的变化,E\rightarrow S

在Figure 3中,CPU Intensive Program 导致了 CPU Usage,并且 CPU Usage 导致了 SQL Query Alert。

 

第三步:单调性的影响类型(Mining Effect Type)

现在需要判断时间序列是突增还是突降了,需要引入t_{score}的概念。

对于\Gamma^{front}=\{\ell_{k}^{front}(S,e_{i}), i=1,\cdots,n\}\Gamma^{rear}=\{\ell_{k}^{rear}(S,e_{i}), i=1,\cdots,n\}而言,其中n是E中的事件个数。t_{score}就可以定义为:

t_{score}=\frac{\mu_{\Gamma^{front}} - \mu_{\Gamma^{rear}}}{\sqrt{\frac{\sigma_{\Gamma^{front}}^{2}+\sigma_{\Gamma^{rear}}^{2}}{n}}}.

那么,如果t_{score}>\alpha,可以得到 E\stackrel{-}{\longrightarrow}S 或者 S\stackrel{-}{\longrightarrow} E;如果t_{score}<-\alpha,可以得到 E\stackrel{+}{\longrightarrow}S 或者 S\stackrel{+}{\longrightarrow} E

其中参数可以设置为:

\alpha = 1.96 for P=0.025

\alpha = 2.58 for P=0.001

 

算法综述:

algorithm

其中,5,6行是为了计算相关性,D_{r} 是 True 表示 \Gamma^{rear} 有异常,否则表示正常;D_{f} 是 True 表示 \Gamma^{front} 有异常,否则表示正常。

7-13行是 E\rightarrow S 的情形,因为\Gamma^{rear} 异常,同时 \Gamma^{front} 正常,说明事件导致了时间序列的变化。7-13行是为了计算 t_{score} 的范围,判断是显著的提升还是下降。

14-20行是 S\rightarrow E 的情形,因为\Gamma^{front} 异常,就导致了事件的发生。14-20行是为了计算 t_{score} 的范围,判断是显著的提升还是下降。

参数:

时间序列的长度 k 可以设置为第一次达到顶峰的长度,

最邻近的元素个数 r=\ln(p),其中p是样本的总个数。

Figure5

Figure6

其他算法:

(1)Pearson Correlation

(2)J-Measure Correlation

 

 

How to Convert a Time Series to a Supervised Learning Problem in Python

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a function to transform a time series dataset into a supervised learning dataset.
  • How to transform univariate time series data for machine learning.
  • How to transform multivariate time series data for machine learning.

Let’s get started.

How to Convert a Time Series to a Supervised Learning Problem in Python

Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

  • data: Sequence of observations as a list or 2D NumPy array. Required.
  • n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
  • n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
  • dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

  • return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?
Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.

Summary

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

  • About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
  • How to reframe a univariate time series into one-step and multi-step supervised learning problems.
  • How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Mueen Keogh算法

论文:Exact Discovery of Time Series Motifs

 

Speeded up Brute Force Motif Discovery:

Github:https://github.com/saifuddin778/mkalgo

但是感觉有一行比较奇怪,应该是 Dist_{I(j+offset)}-Dist_{I(j)} < best-so-far,而不是Dist_{I(j)}-Dist_{I(j+offset)} < best-so-far,因为 D_{I(j)} 是递增排列的,并且 best-so-far > 0.

Speeded up brute force motif discovery

Generalization to multiple reference points:

https://github.com/nicholasg3/motif-mining/tree/95bbb05ac5d0f9e90134a67a789ea7e607f22cea

注意:

for j = 1 to m-offset 而不是 for j = 1 to R

nicholasg3 MK_Motif_discovery

MK Motif Discovery

Time Series Clustering with Dynamic Time Warping (DTW)

https://github.com/goodmattg/wikipedia_kaggle

 

 

 

 

 

不要做低品质的勤奋者---读后感

不要做低品质的勤奋者---读后感

之前在网络上曾经有人写过一篇文章《为什么绝大多数人都是“低品质的勤奋者”》,现在一年之后翻阅来看,结合自己攻读博士学位和工作的经历,在这里稍微总结一下,希望能够给自己一些启示。

作为一个博士生,时间通常来说会比工作人士相对自由,非常类似一个自由职业者。因此也会有时间去做自己想做的事情,阅读自己喜欢的各种书籍,阅读自己想学的各种知识技能。但是,如果要真正走上一条科研的道路,绝对不是把所有的知识点统统学一遍之后才开始进行科研,而是在科研的道路中不断的学习和充实自己,在科研的过程中不断的成长。除了在科研的前期准备一些必要的理论基础之外,在科研的过程中就要有选择放弃一些知识。因为在做科研的时候并不需要了解任何一个知识点,科研是做出来的而不是学出来的。科研的途中只需要掌握某些知识点就可以了,剩下的都是遇到了问题然后靠自己独立攻克解决的。在科研之前抱有“等我读完了这些书再开始科研”其实是一种不明智的想法。

在工业界,一天到晚总会有各种各样杂七杂八的事情,也需要有人去关注这些问题并且解决这些问题。不过在日常的工作中,技术的积累和调研也是一份重要的工作,它不仅关系着员工个人的成长,也关系着整个团队整体实力的提升。这种时候,总需要有人去做一些调研性质的工作,去探索这个技术对整个团队是否存在实际的价值。但是在调研的过程中,不同的人就有着完全不同的做事方法,有的人喜欢直接开始干活,以无穷次试错的方式来判断自己的选择是否是正确的;而有的人喜欢先进行调研的工作,先把公司内外的优秀技术调研清楚,然后再与自己所处的环境做比较,最后再判断自己所选择的方案是否合理。除此之外,有的人总喜欢抓住一些细枝末节的东西反复纠缠,仿佛在这些细节上就能够挖出全部的框架结构,试图从这些细枝末节入手找出关键之处。而有的人则倾向于从一个整体的框架入手,尝试着从这个框架来解释方案是否合理,从整体来反推出之前的那些细节存在的意义。

如果看过天龙八部的人就知道,鸠摩智当时上少林寺去挑战,在少林高僧面前展示出自己所学的少林七十二绝技,诸多少林高僧无不大惊失色。而当时的虚竹在旁边观战,就对少林高僧们说:“鸠摩智所耍的招数虽然是少林绝技,但是本质上却是使用小无相功催动出来的。虽然招数相同,但是却用的道家的内力。”为什么少林的高僧们没有看出来鸠摩智武功的关键之处呢,那是因为少林高僧们在练功的时候,一直抱着武学秘籍在修炼,一辈子练到头了也就13门绝技。其实从鸠摩智的个人修炼来看,修练武学的关键并不在武学秘籍里,而在佛经里。没有找到关键的佛经,没有找到运功的法门,无论抱着武学秘籍修炼多少年,终究与别人有着本质上的差距。诗人陆游也曾经教育过他的后辈:“汝果欲学诗,功夫在诗外”。意思是说,如果你想真正地写出好的诗词,就要在生活上下功夫,去体验生活的酸甜苦辣,而不是抱着一本诗词歌赋来反复阅读。

在读博士的过程中,一般来说,选择博士课题是一件极为重要的事情。如果课题选得太难,那么学生在这几年内也做不出来;如果课题选得过于简单,那么学生即使做出来了也没有太大的意义。在选择课题的时候,博士们都会十分慎重,都会通过半年或者一年甚至更长的时间来选择合适自己的课题,否则在未来的几年里面将会十分的痛苦。在科研的过程中,在做每一个方案之前,一定要进行仔细的思考,因为科研工作一旦开工,很可能在短期内就不会回头。如果选择了一条极其错误的道路,那将会在这条错误的道路上走得越来越远,越来越不可能得到正确的结果。

为了不成为低品质的勤奋者,在平时工作和生活的过程中,在做某件事情之前,一定要提前进行仔细的思考和周密的安排。在这种时候,就不能够通过自己的应激反应来做这些重要而又不紧急的事情。一般来说,通过自己的第一反应来做的事情都是那种重要而又紧急的事情,而通过自己深度思考才开始做的事情都是重要但是又不紧急的事情。在个人成长道路中,要想改变一个人的命运的,通常都是做那些重要而又不紧急的事情。不过,人一般来说都是有惰性的,都是懒得进行深度思考,最常见的做法都是通过应激反应来决定下一步的计划和方案。如果要想保证一个人的成长,就一定要刻意地逼迫自己进行深度思考,逼迫自己每天花一定的时间来处理那些重要而又不紧急的事情。

无论是学习还是工作,都不可能面面俱到,不太可能做到事必躬亲的处理每一件事情。这种时候就要做到有重点的处理关键事情,放弃一些不那么重要的事情。在做每一件事情之前,都要思考做这件事情是否能够真正的带来价值,是否对整个团队和个人有利,而不是在稍微思考了一下觉得可以这么做就开始动手做这件事情。一味的努力做某件事情,只是看起来十分刻苦,但是要想真正的摆脱低品质勤奋者的困境,还需要有很长的路要走。

Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning

本文是运维系统智能化的一次探索工作,论文的作者是清华大学的裴丹教授,论文的题目是《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》。目的是基于机器学习的 KPI(Key Performance Indicator)的自动化异常检测。

标题 Opprentice 来源于(Operator’s Apprentice),意思就是运维人员的学徒。本文通过运维人员的业务经验来进行异常数据的标注工作,使用时间序列的各种算法来提取特征,并且使用有监督学习模型(例如 Random Forest,GBDT,XgBoost 等)模型来实现离线训练和上线预测的功能。本文提到系统 Opprentice 使用了一个多月的历史数据进行分析和预测,已经可以做到准确率>=0.66,覆盖率>=0.66 的效果。

1. Opprentice的介绍

系统遇到的挑战:

Definition Challenges: it is difficult to precisely define anomalies in reality.(在现实环境下很难精确的给出异常的定义)

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI (Key Performance Indicators). To address the definition challenge and the detector challenge, we advocate for using supervised machine learning techniques. (使用有监督学习的方法来解决这个问题)

该系统的优势:

(i) Opprentice is the first detection framework to apply machine learning to acquiring realistic anomaly definitions and automatically combining and tuning diverse detectors to satisfy operators’ accuracy preference.

(ii) Opprentice addresses a few challenges in applying machine learning to such a problem: labeling overhead, infrequent anomalies, class imbalance, and irrelevant and redundant features.

(iii) Opprentice can automatically satisfy or approximate a reasonable accuracy preference (recall>=0.66 & precision>=0.66). (准确率和覆盖率的效果)

2. 背景描述:

KPIs and KPI Anomalies:

KPIs: The KPI data are the time series data with the format of (time stamp, value). In this paper, Opprentice pays attention to three kinds of KPIs: the search page view (PV), which is the number of successfully served queries; The number of slow responses of search data centers (#SR); The 80th percentile of search response time (SRT).

Anomalies: KPI time series data can also present several unexpected patterns (e.g. jitters, slow ramp ups, sudden spikes and dips) in different severity levels, such as a sudden drop by 20% or 50%.

OpprenticeFigure1

问题和目标:

覆盖率(recall):# of true anomalous points detected / # of the anomalous points

准确率(precision):# of true anomalous points detected / # of anomalous points detected

1-FDR(false discovery rate):# of false anomalous points detected / # of anomalous points detected = 1 – precision

The quantitative goal of opprentice is precision>=0.66 and recall>=0.66.

The qualitative goal of opprentice is automatic enough so that the operators would not be involved in selecting and combining suitable detectors, or tuning them.

3. Opprentice Overview: (Opprentice系统的概况)

OpprenticeFigure2

(i) Opprentice approaches the above problem through supervised machine learning.

(ii) Features of the data are the results of the detectors.(Basic Detectors 来计算出特征)

(iii) The labels of the data are from operators’ experience.(人工打标签)

(iv) Addressing Challenges in Machine Learning: (机器学习遇到的挑战)

(1) Label Overhead: Opprentice has a dedicated labeling tool with a simple and convenient interaction interface. (标签的获取)

(2) Incomplete Anomaly Cases:(异常情况的不完全信息)

(3) Class Imbalance Problem: (正负样本比例不均衡)

(4) Irrelevant and Redundant Features:(无关和多余的特征)

4. Opprentice’s Design:

Architecture: Operators label the data and numerous detectors functions are feature extractors for the data.

OpprenticeFigure3

Label Tool:

人工使用鼠标和软件进行标注工作

OpprenticeFigure4

Detectors:

(i) Detectors As Feature Extractors: (Detector用来提取特征)

Here for each parameter detector, we sample their parameters so that we can obtain several fixed detectors, and a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor:

data point + configuration (detector + sample parameters) -> feature,

(ii) Choosing Detectors: (Detector的选择,目前有14种较为常见的)

Opprentice can find suitable ones from broadly selected detectors, and achieve a relatively high accuracy. Here, we implement 14 widely-used detectors in Opprentice.

Opprentice has 14 widely-used detectors:

OpprenticeTable3

Diff“: it simply measures anomaly severity using the differences between the current point and the point of last slot, the point of last day, and the point of last week.

MA of diff“: it measures severity using the moving average of the difference between current point and the point of last slot.

The other 12 detectors come from previous literature. Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severity.

(iii) Sampling Parameters: (Detector的参数选择方法,一种是扫描参数空间,另外一种是选择最佳的参数)

Two methods to sample the parameters of detectors.

(1) The first one is to sweep the parameter space. For example, in EWMA, we can choose \alpha \in \{0.1,0.3,0.5,0.7,0.9\} to obtain 5 typical features from EWMA; Holt-Winters has three [0,1] valued parameters \alpha,\beta,\gamma. To choose \alpha,\beta,\gamma \in \{0.2,0.4,0.6,0.8\}, we have 4^3 features; In ARIMA, we can estimate their “best” parameters from the data, and generate only one set of parameters, or one configuration for each detector.

Supervised Machine Learning Models:

Decision Trees, logistic regression, linear support vector machines (SVMs), and naive Bayes. 下面是决策树(Decision Tree)的一个简单例子。

OpprenticeFigure5

Random Forest is an ensemble classifier using many decision trees. It main principle is that a group of weak learners (e.g. individual decision trees) can together form a strong learner. To grow different trees, a random forest adds some elements or randomness. First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time. The random forest combines those trees by majority vote. The above properties of randomness and ensemble make random forest more robust to noises and perform better when faced with irrelevant and redundant features than decisions trees.

Configuring cThlds: (阈值的计算和预估)

(i) methods to select proper cThlds: offline part

OpprenticeFigure6

We need to figure cThlds rather than using the default one (e.g. 0.5) for two reasons.

(1) First, when faced with imbalanced data (anomalous data points are much less frequent than normal ones in data sets), machine learning algorithems typically fail to identify the anomalies (low recall) if using the default cThlds (e.g. 0.5).

(2) Second, operators have their own preference regarding the precision and recall of anomaly detection.

The metric to evaluate the precision and recall are:

(1) F-Score: F-Score = 2*precision*recall/(precision+recall).

(2) SD(1,1): it selects the point with the shortest Euclidean distance to the upper right corner where the precision and the recall are both perfect.

(3) PC-Score: (本文中采用这种评估指标来选择合适的阈值)

If r>=R and p>=P, then PC-Score(r,p)=2*r*p/(r+p) + 1; else PC-Score(r,p)=2*r*p/(r+p). Here, R and P are from the operators’ preference “recall>=R and precision>=P”. Since the F-Score is no more than 1, then we can choose the cThld corresponding to the point with the largest PC-Score.

(ii) EWMA Based cThld Prediction: (基于EWMA方法的阈值预估算法)

OpprenticeFigure7

In online detection, we need to predict cThlds for detecting future data.

Use EWMA to predict the cThld of the i-th week ( or the i-th test set) based on the historical best cThlds. Specially, EWMA works as follows:

If i=1, then cThld_{i}^{p}=cThld_{1}^{p}= 5-fold prediction

Else i>1, then cThld_{i}^{p}=\alpha\cdot cThld_{i-1}^{b}+(1-\alpha)\cdot cThld_{i-1}^{p}, where cThld_{i-1}^{b} is the best cThld of the (i-1)-th week. cThld_{i}^{p} is the predicted cThld of the i-th week, and also the one used for detecting the i-th week data. \alpha\in [0,1] is the smoothing constant.

For the first week, we use 5-fold cross-validation to initialize cThld_{1}^{p}. As \alpha increases, EWMA gives the recent best cThlds more influences in the prediction. We use \alpha=0.8 in this paper.

5. Evaluation(系统评估)

在 Opprentice 系统中,红色表示 Opprentice 系统的方法,黑色表示其他额外的方法。

OpprenticeFigure8

Opprentice has 14 detectors with about 9500 lines of Python, R and C++ code. The machine learning block is based on the scikit-learn library.

Random Forest is better than decision trees, logistic regression, linear support vector machines (SVMs), and naive Bayes.

 

Advertisements

zr9558's Blog