Unsupervised Machine Learning Approaches for Outlier Detection in Time Series

Example Python Code Included!

In this post, I cover some of my favorite methods for detecting outliers in time series data. There are many different approaches for detecting anomalous data points; for the sake of brevity, I only focus on unsupervised machine learning approaches in this post.

The anomaly/outlier detection algorithms covered in this article include:

  1. Low-pass filters: taking the centered rolling average of a time series, and removing anomalies based on Z-score
  2. Isolation forests
  3. Seasonal-extreme studentized deviate (S-ESD) algorithm
  4. One class support vector machines (SVM’s)

So what is an ‘anomaly’ in a time series, and why do we care about about detecting anomalies in time series sequences?

I deal with time series data a lot, and it’s not uncommon for data sets to experience unexpected drops or spikes, flat lines, or phase shifts. Each of these situations qualifies as an ‘anomaly’–something out of the ordinary when compared to the behavior of the sequence as a whole.

Detecting anomalies in a time series is important for a variety of reasons. Sometimes anomalies are indicative of certain patterns, such as spikes or drops in the trend during a specific day or time. Other times, anomalies are ‘false’ readings, which happen because of an instrumentation or meter error. Often, we want to isolate anomalies from time series data because they skew ‘average’ behaviors–average rate of change, average value, average distribution, etc.

All of the algorithms presented in this article have a binary output–a data point in a time series is either anomalous or it isn’t.

Time Series Example

In this article, we compare the results of several different anomaly detection methods on a single time series. The time series that we will be using is the daily time series for gasoline prices on the U.S. Gulf Coast, which is retrieved using the Energy Information Administration (EIA) API.

For more background on using the EIA’s free API to retrieve energy-related data in Python, check out this tutorial.

We pull and visualize the time series using the following Python code:

import pandas as pd
import matplotlib.pyplot as plt
import eia
def retrieve_time_series(api, series_ID):
    """
    Return the time series dataframe, based on API and unique Series ID
    Arguments:
        api: API that we're connected to
        series_ID: string. Name of the series that we want to pull from the EIA API
    Outputs:
        df: Pandas dataframe of time series
    """
    #Retrieve Data By Series ID 
    series_search = api.data_by_series(series=series_ID)
    ##Create a pandas dataframe from the retrieved time series
    df = pd.DataFrame(series_search)
    return df
def scatterplot(x_data, y_data, x_label, y_label, title):
    """
    Arguments:
        x_data: Series. Desired x-axis for scatterplot.
        y_data: Series. Desired y-axis for scatterplot.
        x_label: String. Label for x-axis.
        y_label: String. Label for y-axis.
        title: String. Title of plot
    Outputs:
        Scatterplot in console.
    """
    fig, ax = plt.subplots()
    ax.scatter(x_data, y_data, s = 30, color = '#539caf', alpha = 0.75)
    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    fig.autofmt_xdate()
    
#####EXECUTE IN MAIN BLOCK
#Create EIA API using your specific API key
api_key = 'YOUR API KEY HERE'
api = eia.API(api_key)
    
#Pull the oil WTI price data
series_ID='PET.EER_EPMRU_PF4_RGC_DPG.D'
gasoline_price_df=retrieve_time_series(api, series_ID)
gasoline_price_df.reset_index(level=0, inplace=True)
#Rename the columns for easer analysis
gasoline_price_df.rename(columns={'index':'Date',
            gasoline_price_df.columns[1]:'Gasoline_Price'}, 
            inplace=True)
     
#Visualize anomalies using matplotlib function
scatterplot(gasoline_price_df['Date'],
                gasoline_price_df['Gasoline_Price'], 
                'Date', 
                'Gasoline Price (Dollars Per Gallon)', 
                'US Gulf Coast Gasoline Price Time Series: 2014-Present')
    

Just to spice things up, we pepper the time series with a few outliers. Hopefully, these points will be picked up by the anomaly detection algorithms.

#Add in a couple anomalous data points for detection by the algorithm
anomaly_dictionary={80: 3.1, 
                    200: 3,
                    333: 1,
                    600: 2.6,
                    710: 2.1,
                    890: 2.3,
                    1100: 1,
                    1211: 2.6,
                    1309: 2.3}
#Set default for Artificially_Generated_Anomaly column to 0
gasoline_price_df.loc[:,'Artificially_Generated_Anomaly']=0
#Create fake anomaly values based on anomaly_dictionary
for index, anomaly_value in anomaly_dictionary.items():
    gasoline_price_df.loc[index,'Gasoline_Price']=anomaly_value
    #Create a column to indicate Anomalies
    gasoline_price_df.loc[index,'Artificially_Generated_Anomaly']=1
    
#Re-visualize data with artificially generated anomalies
scatterplot_with_color_coding(gasoline_price_df['Date'],
                     gasoline_price_df['Gasoline_Price'], 
                     gasoline_price_df['Artificially_Generated_Anomaly'],
                     'Date', 
                     'Gasoline Price (Dollars Per Gallon)', 
                     'Gasoline Prices, Color-Coded on Real/Artificially Generated Data Points')   
Gasoline Price Time Series, with artificially generated anomalies in yellow

Now that we’ve selected our time series and added a few outliers, let’s test our anomaly detection algorithms on it.

Simple Low-Pass Filter: Taking a Rolling Average, and Removing Anomalies based on Z-Score

One of the simplest (but still highly effective) methods that I use for detecting anomalies goes as follows:

  1. Implementing a centered rolling average of the time series data
  2. Calculating the individual Z-score of each data point in the time series, compared to the rolling average
  3. Removing data points that are more than a certain number of standard deviations away from the rolling average (normally 2 to 3 standard deviations away, but depends on the data behavior)

Let’s try this method with an example.

Code Example

As mentioned above, we use the time series for gasoline prices on the US Gulf Coast as our example time series set.

It is important to note that the rolling average that we implement for our low-pass filter is centered. This means that a rolling average with x-length is the mean of x/2 data points before, and x/2 data points after. For example, if we implement a 60-point rolling average at value t, then we find the mean of the data points ranging between (t-30) and (t+30). Using a centered rolling average helps to account for large shifts in the time series from both ends.

We generate the low-pass filter and visualize anomalies using the following code:

def low_pass_filter_anomaly_detection(df, 
                                      column_name, 
                                      number_of_stdevs_away_from_mean):
    """
    Implement a low-pass filter to detect anomalies in a time series, and save the filter outputs 
    (True/False) to a new column in the dataframe.
    Arguments:
        df: Pandas dataframe
        column_name: string. Name of the column that we want to detect anomalies in
        number_of_stdevs_away_from_mean: float. Number of standard deviations away from
        the mean that we want to flag anomalies at. For example, if 
        number_of_stdevs_away_from_mean=2,
        then all data points more than 2 standard deviations away from the mean are flagged as 
        anomalies.
    Outputs:
        df: Pandas dataframe. Dataframe containing column for low pass filter anomalies 
        (True/False)
    """
    #60-day rolling average
    df[column_name+'_Rolling_Average']=df[column_name].rolling(window=60, center=True).mean()
    #60-day standard deviation 
    df[column_name+'_Rolling_StDev']=df[column_name].rolling(window=60, center=True).std()
    #Detect anomalies by determining how far away from the mean (in terms of standard deviation)
    #each data point is
    df[column_name+'_Low_Pass_Filter_Anomaly']=(abs(df[column_name]-df[
                                column_name+'_Rolling_Average'])>(
                                number_of_stdevs_away_from_mean*df[
                                column_name+'_Rolling_StDev']))
    return df
##EXECUTE IN MAIN BLOCK
#APPLY LOW PASS FILTER (ROLLING AVERAGE+ 2 Z-SCORE FILTER) TO DETECT ANOMALIES
gasoline_price_df=low_pass_filter_anomaly_detection(df=gasoline_price_df, 
                                      column_name='Gasoline_Price', 
                                      number_of_stdevs_away_from_mean=3)
#Re-plot time series with color coding for anomaly column
scatterplot_with_color_coding(gasoline_price_df['Date'],
                                  gasoline_price_df['Gasoline_Price'], 
                                  gasoline_price_df['Gasoline_Price_Low_Pass_Filter_Anomaly'],
                                  'Date', 
                                  'Gasoline Price (Dollars Per Gallon)', 
                                  'Gasoline Prices, Color-Coded on Low-Pass Filter Anomalies')   
Gasoline time series, color-coded by detected low-pass filter anomalies. Anomalous data points are in yellow, and normal data points are in purple.

In examining the above results, the low pass filter performs fairly well. Of the nine outliers added to the time series, the low pass filter successfully detects six, plus a few other data points that look anomalous.

Isolation Forests

Isolation forests are an unsupervised extension of the popular random forest algorithm. The building blocks of isolation forests are isolation trees with a binary outcome (is/is not an outlier).

When an isolation forest is built, the algorithm splits each individual data point off from all other data points. The easier it is to isolate a single point in space from all other points, the more likely it is an outlier (because it’s far away from all other data points). If a data point is an in-lier, it will be closely surrounded by other data points, and will take more splits to isolate (1). See the graphic below as an illustration.

Related image
Each line represents a split performed by the algorithm to isolate individual data points. In the first figure, X0 is an outlier, and is isolated in ~3 splits. In the second figure, Xi is an in-lier, and requires many more splits to isolate from all other data points.
Image courtesy of https://www.semanticscholar.org/paper/Anomaly-Detection-through-on-line-Isolation-Forest%3A-Susto-Beghi/c321685dd03fa1a54e99b4c046950b96c3b0f040/figure/1.

For further information on isolation forests, check out this tutorial by T. Fuertes.

Code Example

Again we use the gasoline time series, but this time we apply the isolation forest algorithm to detect anomalies. We use the IsolationForest() model available in the scikit-learn package to build a model and test it on our data set:

def isolation_forest_anomaly_detection(df, 
                                       column_name, 
                                       outliers_fraction):
    """
    In this definition, time series anomalies are detected using an Isolation Forest algorithm.
    Arguments:
        df: Pandas dataframe
        column_name: string. Name of the column that we want to detect anomalies in
        outliers_fraction: float. Percentage of outliers allowed in the sequence.
    Outputs:
        df: Pandas dataframe with column for detected Isolation Forest anomalies (True/False)
    """
    #Scale the column that we want to flag for anomalies
    min_max_scaler = preprocessing.StandardScaler()
    np_scaled = min_max_scaler.fit_transform(df[[column_name]])
    scaled_time_series = pd.DataFrame(np_scaled)
    # train isolation forest 
    model =  IsolationForest(contamination = outliers_fraction, behaviour='new')
    model.fit(scaled_time_series)
    #Generate column for Isolation Forest-detected anomalies
    isolation_forest_anomaly_column = column_name+'_Isolation_Forest_Anomaly'
    df[isolation_forest_anomaly_column] = model.predict(scaled_time_series)
    df[isolation_forest_anomaly_column] = df[isolation_forest_anomaly_column].map( {1: False, -1: True} )
    return df
##EXECUTE IN MAIN BLOCK
#APPLY ISOLATION FOREST TO DETECT ANOMALIES
gasoline_price_df=isolation_forest_anomaly_detection(df=gasoline_price_df, 
                                                         column_name='Gasoline_Price', 
                                                         outliers_fraction=.04)
#Re-plot time series with color coding for anomaly column
scatterplot_with_color_coding(gasoline_price_df['Date'],
                                  gasoline_price_df['Gasoline_Price'], 
                                  gasoline_price_df['Gasoline_Price_Isolation_Forest_Anomaly'],
                                  'Date', 
                                  'Gasoline Price (Dollars Per Gallon)', 
                                  'Gasoline Prices, Color-Coded on Isolation Forest Anomalies') 

Gasoline time series, color-coded by detected Isolation Forest anomalies. Anomalous data points are in yellow, and normal data points are in purple.

Like the low-pass filter, the isolation forest detects most of the artificially generated anomalies; however, it detects more false positives (data points labelled as anomalies when they are actually not). The algorithm has a particularly difficult time recognizing large time series shifts as normal (see late 2014 time period as an example).

Seasonal-Extreme Studentized Deviate (S-ESD) Algorithm

A group of data scientists as Twitter initially developed the S-ESD algorithm. One of the original (and most popular) implementations of this algorithm is in the R ‘AnomalyDetection’ package. Luckily, the R package has been adapted several times over for use in Python.

The algorithm is simple to understand, and, depending on the time series used, quite robust. It works as follows:

  1. A time series is decomposed using STL (season-trend-loess) decomposition.
  2. The ESD algorithm is run on the resulting loess time series to detect anomalies. The loess time series represents noise in the system, after trend and seasonal behavior have been filtered out.

Let’s try an example to demonstrate.

Code Example

Before we implement the S-ESD algorithm, let’s decompose the gasoline price time series to get an idea of the time series’ trend, seasonality, and noise:

def decompose_time_series(series, desired_frequency):
    """
    Perform STL decomposition on the time series.
    Arguments:
        series: Pandas series. Time series sequence that we wish to decompose.
        desired_frequency: Integer. Time frequency of the series. If we want to detect 
        a yearly trend, we'd set the value equal to 365.
    Outputs:
        Plot of time series STL decomposition.
    """
    result = seasonal_decompose(series, model='additive', freq=desired_frequency)
    result.plot()
    plt.show()
##EXECUTE IN MAIN BLOCK
#APPLY S-ESD ALGORITHM TO DETECT ANOMALIES
#Decompose time series on a yearly interval
#Set Date as index for the time series
gasoline_price_df.index=gasoline_price_df['Date']
decompose_time_series(series=gasoline_price_df['Gasoline_Price'], 
                          desired_frequency=365)
STL Decomposition of Gulf Coast Gasoline Price Time Series, 2014-Present

In the above code snippet, we decompose the time series on a yearly frequency. When detecting anomalies, the decomposed time series that we really care about is the Residual Time Series (or loess time series). This time series represents noise in the sequence, after seasonality and trend have been accounted for.

Following decomposition, we apply the Extreme Studentized Deviate (ESD) test on the residual time series to detect outliers. One of the main advantages of the ESD Test is that it only requires an upper bound for the total number of outliers to detect; this is in contrast to the Grubbs Test, where the number of outliers must be declared exactly (2).

For more information on the Extreme Studentized Deviate (ESD) algorithm, check out this link.

In the code snippet below, we perform S-ESD anomaly detection on the time series, which includes STL decomposition and outlier detection using ESD:

def sesd_anomaly_detection(dataframe, 
                           column_name, 
                           desired_frequency, 
                           max_anomalies,
                           alpha_level):
    """
    In this definition, time series anomalies are detected using the S-ESD algorithm.
    Arguments:
        dataframe: Pandas dataframe
        column_name: string. Name of the column that we want to detect anomalies in
        desired_frequency: Integer. Time frequency of the series. If we want to detect 
        a yearly trend, we'd set the value equal to 365.
        max_anomalies: Integer. Max number of anomalies to look for in the time series
        sequence.
        alpha_level: The significance level.
    Outputs:
        df: Pandas dataframe with column for detected S-ESD anomalies (True/False)
    """
    series=np.array(dataframe[column_name])
    #Implement SESD algorithm on the time series
    outliers_indices = sesd.seasonal_esd(series, 
                                         hybrid=False, 
                                         seasonality=desired_frequency,
                                         max_anomalies=max_anomalies, 
                                         alpha=alpha_level)
    #Create a column for SESD anomalies
    sesd_anomaly_column=column_name+'_SESD_Anomaly'
    #Create a detected anomaly column, and mark as False if normal, and True if anomalous
    dataframe[sesd_anomaly_column]=False
    dataframe.loc[dataframe.index.isin(outliers_indices), sesd_anomaly_column]=True
    return dataframe
##EXECUTE IN MAIN BLOCK
#Implement the SESD algorithm on the time series
gasoline_price_df=sesd_anomaly_detection(dataframe=gasoline_price_df, 
                           column_name='Gasoline_Price', 
                           desired_frequency=365, 
                           max_anomalies=20,
                           alpha_level=10)
#Re-plot time series with color coding for anomaly column
scatterplot_with_color_coding(gasoline_price_df['Date'],
                                  gasoline_price_df['Gasoline_Price'], 
                                  gasoline_price_df['Gasoline_Price_SESD_Anomaly'],
                                  'Date', 
                                  'Gasoline Price (Dollars Per Gallon)', 
                                  'Gasoline Prices, Color-Coded on SESD Anomalies')    
Gasoline Time Series, color coded on S-ESD anomalies, with outliers in yellow and all other data points in purple

The results for S-ESD algorithm are not too promising, based on the results above. The algorithm only successfully manages to identify one of the generated anomalies out of the nine. I’ve previously had success with this algorithm using other time series sets, but it appears to struggle with this particular example.

One-Class Support Vector Machines (SVM’s)

One-class SVM’s are the unsupervised version of support vector machines, as they are trained on only one class (the “normal” class). Because the data is unlabeled, one-class support vector machines “infer the properties of the normal cases, and from these properties predict which examples are unlike the normal examples” (3). One-class SVM’s use a binary function that maps the probability density of data across a plane, with high probability regions set to +1 and low probability regions set to -1 (4).

Check out this article by Hsu et al for more in-depth information on one-class support vector machines.

Code Example

Once again we turn to scikit-learn to perform anomaly detection; this time, for one-class SVM’s. We generate a couple variables to plug into the algorithm to improve performance–mainly, a centered 6-point rolling average of the gasoline price time series. We then run the original time series and its rolling average through the one_class_SVM_anomaly_detection() function. In the function, we scale each variable, and then train the unsupervised model. Finally, we classify each data point, and visualize using a scatter plot:

def one_class_SVM_anomaly_detection(dataframe, columns_to_filter_by, outliers_fraction):
    """
    In this definition, time series anomalies are detected
    using a One Class SVM algorithm.
    Arguments:
        df: Pandas dataframe
        columns_to_filter_by: string, or list of strings. Name of the column(s) that 
        we want to use in the One Class SVM to detect time series anomalies
        outliers_fraction: float. Percentage of outliers allowed in the sequence.
    Outputs:
        df: Pandas dataframe with column for detected One Class SVM anomalies (True/False)
    """
    #Subset the dataframe by desired columns
    dataframe_filtered_columns=dataframe[columns_to_filter_by]
    #Transform the data using the fit_transform() function
    min_max_scaler = preprocessing.StandardScaler()
    np_scaled = min_max_scaler.fit_transform(dataframe_filtered_columns)
    scaled_dataframe = pd.DataFrame(np_scaled)
    #Remove any NaN's from the dataframe
    scaled_dataframe =scaled_dataframe.dropna()
    #Train the One Class SVM
    model = svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.01)
    model.fit(scaled_dataframe)
    #Create a column for the anomaly
    one_class_svm_anomaly_column='One_Class_SVM_Anomaly'
    scaled_dataframe[one_class_svm_anomaly_column] = pd.Series(model.predict(
                                                        scaled_dataframe)).map({1: False, -1: True})
    dataframe[one_class_svm_anomaly_column] = scaled_dataframe[one_class_svm_anomaly_column]
    return dataframe
##EXECUTE IN MAIN BLOCK
#APPLY ONE CLASS SVM ALGORITHM TO DETECT ANOMALIES
#Create some columns to cluster against
#Create 6-value rolling average for gasoline prices
gasoline_price_df['Gasoline_Price_Rolling_Average_6_pt']=gasoline_price_df['Gasoline_Price'].rolling(window=6, center=True).mean()
#Calculate difference between gas price and rolling average value
gasoline_price_df['Gasoline_Price_Diff_From_Rolling_Avg']=gasoline_price_df['Gasoline_Price']-gasoline_price_df['Gasoline_Price_Rolling_Average_6_pt']
#Implement One Class SVM to time series
gasoline_price_df=one_class_SVM_anomaly_detection(dataframe=gasoline_price_df, 
                                    columns_to_filter_by=['Gasoline_Price_Diff_From_Rolling_Avg',
                                                          'Gasoline_Price'], 
                                    outliers_fraction=.1)
#Re-plot time series with color coding for anomaly column
scatterplot_with_color_coding(gasoline_price_df['Date'],
                                  gasoline_price_df['Gasoline_Price'], 
                                  gasoline_price_df['One_Class_SVM_Anomaly'],
                                  'Date', 
                                  'Gasoline Price (Dollars Per Gallon)', 
                                  'Gasoline Prices, Color-Coded on One Class SVM Anomalies')    
Gasoline time series, color-coded by detected One Class SVM anomalies. Anomalous data points are in yellow, and normal data points are in purple.

The one class SVM performs relatively well, detecting six out of the nine artificially generated outliers. However, like the isolation forest, it detects several false positives in the sequence.

Conclusions

In this article, we compared the results of a variety of anomaly detection techniques–isolation forests, low-pass filters, one class SVM’s, and the S-ESD algorithm. Surprisingly, when comparing the results side-by-side, the algorithm with the best performance (at least for this example) is a simple low-pass filter. The algorithm with the worst performance is the S-ESD algorithm.

It is important to note that these algorithms perform differently for each individual time series. Just because the low-pass filter works best with this time series, doesn’t mean it works best for all time series. To achieve optimal results, try all the above options before making a final decision on which algorithm to use.

This concludes my tutorial on unsupervised machine learning methods for time series anomaly detection. Thanks for reading!

Sources

  1. T. Fuertes. April 4, 2018. Isolation forest: the art of cutting off from the world. https://quantdare.com/isolation-forest-algorithm/
  2. Engineering Statistics Handbook. Generalized ESD Test for Outliers. https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
  3. Azure Machine Learning Studio. May 5, 2019. One-Class Support Vector Machine. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/one-class-support-vector-machine
  4. Roemer Vlasveld. July 12, 2013. Introduction to One-class Support Vector Machines. http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/

Check out some of my other time series tutorials and articles:

4 comments

  1. Hello,

    Thanks for the great article!! There is a minor issue with the code that you have given, the function scatterplot_with_color_coding is missing.

  2. Hi, I really liked your article, it’s very complete and illustrative. However, I found a flaw: in all cases but one, all algorithms have information about the neighbours of the point to be classified (with the rolling mean, for instance), however, this is not the case for the isolation forest. Here it is only provided the raw value, that’s why, after carefully analysing the plot, can be seen that it’s behaving as a density detector. Adding an extra feature like rolling mean, rolling std or just shifted values could help to increase its performance

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.