Analyzing Electricity Price Time Series Data using Python: Time Series Decomposition and Price Forecasting using a Vector Autoregression (VAR) Model

Welcome to the Tech Rando blog! This tutorial covers time series decomposition and vector autoregression (VAR) modelling to forecast electricity prices for the state of Texas, using time series data collected via the Energy Information Administration’s (EIA) API.

First, a little background on the Energy Information Administration, or the EIA. The EIA is a branch in the US Department of Energy responsible for collecting and analyzing energy-related data, including oil and gas, coal, nuclear, electric, and renewables. Via its Open Data application programming interface (API), users can directly pull the EIA time series data into Python for analysis.

For more background on setting up EIA API access in Python, check out this tutorial:

In this analysis, we’re going to pull the time series for electricity prices for the state of Texas into Python for analysis, as shown below:

def retrieve_time_series(api, series_ID):
    Return the time series dataframe, based on API and unique Series ID
    api: API that we're connected to
    series_ID: string. Name of the series that we want to pull from the EIA API
    #Retrieve Data By Series ID 
    series_search = api.data_by_series(series=series_ID)
    ##Create a pandas dataframe from the retrieved time series
    df = pd.DataFrame(series_search)
    return df
###Execute in the main block
#Create EIA API using your specific API key
api_key = "YOR API KEY HERE"
api = eia.API(api_key)
#Pull the electricity price data
electricity_df=retrieve_time_series(api, series_ID)
electricity_df.reset_index(level=0, inplace=True)
#Rename the columns for easer analysis
Snapshot of the time series data for electricity prices, pulled via the EIA API

First, let’s look at whether or not the monthly electricity data displays seasonality and a trend. To do this, we use the seasonal_decompose() function in the statsmodels.tsa.seasonal package. This function breaks down a time series into its core components: trend, seasonality, and random noise. The code and its outputs are displayed below:

def decompose_time_series(series):
    Decompose a time series and plot it in the console
        series: series. Time series that we want to decompose
        Decomposition plot in the console
    result = seasonal_decompose(series, model='additive')
#Execute in the main block
#Convert the Date column into a date object
#Set Date as a Pandas DatetimeIndex
#Decompose the time series into parts
Time Series Decomposition: Monthly Electricity Prices in TX

So what does the plot above mean? If we take a closer look at the seasonal component of the decomposition, we can validate a trend that logically makes sense: during the summer months in Texas, electricity prices skyrocket. For anyone who has been to Texas in the middle of summer, it’s sweltering hot and practically unlivable without air conditioning (I know, I live in Houston). So, more air conditioning leads to more electricity usage, which leads to higher electricity prices.

Additionally, there is a second, smaller spike in electricity prices around the beginning of each year. This is likely due to the weather being cool enough that Texans resort to heating their homes, offices, etc. As most of Texas’s weather is fairly mild in the winter, this spike is significantly less pronounced than the spike in electricity prices during the summer.

Now that we’ve done some initial analysis using time series decomposition, we want to determine if we can predict electricity prices by using another predictor time series as a proxy. In order to get an idea of what factors may affect electricity prices, let’s first check out how electricity is generated in Texas. On its website, the EIA provides a breakdown of the sources of electricity in Texas, shown in the bar graph below:

Image courtesy of

Based on the above graph, natural gas appears to be the main source of electricity generation in Texas as of March 2019, with coal-fired and renewables tied for second.

Although we’re analyzing electricity data going back to 2001 and the breakdown above is from March 2019, for the sake of simplicity we’ll assume that natural gas has been one of the main sources of electricity generation in Texas for the past 15-20 years. Consequently, we’re going to pull the time series of natural gas prices via the EIA API and compare it side-by-side to the TX electricity price time series:

#Pull in natural gas time series data
nat_gas_df=retrieve_time_series(api, series_ID)
nat_gas_df.reset_index(level=0, inplace=True)
#Rename the columns
#Convert the Date column into a date object
#Set Date as a Pandas DatetimeIndex
#Decompose the time series into parts
#Merge the two time series together based on Date Index
master_df=pd.merge(electricity_df['Electricity_Price'], nat_gas_df['Nat_Gas_Price_MCF'], 
                       left_index=True, right_index=True)
master_df.reset_index(level=0, inplace=True)
#Plot the two variables in the same plot
             master_df['Electricity_Price'], label="Electricity_Price")
             master_df['Nat_Gas_Price_MCF'], label="Nat_Gas_Price")
# Place a legend to the right of this smaller subplot.
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Natural Gas Price vs. TX Electricity Price over Time')
Plotted Natural Gas Prices and Electricity Prices over Time

There appears to be more variation in the natural gas price time series (the highs are especially high and the lows are especially low), but for the most part, the trends in both time series appear very similar. In fact, it’s highly likely we could estimate TX electricity price using natural gas price as a proxy.

This presents an interesting problem–using lagged values of a time series, along with lagged values other time series sets, to predict future values. One approach to solving a problem such as this is called vector autoregression, or VAR. VAR is an extension of the autoregressive (or AR) model, where multiple variables are used when generating predictions. VAR models are very popular because of their flexibility when analyzing economic and financial time series, and are great for forecasting.

The image below gives a breakdown of a VAR equation with two variables and first order dynamics. It is a very simple linear equation, where the x- and y-values at time t-1 are used to predict the x- and y-values at time t. As variable x depends on variable y and vice versa, they are jointly considered using a system of equations. Basically, VAR models work by using lagged data for multiple dependent time series to predict future values.

VAR equation for two variables and first-order dynamics: Image courtesy of

The following handout available via the University of Washington gives an in-depth description of the mathematics behind the VAR model, if you would like further background:

Additionally, the University of Chicago Booth School of Business gives a great high-level overview of VAR modeling in the following presentation:

As with most econometric analysis, there are a couple of requirements for the VAR model to work. First, the passed time series need to be stationary, i.e. the statistical properties of the times series, including mean, variance, and autocorrelation, need to be constant over time. Consequently, time series that display seasonality or a trend are NOT stationary.

There are a couple approaches that we can take to make a model stationary. They are as follows:

  1. Difference the time series. Data differencing is, more or less, calculating the instantaneous velocity at each data point, or the amount the time series changes from one value to the next.
  2. Transform the time series. This can be done by applying a log or power transformation to the data.

Next, in order for a VAR model to work, the sampling frequency (i.e. daily, monthly, yearly data) needs to be the same. If it isn’t, the data needs to be converted to the same frequency using imputation/linear interpolation, etc.

In the code below, each time series is transformed using the numpy natural log function, and then differenced by one interval:

#Transform the columns using natural log
#Difference the data by 1 month
master_df['Electricity_Price_Transformed_Differenced'] = master_df['Electricity_Price_Transformed'] - master_df['Electricity_Price_Transformed'].shift(n)
master_df['Nat_Gas_Price_MCF_Transformed_Differenced'] = master_df['Nat_Gas_Price_MCF_Transformed'] - master_df['Nat_Gas_Price_MCF_Transformed'].shift(n)

In order to test if the time series are now stationary, we use the Augmented Dickey Fuller Test. The Augmented Dickey-Fuller Test tests the null hypothesis that a unit root is present in a time series (i.e. if a unit root is in the time series, then it isn’t stationary). To successfully determine that a time series is stationary, the test must return a p-value of less than .05. In the Python code below, we run the Augmented Dickey-Fuller test on the transformed, differenced time series, determining that they are both stationary (the p-values returned are .000299 and 0 for the electricity and natural gas time series, respectively):

def augmented_dickey_fuller_statistics(time_series):
    Run the augmented Dickey-Fuller test on a time series
    to determine if it's stationary.
        time_series: series. Time series that we want to test 
        Test statistics for the Augmented Dickey Fuller test in 
        the console 
    result = adfuller(time_series.values)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value)) 
#Execute in the main block
#Run each transformed, differenced time series thru the Augmented Dickey Fuller test
print('Augmented Dickey-Fuller Test: Electricity Price Time Series')
print('Augmented Dickey-Fuller Test: Natural Gas Price Time Series')
Outputs for the Augmented Dickey-Fuller Test for the electricity price time series and the natural gas price time series, respectively

Now that we’ve made our two time series stationary, it’s time to fit the data to a VAR model. The Python code below successfully builds the model and returns a summary of the results, where we use a 95/5 percent split for the training/validation sets:

#Conver the dataframe to a numpy array
#Generate a training and test set for building the model: 95/5 split
training_set = master_array[:int(0.95*(len(master_array)))]
test_set = master_array[int(0.95*(len(master_array))):]
#Fit to a VAR model
model = VAR(endog=training_set)
model_fit =
#Print a summary of the model results
VAR Model Summary, with y1=Electricity Price Time Series, and y2=Natural Gas Price Time Series

What we really want to focus on in the model summary above is the equation for y1, where y1 estimates the electricity prices in the state of Texas based on lagged values of itself and lagged natural gas prices.

There is both a t-statistic and p-value associated with the lagged electricity price data (L1.y1), and lagged natural gas price data (L1.y2). The t-statistic compares the data to expected results under the null hypothesis. If sample results exactly equal the null hypothesis, then the t-statistic is 0. Alternatively, the higher the t-statistic value, the more likely that we can reject the null hypothesis. Basically, the higher the t-statistic, the more likely that there is a correlation between the two variables.

Each t-statistic has a corresponding p-value. P-values are also used to reject the null hypothesis. The lower the p-value, the stronger the evidence against the null hypothesis. Normally a p-value under .05 is used to determine statistical significance.

For more information on t-statistics and p-values, check out the following link:

With all of this information in mind, let’s interpret the results of the model summary for equation y1.

For the lagged electricity price variable, there is a t-statistic of 3.002 and a p-value 0.003–clearly, lagged values of the electricity price time series affect future values of the time series, and we can reject the null hypothesis.

Lagged values of the natural gas price time series are also promising with a t-statistic of 2.225 and a p-value of 0.026, well below the cutoff for statistical significance and a clear indicator that lagged values in the natural gas time series affect future values in the electricity price time series.

Now that we’ve built a model, it’s time to generate and compare the predictions to actual data in the test/validation set. It is important to note that the data must be un-differenced and back-transformed in order to understand the electricity price predictions. The following code performs the un-differencing and back-transformation operations, as well as calculating the forecasting metrics associated with evaluating model performance, including forecast bias, mean absolute error, mean squared error and root mean squared error:

def calculate_model_accuracy_metrics(actual, predicted):
    Output model accuracy metrics, comparing predicted values
    to actual values.
        actual: list. Time series of actual values.
        predicted: list. Time series of predicted values
        Forecast bias metrics, mean absolute error, mean squared error,
        and root mean squared error in the console
    #Calculate forecast bias
    forecast_errors = [actual[i]-predicted[i] for i in range(len(actual))]
    bias = sum(forecast_errors) * 1.0/len(actual)
    print('Bias: %f' % bias)
    #Calculate mean absolute error
    mae = mean_absolute_error(actual, predicted)
    print('MAE: %f' % mae)
    #Calculate mean squared error and root mean squared error
    mse = mean_squared_error(actual, predicted)
    print('MSE: %f' % mse)
    rmse = sqrt(mse)
    print('RMSE: %f' % rmse)
#Execute in the main block
#Un-difference the data
for i in range(1,len(master_df.index)-1):[i,'Electricity_Price_Transformed']=[i-1,'Electricity_Price_Transformed'][i,'Electricity_Price_Transformed_Differenced_PostProcess']
#Back-transform the data
#Compare the forecasted data to the real data
print(master_df[master_df['Predicted']==1][['Date','Electricity_Price', 'Predicted_Electricity_Price']])
#Evaluate the accuracy of the results
                                    list(master_df[master_df['Predicted']==1 ['Predicted_Electricity_Price']))
Actual electricity price vs. VAR-model predicted electricity price, predicted out 10 months (after un-differencing and back-transformation)
Accuracy metrics for the forecast: forecast bias, mean absolute error, mean squared error, and root mean square error

The results here look very promising! The VAR appears to estimate electricity prices fairly accurately. The forecast bias metric is -0.041638, indicating that the model has a tendency to over-forecast, or predict too high. Mean absolute error, mean squared error, and root mean squared error are also fairly low, which is good. Based on MAE, there is an average $.22 variation between the forecasted and the actual values.

For more background on time series forecasting model metrics, check out the following link:

This concludes this tutorial. Thank you for reading!

The full script for this analysis is available via Github:

Check out some of my other energy-related Python tutorials:

Automating oil and gas decline curve analysis:

Analyzing open source FracFocus data to gain completions insights:


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.