Using ML to Predict the Best Time to Sell Electricity Back to the Grid

Fire up that Tesla PowerWall or PowerPack and save big! Full Python ML analysis included

With the rise of new technology in the renewables and energy storage industries, there are many new opportunities to save on electricity, or even make money by selling electricity back to the grid. Currently 40 states in the US allow homeowners to sell excess electricity back to the grid that is generated via residential solar projects (check out this link for more info on state net metering policies). Furthermore, new technologies such as the Tesla PowerWall and Tesla PowerPack–rechargeable lithium-ion battery devices–allow for energy storage for later usage.

It’s important to know when daily peak electricity demand occurs. A couple reasons why include:

  1. You want to avoid using electricity during peak hours, so you can save the most on your bills.
  2. When you own a PowerWall or PowerPack and you’re selling electricity back to the grid, you want to sell during peak demand times so that you can command the highest prices for the electricity that you’re selling.

In this article, I use machine learning to specifically predict the best time to sell electricity on the grid on a daily basis (or the best time to turn off all the lights/appliances in your house to save money), based on when peak electricity demand occurs. The below analysis is performed in Python and relies on packages such as pandas and sklearn. For my training data set, I use several years’ worth of hourly electricity demand data pulled via the Energy Information Administration’s Open Data API for the state of TX. The hourly electricity demand data acts as a proxy for hourly electricity price data, which was not publicly available. For more information on pulling data via the EIA API, check out my other tutorial here.

Data Visualization and Exploration

First, let’s pull some electricity demand data using the EIA API and visualize it on yearly and weekly scales:

def plot_data(df, x_variable, y_variable, title):
    """
    Plot the x- and y- variables against each other, where the variables are columns in
    a pandas dataframe
    df: Pandas dataframe. 
    x_variable: String. Name of x-variable column
    y_variable: String. Name of y-variable column
    title: String. Desired title name
    """
    fig, ax = plt.subplots()
    ax.plot_date(df[x_variable], 
                 df[y_variable], marker='', linestyle='-', label=y_variable)
    fig.autofmt_xdate()
    plt.title(title)
    plt.show()
  

###Execute in main block

#Plot the data on a yearly basis, using 2019 as an example year
plot_data(
df=electricity_demand_df[(electricity_demand_df['Date_Time']>=pd.to_datetime('2019-01-01')) & (electricity_demand_df['Date_Time']<pd.to_datetime('2020-01-01'))], 
                                    x_variable='Date_Time', 
                                    y_variable='Electricity_Demand_MWh', 
                                    title='TX Electricity Demand: 2019')
#Plot the data on a weekly basis, using July 1-7, 2019 as an example
plot_data(df=electricity_demand_df[(electricity_demand_df['Date_Time']>=pd.to_datetime('2019-07-01')) & (electricity_demand_df['Date_Time']<pd.to_datetime('2019-07-07'))], 
                                    x_variable='Date_Time', 
                                    y_variable='Electricity_Demand_MWh', 
                                    title='TX Electricity Demand: December 2017')

Yearly Demand, 2018 and 2019

Weekly Demand: Monday through Sunday, July 1st-7th 2019

The seasonality of the electricity demand time series is evident from the graphs above. In the yearly view, demand appears to increase significantly during the summer months (likely due to increased use of air conditioning). In the weekly view, demand appears to fluctuate significantly over the course of each day, with the lowest demand occurring in the early morning, and the highest demand occurring in the afternoon.

Now that we’ve visualized the data, let’s further break down when the peak demand hour is most likely to occur for each day in the data set. We use the following code to identify and subset the peak demand hour for each day:

#Pull the day of month for each reading
electricity_demand_df['Day_Of_Month']=electricity_demand_df['Date_Time'].dt.day
#Pull the month of the year
electricity_demand_df['Month']=electricity_demand_df['Date_Time'].dt.month.apply(lambda x: calendar.month_abbr[x])
#Pull the year
electricity_demand_df['Year']=electricity_demand_df['Date_Time'].dt.year
    
#Calculate the hour with max demand for each date in the data set
    electricity_demand_df['Peak_Demand_Hour_MWh_For_Day']=electricity_demand_df.groupby(['Day_Of_Month', 'Month', 'Year'], sort=False)['Electricity_Demand_MWh'].transform('max')
    
#Subset time series to only include peak hourly data
    peak_demand_hour_df=electricity_demand_df[electricity_demand_df['Electricity_Demand_MWh']==electricity_demand_df['Peak_Demand_Hour_MWh_For_Day']]

When we plot a histogram of the peak hour for each day (0-23 hour, where 0 is midnight, 1 is 1 am, etc.) by the number of times it occurs in the data set, we receive the following result:

#Create a histogram of counts by hour
ax=peak_demand_hour_df['Peak_Demand_Hour'].value_counts().plot(kind='bar', title='Peak Demand Hour by Number of Occurrences')
ax.set_xlabel("Demand Hour (0-23 hour)")
ax.set_ylabel("Number of Occurrences")
Peak demand hour by number of occurrences

Peak hourly demand most frequently occurs around hour 17, or 5 pm, with over 500 occurrences across the 1412-day data set. Following behind are 6 pm (hour 18), 7 pm (hour 19), and 8 am (hour 8). So, if we’re looking at the entire data set with no correction for month/day of the week/season/etc., peak demand occurs at 5 pm, with high demand around 6 to 7 pm in the evening (when people get back from work and are eating dinner), or 8 am (when people are waking up/about to start work).

But what if peak demand varies by month or day of the week? To investigate further, let’s break down when peak demand occurs, based on both of these factors:

def generate_histogram_of_aggregated_counts(df, 
                                            peak_demand_hour_column, 
                                            group_by_column):
    """
    Generate a histogram of peak demand hour counts, grouped by a column
    Arguments:
        df: Pandas dataframe
        peak_demand_hour_column: String. Name of the column for peak demand hour
        group_by_column: String. Name of column to group by 
    """
    #Create a histogram of counts by hour, grouped by month
    fig = plt.figure(figsize = (20,15))
    ax = fig.gca()
    axarr = df[peak_demand_hour_column].hist(by=df[group_by_column], bins=24, ax=ax)
    for ax in axarr.flatten():
        ax.set_xlabel("Peak Demand Hour (0-23)")
        ax.set_ylabel("Number of Occurrences")
    #Count number of peak hour occurrences, grouped by month
    peak_hour_counts=pd.DataFrame(df.groupby([peak_demand_hour_column, 
                                              group_by_column]).size()).reset_index().rename(columns={0:'Counts'})
    #Pull peak hour for each month and write back as a column
    peak_hour_counts['Number_Occurrences']=peak_hour_counts.groupby([group_by_column])['Counts'].transform('max')
    #Subset the dataframe to only include max counts for peak demand hours for each month
    peak_hour_counts=peak_hour_counts[peak_hour_counts['Counts']==peak_hour_counts['Number_Occurrences']]
    #Order the dataframe by group_by_column
    peak_hour_counts=peak_hour_counts.sort_values(by=[group_by_column])
    #Print the subsetted dataframe
    print(peak_hour_counts[[group_by_column, peak_demand_hour_column, 'Number_Occurrences']])


##Execute in main block

#Create a histogram of counts by peak demand hour, grouped by day of the week
generate_histogram_of_aggregated_counts(peak_demand_hour_df, 
                                            peak_demand_hour_column='Peak_Demand_Hour', 
                                            group_by_column='Day_Of_Week_Numeric')
#Create a histogram of counts by peak demand hour, grouped by month
generate_histogram_of_aggregated_counts(peak_demand_hour_df, 
                                            peak_demand_hour_column='Peak_Demand_Hour', 
                                            group_by_column='Month_Numeric')
Number of Occurrences by Peak Demand Hour, for each day of the week. Day of the week is labelled as 1 through 7, where 1 is Monday, 2 is Tuesday, 3 is Wednesday, etc.
Most frequently occurring peak demand hour for each day of the week, along with the number of times it occurs. Day of the week is labelled as 1 through 7, where 1 is Monday, 2 is Tuesday, 3 is Wednesday, etc.
Number of Occurrences by Peak Demand Hour, for each month of the year. Month of the year is labelled as 1 through 12, where 1 is January, 2 is February, 3 is March, etc.
Most frequently occurring peak demand hour for each month, along with the number of times it occurs. Month of the year is labelled as 1 through 12, where 1 is January, 2 is February, 3 is March, etc.

In examining the histogram above, Monday through Saturday are most likely to experience peak demand around 5 pm (or 17 hours). Sunday’s peak demand is later, occurring around 6 pm (or 18 hours).

Peak hourly demand most frequently occurs around 5-6 pm during the summer months of April through October (month 4-month 10). November and December have a slightly later peak demand hour at 7 pm (hour 19); whereas February and March have the latest peak hour of all–around 8 to 9 pm. January’s peak demand hour–at 8 am–is the most anomalous in the data set.

Building the Machine Learning Model

Let’s now test the hypothesis that the factors explored above influence daily peak demand hour, using a machine learning model. In the code snippet below, we perform some feature pre-processing:

#Subset the dataframe to only include the features and labels that we're going to use 
#in the model
peak_demand_hour_model=peak_demand_hour_df[['Peak_Demand_Hour',
                                                      'Day_Of_Week', 
                                                      'Week',
                                                      'Month']]
#Convert the Week, Year, and Peak_Demand_Your variables into categorical string variables (from numeric)
peak_demand_hour_model.loc[:,'Week']=peak_demand_hour_model['Week'].apply(str)
peak_demand_hour_model.loc[:,'Peak_Demand_Hour']='Hour '+peak_demand_hour_model['Peak_Demand_Hour'].apply(str)
 
#Remove the labels from the features
features= peak_demand_hour_model.drop('Peak_Demand_Hour', axis = 1)
#One hot encode the categorical features
features = pd.get_dummies(features)
#Create labels 
labels = np.array(peak_demand_hour_model['Peak_Demand_Hour'])

Three different categorical variables are pre-processed to predict when peak daily demand will occur–day of the week (Monday-Sunday), week of the year (ranges between week 1 and week 52), and month (January-December). As all of these are categorical variables, we use the pandas get_dummies function to perform one-hot encoding. One-hot encoding is used to transform categorical variables so that they can be easily interpreted by machine learning models (check out this tutorial for more info).

Now that we have a data set with both features and labels, let’s build the machine learning model. Our label, or the value that we want to predict, is the hour of each day when peak electricity demand occurs. Our features, or the data that we want to use to predict peak demand hour, are day of the week, month, and week of the year. After some initial testing using a regression vs. classification approach, we settle on using a classification model–i.e., there are a max of 24 categories that we can classify each data point as, ranging from hour 0 (peak demand hour occurs at midnight), to hour 23 (peak demand hour occurs at 11 pm) (NOTE: In reality, there are fewer than 24 categories, as some hours have never historically been classified as a peak demand hour).

There are a multitude of different algorithms that can be used for multi-class classification, including support vector machines (SVM’s), artificial neural networks (ANN’s), decision trees, and random forests (RF’s). For this exercise, we will be using a random forest model.

The random forest model–one of my favorite ML algorithms–is both incredibly versatile, and performs well when stacked up against other machine- and deep-learning algorithms. In an RF, a series of decision trees are built, based on logic derived from the training data set. When a prediction is made, its feature data is run through the logic across all of the decision trees in the RF. The majority class from the decision tree outcomes is deemed the ultimate predicted class. The graphic below helps to illustrate this process.

A simplified diagram of a random forest, courtesy of https://shirinsplayground.netlify.com/2018/10/ml_basics_rf/

For more background on random forests and how they work, check out this tutorial: https://towardsdatascience.com/understanding-random-forest-58381e0602d2

First, we split up our data set into training and validation sets, using a 75/25 percent split:

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, 
                                                                            labels, 
                                                                            test_size = 0.25, 
                                                                            random_state = 5)  

We perform some RF tuning to the model to determine optimal model parameters, including optimal number of trees and optimal tree depth:

def grid_search_rf(parameter_grid, train_features, train_labels):
    """
    Perform Grid Search on the random forest classifier model, in order to optimize model 
    parameters
    parameter_grid: grid parameters to test against to determine optimal parameters
    train_features: Numpy array, containing training set features
    train_labels: Numpy array, containing training set labels
    """
    # Create a random forest classifier model
    rf = RandomForestClassifier()
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = rf, 
                               param_grid = parameter_grid, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 2)
    grid_search.fit(train_features, train_labels)
    print(grid_search.best_params_)

##Execute in main script

#Create the parameter grid, which is plugged into GridSearchCV, where all hyperparameter combos #are tested to find the optimal parameters combination
parameter_grid={'max_depth': [80, 90, 100, 110],
                    'n_estimators': [700, 800, 900, 1000, 1100, 1200]}   
grid_search_rf(parameter_grid, train_features, train_labels)

"""
    Grid Search Outputs:
        Fitting 3 folds for each of 20 candidates, totalling 60 fits
        [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
        [Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.1min
        [Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  3.7min finished
        {'max_depth': 100, 'n_estimators': 1100}
"""

If you’re interested in further information on RF hyperparameter tuning, check out this tutorial.

Model Results

Now that we’ve determined optimal model parameters, we retrain the model and test its performance using the validation set:

#Plug in optimized model parameters into final RF model 
rf = RandomForestClassifier(n_estimators=1100, 
                                max_depth=100,
                                random_state = 1000)
#Fit the model 
rf.fit(train_features, train_labels)

#Get accuracy score of model
accuracy_score(test_labels, rf.predict(test_features), normalize=True, sample_weight=None)

# Use the forest's predict method on the test data
print(confusion_matrix(test_labels, 
                           rf.predict(test_features),
                           labels=['Hour 8', 'Hour 9', 'Hour 10', 
                                   'Hour 14', 'Hour 15', 'Hour 16',
                                   'Hour 17', 'Hour 18', 'Hour 19',
                                   'Hour 20', 'Hour 21']))
Model Accuracy Output

Overall model accuracy is around 54%. We can accurately predict the exact hour of daily peak demand over 54 percent of the time, knowing only a few basic factors–the day of the week, the month, and the week of the year.

Likely other contributing factors that are not included in this model, but still affect when daily peak demand hours occur, include daily weather and temperature.

Let’s visualize the confusion matrix results, which will help us get a better idea of where false positives and false negatives are occurring in the model:

Confusion Matrix Results on Validation Data

In a confusion matrix, we want the highest number of cases in the left diagonal, with as close to 0 for all other values in the matrix. In interpreting the above results, 104 cases where the peak demand hour was 5 pm (hour 17) were correctly identified by the model. 13 cases were incorrectly identified as 6 pm (hour 18) when they were actually 5 pm, and so on and so forth. The model appears to be most confused when classifying data points between 4 and 6 pm (hour 16-18), as this is where the most false positives/false negatives occur. This makes sense, as this peak demand hour range is the most commonly occurring in the data set.

We also can analyze which factors affect when peak demand hour will occur the most, using the .feature_importances_ attribute:

#Obtain feature importances in the model
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = feature_list,
                                   columns=['importance']).sort_values('importance', 
                                   ascending=False)
print(feature_importances)
Snapshot of some of the top contributing categorical (one-hot encoded) variables that affect daily peak demand hour the most

Based on the snapshot of feature importances above, we can generally say one of the largest contributing factors to predicting the hour of daily peak demand is the day of the week. Whether or not a data point is Sunday appears to have a largest impact on model results, when comparing day-of-the-week features.

The week and month or the year affect peak demand hour less, but still have an impact. Interestingly, two of the greatest contributing weeks to model outcome are the week of Thanksgiving (week 48) and the week of Christmas and New Year’s (week 52).

Moral of the Story

Now that we’ve performed all of this analysis on peak daily electricity demand, what should we do with it? Let’s say you want to sell back to the grid, and you want to get the optimal price for the electricity that you’re selling. In general, you will get the most bang for your buck if you sell between the hours of 5 and 7 pm every day. Likewise, if you want to save money on electricity bills by avoiding usage during peak hours, turn off your lights and appliances between 5 and 7 pm to save the most money.

If you want to more accurately pinpoint the exact hour to sell electricity in advance, you can use the ML model above, which accounts for day of the week, month and week of the year, and year, when estimating the peak demand hour.

Performing This Analysis for Other US Electricity Markets

Performing this analysis for another desired state or region in the US is simple–just substitute the TX electricity demand series ID with the state ID that you want to analyze for. The listing for electricity demand time series ID’s is available via the following link on the EIA website:

https://www.eia.gov/opendata/qb.php?category=2122628

I have posted the code that I used to build the model above in my personal Github repo, as reference: https://github.com/kperry2215/daily_peak_electricity_demand_analysis

As always, thanks for reading!

Check out some of my other data science articles:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.