How to Web Scrape Monthly Oil & Gas Data for the Bakken Formation, Collected from the North Dakota Oil & Gas Division Website

In this post, I’ll walk you through web scraping the North Dakota Oil & Gas Division website, pulling tons of great oil and gas data related to North Dakota’s Bakken shale play.

North Dakota’s Oil and Gas Division’s website is located here. Its primary mission is to regulate the drilling and production of oil and gas wells in the state of North Dakota. On its site, it offers a true treasure trove of data sets, available via its general statistics page. Its monthly production report index, in particular, offers some amazing information on a very granular level–including monthly oil, gas, and water production data on a well-by-well basis, as well as days online for each well, and operator of each well. I’ve included a snapshot of one of the monthly Excel files that I pulled from the page, shown below:

A true wealth of oil & gas information!

There are individual monthly .xlsx and .pdf file extensions available for this data from 2003 to current, but managing the data pull from the internet and assembling all of the data into a single database can be quite unruly. So, to significantly speed up the process, I wrote a quick Python script that allows you to select the months you’d like to pull, and then web scrape the site to create a master Pandas dataframe that you can use to perform data analysis. The Python script is below (also available via my GitHub account):

import requests
import xlrd
import csv
import pandas as pd
from datetime import datetime
import os

def pull_desired_files_and_create_master_df(dates):
    This function web scrapes the target website, pulls all of the xlsx files based on the 
    selected dates using the xlrd function, converts to csv files, and reads in all csv's into a 
    master pandas dataframe that will be used for all subsequent calculations.
        dates: list. List of all of the dates (month/year, formatted based on website 
        xlsx formatting), which is looped through
        master_df: Pandas dataframe. 
        csv's of all of the files that were scraped, saved in the Documents folder.
    #Define the base url we will be pulling off of
    base_url = "" 
    #Create a master pandas dataframe  that will hold all of the data that we want
    #loop through all of the dates in the list
    for date in dates:        
        r = requests.get(desired_url)  # make an HTTP request
        #Open the contents of the xlsx file as an xlrd workbook    
        workbook = xlrd.open_workbook(file_contents=r.content)  
        #Take first worksheet in the workbook, as it's the one we'll be using
        worksheet = workbook.sheet_by_index(0) 
        #Obtain the year/month date data from the worksheet, and convert the ReportDate column 
        from float to datetime, using xlrd datemode functionality
        for i in range(1, worksheet.nrows):
            wrongValue = worksheet.cell_value(i,0)
            workbook_datemode = workbook.datemode
            year, month, day, hour, minute, second = xlrd.xldate_as_tuple(wrongValue, workbook_datemode)
            worksheet._cell_values[i][0]=datetime(year, month, 1).strftime("%m/%Y")
        #Generate a csv name to save under
        #Save as a csv
        csv_file = open(file_name, 'w',newline='')
        #Create writer to csv file
        wr = csv.writer(csv_file)
        #Loop through all the rows and write to csv file
        for rownum in range(worksheet.nrows):
        #Close the csv file
        #Read in csv as pandas dataframe
        #Append to the master dataframe
    #Return the final master dataframes
    return master_df

def main():
    #Create a new folder called 'Bakken' to drop all of the files in
    newpath = r'C:\Bakken' 
    if not os.path.exists(newpath):
    #Pull the months that we want to process--December 2016 to January 2019
    dates=['2016_12', '2017_01', '2017_02', '2017_03', '2017_04', '2017_05', '2017_06', 
           '2017_07', '2017_08', '2017_09', '2017_10', '2017_11', '2017_12',
           '2018_01', '2018_02', '2018_03', '2018_04', '2018_05', '2018_06', 
           '2018_07', '2018_08', '2018_09', '2018_10', '2018_11', '2018_12', '2019_01']
    #Run through the web scraper and save the desired csv files. Create a master dataframe with 
    all of the months' data
    #Declare the ReportDate column as a pandas datetime object
    #Write the master dataframe to a master csv 

if __name__== "__main__":

Let’s break down what the above Python code means.

From the main() block, I first create a new folder in the C:/ drive, called ‘Bakken’, to store all of the data.

From there, I select the months that I wish to pull–between December 2016 and January 2019–and add them to a list.

After that, I run the pull_desired_files_and_create_master_df() function. In this function, I ping the URL, retrieve the desired .xlsx file for the designated month, write it as an .xlsx file in the C:/Bakken folder, and then read the file’s first tab into an ongoing master pandas dataframe, called master_df.

After all of the dates have been read in a loop, the master_df dataframe is returned. This dataframe contains all of our concatenated data from the spreadsheets, ready to be used for data analysis in Python.

This concludes this post. For more on web scraping/automating oil & gas data pulls into Python, check out my other posts:

As always, thanks for reading!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.