In this post, I’ll walk you through web scraping the North Dakota Oil & Gas Division website, pulling tons of great oil and gas data related to North Dakota’s Bakken shale play.
North Dakota’s Oil and Gas Division’s website is located here. Its primary mission is to regulate the drilling and production of oil and gas wells in the state of North Dakota. On its site, it offers a true treasure trove of data sets, available via its general statistics page. Its monthly production report index, in particular, offers some amazing information on a very granular level–including monthly oil, gas, and water production data on a well-by-well basis, as well as days online for each well, and operator of each well. I’ve included a snapshot of one of the monthly Excel files that I pulled from the page, shown below:

There are individual monthly .xlsx and .pdf file extensions available for this data from 2003 to current, but managing the data pull from the internet and assembling all of the data into a single database can be quite unruly. So, to significantly speed up the process, I wrote a quick Python script that allows you to select the months you’d like to pull, and then web scrape the site to create a master Pandas dataframe that you can use to perform data analysis. The Python script is below (also available via my GitHub account):
import requests
import xlrd
import csv
import pandas as pd
from datetime import datetime
import os
def pull_desired_files_and_create_master_df(dates):
"""
This function web scrapes the target website, pulls all of the xlsx files based on the
selected dates using the xlrd function, converts to csv files, and reads in all csv's into a
master pandas dataframe that will be used for all subsequent calculations.
Arguments:
dates: list. List of all of the dates (month/year, formatted based on website
xlsx formatting), which is looped through
Outputs:
master_df: Pandas dataframe.
csv's of all of the files that were scraped, saved in the Documents folder.
"""
#Define the base url we will be pulling off of
base_url = "https://www.dmr.nd.gov/oilgas/mpr/"
#Create a master pandas dataframe that will hold all of the data that we want
master_df=pd.DataFrame()
#loop through all of the dates in the list
for date in dates:
desired_url=base_url+date+'.xlsx'
r = requests.get(desired_url) # make an HTTP request
#Open the contents of the xlsx file as an xlrd workbook
workbook = xlrd.open_workbook(file_contents=r.content)
#Take first worksheet in the workbook, as it's the one we'll be using
worksheet = workbook.sheet_by_index(0)
#Obtain the year/month date data from the worksheet, and convert the ReportDate column
from float to datetime, using xlrd datemode functionality
for i in range(1, worksheet.nrows):
wrongValue = worksheet.cell_value(i,0)
workbook_datemode = workbook.datemode
year, month, day, hour, minute, second = xlrd.xldate_as_tuple(wrongValue, workbook_datemode)
worksheet._cell_values[i][0]=datetime(year, month, 1).strftime("%m/%Y")
#Generate a csv name to save under
file_name='C:/Bakken/'+date+'.csv'
#Save as a csv
csv_file = open(file_name, 'w',newline='')
#Create writer to csv file
wr = csv.writer(csv_file)
#Loop through all the rows and write to csv file
for rownum in range(worksheet.nrows):
wr.writerow(worksheet.row_values(rownum))
#Close the csv file
csv_file.close()
#Read in csv as pandas dataframe
dataframe=pd.read_csv(file_name)
#Append to the master dataframe
master_df=master_df.append(dataframe)
#Return the final master dataframes
return master_df
def main():
#Create a new folder called 'Bakken' to drop all of the files in
newpath = r'C:\Bakken'
if not os.path.exists(newpath):
os.makedirs(newpath)
#Pull the months that we want to process--December 2016 to January 2019
dates=['2016_12', '2017_01', '2017_02', '2017_03', '2017_04', '2017_05', '2017_06',
'2017_07', '2017_08', '2017_09', '2017_10', '2017_11', '2017_12',
'2018_01', '2018_02', '2018_03', '2018_04', '2018_05', '2018_06',
'2018_07', '2018_08', '2018_09', '2018_10', '2018_11', '2018_12', '2019_01']
#Run through the web scraper and save the desired csv files. Create a master dataframe with
all of the months' data
master_dataframe_production=pull_desired_files_and_create_master_df(dates)
#Declare the ReportDate column as a pandas datetime object
master_dataframe_production['ReportDate']=pd.to_datetime(master_dataframe_production['ReportDate'])
#Write the master dataframe to a master csv
master_dataframe_production.to_csv(newpath+'\master_dataframe_production.csv')
if __name__== "__main__":
main()
Let’s break down what the above Python code means.
From the main() block, I first create a new folder in the C:/ drive, called ‘Bakken’, to store all of the data.
From there, I select the months that I wish to pull–between December 2016 and January 2019–and add them to a list.
After that, I run the pull_desired_files_and_create_master_df() function. In this function, I ping the URL, retrieve the desired .xlsx file for the designated month, write it as an .xlsx file in the C:/Bakken folder, and then read the file’s first tab into an ongoing master pandas dataframe, called master_df.
After all of the dates have been read in a loop, the master_df dataframe is returned. This dataframe contains all of our concatenated data from the spreadsheets, ready to be used for data analysis in Python.
[…] https://techrando.com/2019/06/26/how-to-web-scrape-monthly-oil-and-gas-data-from-the-bakken-formatio… […]
[…] the following Python code, I use Bakken well production data (see this tutorial for public Bakken data) to perform DCA for well production. This script automates curve fitting for both hyperbolic and […]