I recently wrote some script that automated data pulls from the publicly available FracFocus database, a government-operated source which provides a comprehensive listing of hydraulic fracturing chemicals pumped in unconventional oil and gas completions jobs in the United States. This database is a great resource, not only for the public, but also for oil and gas companies who are looking to perform competitive intelligence analysis regarding other operators’ completions strategies. With this treasure trove of data readily available, I wanted to do some analysis with it. In today’s post I do just that–using basic data visualization packages in Python like Matplotlib, I analyze operator completions information for the Permian Basin in West Texas.
First, a little background on the data itself.
The FracFocus database contains several optional and non-optional fields for data entry. Fields include well name, well identifier number (called API), latitude and longitude coordinates, operator, total vertical depth (TVD) of the completion (how far down the well was drilled), amount of water pumped during the frac job, and amount of non-water additives pumped. Additionally, the database includes the different types of chemicals and frac sand pumped, as well as the vendors that the products were purchased from. A snapshot of the data is provided below:

The FracFocus database that I web scraped (see here for tutorial) is a massive file with over 4 million rows, so I created a Python object to filter out the database (uploaded into Python as a pandas dataframe), using characteristics such as operator name, state, and county:
class fracfocus_data_search:
"""
This class generates an object that is used the filter the master
fracfocus dataframe so it only contains a certain state/state abbreviation,
county (list), and/or operator
"""
def __init__(self, state=None, state_abbreviation=None, county_list=None,
operator=None):
#All data in initialize def optional depending on what kind of filtering
#we want to do
self.state = state
self.state_abbreviation = state_abbreviation
self.county_list=county_list
self.operator=operator
def filter_dataframe(self, df, column_state,
column_county, column_operator):
"""
Filter the data by parameters
Arguments:
df: Pandas dataframe that we want to to subset by
column_state: String. Name of the column for state
column_county: String. Name of column for county
column_operator: String. Name of column for operator
Outputs:
df_subset: Pandas dataframe. Dataframe after it's been subsetted.
"""
#Filter by all the parameters and return the subset
df_subset=df[
#By state
(((df[column_state].str.lower())==self.state.lower()) |
((df[column_state].str.lower())==self.state_abbreviation.lower())) &
#By county
((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
#By operator
((df[column_operator].str.lower()).str.contains(self.operator.lower()))
]
return df_subset
For the purpose of this tutorial, I’m going to filter the data down to eight counties in West Texas’s Permian Basin (Andrews, Borden, Crane, Dawson, Ector, Eddy, Gaines, and Glasscock), for the operator XTO (the onshore unconventionals spin-off of Exxon).
#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
'Ector', 'Eddy', 'Gaines', 'Glasscock'], operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
column_county='CountyName', column_operator='OperatorName')
42,850 rows of a data are returned when I apply the filter. Obviously, XTO is a fairly active operator in the Permian Basin.
Next, I want to visualize how much water XTO pumps during each of its completions over time.
I use the following function generate_plot() to create a visualization of base water pumped over time:
def generate_plot(df, x_variable, y_variables, plot_title):
"""
This function is used to map x- and y-variables against each other
Arguments:
df: Pandas dataframe.
x_variable: String. Name of the column that we want to set as the
x-variable in the plot
y_variables: string (single), or list of strings (multiple). Name(s)
of the column(s) that we want to set as the y-variable in the plot
Outputs:
Scatter plot in console.
"""
#Plot results
df.plot(x=x_variable, y=y_variables, title=plot_title)
plt.show()
...
#Execution in main script
generate_plot(dataframe, x_variable='JobStartDate',
y_variables=['TotalBaseWaterVolume'],
plot_title='Total Base Water Volume for Fracs over Time')

So what does the above plot tell us? Mainly, that over time XTO has become more aggressive in the amount of water it pumped in each of its fracs, increasing its largest fracs from from less than 5 million gallons per frac pumped in 2014, to over 25 million gallons per frac in 2018-2019.
Next, I generate a plot for non-water additives pumped in fracs over time, also using the generate_plot() function.
#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(dataframe, x_variable='JobStartDate',
y_variables=['TotalBaseNonWaterVolume'],
plot_title='Total Base Non-Water Volume for Fracs over Time')

There is less of a clear cut trend here. With the exception of one massive outlier in mid-2019, the data looks to be fairly constant. To visualize the distribution of the non-water additive data with outliers removed, I generate a box-and-whisker plot using the following function:
def generate_boxplot(df, x_variable):
"""
This function generates a basic histogram of a column's data, with
outliers removed
Arguments:
df: Pandas dataframe
x_variable: String. Name of the column that we want to generate
boxplot from
Outputs:
Box plot in console.
"""
plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
plt.show()
...
#Execution in main script
generate_boxplot(dataframe, x_variable='TotalBaseNonWaterVolume')

Looking at the box-and-whisker plot above, it looks like the average frac has about 12,000 gallons of non-water additives, with a skewed upward distribution.
One of the more interesting facets of this data set is the vendor information, from which we can glean all sort of interesting insights.
For example, from whom does XTO buy its chemicals and frac sand for its completions? Did the company switch preferred vendors over time? I used the following code to generate a bar graph to gain insight into some of these questions:
def generate_bar_graph(df, title):
"""
This function creates a bar graph from pandas dataframe columns.
Arguments:
df: Pandas dataframe. Index will be x-axis. Categories and
associated amounts are from columns
title: String. Name of the bar graph
Outputs:
Bar graph in console.
"""
df.plot.bar(rot=0)
plt.title(title, color='black')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
...
#Execution in main script
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
title='Number of Times Vendor Was Purchased From Over Quarter')

We can glean a few insights from the above graph:
- It looks like XTO switched to heavily using P3 as a preferred vendor around Q2 of 2018. In Q1, P3 wasn’t used at all.
- Throughout 2018, XTO consistently relied on the following vendors when purchasing products: Ace, Chemplex, Finoric, Nalco, and Sandtrol.
- It looks like Universal Pressure Pumping was trialed as a completions vendor in Q2 of 2018. However, it was not used before or after Q2, so it was likely not picked up as a preferred vendor within the geographic location that the data is filtered by.
Although the bar chart above is limited by time and geographic location, it still offers a snapshot into XTO’s completions and vendor strategy. With this information, O&G companies can gather important vendor insights that may help them to plan completions in the future.
The code for this article, shown below, is also available via my Github account. The code is easily customizable, so you can filter by asset and operator with minimal editing.
import pandas as pd
import matplotlib.pyplot as plt
class fracfocus_data_search:
"""
This class generates an object that is used the filter the master
fracfocus dataframe so it only contains a certain state/state abbreviation,
county (list), and/or operator
"""
def __init__(self, state=None, state_abbreviation=None, county_list=None,
operator=None):
#All data in initialize def optional depending on what kind of filtering
#we want to do
self.state = state
self.state_abbreviation = state_abbreviation
self.county_list=county_list
self.operator=operator
def filter_dataframe(self, df, column_state,
column_county, column_operator):
"""
Filter the data by parameters
Arguments:
df: Pandas dataframe that we want to to subset by
column_state: String. Name of the column for state
column_county: String. Name of column for county
column_operator: String. Name of column for operator
Outputs:
df_subset: Pandas dataframe. Dataframe after it's been subsetted.
"""
#Filter by all the parameters and return the subset
df_subset=df[
#By state
(((df[column_state].str.lower())==self.state.lower()) |
((df[column_state].str.lower())==self.state_abbreviation.lower())) &
#By county
((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
#By operator
((df[column_operator].str.lower()).str.contains(self.operator.lower()))
]
return df_subset
def clean_vendor_data(df, column, column_search_string, column_string_rename):
"""
This function is used to search the vendor data for a specific keyword, and
if it's found, change the row value to that specific keyword. Used to clean up the data
if a vendor is added in multiple different ways.
Arguments:
df: Pandas dataframe.
column: String. Name of the column that we're cleaning up
column_search_string: String. String that we're looking to match in the column
column_string_rename: String. What we want to rename the string to if we find
any string matches in the column
Outputs:
df: Pandas dataframe. Dataframe with returned cleaned up column
"""
df.loc[df[column].str.contains(column_search_string), column] = column_string_rename
return df
def generate_plot(df, x_variable, y_variables, plot_title):
"""
This function is used to map x- and y-variables against each other
Arguments:
df: Pandas dataframe.
x_variable: String. Name of the column that we want to set as the
x-variable in the plot
y_variables: string (single), or list of strings (multiple). Name(s)
of the column(s) that we want to set as the y-variable in the plot
Outputs:
Scatter plot in console.
"""
#Plot results
df.plot(x=x_variable, y=y_variables, title=plot_title)
plt.show()
def generate_boxplot(df, x_variable):
"""
This function generates a basic histogram of a column's data, with
outliers removed
Arguments:
df: Pandas dataframe
x_variable: String. Name of the column that we want to generate
boxplot from
Outputs:
Box plot in console.
"""
plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
plt.show()
def generate_bar_graph(df, title):
"""
This function creates a bar graph from pandas dataframe columns.
Arguments:
df: Pandas dataframe. Index will be x-axis. Categories and
associated amounts are from columns
title: String. Name of the bar graph
Outputs:
Bar graph in console.
"""
df.plot.bar(rot=0)
plt.title(title, color='black')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
def main():
"""
Main definition, where we subset the data, analyze it, and generate plots from
"""
#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
'Ector', 'Eddy', 'Gaines', 'Glasscock'],
operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
column_county='CountyName', column_operator='OperatorName')
#Convert the data column to a pandas datetime
subsetted_df['JobStartDate']=pd.to_datetime(subsetted_df['JobStartDate'])
#Now that we have our desired dataframe, it's time to analyze the data
#Let's calculate the average amount of liquids that the operator pumps in fracs
#over time
basic_frac_characteristics=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
'TotalBaseNonWaterVolume', 'TVD',
'TotalBaseWaterVolume',
'Latitude', 'Longitude']].drop_duplicates()
#Plot the 'TotalBaseWaterVolume' variable over time
generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
y_variables=['TotalBaseWaterVolume'],
plot_title='Total Base Water Volume for Fracs over Time')
#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
y_variables=['TotalBaseNonWaterVolume'],
plot_title='Total Base Non-Water Volume for Fracs over Time')
#Plot the distribution of TotalBaseNoneWaterVolume using a box-and-whisker plot, with outliers removed
generate_boxplot(basic_frac_characteristics, x_variable='TotalBaseNonWaterVolume')
#PERFORM VENDOR ANALYSIS BELOW
#Subset the data set to get unique rows for vendor data
vendor_data=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
'Latitude', 'Longitude', 'Supplier', 'TradeName']].drop_duplicates()
#PERFORM SOME DATA CLEANING ON THE VENDOR DATA
#Remove NaN supplier values
vendor_data=vendor_data.dropna(subset=['Supplier'])
#Make all Supplier data uppercase
vendor_data['Supplier']=vendor_data['Supplier'].str.upper()
#Use the clean_vendor_data() function to clean up the vendor categories so they're more standardized
vendor_lookup_dict={'RISING STAR': 'RISING STAR',
'CHEMPLEX': 'CHEMPLEX',
'SAN.*TROL': 'SANDTROL',
'MULTI.*CHEM': 'MULTI-CHEM',
'XTO': 'OPERATOR',
'PFP': 'PFP',
'CESI': 'CESI',
'NALCO': 'NALCO',
'FRITZ': 'FRITZ INDUSTRIES',
'ASK': 'ASK',
'ACE': 'ACE',
'BRENNTAG': 'BRENNTAG',
'COIL.*CHEM': 'COILCHEM',
'COOPER': 'COOPER NATURAL RESOURCES',
'ECONOMY': 'ECONOMY POLYMERS',
'FINORIC': 'FINORIC',
'EES': 'ENVIRONMENTAL ENERGY SERVICE',
'PREFERRED': 'PREFERRED SANDS',
'ROCKWATER': 'ROCKWATER',
'SNF': 'SNF',
'MULTIPLE': 'MULTIPLE SUPPLIERS',
'REAGENT': 'REAGENT',
'PRO.*FRAC': 'PROFRAC'}
#Loop through the dictionary and change the name accordingly based on character lookups
for vendor_lookup, vendor_name in vendor_lookup_dict.items():
vendor_data=clean_vendor_data(vendor_data, 'Supplier', vendor_lookup, vendor_name)
#Make a column converting data to monthly (this will be the x-axis for the bar graph)
vendor_data['JobStartDateQuarter'] = vendor_data['JobStartDate'].dt.to_period('Q')
#Subset to only include 2018 data
vendor_data=vendor_data[(vendor_data.JobStartDate>=pd.to_datetime('2018-01-01')) &
(vendor_data.JobStartDate<=pd.to_datetime('2019-01-01'))]
#Count the number of purchases that the operator makes over time from each vendor
vendor_data['NumberTimesSupplierUsed']=vendor_data[['JobStartDateQuarter', 'Supplier']].groupby([
'JobStartDateQuarter', 'Supplier'])['Supplier'].transform('count')
#Subset the data to only include vendor counts by quarter
vendor_data_stacked_bar_graph=vendor_data[['Supplier', 'JobStartDateQuarter',
'NumberTimesSupplierUsed']].drop_duplicates()
#Pivot the data from long to wide format
vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph.pivot(
index='JobStartDateQuarter',
columns='Supplier',
values='NumberTimesSupplierUsed').fillna(0)
#Filter to only include suppliers purchased from more than 20 times
total_col_sums=pd.DataFrame(vendor_data_stacked_bar_graph_pivoted.sum()>=20)
vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph_pivoted.drop(list(
total_col_sums[total_col_sums[0]==False].index), axis=1)
#Plot the vendor data in a bar chart
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
title='Number of Times Vendor Was Purchased From Over Quarter')
#Run main
if __name__== "__main__":
main()
This concludes this tutorial. As always, thanks for reading!
For other oil & gas tutorials, check out the links below.
https://techrando.com/2019/06/23/how-to-automate-data-pulls-from-the-online-fracfocus-database/