Using FracFocus and Python’s Matplotlib to Visualize Oil & Gas Companies’ Completions Strategies in the Permian Basin

I recently wrote some script that automated data pulls from the publicly available FracFocus database, a government-operated source which provides a comprehensive listing of hydraulic fracturing chemicals pumped in unconventional oil and gas completions jobs in the United States. This database is a great resource, not only for the public, but also for oil and gas companies who are looking to perform competitive intelligence analysis regarding other operators’ completions strategies. With this treasure trove of data readily available, I wanted to do some analysis with it. In today’s post I do just that–using basic data visualization packages in Python like Matplotlib, I analyze operator completions information for the Permian Basin in West Texas.

First, a little background on the data itself.

The FracFocus database contains several optional and non-optional fields for data entry. Fields include well name, well identifier number (called API), latitude and longitude coordinates, operator, total vertical depth (TVD) of the completion (how far down the well was drilled), amount of water pumped during the frac job, and amount of non-water additives pumped. Additionally, the database includes the different types of chemicals and frac sand pumped, as well as the vendors that the products were purchased from. A snapshot of the data is provided below:

A snapshot of some of the FracFocus data

The FracFocus database that I web scraped (see here for tutorial) is a massive file with over 4 million rows, so I created a Python object to filter out the database (uploaded into Python as a pandas dataframe), using characteristics such as operator name, state, and county:

class fracfocus_data_search:
    """
    This class generates an object that is used the filter the master
    fracfocus dataframe so it only contains a certain state/state abbreviation,
    county (list), and/or operator
    """

    def __init__(self, state=None, state_abbreviation=None, county_list=None,
                 operator=None):
        #All data in initialize def optional depending on what kind of filtering
        #we want to do
        self.state = state
        self.state_abbreviation = state_abbreviation
        self.county_list=county_list
        self.operator=operator

    def filter_dataframe(self, df, column_state, 
                        column_county, column_operator):
        """
        Filter the data by parameters
        Arguments:
            df: Pandas dataframe that we want to to subset by
            column_state: String. Name of the column for state
            column_county: String. Name of column for county
            column_operator: String. Name of column for operator
        Outputs:
            df_subset: Pandas dataframe. Dataframe after it's been subsetted.
        """
        #Filter by all the parameters and return the subset
        df_subset=df[
                #By state
                (((df[column_state].str.lower())==self.state.lower()) | 
                ((df[column_state].str.lower())==self.state_abbreviation.lower())) &
                #By county
                ((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
                #By operator
                ((df[column_operator].str.lower()).str.contains(self.operator.lower()))  
                ]
        return df_subset

For the purpose of this tutorial, I’m going to filter the data down to eight counties in West Texas’s Permian Basin (Andrews, Borden, Crane, Dawson, Ector, Eddy, Gaines, and Glasscock), for the operator XTO (the onshore unconventionals spin-off of Exxon).

#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
                          county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
                                       'Ector', 'Eddy', 'Gaines', 'Glasscock'], operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName', 
                        column_county='CountyName', column_operator='OperatorName')

42,850 rows of a data are returned when I apply the filter. Obviously, XTO is a fairly active operator in the Permian Basin.

Next, I want to visualize how much water XTO pumps during each of its completions over time.

I use the following function generate_plot() to create a visualization of base water pumped over time:

def generate_plot(df, x_variable, y_variables, plot_title):
    """
    This function is used to map x- and y-variables against each other
    Arguments:
        df: Pandas dataframe.
        x_variable: String. Name of the column that we want to set as the 
        x-variable in the plot
        y_variables: string (single), or list of strings (multiple). Name(s) 
        of the column(s) that we want to set as the y-variable in the plot
    Outputs:
        Scatter plot in console.
    """
    #Plot results
    df.plot(x=x_variable, y=y_variables, title=plot_title)
    plt.show()

...

#Execution in main script
generate_plot(dataframe, x_variable='JobStartDate', 
              y_variables=['TotalBaseWaterVolume'], 
              plot_title='Total Base Water Volume for Fracs over Time')
Output of generate_scatter_plot() function: Total Water Volume Pumper per Frac over Time

So what does the above plot tell us? Mainly, that over time XTO has become more aggressive in the amount of water it pumped in each of its fracs, increasing its largest fracs from from less than 5 million gallons per frac pumped in 2014, to over 25 million gallons per frac in 2018-2019.

Next, I generate a plot for non-water additives pumped in fracs over time, also using the generate_plot() function.

#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(dataframe, x_variable='JobStartDate', 
              y_variables=['TotalBaseNonWaterVolume'], 
              plot_title='Total Base Non-Water Volume for Fracs over Time')
Total Non-Water Volume Pumped per Frac over Time

There is less of a clear cut trend here. With the exception of one massive outlier in mid-2019, the data looks to be fairly constant. To visualize the distribution of the non-water additive data with outliers removed, I generate a box-and-whisker plot using the following function:

def generate_boxplot(df, x_variable):
    """
    This function generates a basic histogram of a column's data, with
    outliers removed
    Arguments:
        df: Pandas dataframe
        x_variable: String. Name of the column that we want to generate 
        boxplot from
    Outputs:
        Box plot in console.
    """  
    plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
    plt.show()

...
#Execution in main script
generate_boxplot(dataframe, x_variable='TotalBaseNonWaterVolume')
Box-and-whisker plot of the amount of non-water additives pumped per frac for XTO in the Permian

Looking at the box-and-whisker plot above, it looks like the average frac has about 12,000 gallons of non-water additives, with a skewed upward distribution.

One of the more interesting facets of this data set is the vendor information, from which we can glean all sort of interesting insights.

For example, from whom does XTO buy its chemicals and frac sand for its completions? Did the company switch preferred vendors over time? I used the following code to generate a bar graph to gain insight into some of these questions:

def generate_bar_graph(df, title):
    """
    This function creates a bar graph from pandas dataframe columns.
    Arguments:
        df: Pandas dataframe. Index will be x-axis. Categories and 
        associated amounts are from columns
        title: String. Name of the bar graph
    Outputs:
        Bar graph in console.
    """
    df.plot.bar(rot=0)
    plt.title(title, color='black')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()

...
#Execution in main script
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted, 
                       title='Number of Times Vendor Was Purchased From Over Quarter')
Bar graph illustrating the number of times XTO purchased from each vendor over 2018

We can glean a few insights from the above graph:

  1. It looks like XTO switched to heavily using P3 as a preferred vendor around Q2 of 2018. In Q1, P3 wasn’t used at all.
  2. Throughout 2018, XTO consistently relied on the following vendors when purchasing products: Ace, Chemplex, Finoric, Nalco, and Sandtrol.
  3. It looks like Universal Pressure Pumping was trialed as a completions vendor in Q2 of 2018. However, it was not used before or after Q2, so it was likely not picked up as a preferred vendor within the geographic location that the data is filtered by.

Although the bar chart above is limited by time and geographic location, it still offers a snapshot into XTO’s completions and vendor strategy. With this information, O&G companies can gather important vendor insights that may help them to plan completions in the future.

The code for this article, shown below, is also available via my Github account. The code is easily customizable, so you can filter by asset and operator with minimal editing.

    
import pandas as pd 
import matplotlib.pyplot as plt 

class fracfocus_data_search:
    """
    This class generates an object that is used the filter the master
    fracfocus dataframe so it only contains a certain state/state abbreviation,
    county (list), and/or operator
    """

    def __init__(self, state=None, state_abbreviation=None, county_list=None,
                 operator=None):
        #All data in initialize def optional depending on what kind of filtering
        #we want to do
        self.state = state
        self.state_abbreviation = state_abbreviation
        self.county_list=county_list
        self.operator=operator

    def filter_dataframe(self, df, column_state, 
                        column_county, column_operator):
        """
        Filter the data by parameters
        Arguments:
            df: Pandas dataframe that we want to to subset by
            column_state: String. Name of the column for state
            column_county: String. Name of column for county
            column_operator: String. Name of column for operator
        Outputs:
            df_subset: Pandas dataframe. Dataframe after it's been subsetted.
        """
        #Filter by all the parameters and return the subset
        df_subset=df[
                #By state
                (((df[column_state].str.lower())==self.state.lower()) | 
                ((df[column_state].str.lower())==self.state_abbreviation.lower())) &
                #By county
                ((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
                #By operator
                ((df[column_operator].str.lower()).str.contains(self.operator.lower()))  
                ]
        return df_subset

def clean_vendor_data(df, column, column_search_string, column_string_rename):
    """
    This function is used to search the vendor data for a specific keyword, and 
    if it's found, change the row value to that specific keyword. Used to clean up the data
    if a vendor is added in multiple different ways.
    Arguments:
        df: Pandas dataframe.
        column: String. Name of the column that we're cleaning up
        column_search_string: String. String that we're looking to match in the column
        column_string_rename: String. What we want to rename the string to if we find 
        any string matches in the column
    Outputs:
        df: Pandas dataframe. Dataframe with returned cleaned up column
    """
    df.loc[df[column].str.contains(column_search_string), column] = column_string_rename
    return df

def generate_plot(df, x_variable, y_variables, plot_title):
    """
    This function is used to map x- and y-variables against each other
    Arguments:
        df: Pandas dataframe.
        x_variable: String. Name of the column that we want to set as the 
        x-variable in the plot
        y_variables: string (single), or list of strings (multiple). Name(s) 
        of the column(s) that we want to set as the y-variable in the plot
    Outputs:
        Scatter plot in console.
    """
    #Plot results
    df.plot(x=x_variable, y=y_variables, title=plot_title)
    plt.show()

def generate_boxplot(df, x_variable):
    """
    This function generates a basic histogram of a column's data, with
    outliers removed
    Arguments:
        df: Pandas dataframe
        x_variable: String. Name of the column that we want to generate 
        boxplot from
    Outputs:
        Box plot in console.
    """  
    plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
    plt.show()
    
def generate_bar_graph(df, title):
    """
    This function creates a bar graph from pandas dataframe columns.
    Arguments:
        df: Pandas dataframe. Index will be x-axis. Categories and 
        associated amounts are from columns
        title: String. Name of the bar graph
    Outputs:
        Bar graph in console.
    """
    df.plot.bar(rot=0)
    plt.title(title, color='black')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()
    
def main():
    """ 
    Main definition, where we subset the data, analyze it, and generate plots from
    """
    #Pull all of the fracfocus data from a csv
    fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
    #Make all of the state column lowercase
    fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
                          county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
                                       'Ector', 'Eddy', 'Gaines', 'Glasscock'],
                          operator='XTO')
    #Filter dataframe by its parameters
    subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName', 
                        column_county='CountyName', column_operator='OperatorName')
    #Convert the data column to a pandas datetime
    subsetted_df['JobStartDate']=pd.to_datetime(subsetted_df['JobStartDate'])
    #Now that we have our desired dataframe, it's time to analyze the data
    #Let's calculate the average amount of liquids that the operator pumps in fracs 
    #over time
    basic_frac_characteristics=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber', 
                                             'TotalBaseNonWaterVolume', 'TVD',
                                             'TotalBaseWaterVolume', 
                                             'Latitude', 'Longitude']].drop_duplicates()
    #Plot the 'TotalBaseWaterVolume' variable over time
    generate_plot(basic_frac_characteristics, x_variable='JobStartDate', 
              y_variables=['TotalBaseWaterVolume'], 
              plot_title='Total Base Water Volume for Fracs over Time')
    #Plot the 'TotalBaseNonWaterVolume' variable over time
    generate_plot(basic_frac_characteristics, x_variable='JobStartDate', 
              y_variables=['TotalBaseNonWaterVolume'], 
              plot_title='Total Base Non-Water Volume for Fracs over Time')
    #Plot the distribution of TotalBaseNoneWaterVolume using a box-and-whisker plot, with outliers removed
    generate_boxplot(basic_frac_characteristics, x_variable='TotalBaseNonWaterVolume')
    
    #PERFORM VENDOR ANALYSIS BELOW
    #Subset the data set to get unique rows for vendor data
    vendor_data=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber', 
                              'Latitude', 'Longitude', 'Supplier', 'TradeName']].drop_duplicates()
    #PERFORM SOME DATA CLEANING ON THE VENDOR DATA
    #Remove NaN supplier values
    vendor_data=vendor_data.dropna(subset=['Supplier'])
    #Make all Supplier data uppercase
    vendor_data['Supplier']=vendor_data['Supplier'].str.upper()
    #Use the clean_vendor_data() function to clean up the vendor categories so they're more standardized
    vendor_lookup_dict={'RISING STAR': 'RISING STAR', 
                             'CHEMPLEX': 'CHEMPLEX',
                             'SAN.*TROL': 'SANDTROL',
                             'MULTI.*CHEM': 'MULTI-CHEM',
                             'XTO': 'OPERATOR',
                             'PFP': 'PFP',
                             'CESI': 'CESI',
                             'NALCO': 'NALCO', 
                             'FRITZ': 'FRITZ INDUSTRIES',
                             'ASK': 'ASK',
                             'ACE': 'ACE',
                             'BRENNTAG': 'BRENNTAG', 
                             'COIL.*CHEM': 'COILCHEM',
                             'COOPER': 'COOPER NATURAL RESOURCES',
                             'ECONOMY': 'ECONOMY POLYMERS',
                             'FINORIC': 'FINORIC',
                             'EES': 'ENVIRONMENTAL ENERGY SERVICE',
                             'PREFERRED': 'PREFERRED SANDS', 
                             'ROCKWATER': 'ROCKWATER',
                             'SNF': 'SNF',
                             'MULTIPLE': 'MULTIPLE SUPPLIERS',
                             'REAGENT': 'REAGENT',
                             'PRO.*FRAC': 'PROFRAC'}
    #Loop through the dictionary and change the name accordingly based on character lookups
    for vendor_lookup, vendor_name in vendor_lookup_dict.items():
        vendor_data=clean_vendor_data(vendor_data, 'Supplier', vendor_lookup, vendor_name)
    #Make a column converting data to monthly (this will be the x-axis for the bar graph)
    vendor_data['JobStartDateQuarter'] = vendor_data['JobStartDate'].dt.to_period('Q')
    #Subset to only include 2018 data
    vendor_data=vendor_data[(vendor_data.JobStartDate>=pd.to_datetime('2018-01-01')) & 
                            (vendor_data.JobStartDate<=pd.to_datetime('2019-01-01'))]
    #Count the number of purchases that the operator makes over time from each vendor
    vendor_data['NumberTimesSupplierUsed']=vendor_data[['JobStartDateQuarter', 'Supplier']].groupby([
            'JobStartDateQuarter', 'Supplier'])['Supplier'].transform('count')
    #Subset the data to only include vendor counts by quarter
    vendor_data_stacked_bar_graph=vendor_data[['Supplier', 'JobStartDateQuarter', 
                                    'NumberTimesSupplierUsed']].drop_duplicates()
    #Pivot the data from long to wide format
    vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph.pivot(
                                                            index='JobStartDateQuarter', 
                                                            columns='Supplier', 
                                                            values='NumberTimesSupplierUsed').fillna(0)
    #Filter to only include suppliers purchased from more than 20 times
    total_col_sums=pd.DataFrame(vendor_data_stacked_bar_graph_pivoted.sum()>=20)
    vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph_pivoted.drop(list(
            total_col_sums[total_col_sums[0]==False].index), axis=1)
    #Plot the vendor data in a bar chart
    generate_bar_graph(vendor_data_stacked_bar_graph_pivoted, 
                       title='Number of Times Vendor Was Purchased From Over Quarter')
    
#Run main
if __name__== "__main__":
    main()

This concludes this tutorial. As always, thanks for reading!

For other oil & gas tutorials, check out the links below.

https://techrando.com/2019/06/23/how-to-automate-data-pulls-from-the-online-fracfocus-database/

https://techrando.com/2019/07/03/how-to-automate-decline-curve-analysis-dca-in-python-using-scipys-optimize-curve_fit-function/

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.