How To Automate Data Pulls from the Online FracFocus Database

Automatically Web Scrape the FracFocus Site, and Pull the Data into Python Pandas for Analysis

In this post, I go over automating data pulls from the online FracFocus database, creating a master Pandas dataframe that can be used for data analysis.

FracFocus is the national hydraulic fracturing chemical registry. According to its site, it was created “to provide the public access to reported chemicals used for hydraulic fracturing within their area.” The database is a wealth of machine readable information, with over 127,000 wells registered. It is a great resource not only for the public, who is concerned about groundwater protection, but also for private companies interested in the chemicals and the vendors used during the hydraulic fracturing process (as well as data scientists and hobbyists who just want to play around with a publicly-available data set).

On its website, FracFocus releases .zip files with updated data on the 1st and 15th of every month. The .zip file contains several csv’s that need to be appended together, making the initial data pull somewhat unruly. The code that I provide in this post not only allows you to pull the FracFocus data directly from the site (no messy downloading .zip files!), but it also appends all of the csv’s together into a master Pandas dataframe for easy manipulation.

Here is the code, also available via my GitHub account:

#Import desired packages 
import requests
import zipfile
import io
import pandas as pd

def pull_zip_file_from_url(url):
    """
    This function pulls a zip file from a URL and generates a ZipFile object
    Arguments:
        url: String. Name of the URL that we want to pull from
    Outputs:
        zip_file: ZipFile object, generated from the URL
    """
    request = requests.get(url)
    zip_file = zipfile.ZipFile(io.BytesIO(request.content))
    return zip_file

def append_filename_to_desired_list(list_to_append_to, key_word, file_name):
    """
    This function detects if the filename in the .zip folder is a csv, and if it 
contains a specific key_word string. If it 
    does, it appends to the desired master list of filenames
    Arguments:
        list_to_append_to: List. Master list containing filenames that end with .csv extension, and contain the key_word value
        key_word: String. Identifier that the filename must contain in order to be append to the list_to_append_to list.
        file_name: String. Name of the file that we're checking, contained in the .zip folder
    Outputs:
        list_to_append_to: List. Master list, with new filename appended if it is a .csv extension and contains the key_word string.
    """
    if ((file_name.endswith('.csv')) & (key_word in file_name)):
        list_to_append_to.append(file_name)
    return list_to_append_to

def append_dataframes_into_master(list_of_filenames, zip_file):
    """
    This function generates a master dataframe, pulled from all of the dataframes with filenames in the list_of_filenames list
    Arguments: 
        list_of_filenames: List. List of filenames that we want to pull the data from. 
        zip_file: ZipFile object. Zipfile that we're pulling files from
    Outputs:
        df_master: Pandas dataframe. Master dataframe containing all the data associated with the filenames from list_of_filenames 
        list.
    """
    list_of_dfs=[pd.read_csv(zip_file.open(x), low_memory=False) for x in list_of_filenames]
    df_master = pd.concat([r for r in list_of_dfs], ignore_index=True)
    return df_master

def main():
    #Name the FracFocus URL
    fracfocus_url='http://fracfocusdata.org/digitaldownload/fracfocuscsv.zip'
    #Create a ZipFile object from the pulled URL .zip file
    zip_file = pull_zip_file_from_url(fracfocus_url)
    #Get the list of file names in the zip file
    list_of_file_names = zip_file.namelist()
    #Create lists for storing the registryupload files, and the FracFocus registry files
    list_fracfocus_registry_files=[]
    list_registry_upload_files=[]
    # Iterate over the file names
    for file_name in list_of_file_names:
        # Check filename endswith csv, and is one of the FracFocusRegistry files
       list_fracfocus_registry_files=append_filename_to_desired_list(list_fracfocus_registry_files, 'FracFocusRegistry', file_name)
        #Detect if the filename ends with csv, and is one of the registryupload files
        list_registry_upload_files=append_filename_to_desired_list(list_registry_upload_files, 'registryupload', file_name)
    #Append the registry upload files into a master df
    registry_upload_master=append_dataframes_into_master(list_registry_upload_files, zip_file)
    #Append the FracFocus registry files into a master df
    fracfocus_registry_master=append_dataframes_into_master(list_fracfocus_registry_files, zip_file)
    #Write both files to their respective master csv's
    registry_upload_master.to_csv('registry_upload_master.csv')
    fracfocus_registry_master.to_csv('fracfocus_registry_master.csv')
    
if __name__== "__main__":
    main()

In the main() block of this script, the .zip file is first pulled via the URL, and a ZipFile object is created. All of the file names in the ZipFile object are then iterated through, and two lists are generated: one list containing the FracFocusRegistry file names, and another list containing the RegistryUpload file names. Then, using the append_dataframes_into_master() function, the files from each list are extracted into a master pandas dataframe. Two Pandas dataframes are created: registry_upload_master and fracfocus_registry_master. Both dataframes are exported as csv’s of the same name for use in Excel or Spotfire/Tableau/PowerBI, or can be used as Pandas dataframes for further manipulation in Python.

As always, thanks for reading! Full documentation of the code presented here is available via this GitHub repository.

Check out my other tutorial on using the FracFocus data to perform a completions competitive intelligence analysis in the Permian Basin:

Using FracFocus and Python’s Matplotlib to Visualize Oil & Gas Companies’ Completions Strategies in the Permian Basin

Tech Rando

Welcome to the Tech Rando Blog!

How To Automate Data Pulls from the Online FracFocus Database

Automatically Web Scrape the FracFocus Site, and Pull the Data into Python Pandas for Analysis

As always, thanks for reading! Full documentation of the code presented here is available via this GitHub repository.

Check out my other tutorial on using the FracFocus data to perform a completions competitive intelligence analysis in the Permian Basin:

Like this:

Related

5 comments

Leave a ReplyCancel reply

How To Automate Data Pulls from the Online FracFocus Database

Automatically Web Scrape the FracFocus Site, and Pull the Data into Python Pandas for Analysis

As always, thanks for reading! Full documentation of the code presented here is available via this GitHub repository.

Check out my other tutorial on using the FracFocus data to perform a completions competitive intelligence analysis in the Permian Basin:

Share this:

Like this:

Related

5 comments

Leave a ReplyCancel reply

Discover more from Tech Rando