Automatically Web Scrape the FracFocus Site, and Pull the Data into Python Pandas for Analysis
In this post, I go over automating data pulls from the online FracFocus database, creating a master Pandas dataframe that can be used for data analysis.
FracFocus is the national hydraulic fracturing chemical registry. According to its site, it was created “to provide the public access to reported chemicals used for hydraulic fracturing within their area.” The database is a wealth of machine readable information, with over 127,000 wells registered. It is a great resource not only for the public, who is concerned about groundwater protection, but also for private companies interested in the chemicals and the vendors used during the hydraulic fracturing process (as well as data scientists and hobbyists who just want to play around with a publicly-available data set).
On its website, FracFocus releases .zip files with updated data on the 1st and 15th of every month. The .zip file contains several csv’s that need to be appended together, making the initial data pull somewhat unruly. The code that I provide in this post not only allows you to pull the FracFocus data directly from the site (no messy downloading .zip files!), but it also appends all of the csv’s together into a master Pandas dataframe for easy manipulation.
Here is the code, also available via my GitHub account:
#Import desired packages
import requests
import zipfile
import io
import pandas as pd
def pull_zip_file_from_url(url):
"""
This function pulls a zip file from a URL and generates a ZipFile object
Arguments:
url: String. Name of the URL that we want to pull from
Outputs:
zip_file: ZipFile object, generated from the URL
"""
request = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(request.content))
return zip_file
def append_filename_to_desired_list(list_to_append_to, key_word, file_name):
"""
This function detects if the filename in the .zip folder is a csv, and if it
contains a specific key_word string. If it
does, it appends to the desired master list of filenames
Arguments:
list_to_append_to: List. Master list containing filenames that end with .csv extension, and contain the key_word value
key_word: String. Identifier that the filename must contain in order to be append to the list_to_append_to list.
file_name: String. Name of the file that we're checking, contained in the .zip folder
Outputs:
list_to_append_to: List. Master list, with new filename appended if it is a .csv extension and contains the key_word string.
"""
if ((file_name.endswith('.csv')) & (key_word in file_name)):
list_to_append_to.append(file_name)
return list_to_append_to
def append_dataframes_into_master(list_of_filenames, zip_file):
"""
This function generates a master dataframe, pulled from all of the dataframes with filenames in the list_of_filenames list
Arguments:
list_of_filenames: List. List of filenames that we want to pull the data from.
zip_file: ZipFile object. Zipfile that we're pulling files from
Outputs:
df_master: Pandas dataframe. Master dataframe containing all the data associated with the filenames from list_of_filenames
list.
"""
list_of_dfs=[pd.read_csv(zip_file.open(x), low_memory=False) for x in list_of_filenames]
df_master = pd.concat([r for r in list_of_dfs], ignore_index=True)
return df_master
def main():
#Name the FracFocus URL
fracfocus_url='http://fracfocusdata.org/digitaldownload/fracfocuscsv.zip'
#Create a ZipFile object from the pulled URL .zip file
zip_file = pull_zip_file_from_url(fracfocus_url)
#Get the list of file names in the zip file
list_of_file_names = zip_file.namelist()
#Create lists for storing the registryupload files, and the FracFocus registry files
list_fracfocus_registry_files=[]
list_registry_upload_files=[]
# Iterate over the file names
for file_name in list_of_file_names:
# Check filename endswith csv, and is one of the FracFocusRegistry files
list_fracfocus_registry_files=append_filename_to_desired_list(list_fracfocus_registry_files, 'FracFocusRegistry', file_name)
#Detect if the filename ends with csv, and is one of the registryupload files
list_registry_upload_files=append_filename_to_desired_list(list_registry_upload_files, 'registryupload', file_name)
#Append the registry upload files into a master df
registry_upload_master=append_dataframes_into_master(list_registry_upload_files, zip_file)
#Append the FracFocus registry files into a master df
fracfocus_registry_master=append_dataframes_into_master(list_fracfocus_registry_files, zip_file)
#Write both files to their respective master csv's
registry_upload_master.to_csv('registry_upload_master.csv')
fracfocus_registry_master.to_csv('fracfocus_registry_master.csv')
if __name__== "__main__":
main()
In the main() block of this script, the .zip file is first pulled via the URL, and a ZipFile object is created. All of the file names in the ZipFile object are then iterated through, and two lists are generated: one list containing the FracFocusRegistry file names, and another list containing the RegistryUpload file names. Then, using the append_dataframes_into_master() function, the files from each list are extracted into a master pandas dataframe. Two Pandas dataframes are created: registry_upload_master and fracfocus_registry_master. Both dataframes are exported as csv’s of the same name for use in Excel or Spotfire/Tableau/PowerBI, or can be used as Pandas dataframes for further manipulation in Python.
[…] https://techrando.com/2019/06/23/how-to-automate-data-pulls-from-the-online-fracfocus-database/ […]
[…] https://techrando.com/2019/06/23/how-to-automate-data-pulls-from-the-online-fracfocus-database/ […]
[…] recently wrote some script that automated data pulls from the publicly available FracFocus database, a government-operated […]
Hello I am extremely new to Python and I am trying to run this code in Spyder. But when I click run nothing happens? I am probably being really dumb, but any advice much appreciated please.
Thanks,
J
Hi Jon, I would check to make sure that the main() block of the code is executing at the bottom of the script. The main function controls the main execution of the script. If you’re still having issues, feel free to shoot me a personal note.