How to Use the Environmental Protection Agency’s (EPA’s) API to pull Data, Using Python

Welcome to Tech Rando! In this post, I will walk you through using the Environmental Protection Agency’s (EPA’s) API to pull publicly available EPA data into Python for analysis.

First, a little background on the EPA. The Environmental Protection Agency’s primary goal is to develop and enforce regulations that ensure that Americans have clean air, water, and land. The EPA is also involved in a host of other activities: it provides grants to non-profits and state governments to aid in environmental cleanups, it teaches people about the environment, and it publishes a ton of environmental information that is publicly available. Through the EPA’s Envirofacts Data Service API (see here for documentation), you can access all of the publicly available EPA data sets from a single entry point via Python. Some of the data sources available include: Greenhouse Gas (GHG) emissions, the Radiation Information Database, the Toxics Release Inventory, the Superfund Enterprise Management System, and the Safe Drinking Water Information System.

As an example, we’re going to look at pulling data from the Greenhouse Gas emissions database.

First, it may be important to note that the EPA’s Envirofacts API is not exactly user-friendly. I initially attempted to download the epa package via Python, which acts as a wrapper for the Envirofacts API. However, since it doesn’t appear like the package has been updated since its creation in 2011, it throws several errors when using a Python 3 environment. So, I turned to the EPA’s documentation on constructing queries to pull directly via URL’s, available here: https://www.epa.gov/enviro/envirofacts-data-service-api#metadata.

The website offers instructions on how to effectively construct a URL to pull data directly, shown below:

Image courtesy of https://www.epa.gov/enviro/envirofacts-data-service-api#metadata

An example of a URL pulled via the API is as follows:

https://enviro.epa.gov/enviro/efservice/t_design_for_environment/JSON/rows/1:19

Let’s break down what the above URL means. First, the t_design_for_environment is the name of a table that we want to pull. We want the results in JSON format, so that is referenced next. And finally, we want to pull rows 1 through 19 from the data source, referenced last.

The Envirofacts website has information on how specifically to construct queries, so you’ll want to check that out. The site offers a lot of tables to sort through, so it’s important to have some sort of an idea of what you’re looking for when you’re building a query.

Here, I want to go to the GHG database to construct a search, by clicking the link shown below, available on the following webpage: https://www.epa.gov/enviro/envirofacts-data-service-api#metadata

Click on the GHG option to search the metadata!

The following should appear on the greenhouse gas model webpage ( https://www.epa.gov/enviro/greenhouse-gas-model):

All of the blue ‘SubPart’ tabs are links to tables available in the greenhouse gas model database. If you click on several of the ‘SubPart’ tabs, you’ll notice there are about 4-5 tables associated with each ‘SubPart’, resulting in a lot of tables!

To keep things simple for this tutorial, we’re going to select the master ‘GREENHOUSE GAS SUMMARY’ tab:

We’re taken to a web page containing the names of several tables, as well as the table linking relationships:

For the purpose of this tutorial, I want to pull all of the associated GHG emissions by sector and subsector, on the right branch of the tree:

This means I need to pull the following tables via the API:

PUB_DIM_FACILITY

PUB_FACTS_SECTOR_GHG_EMISSION

PUB_DIM_SECTOR

PUB_DIM_SUBSECTOR

PUB_DIM_GHG

To construct each of these queries, I click on each of the tables to see how the URL is formatted. This is what it looks like for the PUB_DIM_FACILITY table (see https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=PUB_DIM_FACILITY&p_topic=GHG):

I want to pull the whole table in as a Excel csv so I reformat the URL query as follows:

https://data.epa.gov/efservice/PUB_DIM_FACILITY/CSV

Not bad! In fact, pulling in this data via Python into pandas dataframes is a fairly automatable process, as shown in the script below (also available via my GitHub account):

import pandas as pd
import io
import requests

class EPAQuery():
    """
    This class is used to pull EPA data directly into Python
    """

    def __init__(self, table_name):
        self.base_url='https://data.epa.gov/efservice/'
        self.table_name = table_name
        self.desired_output_format='CSV' 
        
    def construct_query_URL(self,
                        desired_state=None, desired_county=None,
                        desired_area_code=None, desired_year=None,
                        rows_to_include=None):
        """
        This function constructs the URL that we want to pull the data from 
        based on function inputs
        Arguments:
            table_name: String. Name of the table in the Envirofacts database 
            that we want to pull from
            desired_output_format: String. Can be 'EXCEL', 'CSV', or 'JSON'; 
            the format that you want the data delivered in. We set default to csv as 
            that's how  we're gonna pull into pandas
            desired_state: name of the state abbreviation that you want to pull from.
            DEFAULT SET TO NONE
            desired_county: name of the county that you want to pull from.
            DEFAULT SET TO NONE
            desired_area_code: area code that you want to pull from.
            DEFAULT SET TO NONE
            desired_year: year that you want to pull from.
            DEFAULT SET TO NONE
            rows_to_include: rows that you want to include when pulling the query. EX:
            1:19--rows 1 thru 19. DEFAULT SET TO NONE
        Outputs:
            query: string. URL that we want to pull
        """
        #Base of the query that we're going to build off of
        query_base=self.base_url
        #Add in the table name
        query=query_base+self.table_name+'/'
        #Add in the state qualifier, if the desired_state variable is named
        if desired_state!=None:
            query=query+'state_abbr/'+desired_state+'/'
        #Add in the county qualifier, if the desired_county variable is named
        if desired_county!=None:
            query=query+'county_name/'+desired_county+'/'
        #Add in the area code qualifier, if the desired_area_code variable is named
        if desired_area_code!=None:
            query=query+'zip_code/'+desired_area_code+'/'
        #Add in the year qualifier, if the desired_year variable is named
        if desired_year!=None:
            query=query+'reporting_year/'+desired_year+'/'
        #Add in the desired output format to the query
        query=query+self.desired_output_format
        #If there is a row qualifier, add it here
        if rows_to_include!=None:
            query=query+'/rows/'+rows_to_include
        #Return the completed query
        return query

    def read_query_into_pandas(self, query):
        """
        This function takes the query URL, pings it, and writes to a pandas dataframe
        that is returned
        Arguments:
            query: string. Name of the URL that we want to pull
        Outputs:
            dataframe: pandas dataframe. Dataframe generated from the file URL
        """
        s=requests.get(query).content
        dataframe=pd.read_csv(io.StringIO(s.decode('utf-8')), engine='python',
                              encoding='utf-8', error_bad_lines=False)
        return dataframe
            
def main():
    #Declare the names of the tables that we want to pull
    table_names=['PUB_DIM_FACILITY', 'PUB_FACTS_SECTOR_GHG_EMISSION',
                 'PUB_DIM_SECTOR', 'PUB_DIM_SUBSECTOR', 'PUB_DIM_GHG']
    #Dataframe dictionary
    epa_dfs={}
    #Object dictionary
    table_objects={}
    #Loop through all of the table names in the list, and generate
    #a query to pull via the API, and save to a pandas dataframe
    for table_name in table_names:
        #Generate a new object
        table_objects[table_name]=EPAQuery(table_name)
        #Construct the desired query name
        query=table_objects[table_name].construct_query_URL()
        #Pull in via the URL, and generate a pandas df,
        #which is then saved into a dictionary of dataframes called
        #epa_dfs for future reference
        epa_dfs[table_name]=table_objects[table_name].read_query_into_pandas(query)
    #Generate a master dataframe by joining all of the
    #dataframes together
    #Merge PUB_DIM_FACILITY and PUB_FACTS_SECTOR_GHG_EMISSION
    master_df=pd.merge(epa_dfs['PUB_DIM_FACILITY'], epa_dfs['PUB_FACTS_SECTOR_GHG_EMISSION'],
                       left_on=['PUB_DIM_FACILITY.FACILITY_ID', 'PUB_DIM_FACILITY.YEAR'],
                       right_on=['PUB_FACTS_SECTOR_GHG_EMISSION.FACILITY_ID',
                                 'PUB_FACTS_SECTOR_GHG_EMISSION.YEAR'], how='inner')
    #Merge master_df with PUB_DIM_SECTOR
    master_df=pd.merge(master_df, epa_dfs['PUB_DIM_SECTOR'], 
                       left_on='PUB_FACTS_SECTOR_GHG_EMISSION.SECTOR_ID',
                       right_on='PUB_DIM_SECTOR.SECTOR_ID')
    #Merge master_df with PUB_DIM_SUBSECTOR
    master_df=pd.merge(master_df, epa_dfs['PUB_DIM_SUBSECTOR'], 
                       left_on='PUB_FACTS_SECTOR_GHG_EMISSION.SUBSECTOR_ID',
                       right_on='PUB_DIM_SUBSECTOR.SUBSECTOR_ID')
    #Merge master_df with PUB_DIM_GHG
    master_df=pd.merge(master_df, epa_dfs['PUB_DIM_GHG'], 
                       left_on='PUB_FACTS_SECTOR_GHG_EMISSION.GAS_ID',
                       right_on='PUB_DIM_GHG.GAS_ID')
    #Subset to include only the important columns
    master_df_subsetted=master_df[['PUB_DIM_FACILITY.LATITUDE', 'PUB_DIM_FACILITY.LONGITUDE',
                     'PUB_DIM_FACILITY.CITY', 'PUB_DIM_FACILITY.STATE',
                     'PUB_DIM_FACILITY.ZIP', 'PUB_DIM_FACILITY.COUNTY', 
                     'PUB_DIM_FACILITY.ADDRESS1', 'PUB_DIM_FACILITY.YEAR',
                     'PUB_DIM_FACILITY.PARENT_COMPANY', 'PUB_DIM_SECTOR.SECTOR_NAME',
                     'PUB_DIM_SUBSECTOR.SUBSECTOR_DESC', 'PUB_DIM_GHG.GAS_CODE', 
                     'PUB_FACTS_SECTOR_GHG_EMISSION.CO2E_EMISSION']]
    print(master_df_subsetted.head(10))
    

if __name__== "__main__":
    main()

Let’s break down what the above script means. First, in the EPAQuery() class, an object is initialized with a table_name string (for example, ‘PUB_DIM_SECTOR’), a default ‘CSV’ output, and the EPA’s base url, ‘https://data.epa.gov/efservice/’. From there, a query can be written to pull the data, using the construct_query_URL() function. Once the URL has been built, it is called via the read_query_into_pandas() function, and the table output is written to a pandas dataframe. All of these steps are performed in a loop for the specified tables in the main() block. Finally, the tables are merged together to create a master_df dataframe, which can be used for analysis.

This concludes this tutorial. Thank you for reading!

This post is very similar to my post outlining how to use the EIA’s API gateway to pull data into Python. If you’re also interested in that, check out the following link:

https://techrando.com/2019/06/26/how-to-use-the-energy-information-administration-eia-application-programming-interface-api-and-pull-live-data-directly-into-python-for-analysis/

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.