How to Web Scrape an ASP.NET Web Form using Selenium in Python

Recently, I worked on a little side project that required web scraping ASP.NET forms off of a website. I quickly discovered that this task is more complicated than the usual requests/BeautifulSoup package combo that I use normally for web scraping. In this tutorial, I provide some code for web scraping an ASP.NET form, using a Selenium driver in Python. The code I show in this tutorial can be easily adapted for use with other sites that use ASP.NET forms.

For the purpose of this tutorial, I will be web scraping the public trustee website for Jefferson County, Colorado (I live here), with the intent of getting public foreclosure information. You can see a snapshot of the site’s foreclosure lookup below:

Snapshot of http://gts.co.jefferson.co.us/index.aspx

Why Use a Selenium Driver to Webscrape an ASP.NET Form?

Normally I stick with using the requests package and/or BeautifulSoup package for webscraping. However, these packages struggle when an actual browsing session needs to be simulated–for instance, when an “Accept Terms” button needs to be pressed in the session. That’s where Selenium comes in. It simulates a real browsing session, allowing a user to click through a series of options to access a specific web page, all through Python.

Download the Correct Packages

For this tutorial, we will need to download both the Selenium and the Pandas packages (we’ll be saving the results in Pandas dataframe format). To download the packages, perform the following pip-installs via the command line:

pip install pandas
pip install selenium

In addition to the necessary Python packages, you will need to install a web driver for Selenium. I used GeckoDriver for Selenium for the Firefox browser. You can download GeckoDriver here. Select the correct version based on your operating system and bit-size (I used Windows 64).

For additional information on GeckDriver and what it does, check out the following link:

Once the above packages and GeckoDriver are installed, open up your Python console and you’re ready to go!

Run the Code

Initiate a Selenium Driver Browsing Session and Accept Website Terms

The Python code below initiates a Selenium driver, navigates to the webpage, and accepts the website terms:

from selenium import webdriver
import pandas as pd

url = "http://gts.co.jefferson.co.us/index.aspx"
# create a new Firefox session
driver = webdriver.Firefox(executable_path='C:/Users/kirst/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')
driver.implicitly_wait(30)
driver.get(url)

Let’s break down what the above code means. Using the webdriver.Firefox() call, we declare the executable_path variable as the path to our GeckDriver executable. In my case, the GeckDriver executable is located under file path C:/Users/kirst/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe.

In the driver.get() call, the Selenium Driver should initiate a new Firefox browser on your computer, and navigate to the designated URL. The new browser should show the following:

Navigating to the Jefferson County Public Trustees Page

We need our virtual Selenium session to accept the website terms, so we can continue navigating the website. We achieve this using the following code:

#Accept terms
accept_button = driver.find_element_by_id("ctl00_ContentPlaceHolder1_btnAcceptTerms")
accept_button.click()

#Wait 10 seconds for the load
driver.implicitly_wait(10)
“Accept Terms” button highlighted

The driver.find_element_by_id() call is used to located the “Accept Terms” button on the website. We find the name of the “Accept Terms” button (in this case, “ctl00_ContentPlaceHolder1_btnAcceptTerms”) by right-clicking the button in the Firefox Selenium session, and selecting the “Inspect Element” option. The following “Inspector” tab should appear, with the highlighted input ID for the button:

Once the above code is run, we should now be directed to the “Search Criteria” page in our Selenium browser.

Through our Selenium browser, we select the “Show All” option on the “Search Criteria” page. By selecting this button, the website will show all foreclosure information available for Jefferson county. We use the following code to do this:

#Select the 'Show All' option
accept_button = driver.find_element_by_id("ctl00_ContentPlaceHolder1_btnShowAll")
accept_button.click()
We select the “Show All” option on the “Search Criteria” page

Scraping the Form’s Data

Now that we’ve successfully navigated to the ASP.NET form, let’s automatically web scrape all of its table information. The code below creates a Pandas dataframe, and then loops through the first 15 pages of the form, collecting the table data. We can adapt the code easily to pull fewer or all of the pages of data:

#Create a dataframe to store all of the scraped table data
df = pd.DataFrame(columns = ["FC #", "Owner Name", "Street", "Zip", "Subdivision", "Balance Due", "Status"])

#Flip through all of the records and save them
for n in range(2, 15):
    for i in range(3):
        try:
            mytable = driver.find_element_by_css_selector("table[id='ctl00_ContentPlaceHolder1_gvSearchResults']")
            #Read in all of the data into the dataframe
            for row in mytable.find_elements_by_css_selector('tr'):
                row_list = []
                #Add to dataframe accordingly
                for cell in row.find_elements_by_css_selector('td'):
                    cell_reading = cell.text
                    row_list.append(cell_reading)
                #Add the list as a row, if possible 
                try:
                    a_series = pd.Series(row_list, index = df.columns)
                    df = df.append(a_series, ignore_index=True)
                except:
                    print("Could not append: " + str(row_list))
            break
        except:
            driver.implicitly_wait(5)
    if n%10 == 1:
        #Click second "..." if on greater than page 10
        if n < 20:
            driver.find_elements_by_xpath("//td/a[text()='...']")[0].click()  
        else:
            driver.find_elements_by_xpath("//td/a[text()='...']")[1].click()  
    else:
        driver.find_element_by_xpath("//td/a[text()='" + str(n)+ "']").click()    
    #Wait three seconds so the website doesn't crash
    driver.implicitly_wait(3)

#Write to a csv
df.to_csv("jefferson_county_foreclosures.csv", index= False)

The code above creates Pandas dataframe “df” to store all web-scraped table information. It then uses a FOR loop to loop through the first 15 pages of the ASP.NET web form, collecting the table information and appending to dataframe “df”.

When pulling the data from each table page, we use a secondary FOR loop with a try-except statement, where we attempt to pull the data up to three times. This logic prevents errors from bottlenecking. Once we pull the data successfully, the secondary FOR loop breaks.

The code uses Selenium’s find_element_by_css_selector() functionality to locate the table in the form. We identify each grid row in the table by the HTML “tr” tag before it. Similarly, we identify individual table cell entries using the “td” preface, which is an HTML tag used denote standard data cells.

Get the Table ID using the “Inspect Element” option in your virtual Selenium session.
“tr” and “td” classes in the web form’s table.

The last bit of logic in our code handles cycling through pages when we need to move to a page greater than the ones currently displayed (this form only shows options for 10 pages at a time). We use Selenium’s find_elements_by_xpath() option to locate the ellipsis selection on the “page turner” listing, and select it.

Locate the ellipsis to flip through all of the pages!

Conclusions

We can easily adapt this code to web scrape other web forms with a similar format. The full code for this tutorial is available via the following Github repo:

https://github.com/kperry2215/foreclosure_webscraper

Thanks for reading! Check out some of my other posts/tutorials:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.