Recently, I worked on a little side project that required web scraping ASP.NET forms off of a website. I quickly discovered that this task is more complicated than the usual requests/BeautifulSoup package combo that I use normally for web scraping. In this tutorial, I provide some code for web scraping an ASP.NET form, using a Selenium driver in Python. The code I show in this tutorial can be easily adapted for use with other sites that use ASP.NET forms.
For the purpose of this tutorial, I will be web scraping the public trustee website for Jefferson County, Colorado (I live here), with the intent of getting public foreclosure information. You can see a snapshot of the site’s foreclosure lookup below:
Why Use a Selenium Driver to Webscrape an ASP.NET Form?
Normally I stick with using the requests package and/or BeautifulSoup package for webscraping. However, these packages struggle when an actual browsing session needs to be simulated–for instance, when an “Accept Terms” button needs to be pressed in the session. That’s where Selenium comes in. It simulates a real browsing session, allowing a user to click through a series of options to access a specific web page, all through Python.
Download the Correct Packages
For this tutorial, we will need to download both the Selenium and the Pandas packages (we’ll be saving the results in Pandas dataframe format). To download the packages, perform the following pip-installs via the command line:
pip install pandas pip install selenium
In addition to the necessary Python packages, you will need to install a web driver for Selenium. I used GeckoDriver for Selenium for the Firefox browser. You can download GeckoDriver here. Select the correct version based on your operating system and bit-size (I used Windows 64).
For additional information on GeckDriver and what it does, check out the following link:
Once the above packages and GeckoDriver are installed, open up your Python console and you’re ready to go!
Run the Code
Initiate a Selenium Driver Browsing Session and Accept Website Terms
The Python code below initiates a Selenium driver, navigates to the webpage, and accepts the website terms:
from selenium import webdriver import pandas as pd url = "http://gts.co.jefferson.co.us/index.aspx" # create a new Firefox session driver = webdriver.Firefox(executable_path='C:/Users/kirst/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe') driver.implicitly_wait(30) driver.get(url)
Let’s break down what the above code means. Using the webdriver.Firefox() call, we declare the executable_path variable as the path to our GeckDriver executable. In my case, the GeckDriver executable is located under file path C:/Users/kirst/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe.
In the driver.get() call, the Selenium Driver should initiate a new Firefox browser on your computer, and navigate to the designated URL. The new browser should show the following:
We need our virtual Selenium session to accept the website terms, so we can continue navigating the website. We achieve this using the following code:
#Accept terms accept_button = driver.find_element_by_id("ctl00_ContentPlaceHolder1_btnAcceptTerms") accept_button.click() #Wait 10 seconds for the load driver.implicitly_wait(10)
The driver.find_element_by_id() call is used to located the “Accept Terms” button on the website. We find the name of the “Accept Terms” button (in this case, “ctl00_ContentPlaceHolder1_btnAcceptTerms”) by right-clicking the button in the Firefox Selenium session, and selecting the “Inspect Element” option. The following “Inspector” tab should appear, with the highlighted input ID for the button:
Once the above code is run, we should now be directed to the “Search Criteria” page in our Selenium browser.
Through our Selenium browser, we select the “Show All” option on the “Search Criteria” page. By selecting this button, the website will show all foreclosure information available for Jefferson county. We use the following code to do this:
#Select the 'Show All' option accept_button = driver.find_element_by_id("ctl00_ContentPlaceHolder1_btnShowAll") accept_button.click()
Scraping the Form’s Data
Now that we’ve successfully navigated to the ASP.NET form, let’s automatically web scrape all of its table information. The code below creates a Pandas dataframe, and then loops through the first 15 pages of the form, collecting the table data. We can adapt the code easily to pull fewer or all of the pages of data:
#Create a dataframe to store all of the scraped table data df = pd.DataFrame(columns = ["FC #", "Owner Name", "Street", "Zip", "Subdivision", "Balance Due", "Status"]) #Flip through all of the records and save them for n in range(2, 15): for i in range(3): try: mytable = driver.find_element_by_css_selector("table[id='ctl00_ContentPlaceHolder1_gvSearchResults']") #Read in all of the data into the dataframe for row in mytable.find_elements_by_css_selector('tr'): row_list =  #Add to dataframe accordingly for cell in row.find_elements_by_css_selector('td'): cell_reading = cell.text row_list.append(cell_reading) #Add the list as a row, if possible try: a_series = pd.Series(row_list, index = df.columns) df = df.append(a_series, ignore_index=True) except: print("Could not append: " + str(row_list)) break except: driver.implicitly_wait(5) if n%10 == 1: #Click second "..." if on greater than page 10 if n < 20: driver.find_elements_by_xpath("//td/a[text()='...']").click() else: driver.find_elements_by_xpath("//td/a[text()='...']").click() else: driver.find_element_by_xpath("//td/a[text()='" + str(n)+ "']").click() #Wait three seconds so the website doesn't crash driver.implicitly_wait(3) #Write to a csv df.to_csv("jefferson_county_foreclosures.csv", index= False)
The code above creates Pandas dataframe “df” to store all web-scraped table information. It then uses a FOR loop to loop through the first 15 pages of the ASP.NET web form, collecting the table information and appending to dataframe “df”.
When pulling the data from each table page, we use a secondary FOR loop with a try-except statement, where we attempt to pull the data up to three times. This logic prevents errors from bottlenecking. Once we pull the data successfully, the secondary FOR loop breaks.
The code uses Selenium’s find_element_by_css_selector() functionality to locate the table in the form. We identify each grid row in the table by the HTML “tr” tag before it. Similarly, we identify individual table cell entries using the “td” preface, which is an HTML tag used denote standard data cells.
The last bit of logic in our code handles cycling through pages when we need to move to a page greater than the ones currently displayed (this form only shows options for 10 pages at a time). We use Selenium’s find_elements_by_xpath() option to locate the ellipsis selection on the “page turner” listing, and select it.
We can easily adapt this code to web scrape other web forms with a similar format. The full code for this tutorial is available via the following Github repo: