Automate Your ML Model Tuning and Selection Using AutoML in Python

A Review of Different Python AutoML Packages

Testing different ML approaches on the same data set to evaluate model performance can be a tedious task. Furthermore, properly tuning deep learning models can take hours, if not days, to do. Luckily, within the past decade, there has been a serious push to develop methods to automate ML model selection and tuning. Although the open source solutions currently available are not silver bullets (and should not be treated as such!), using AutoML when building your ML or DL models can save a significant amount of time, and at least point you in the right direction of an optimal model. In this post, I go over some of the AutoML implementations currently available in Python, and provide specific examples (code included!).

A few of the options currently available for automating model selection and tuning in Python are as follows (1):

  1. The H2O package
  2. The auto-sklearn package
  3. The TPOT package

In this post, we’ll review AutoML functionality in the H2O package and the TPOT package. Unfortunately, auto-sklearn is only available on Linux operating systems (which I don’t have), so it won’t be covered.

The Data

For this article, we will be using a well-known data set available via the UCI Machine Learning Repository, the breast cancer data set, to classify whether or not the cancer is recurring (binary outcome of 0 or 1), based on a variety of factors. Potential predictor variables include the patient’s age (categorical variable — binned), if the patient has gone through menopause (categorical), and tumor size (categorical — binned), among others. A snapshot of the data set is provided below:

snapshot of the UCI breast cancer data set that we will perform autoML on
Snapshot of the breast cancer data, available via the UCI repository

Before we run the data through any autoML functions, let’s clean it up a bit. Mainly, we convert all of our categorical variables to numeric using sklearn’s LabelEncoder() function (using label encoding isn’t necessary for H2O, but it is for TPOT):

#Read in the cancer data set
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data', header=None)
#Declare the column names of the cancer data set
df.columns=["Class", "Age", "Menopause",
            "Tumor_Size", "Inv_Nodes", 
            "Node_Caps", "Deg_Malig",
            "Breast", "Breast_quad",
            "Irradiat"]
#Convert all of the categorical features variables to numeric (use LabelEncoder)
d = defaultdict(LabelEncoder)    
df_label_encoded = df.apply(lambda x: d[x.name].fit_transform(x))
Python data set after it has been label encoded using sklearn's LabelEncoder function
Data after being label-encoded

Now that we have a clean data set that is ready to use, let’s run AutoML on it!

H20’s AutoML

H20 is an open source ML library which allows users to quickly build, test, and productionize ML models. H2O’s AutoML function automates the process of selecting the optimal ML or DL model for a training data set. The package is incredibly versatile and robust.

Downloading H20

This link contains the directions for downloading H20 for Python. The package requires a few dependencies that must be installed beforehand. Directions for installing using Anaconda, as well as regular Python, are available.

Running H2O’s AutoML on Our Data Set

We perform autoML on the data set using the following code:

def run_h2o_automl(dataframe, variable_to_predict,
                   max_number_models):
    """
    This function initiates an h2o cluster, converts
    the dataframe to an h2o dataframe, and then runs
    the autoML function to generate a list of optimal 
    predictor models. The best models are displayed via a 
    scoreboard.
    Arguments:
        dataframe: Pandas dataframe. 
        variable_to_predict: String. Name of the dataframe that we're predicting.
        max_number_models: Int. Total number of models to run.
    Outputs:
        Leader board of best performing models in the console, plus performance of
        best fit model on the test data, including confusion matrix
    """
    h2o.init()
    #Convert the dataframe to an h2o dataframe
    dataframe = h2o.H2OFrame(dataframe)
    #Convert the variable we're predicting to a factor; otherwise this
    #will run as a regression problem
    dataframe[variable_to_predict] = dataframe[variable_to_predict].asfactor()
    #Declare the x- and y- variables for the database. 
    #x-variables are predictor variables, and y-variable is what
    #we wish to predict
    x = dataframe.columns
    y = variable_to_predict
    x.remove(y)
    #Pull the training and test data out at a 75/25 split.
    train, test, validate = dataframe.split_frame(ratios=[.75, .125])
    # Run AutoML (limited to 1 hour max runtime by default)
    aml = H2OAutoML(max_models=max_number_models, seed=1)
    aml.train(x=x, y=y, training_frame = train, validation_frame = validate)
    # View the AutoML Leaderboard
    lb = aml.leaderboard
    print(lb.head(rows=lb.nrows))
    #Get performance on test data
    performance = aml.leader.model_performance(test)
    print(performance)
#################################################################################################
###RUN run_h2o_automl() FUNCTION IN MAIN
run_h2o_automl(dataframe=df, 
               variable_to_predict='Deg_Malig',
               max_number_models=10)
H2O leaderboard for best performing models on training set
A snapshot of the leaderboard for H2O’s AutoML results

A leaderboard for the best fitting models appears in the console, as shown in the figure above. Based on the leaderboard outputs, our best performing model is a general linear model (GLM), followed by a gradient boosted model (GBM). We can gauge GLM performance by running our test data through H2O’s model_performance() function, where aml.leader is the best performing model:

performance = aml.leader.model_performance(test)
print(performance)
confusion matrix for H2O AutoML-trained model
Confusion matrix for test predictions, generated via H2O’s model_performance() function

Based on the above confusion matrix, the model has an overall accuracy of approximately 83%, with a recall of ~73% and a precision of 87.5%. An amazing result, considering we did practically no preprocessing on our data set before running it through AutoML.

TPOT AutoML

TPOT, or the Tree-Based Pipeline Optimization Tool, is one of the first AutoML methods developed. Dr. Randal Olson developed TPOT while working in the Computational Genetics Lab at the University of Pennsylvania (2). The main goal of TPOT is to automate the ML pipeline via genetic programming. A diagram of the TPOT automation process is shown below (3):

The ML pipeline as automated by the TPOT package (3)

As you can see in the above schematic, TPOT’s AutoML functionality automates the ML pipeline post-data cleaning. Specific automated steps in the pipeline include feature selection and preprocessing, model selection, and parameter optimization.

Downloading TPOT for Python

TPOT’s functionality depends on scikit-learn, so you will need to install both sklearn (if you don’t already have it) and TPOT to use TPOT’s AutoML functionality:

pip install sklearn
pip install tpot

Using TPOT’s AutoML Function

We’re going to use the same cancer data set used in the H2O autoML example, once again predicting whether or not the cancer is recurring (the ‘Class’ column). We run the label-encoded data set through the run_tpot_automl() function:

def run_tpot_automl(dataframe, 
                    variable_to_predict, 
                    number_generations,
                    file_to_export_pipeline_to = 'tpot_classifier_pipeline.py'):
    """
    This function runs a TPOT classifier on the dataset, after splitting into
    a training and test set, and then oversampling the training set.
    Args:
        dataframe: pandas dataframe. Master dataframe containing the feature and target
        data
        variable_to_predict: String. Name of the target variable that we want to predict.
        number_of_generations: Int. Number of generations to iterate through.
    Outputs:
        File containing the machine learning pipeline for the best performing model.
    """
    #Remvoe the target column to get the features dataframe
    features_dataframe = dataframe.loc[:, dataframe.columns != variable_to_predict]
    X_train, X_test, y_train, y_test = train_test_split(features_dataframe, dataframe[variable_to_predict],
                                                    train_size=0.75, test_size=0.25)
    #Run the TPOT pipeline
    tpot = TPOTClassifier(generations= number_generations, population_size=20, verbosity=2)
    tpot.fit(X_train, y_train)
    print(tpot.score(X_test, y_test))
    tpot.export(file_to_export_pipeline_to)
#################################################################################################
#Run in main block
run_tpot_automl(dataframe =  df_label_encoded, 
                    variable_to_predict = 'Class', 
                    number_generations =10)

The code outputs a python file, called tpot_classifier_pipeline.py. The file contains the optimized ML pipeline for the model:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import RobustScaler
from tpot.builtins import StackingEstimator
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:0.7615725359911407
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=BernoulliNB(alpha=0.001, fit_prior=False)),
    RobustScaler(),
    RobustScaler(),
    StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.35000000000000003, min_samples_leaf=17, min_samples_split=6, n_estimators=100)),
    StackingEstimator(estimator=BernoulliNB(alpha=1.0, fit_prior=False)),
    RobustScaler(),
    ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.55, min_samples_leaf=6, min_samples_split=19, n_estimators=100))
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

We make a couple edits to the automated output file, and return the predicted test results with an associated confusion matrix:

results_rounded = np.round(results)
confusion_matrix(testing_target, results_rounded)
confusion matrix for TPOT AutoML-trained model
Confusion matrix of test outputs.

In reviewing the results of the selected AutoML model, an Extra Trees Classifier (a variation of a random forest), the total model accuracy is 83%. Model recall is 75%, and precision is ~84%. Almost exactly the same results as the H2O AutoML model, but with a lower precision score.

How the Different Packages Stack Up

If you’re sampling AutoML on a model, I recommend trying both packages as your results may differ from the ones in this article. In terms of each package, here’s what I observed as pros and cons:

H2O’s AutoML

Pros:

  1. Easy to download
  2. Little to no data preprocessing required (Label encoding is not necessary, as it is with TPOT)
  3. Results are easy to interpret
  4. Samples a wide variety of ML and DL types when selecting an optimal model

Cons:

  1. Spyder, which is my preferred IDE (available via the Anaconda distro), will not display H2O outputs. H2O must be run in a Jupyter Notebook or equivalent to display leaderboard results.

TPOT’s AutoML

Pros:

  1. Results are highly interpretable, and an automated file is generated containing the optimal ML pipeline.
  2. Easy to download .

Cons:

  1. More data pre-processing required to get the data set into an acceptable format to run AutoML.
  2. At least for this example, not (quite) as accurate as H2O’s AutoML. However, this may be a one-off and results could differ when sampling with other data sets.

This concludes my tutorial on Python AutoML. Thanks for reading! Full code for this tutorial is available via my personal Github repo:

https://github.com/kperry2215/automl_examples/blob/master/automl_example_code.py

Check out some of my other machine learning articles:

Sources

  1. AutoML. Retrieved from https://www.ml4aad.org/automl/
  2. AutoML: Information about Automated Machine Learning. Retrieved from http://automl.info/tpot/
  3. TPOT: A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. Retrieved from https://github.com/EpistasisLab/tpot

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.