H2O AutoML in Python Comprehensive Tutorial

Sean Zhang
4 min readOct 30, 2020

--

What is AutoML and Why AutoML?

  1. AutoML automates methods for model selection, hyperparameter tuning, and model ensemble. It does not help feature engineering.
  2. AutoML works best for common cases including tabular data(66% of data used at work are tabular), time series, and text data. It does not work as good in deep learning because deep learning requires massive calculation and proper layer architect, which does not function well with the hyperparameter tuning part in AutoML.
  3. AutoML can simplify machine learning coding and thus reduce labor costs. If you are using some common models on a simple dataset such as GBM, Random Forest, or GLM, AutoML is a great choice.

There are several popular platforms for AutoML including Auto-SKLearn, MLbox, TPOT, H2O, Auto-Keras. I will focus on H2O today. H2O AutoML is built in Java and can be applied to Python, R, Java, Hadoop, Spark, and even AWS. If you want to know more about other tools, check out this article.

How to use H2O in Python

I am going to use the classic dataset Titanic as an example here.

h2o installation: click here.

Notes: To run H2O you have to have JDK because H2O is based on Java. Currently, only Java 8–13 are supported.

Result: 13 useful lines lead to an AUC of 84.5%

------------------------Tutorial Starts Here------------------------

Initialize H2O & Import Data

h2o.init(max_mem_size='8G')

Initialization of H2O, in which you can set up maximum/minimum memory, set up the IP and Port. If you use H2O as a company, there are a lot more parameters to check out here. You will get a result similar to:

It is really useful you use the H2O connection URL to visualize the entire automation process. On that interface, you can select the model, check the log of the training, and do predicting work without coding.

Log Provided by H2O
from h2o.automl import H2OAutoML
train = h2o.import_file("train.csv")
test = h2o.import_file("test.csv")

After setting up H2O, we read the data in. The train and test here are called “H2OFrame”, which is very similar to DataFrame. It is Java-based so you will see the “enum” type, which represents categorical data in Python. Functions like “describe” are provided. There are some other parameters and functions including the “asfactor” we are going to use. Check them all here.

x = train.columns
y = "Survived"
train[y] = train[y].asfactor()
x.remove(y)

These four lines specify the target. We use “asfactor()” because H2O read the “Survived” column as “int”, which instead should be an “object”.

H2O does not do feature engineering for you. If you want a better result, I suggest you use Python classic methods to do feature engineering instead of the basic manipulations provided by H2O.

Model Training

%%time
aml = H2OAutoML(max_models=20, max_runtime_secs=12000)
aml.train(x=x, y=y, training_frame=train)

Training Customization

nfolds=5, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs=None, max_runtime_secs_per_model=None, max_models=None, stopping_metric='AUTO', stopping_tolerance=None, stopping_rounds=3, seed=None, project_name=None, exclude_algos=None, include_algos=None, exploitation_ratio=0, modeling_plan=None, preprocessing=None, monotone_constraints=None, keep_cross_validation_predictions=False, keep_cross_validation_models=False, keep_cross_validation_fold_assignment=False, sort_metric='AUTO'

Massive parameters are provided for you to customize training. The most common ones are nfolds for cross-validation; balance_classes for imbalanced data(set it to True to do other sampling methods); max_runtime_secs; exclude_algos; and sort_metric.

Check the leaderboard

lb = aml.leaderboard
lb.head(rows=15)
Leaderboard Result

In the leaderboard, you can check model performance by AUC, logloss, mean_per_class_error, RMSE, and MSE. You can set up the rank in the training process by specifying sort_metric.

aml.leader #Best model

This is one of the most important features provided by H2O AutoML. You can get the best model parameters, Confusion Matrix, Gain/Lift Table, Scoring History, and Variable Importance by this single line of code.

If your leader is an ensemble model:

metalearner = h2o.get_model(aml.leader.metalearner()['name'])

You can check the variable importance by:

aml.leader.varimp()
model = h2o.get_model("XRT_1_AutoML_20201030_001219")
model.varimp_plot(num_of_features=8)

To predict and get the result:

pred = aml.predict(test)
pred = pred[0].as_data_frame().values.flatten()
Variable Importance from Leader Model

It is just so simple and convenient. For the best performance, you have to set up more parameters. Thank you for reading and if you like my article, please leave me a thumb.

--

--