Amazon SageMaker now comes pre-configured with the Scikit-Learn machine learning library in a Docker container. Scikit-Learn is popular choice for data scientists and developers because it provides efficient tools for data analysis and high quality implementations of popular machine learning algorithms through a consistent Python interface and well documented APIs. Scikit-Learn executes quickly and can scale to most data sets and problems, making it an ideal choice when you need to iterate quickly on your machine learning problems. Unlike Deep Learning frameworks such as TensorFlow or MxNet, Scikit-Learn is used for machine learning and data analysis. You can select from a range of supervised and unsupervised learning algorithms for clustering, regression, classification, dimensionality reduction, feature preprocessing, and model selection.
The newly added Scikit-Learn library is available in the Amazon SageMaker Python SDK. You can write your Scikit-Learn script and use the Amazon SageMaker training capabilities, including automatic model tuning. Once your model is trained, you can deploy your Scikit-Learn models to highly available endpoints that auto-scale to make real-time predictions with low latency. You can also use the same models in large-scale batch transform jobs.
In this blog post, I show you how to use the pre-built Scikit-Learn library in Amazon SageMaker to build, train, and deploy a multi-class classification model.
Training and deploying a Scikit-Learn model
In this example, we’re going to train a decision tree classifier on the IRIS dataset. This example is based on the Scikit-Learn Decision Tree Classification example. The full Amazon SageMaker notebook is available to try out. We’ll highlight the most important pieces here. This dataset has 50 samples from three different species of Iris flower, and is commonly used to demonstrate machine learning techniques. The goal is to predict which of the three species a flower belongs to, based on a number of different properties (petal length, petal width, and so on). While we’re using decision trees to solve this problem, Scikit-Learn offers a number of other algorithms that you can use.
Entry point script
The first step is to write the Scikit-Learn script. Starting with the main guard, we use a parser to read the hyperparameters that we pass to our Amazon SageMaker Estimator when creating the training job. These hyperparameters are made available as arguments to our input script in the training container. In this example, we look for the maximum number of leaf nodes. We also parse a number of Amazon SageMaker-specific environment variables to get information about the training environment, such as the location of input data and location where we want to save the model.
After we’ve defined our hyperparameters, we then load the dataset. For this example, we load all of the CSV files using the Pandas library.
In this example, we assume that the label (which species the flower belongs to) is stored in the first column. We separate our features and the label into two separate data frames.
Now we’re ready to train our model. This is as simple as creating the right classifier and calling fit. A key benefit of Scikit-Learn is the simplicity and consistency of the interface that each algorithm exposes. If your model needs pre-processing of features or calculation of validation scoresc, you can also do those things in this step.
Finally, we save our model.
Amazon SageMaker notebook – Setup
Now that we’ve written our Scikit-Learn script we can run it on Amazon SageMaker using the Amazon SageMaker pre-built Scikit-Learn container. We’re going to use a hosted Jupyter notebook to orchestrate the training process. Feel free to follow along interactively by running the notebook.
First, we set up the Amazon S3 bucket for storing the data and the model, as well as the AWS Identity and Access Management (IAM) role for the data and Amazon SageMaker permissions.
Now we’ll import the Python libraries we’ll need and create an Amazon SageMaker session.
Next, we’ll download our dataset and upload it to Amazon S3. In this example, we use Scikit-Learn locally in our notebook since it provides convenience functions to download the IRIS dataset.
Amazon SageMaker notebook – Training
Now that we’ve prepared the training data and our Scikit-Learn script (which we’ll name scikit_learn_iris.py), the SKLearn class in the SageMaker Python SDK allows us to run that script as a training job on the Amazon SageMaker managed training infrastructure. We’ll also pass the estimator our IAM role, the type of instance we want to use, and a dictionary of the hyperparameters that we want to pass to our script. Since Scikit-Learn runs on a single CPU-only machine, the Estimator only supports an instance count for training of 1 and GPU instances are not supported.
After we’ve constructed our SKLearn estimator, we can fit it by passing in the data we uploaded to Amazon S3. Amazon SageMaker makes sure our data is available in the local filesystem of the training cluster, so our Scikit-Learn script can simply read the data from disk.
Amazon SageMaker Notebook – Deployment
After training, we can use the SKLearn estimator to create an Amazon SageMaker endpoint – a hosted and managed prediction service that we can use to perform inference.
To do this our scikit_learn_iris.py script needs a model_fn() function that loads our saved model to make predictions.
You can also optionally specify other functions to customize the behavior of deserialization of the input request (input_fn()), serialization of the predictions (output_fn()), and how predictions are made (predict_fn()). The defaults work for our current use-case so we don’t need to define them.
For more details on the default implementations to customize the behavior, see the SageMaker Scikit-Learn Container GitHub Repository.
Now we can use our SKLearn estimator in the notebook to deploy the model. The deploy() function allows us to set the number and type of instances that will be used for the prediction endpoint. These do not need to be the same values we used for the training job. Here we will deploy the model to a single ml.m4.xlarge instance but you can also deploy the model to more than one instance and set up Auto Scaling for the endpoint.
Amazon SageMaker notebook – Prediction and evaluation
Now we can use this predictor to classify flowers from the IRIS dataset. Ideally, the evaluation dataset should be different from the training dataset but for this example, we’ll just reuse some of the training dataset to invoke the endpoint. .
First, we create a test dataset.
Now we can use the endpoint to make predictions by calling the predict function with our features.
Amazon SageMaker notebook – Cleanup
After you have finished with this example, remember to delete the prediction endpoint to release the instances associated with it.
Amazon SageMaker notebook – Batch Transform
Amazon SageMaker also provides Batch Transform, a managed service for doing large-scale that can be used to perform inference against your trained model on a large dataset. It’s ideal for scenarios where you’re dealing with large batches of data, you don’t need sub-second latency, or you need to preprocess and transform the training data.
First, we create a transformer and specify the number and type of instances we want to use for the job. In this case we specify 2 instances but you can scale this with the size of your dataset to reduce the amount of time it takes.
Now we start the transform job by providing it the location of our data on Amazon S3. The notebook example includes the steps followed to upload the dataset.
After the transform job has completed, we can download the output data from Amazon S3. For every input file we had, we will have a corresponding output file.
In this blog post we show you how to use the Amazon SageMaker built-in Scikit-Learn container to train a model on the IRIS dataset. However, that’s just the beginning. Scikit-Learn has a large selection of algorithms and transformers that you can use for your machine learning use-cases. Amazon SageMaker enables you to use Scikit-Learn scripts and works seamlessly with SageMaker training, automatic model tuning, and deployment capabilities.
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
About the Author
Laurence Rouesnel is the Algorithms & Platforms Group Manager in Amazon AI Labs. He leads a team of engineers and scientists working on deep learning and machine learning research and products. In his spare time he is an avid traveler, and loves the outdoors whether it’s hiking, skiing, or windsurfing.