Förderjahr 2021 / Stipendien Call #16 / ProjektID: 5884 / Projekt: Efficient and Transparent Model Selection for Serverless Machine Learning Platforms
In the previous blog post, I detailed how the general idea and the design goals for the platform were shaped, resulting in the current architecture consisting of multiple components running in a Kubernetes cluster with model selection being the main contribution of the project. In this blog post, I will detail how those components interact with each other and describe in more detail how the model selection in MuLambda works.
Platform Architecture
In general, the architecture supports the following workflow, including the two main actors administrator and developer and the additional actor consumer:
- Administrators upload saved model files to the platform.
- Administrators launch model executors which load the models from the storage and make them available for inference. When launching the model executors, metadata about the model is inserted into the metadata store. While the model executor is running, it periodically monitors the latency between model and clients and updates the metadata in the metadata store.
- Developers launch clients on the platform, which register themselves in the metadata store and serve consumers. The developer defines the criteria by which the platform selects a model for the client.
- Consumers send requests to clients, which then query the platform for a suitable model and then send the request to the model.
Infrastructure
Some of the components in the architecture fulfill specific infrastructural needs of the platform, e.g., storage. They work as follows:
-
Model Data Storage: The first step of making models available on the platform is to upload them to the storage. For this we use min.io, which is an open-source object storage server which implements the Amazon S3 API. Amazon S3 is often used as the storage backend for serverless applications which makes an S3 API compatible storage a good fit for the platform. This API is widely used in the industry and therefore provides a large set of tools which can be used to interact with the storage. It also offers the possibility to easily launch the models with Tensorflow Serving, which is a widely used framework for serving machine learning models and easy to integrate with the platform.
-
Model executor: After administrators have uploaded models to the storage, they can launch model executors which use the models. The model executors are Kubernetes pods and consist of two containers: the model serving container and the companion container. The model serving container is a Tensorflow Serving container which loads the model from the storage and makes it available for inference. The loaded model is kept in memory by the serving container and therefore it doesn't need to be loaded from the storage for every request. The serving container can be accessed via a REST API and a gRPC endpoint which are both exposed by the Kubernetes pod. The latency is measured by the companion container by periodically sending requests to all registered clients and measuring the time it takes to receive a response from each.
-
Metadata storage: The metadata storage is a Redis API compatible key-value store which stores metadata about the models and clients. The prototype uses DragonflyDB which implements all the needed Redis API functions and promises better performance than Redis itself. However, the prototype can be easily adapted to use Redis (or any other compatible key-value store) instead of DragonflyDB.
Model Selection
The selection of models is a component which contributes to the automatization aspect of the platform. Because the platform is able to automatically select the best model given specific criteria, developers do not need to explicitly choose a model for their application. This is especially useful for applications which do not have many hard requirements on the model, but instead want a general kind of functionality and performance characteristics. The approach to use machine learning models based on their traits instead of their identity therefore lets developers focus on their product instead of the underlying models, of which there is a vast amount. The usefulness of this is evident for applications which just want to receive the best possible output for a given input, but do not care which model generated this output. Furthermore, distributed systems benefit by automatically receiving co-located models which can lead to an overall decrease in latency.
To even be able to perform model selection, we first need to devise a classification scheme. Based on this classification scheme, developers can specify criteria by which the platform should select the best model. We separate the criteria into hard an soft:
Hard Criteria are criteria which are essential for the client application to function as intended. The platform will never select a model which does not meet these criteria. The available hard criteria are:
- Model Type: The main function of the model. Here, we distinguish between models for classification, regression, etc. This is a hard criteria because the outputs of different types of models are not compatible with each other.
- Input Format: The kind of data client applications infer the model with – (i.e. text, image, audio, etc.). Those inputs are in general incompatible with each other and therefore this is a hard criteria.
- Output Format: The kind of data the client application expects as output – (i.e. text, image, audio, etc.). Client applications have a specific purpose and work on a specific kind of data which is non-negotiable. Thus, this is a hard criteria.
Soft Criteria are criteria which let the client application signal what kind of performance it expects – i.e. if it wants faster or better results. Based on the requested performance profile given by the soft criteria, the platform will select the most suitable model which meets the hard criteria. The available soft criteria are:
- Latency: The time it takes for the model to deliver an output to the client for a given input. Even though we always want to be as fast as possible, latency performance can be traded off for better accuracy. This trade-off makes latency a soft criteria.
- Accuracy: The quality of the output of the model. Analogous to latency, higher accuracy can be traded off for less latency, making accuracy a soft criteria.
In the next blog post, we will explore the evaluation of this model selector in more detail.