What is Model Serving Exactly? - An Example With Amazon Web Services (AWS)

Model Serving is an important step in the machine learning lifecycle when creating an AI application. It involves taking the model that's been trained with a dataset and making it accessible for prediction or inference requests. Prediction is typically used in the context of supervised learning, where a model is trained to predict a certain output given a set of inputs. An inference request is when you submit new data to a model and ask it to perform a task like completing text.

It is important to ensure that models are served in a reliable, accurate and efficient manner in order to ensure the success your AI application.

What is Model Training?

In order to train a model accurately, it is necessary to understand the task and the data used. To do this, it is important to determine the type of predictive task and select the most suited algorithm. Information technology is ultimately a tool in the world of humans. Your assumptions about of your data and algorithm of choice must fit reality to yield useful results.

To ensure maximum accuracy and performance, it is important to optimize the model’s performance during each step of the training process. This may involve hyperparameter tuning, regularization techniques, and dealing with data imbalance.  

Once a model is trained, it can be saved in various formats. Parameters and weights can be stored in a file so that it can be used later.

Where to Deploy Machine Learning Models?

Deployment of the model is the next step in the machine learning lifecycle. It involves making the model accessible to the outside world on the internet or within an intranet. The model can be deployed using a physical server, virtual machines or a solution with containers.

Physical servers require hardware and maintenance, which make them expensive and resource intensive. Physical servers may still be a viable option depending on the difficulty of the tasks and requirements of your projects.

Virtual machines (VMs) are the traditional choice for deploying machine learning models. They ensure scalability and cost efficiency. Amazon EC2 instances are virtual machines. VMs run on physical servers.

Containers like Docker containers are self-contained, isolated environments that run on single VMs. The benefit of containers is that you can run multiple containers on a VM without them having to deal with dependency issues.

Container orchestration software like Kubernetes is used to manage containers across multiple VMs. A setup where Kubernetes is used to manage multiple containers across multiple VMs is called a Kubernetes cluster.

Cloud providers like AWS offer managed services for containers. E.g. Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS) can be used to run containers without having to manage VMs at all!

And last but not least, elaborate Integrated development environments (IDEs) like Amazon SageMaker are designed to automate machine learning model deployment processes while providing a graphical user interface. They integrate with services such as EKS and ECS.

How To Monitor Machine Learning Models?

In order to ensure the accuracy of a machine learning model, it is important to monitor it. This can be done through logging predictions and outcomes. This involves collecting data from a trained model to measure its accuracy and performance. This data can then be used to identify any areas where the model may be underperforming and make the necessary adjustments.

In addition to logging predictions and outcomes, it is also important to use monitoring tools to ensure the accuracy of a model. Different tools such as MLFlow and TensorBoard can be used to monitor models and see how different changes are affecting their accuracy.

Sparring Time with Opsie!

Opsie is our (imaginary) external audit & consulting sparring partner who answers all the naïve and uncomfortable questions. Let’s spar!
Q: While this seems straightforward in theory, how can we ensure that our model will respond adequately under high loads? And given the variability in model performance, how can we be confident in its accuracy and reliability without a significant history of performance data?

Opsie: Ensuring that a model will perform adequately under high loads involves robust stress testing, load testing, and performance benchmarking. One can set up an infrastructure that scales based on demand using cloud-based auto-scaling solutions. Despite the lack of significant performance data history, one can create synthetic data or simulate high load scenarios to test the model's performance. Additionally, implementation of robust error handling, failover strategies, and performance monitoring can ensure the model's accuracy and reliability.

Q: What if our assumptions about the data are incorrect, or the selected algorithm is not the best fit? Isn't it highly possible, given the rapidly changing nature of technology and the inherent unpredictability of data, that our model might fail to capture key aspects of reality, thus rendering its predictions inaccurate or even misleading?

Opsie: You're right, assumptions about the data or the selected algorithm might be incorrect or inadequate. This is why the process of model development involves cycles of training, validation, and testing. Furthermore, data science teams often train several models using different algorithms and configurations (hyperparameter tuning) to select the best one. It's also crucial to continuously reevaluate the model as new data comes in and retrain it as necessary. Techniques like cross-validation, bootstrapping, and ensemble learning can also help mitigate the risk of incorrect assumptions.

Q: Deploying on physical servers, VMs or using container solutions seems to involve a trade-off between cost and performance. How do we strike the right balance? What guarantees do we have that the benefits of using a VM or a container solution will outweigh the associated costs and complexities?

Opsie: Striking the right balance between cost and performance in deployment involves careful planning and consideration. Factors such as expected load, budget, and latency requirements should guide the choice. Container solutions like Docker or Kubernetes offer scalability and isolation but have overhead costs. VMs are more heavyweight but can provide better isolation. Physical servers offer the best performance but lack the flexibility of VMs and containers. The best choice depends on your specific needs and constraints.

Q: Monitoring for accuracy seems to assume that the model's performance will remain consistent over time. But isn't it true that machine learning models can 'drift' due to changing data trends? If so, how often should we retrain the model to ensure its performance remains optimal?

Opsie: You're right, data drift is a real issue. That's why it's necessary to continually monitor the model's performance and retrain it regularly. The frequency of retraining depends on the nature of your data and the model's application. It could be on a daily, weekly, or monthly basis. Implementing a feedback loop where predictions are compared with the actual outcomes helps in catching when the model's performance is degrading.

Q: While AWS's offerings sound comprehensive and user-friendly, isn't there a risk of vendor lock-in, potentially leading to increased costs and reduced flexibility in the long run? How can we ensure that AWS's services will continue to meet our needs as they evolve over time?

Opsie: Vendor lock-in is indeed a risk when heavily relying on one provider's services. To mitigate this, design your architecture in a way that's as agnostic to the underlying cloud services as possible. Also, make use of multi-cloud strategies or open-source tools when feasible. Regularly reviewing your needs and the services you are using can also ensure that AWS's or any other provider's offerings continue to meet your needs.

Opsie: I see you next time, guys. My secretary will take care of follow-up questions in about one hour.

So How Important Is Model Serving?

The entire journey of model serving — ranging from comprehension of the task, algorithm selection, saving the model, deploying it in a suitable environment, to continuous performance monitoring — is a pivotal phase in the lifecycle of machine learning. It is a multifaceted process that requires thorough understanding and careful execution to ensure reliable and effective utilization of machine learning models. Through intelligent choice of deployment methods, robust stress testing, iterative improvement strategies, and regular performance monitoring, the precision and reliability of models can be enhanced. As technology advances, so do the complexities of deployment and monitoring methods.

Start Your Project Today

If this work is of interest to you, then we’d love to talk to you. Please get in touch with our experts and we can chat about how we can help you.

Send us a message and we’ll get right back to you. ->

Keep Reading