navigation

Account

In your account you can view the status of your application, save incomplete applications and view current news and events

enEnglish

deGerman

March 14, 2024

Machine Learning Ops: What It's Like in Practice

Data science

Development

What is the article about?

In every corner of OTTO you find machine learning (ML) systems. You see a hat you like in our personalized recommendations; the AI support chat answers your questions and search results are sorted to your liking, all done by ML systems. But who is building the tech?

In this Blog I want to give you insights into what a MLOps focused engineer is doing in our department and how you can start of into this discipline.

Machine Learning Ops: What It's Like in Practice

Machine Learning Operations (MLOps) is an interdisciplinary set of practices from DevOps, data science and data engineering. It revolves around the life cycle of ML systems: data, training, inference, monitoring and everything in between.

Fig. 1: Illustration of MLOps as the Intersection of Roles

So what exactly is a MLOps specialist doing day to day?

It depends... ML products evolve through phases over several months. Starting off with data science focused analysis and experimentation and culminating in a robust, operational system that's constantly improving. In this case we are looking at a real-time example. There are also systems which consume batches instead and therefore need some adjustments in architecture.

Let's take a deeper look at the phases and what you as MLOps in our team are doing.

Phase 1: Exploration

In Phase 1 our team does not have a product yet. We identified a use case where ML could solve a problem or replace non-ML algorithms. An example could be that a colleague identifies manually arranged pages and wants to automate this arrangement or even personalize it.

During this exploration phase you will support the data scientists by extracting, transforming and loading (storing) data (ETL). You might analyze it to identify "dirty" data which needs to be cleaned up. You will work close together with the data scientist and talk about which data is reliable and useful for our case. Therefore, you should have the foresight to spot problems that could arise when the system matures. For example, if you move from historical data to real time acquisition.

Depending on the team, we do not generally use a shared data science platform for all teams. Therefore, you might want to provide an environment where our data scientists can work and collaborate. This can be anything between a git repository or self-hosted runners, where the data scientists can execute trainings.

To summarize phase 1, expect a fair share of data engineering, ops and discussions about data.

Fig. 2: Tasks by role in Phase 1: Exploration

On the data engineering side (bright) we take care of acquiring, transforming and storing data. This is often orchestrated by a workflow management tool e.g. Airflow. The data science tasks (red) it’s all about trying models and iterating until we find a promising candidate model.

There is a constant feedback loops between the roles so they can support each other.

Phase 2: Going Live

After the exploration phase we go straight to testing and prove of concept (POC).

Now we should have a trained model which we can export in any suitable way e.g. ONNX, to host it. The training might be automated already if it helped your team to try and test faster in phase 1.

Next, you need to enable live inference. We want to iterate fast so instead of creating the perfect model validation cycle you will probably focus on the inference service and connection of real time data to it. Depending on latency requirements you choose the suitable technology for inference (e.g. prebuilt like NVIDIA Triton or a custom solution which handles multiprocessing well).

Monitoring is important, since without it, we are unable to verify our POC in an A-B-test. Grafana in combination with Prometheus does the trick for most of our teams. We gather information about how well our model is performing e.g. How often to users interact with our models result?, How big is the result?, How fast is our inference? and our key performance indicator (KPI).

CI/CD is added wherever it helps us, manual steps might still exist, if they are no bottleneck in creating our POC.

Wrapping up phase 2, we focus on gathering real feedback for our model as fast as possible. At this point, there is no need to create the perfect infrastructure, since a model which does not meet our expectations, might let us start from scratch again.

During Phase 2 a lot of Ops tasks are added to the mix. CI/CD helps us to automate our system to make it reliable. The actual inference server needs to be added and monitored. The previously on demand data extraction must be executed on real-time or batched data. The data should be monitored as well to see if our features change.

Phase 3: Improving and keeping the product alive

Congratulations, we have a product! Our A-B-Test was successful, and we can permanently implement our model. But your job is not finished. At OTTO you built it and you run it.

This means in the third phase we want to balance keeping the old model alive (monitoring, validation, retraining) while simultaneously improving our product. Improving could mean anything between adjusting some hyper parameters to creating a whole new model which tackles a similar problem in our team’s domain.

We start by focusing on a fleshed-out ML-lifecycle. Some parts are already automated and we need to take care of the rest. Your goal is a workflow or pipeline which gets and prepares data, trains a model, evaluates it against the current live model and finally deploys it (Figure 4).

Great! After the pipeline is finished our system is automated, but not a lifecycle yet. Last thing is to trigger the pipeline. Ideally, we can detect when our model needs a new training (e.g. Data is shifting or predictions become unprecise) and trigger our pipeline on that condition. This is superior to a schedule because we are not guessing when a new training is dew. When monitoring is still in the making, a well guessed schedule is a good start though.

That’s it, right? Well, if you are lucky the data scientists already have three new ideas up their sleeves but this time you are not starting from scratch. You pimp the system, make it more flexible to provide for different model types and increase performance to provide an even better experience for our customers.

How to learn it?

Interested? You learned that MLOps combines skills from different roles, any of those can learn to carry out MLOps. I for example started as DevOps and continued to learn all about data engineering before I finally learned a little bit about ML models in order to create living products from the ideas and models built by our data scientists.

My goal is to create the ideal, care-free environment so our data scientists can focus on improving models. So, wherever you are starting, look into the tools of the other roles one by one, until you can take care of a whole MLOps tech-stack.

Since the technology can vary a lot, but the ideas stay similar, I will share, what we are using. Please keep in mind that our teams use different tech stacks, so this is just an example.

"My goal is to create the ideal, care-free environment so our data scientists can focus on improving models."

First: Data

Since we are working with AWS, we store data in AWS-S3 bucket. We transform it with Kotlin or Python jobs and manage these jobs with a workflow-management system Sagemaker Pipelines or Airflow. Depending on the size of the data, SPARK is also a common tool in our teams. Kafka is also a must have for streamed data.

Data Science

Exploration and analysis often take place in python notebooks which are hosted in Sagemaker Studio. We focus on pytorch to build and train the models.

Inference

We decided to use NVIDIA Triton as inference server. Other teams like custom pytorch implementations.

Ops

In our team we focus on CloudFormation for IaC (Infrastructure as Code), GitHub-Actions for CI/CD and Kotlin + Spring for front-end facing services. Finally our monitoring Stack contains of Prometheus, Grafana and AWS Cloudwatch.

AWS Sagemaker provides an all-in-one platform for the whole ML life cycle. They even created a set of in-depth tutorials and notebooks which I would recommend digging into. This might also be the topic of my next blog entry – so stay tuned ;D

I hope I was able to give you an idea what MLOps means and the challenges it provides.

Want to become part of the team?

Jobs

7 people like this.

0No comments yet.

Write a comment

Answer to: Reply directly to the topic

Written by

Jim Duden

Software Developer

About the author

We want to improve out content with your feedback.

How interesting is this blogpost?

We have received your feedback.

Allow cookies?

OTTO and four partners need your consent (click on "OK") for individual data uses in order to store and/or retrieve information on your device (IP address, user ID, browser information).
Data is used for personalized ads and content, ad and content measurement, and to gain insights about target groups and product development. More information on consent can be found here at any time. You can refuse your consent at any time by clicking on the link "refuse cookies".

Data uses

OTTO works with partners who also process data retrieved from your end device (tracking data) for their own purposes (e.g. profiling) / for the purposes of third parties. Against this background, not only the collection of tracking data, but also its further processing by these providers requires consent. The tracking data will only be collected when you click on the "OK" button in the banner on otto.de. The partners are the following companies:
Google Ireland Limited, Meta Platforms Ireland Limited, LinkedIn Ireland Unlimited Company, TikTok Information Technologies UK Limited
For more information on the data processing by these partners, please see the privacy policy at otto.de/jobs. The information can also be accessed via a link in the banner.
You can also withdraw your consent at any time without giving any reason by clicking on the button 'Cookie Settings' in the footer of the website and 'Refuse Cookies'.

more information

Machine Learning Ops: What It's Like in Practice

What is the article about?

Machine Learning Ops: What It's Like in Practice

So what exactly is a MLOps specialist doing day to day?

Phase 1: Exploration

Phase 2: Going Live

Phase 3: Improving and keeping the product alive

How to learn it?

First: Data

Data Science

Inference

Ops

0No comments yet.

Written by

Similar Articles

How we used a simple trick to save USD 500,000 in data transfer costs

Developer Hacks – Modern Command Line Tools and Advanced Git Commands

Your profile -
Your advantages

A people company.

Driven by technology.

We want to improve out content with your feedback.

Allow cookies?

Data uses