1
 
 
Account
In your account you can view the status of your application, save incomplete applications and view current news and events
March 14, 2024

Machine Learning Ops: What It's Like in Practice

What is the article about?

In every corner of OTTO you find machine learning (ML) systems. You see a hat you like in our personalized recommendations; the AI support chat answers your questions and search results are sorted to your liking, all done by ML systems. But who is building the tech?

In this Blog I want to give you insights into what a MLOps focused engineer is doing in our department and how you can start of into this discipline.

Machine Learning Ops: What It's Like in Practice

Machine Learning Operations (MLOps) is an interdisciplinary set of practices from DevOps, data science and data engineering. It revolves around the life cycle of ML systems: data, training, inference, monitoring and everything in between.

Fig. 1: Illustration of MLOps as the Intersection of Roles
Fig. 1: Illustration of MLOps as the Intersection of Roles


So what exactly is a MLOps specialist doing day to day?


It depends... ML products evolve through phases over several months. Starting off with data science focused analysis and experimentation and culminating in a robust, operational system that's constantly improving. In this case we are looking at a real-time example. There are also systems which consume batches instead and therefore need some adjustments in architecture.

Let's take a deeper look at the phases and what you as MLOps in our team are doing.


Phase 1: Exploration


In Phase 1 our team does not have a product yet. We identified a use case where ML could solve a problem or replace non-ML algorithms. An example could be that a colleague identifies manually arranged pages and wants to automate this arrangement or even personalize it.

During this exploration phase you will support the data scientists by extracting, transforming and loading (storing) data (ETL). You might analyze it to identify "dirty" data which needs to be cleaned up. You will work close together with the data scientist and talk about which data is reliable and useful for our case. Therefore, you should have the foresight to spot problems that could arise when the system matures. For example, if you move from historical data to real time acquisition.

Depending on the team, we do not generally use a shared data science platform for all teams. Therefore, you might want to provide an environment where our data scientists can work and collaborate. This can be anything between a git repository or self-hosted runners, where the data scientists can execute trainings.

To summarize phase 1, expect a fair share of data engineering, ops and discussions about data.

Fig. 2: Tasks by role in Phase 1: Exploration
Fig. 2: Tasks by role in Phase 1: Exploration


On the data engineering side (bright) we take care of acquiring, transforming and storing data. This is often orchestrated by a workflow management tool e.g. Airflow. The data science tasks (red) it’s all about trying models and iterating until we find a promising candidate model.

There is a constant feedback loops between the roles so they can support each other.


Phase 2: Going Live


After the exploration phase we go straight to testing and prove of concept (POC).

Now we should have a trained model which we can export in any suitable way e.g. ONNX, to host it. The training might be automated already if it helped your team to try and test faster in phase 1.

Next, you need to enable live inference. We want to iterate fast so instead of creating the perfect model validation cycle you will probably focus on the inference service and connection of real time data to it. Depending on latency requirements you choose the suitable technology for inference (e.g. prebuilt like NVIDIA Triton or a custom solution which handles multiprocessing well).

Monitoring is important, since without it, we are unable to verify our POC in an A-B-test. Grafana in combination with Prometheus does the trick for most of our teams. We gather information about how well our model is performing e.g. How often to users interact with our models result?, How big is the result?, How fast is our inference? and our key performance indicator (KPI).

CI/CD is added wherever it helps us, manual steps might still exist, if they are no bottleneck in creating our POC.

Wrapping up phase 2, we focus on gathering real feedback for our model as fast as possible. At this point, there is no need to create the perfect infrastructure, since a model which does not meet our expectations, might let us start from scratch again.

Fig. 3: Tasks by Role in Phase 2
Fig. 3: Tasks by Role in Phase 2


During Phase 2 a lot of Ops tasks are added to the mix. CI/CD helps us to automate our system to make it reliable. The actual inference server needs to be added and monitored. The previously on demand data extraction must be executed on real-time or batched data. The data should be monitored as well to see if our features change.


Phase 3: Improving and keeping the product alive


Congratulations, we have a product! Our A-B-Test was successful, and we can permanently implement our model. But your job is not finished. At OTTO you built it and you run it.

This means in the third phase we want to balance keeping the old model alive (monitoring, validation, retraining) while simultaneously improving our product. Improving could mean anything between adjusting some hyper parameters to creating a whole new model which tackles a similar problem in our team’s domain.

We start by focusing on a fleshed-out ML-lifecycle. Some parts are already automated and we need to take care of the rest. Your goal is a workflow or pipeline which gets and prepares data, trains a model, evaluates it against the current live model and finally deploys it (Figure 4).

Fig. 4: Simplified ML-Pipeline
Fig. 4: Simplified ML-Pipeline


Great! After the pipeline is finished our system is automated, but not a lifecycle yet. Last thing is to trigger the pipeline. Ideally, we can detect when our model needs a new training (e.g. Data is shifting or predictions become unprecise) and trigger our pipeline on that condition. This is superior to a schedule because we are not guessing when a new training is dew. When monitoring is still in the making, a well guessed schedule is a good start though.

That’s it, right? Well, if you are lucky the data scientists already have three new ideas up their sleeves but this time you are not starting from scratch. You pimp the system, make it more flexible to provide for different model types and increase performance to provide an even better experience for our customers.


How to learn it?


Interested? You learned that MLOps combines skills from different roles, any of those can learn to carry out MLOps. I for example started as DevOps and continued to learn all about data engineering before I finally learned a little bit about ML models in order to create living products from the ideas and models built by our data scientists. 

My goal is to create the ideal, care-free environment so our data scientists can focus on improving models. So, wherever you are starting, look into the tools of the other roles one by one, until you can take care of a whole MLOps tech-stack. 

Since the technology can vary a lot, but the ideas stay similar, I will share, what we are using. Please keep in mind that our teams use different tech stacks, so this is just an example.

"My goal is to create the ideal, care-free environment so our data scientists can focus on improving models."


First: Data


Since we are working with AWS, we store data in AWS-S3 bucket. We transform it with Kotlin or Python jobs and manage these jobs with a workflow-management system Sagemaker Pipelines or Airflow. Depending on the size of the data, SPARK is also a common tool in our teams. Kafka is also a must have for streamed data.

Data Science


Exploration and analysis often take place in python notebooks which are hosted in Sagemaker Studio. We focus on pytorch to build and train the models.

Inference


We decided to use NVIDIA Triton as inference server. Other teams like custom pytorch implementations.

Ops


In our team we focus on CloudFormation for IaC (Infrastructure as Code), GitHub-Actions for CI/CD and Kotlin + Spring for front-end facing services. Finally our monitoring Stack contains of Prometheus, Grafana and AWS Cloudwatch.

AWS Sagemaker provides an all-in-one platform for the whole ML life cycle. They even created a set of in-depth tutorials and notebooks which I would recommend digging into. This might also be the topic of my next blog entry – so stay tuned ;D

I hope I was able to give you an idea what MLOps means and the challenges it provides.

Want to become part of the team?

6 people like this.

0No comments yet.

Write a comment
Answer to: Reply directly to the topic

Written by

Jim Duden
Jim Duden
Software Developer

Similar Articles

We want to improve out content with your feedback.

How interesting is this blogpost?

We have received your feedback.

Cookies erlauben?

OTTO und drei Partner brauchen deine Einwilligung (Klick auf "OK") bei einzelnen Datennutzungen, um Informationen auf einem Gerät zu speichern und/oder abzurufen (IP-Adresse, Nutzer-ID, Browser-Informationen).
Die Datennutzung erfolgt für personalisierte Anzeigen und Inhalte, Anzeigen- und Inhaltsmessungen sowie um Erkenntnisse über Zielgruppen und Produktentwicklungen zu gewinnen. Mehr Infos zur Einwilligung gibt’s jederzeit hier. Mit Klick auf den Link "Cookies ablehnen" kannst du deine Einwilligung jederzeit ablehnen.

Datennutzungen

OTTO arbeitet mit Partnern zusammen, die von deinem Endgerät abgerufene Daten (Trackingdaten) auch zu eigenen Zwecken (z.B. Profilbildungen) / zu Zwecken Dritter verarbeiten. Vor diesem Hintergrund erfordert nicht nur die Erhebung der Trackingdaten, sondern auch deren Weiterverarbeitung durch diese Anbieter einer Einwilligung. Die Trackingdaten werden erst dann erhoben, wenn du auf den in dem Banner auf otto.de wiedergebenden Button „OK” klickst. Bei den Partnern handelt es sich um die folgenden Unternehmen:
Google Inc., Meta Platforms Ireland Limited, elbwalker GmbH
Weitere Informationen zu den Datenverarbeitungen durch diese Partner findest du in der Datenschutzerklärung auf otto.de/jobs. Die Informationen sind außerdem über einen Link in dem Banner abrufbar.