ML Pipelines in a Nutshell

Vittorio Scacchetti
3 min readDec 21, 2020

In this post we introduce the concept of pipelines and see what machine learning pipelines consist of!

We have data. Lots of data. Too much data. Data that must be processed, analyzed and appropriately managed. We certainly can’t start manually performing every single operation.

Key word: automation, and that’s how machine learning pipelines come into play.

A machine learning pipeline is nothing more than an ordered and defined sequence of components that process data. They are very common within machine learning systems due to the large number of data to be managed and transformations to be applied. We then explained that they arise from the need to automate the workflow.

But what is the real advantage? Consider that each component is typically asynchronous, i.e. its execution is independent of the others. Data entering the pipeline is altered by a component that produces output saved in an element called a data store. The data store is the connection point between the components.

Machine Learning Pipelines from Hands-On Machine Learning With Scikit-Learn & Tensorflow

After a certain time, the downstream component picks up the data managed by the previous component (the upstream component) performing new transformations. The output, as you may have guessed, is saved in a new data store. From here, the cycle continues until the last data store of the pipeline.

Advantages

A similar system offers the advantage of being easily understood, thanks also to the abstraction made with graphics and illustrations. Multiple teams can work on the same pipeline, each taking care of a different component. Furthermore, if a component should malfunction, the downstream component can continue to operate normally, at least until the data is processed in the shared store.

Disadvantages

A malfunctioning component, due to the advantageous and robust structure with which the pipeline is designed, can however be a problem, making it difficult to identify. This is why it is important to develop appropriate monitoring systems to ensure that everything is running smoothly. Otherwise, the data becomes obsolete and performance is drastically reduced.

Machine Learning Pipeline | Appearance

Ok, we have seen what they are in theory. We know why they are useful and what advantages and disadvantages they have, but without a palpable understanding of their nature, this post would make little sense.

At a macro level, a component, ultimately the main element of the pipeline, is a data science project. For example, a supervised regression problem with a batch learning method. The data store? The output of the machine learning model: one or more discrete or categorical variables.

We can also find pipelines at a micro level, that is, within each macro machine learning project. At this level, a pipeline is made up of components that review the phases of the Data Science Methodology as pictured below:

In this way data scientists, data engineers, IT professionals can collaborate in the phases of: data preparation, normalizations and transformations, model training, model evaluation and distribution.

--

--