DATA DO – データ道

DataScientists: a blog about everything data related.

The Data Engineer Role in a ML Pipeline

Data engineers provide the critical foundation for every successful Machine Learning (ML) deployment, supporting the powerful models and insights that often grab headlines. While data scientists focus on model development and evaluation, data engineers ensure that the right data is collected, processed, and made available in a reliable and scalable way. 1. The Overlooked Hero…

January 22, 2026
AI Agent Workflows: Pydantic AI

Building Intelligent Multi-Agent Systems with Pydantic AI In the rapidly evolving landscape of artificial intelligence, multi-agent systems have emerged as a powerful paradigm for tackling complex, domain-specific challenges. Today, I’ll walk you through a sophisticated AI agent workflow that demonstrates how multiple specialized agents can collaborate to process, analyze, and generate insights from research literature…

January 16, 2026
The Ultimate Vector Database Showdown: A Performance and Cost Deep Dive on AWS

In the age of AI, Retrieval-Augmented Generation (RAG) is king. The engine powering this revolution? The vector database. Choosing the right one is critical for building responsive, accurate, and cost-effective AI applications. But with a growing number of options, which one truly delivers? To answer this, we put five popular AWS-hosted vector database solutions to…

January 8, 2026
Data Engineer – The Top 10 Books to read in 2023

Whether you are just starting out as a data engineer or you are an old pro it is always important to stay up to date on trends and technologies. In this post I will talk about the top 10 books every data engineer should read in 2023 to keep their skills fresh. Data Science from…

May 8, 2023
Apache Nifi on Google Cloud Kubernetes Engine (GKE)

Apache Nifi on GKE can be a good solution, if you want to have a low code solution for processing streaming data. If you set it up on GKE, a managed version of Kubernetes, you have a managed scalable environment and do not need to worry about handling the actual servers. Setup of the Apache…

December 6, 2022
Data Infrastructure in the Cloud

Having your data infrastructure in the cloud has become a real option for a lot of companies, especially since the big cloud providers have a lot of managed services available for a modern data architecture aside from just a database management system.

January 30, 2021
Bringing machine learning models into production

Developing and bringing machine learning models into production is a task with a lot of challenges. These include model and attribute selection, dealing with missing values, normalization and others. Finding a workflow that puts all the gears, from data preprocessing and analysis over building models and selecting the best performing one to serving the model…

May 29, 2020
Google Cloud Data Engineer Exam Preparation

This is a little text with all the stuff that helped me prepare for the Google Cloud Data Engineer Exam. There are a lot of courses and resources, that help you in preparing for this. The following links helped me in preparation for my Google Data Engineer Exam. On Coursera there is are several courses…

August 19, 2019
AVRO schema generation with reusable fields

Why use AVRO and AVRO Schema? There are several serialized file formats out there, so chosing the one most suited for your needs is crucial. This blog entry will not compare them, but it will just point out some advantages of AVRO and AVRO Schema for an Apache Hadoop ™ based system. Avro schema can…

October 7, 2018
Plumber: Getting R ready for production environments?

R Project and Production Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can…

June 13, 2018

Got any book recommendations?

By continuing to use the site, you agree to the use of cookies. more information