Building a Productive Data Lake: How to keep three systems in sync

Three Systems for save Development

When you are building a productive Data Lake it is important to have at least three environments:

Development: for development, where “everything” is allowed.
Staging: for testing changes in a production like environment.
Production: Running your tested and productive data applications

With these different environments comes the need to keep data structures in sync. You need to find a way to change each system in a controlled way, when it is time to deploy some changes. There are differences in how to manage data structures, depending on where in data processing they occur. Theses differences we split into the parts “Data Entry to Staging” and “Core and Presentation”. I wi

Managing different data structures in the three systems of your Productive Data Lake

As all three systems can be have different data structures, using tools to make this easier is important. Different Apache Avro™ schema versions are managed by a publishing them to a tool called Apache Avro™ Schema Service of which there also also three instances. By publishing changes first on the development, then staging, then production system, we make sure, only tested data structures are deployed to the Production Data Lake. Each change to a schema belongs to a version number, so rolling back to an earlier state, in case of an error is possible.
Evolving our data structures not contained in Apache Avro™ schemas, we use tables managing all meta information.

Data Structures from Entry to Staging

Apache Avro™ is a good way to manage evolving data structures, as I explained here. Once the data is available in Apache Hive™ mapping the AVRO data types to those of your Productive Data Lake system is necessary. We use Apache HAWQ™ as base of our Data Lake. So we map the AVRO data types to HAWQ data types. To keep this as effortless as possible, we reuse attribute names as often as possible. If an attribute, like “user_id” is occuring in different schemas and has the same meaning, then we just reuse it. So the mapping table does not contain an entry for each column of each schema, but only for distinct ones. Considering more than 100 schema with on average 10 attributes, this helps in keeping maintanence down. In the table below is an example, of this mapping.

Schema Fieldname	AVRO datatype	HAWQ datatype
user_id	string	varchar(100)
transaction_id	string	varchar(100)
parameters	map	text
request_time	long	bigint
session_id	string	text
list_of_something	array	text

Data Strucutes for Core and Presentation

A totally different challenge is it, to manage the data structures after data entry. There need to occur transformations, for technical and business user benefit. This can mean other attributes / column names or a splitting or joining of data from one schema into different tables. These tables here again can evolve. Evolving database tables is a pain, since you cannot, in most cases just drop the table and recreate the new one. Here the processing power of Apache HAWQ™ comes into play, as we can exactly do that now. In most cases, at least for transactional data, we just reload the whole table, during the daily batch process. So we manage all table definitions in tables. They contain all table and column attributes, as well as a version number. This makes rolling back possible.

Example of table definitions:

Schema Name	Table Name	Partitioned	distributed by	dropable	Version
core	dim_user		user_id	true	3.0
core	rel_user_purchase		user_id, purchase_id	true	4.0
core	fact_user_registration		user_id	true	4.2
core	fact_usage_page	time_value_request		false	5.0

Exmaple of column definitions:

Schema Name	Table Name	Column Name	Data Type	Data Type Length	Version
core	dim_user	user_id	varchar	100	3.0
core	dim_user	user_type	varchar	50	3.0
core	rel_user_purchase	user_id	varchar	100	4.0
core	rel_user_purchase	purchase_id	varchar	100	4.0

Automation of data structure deployment.

After defining all the meta data concerning the data structures, we need a way to automate this. For this I created a Java programm using all the information from schema definitions and tables for data type mappings and table definitions. This programm runs on any of the three environments and also take in consideration the version it should create. This only works, if no one can create or change table information manually. So the programm first deploys changes to the data structure on the development system. If it works there and all changes are ready to be deployed on the staging system, run the programm in this environment. With the version numbers defined besides each change, it is possible to run version 5.0 on production, version 5.1 on staging and version 5.2 on development if neccessary.

Conclussion

This way is a possibility to make data structure evolution easier and keep three different systems in sync. It is most important to make sure, that there are no manual changes to the data structure. This approach works nicely, if you can drop and recreate all tables without hesitation. If that is not the case, you can still define your tables this way, just make sure they are not dropped automatically when the programm runs. Having all definitions in one place and being able to automatically deploy the tables and the changes, with version control, can be a great help. This way no one has to remember to create new tables manually in either system. So errors due to missing tables or columns are minimized which is good if you create a Productive Data Lake, where errors affect data products needed by other products of your company.

Author

Marc Matt

Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

I help clients:

Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.

Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.

Proven track record leading engineering teams.