Category: Tools
-
Data Engineer – The Top 10 Books to read in 2023
Whether you are just starting out as a data engineer or you are an old pro it is always important to stay up to date on trends and technologies. In this post I will talk about the top 10 books every data engineer should read in 2023 to keep their skills fresh. Data Science from…
-
Apache Nifi on Google Cloud Kubernetes Engine (GKE)
Apache Nifi on GKE can be a good solution, if you want to have a low code solution for processing streaming data. If you set it up on GKE, a managed version of Kubernetes, you have a managed scalable environment and do not need to worry about handling the actual servers. Setup of the Apache…
-
Data Infrastructure in the Cloud
Having your data infrastructure in the cloud has become a real option for a lot of companies, especially since the big cloud providers have a lot of managed services available for a modern data architecture aside from just a database management system.
-
Google Cloud Data Engineer Exam Preparation
This is a little text with all the stuff that helped me prepare for the Google Cloud Data Engineer Exam. There are a lot of courses and resources, that help you in preparing for this. The following links helped me in preparation for my Google Data Engineer Exam. On Coursera there is are several courses…
-
Plumber: Getting R ready for production environments?
R Project and Production Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can…
-
Building a Productive Data Lake: How to keep three systems in sync
Three Systems for save Development When you are building a productive Data Lake it is important to have at least three environments: Development: for development, where “everything” is allowed. Staging: for testing changes in a production like environment. Production: Running your tested and productive data applications With these different environments comes the need to keep…
-
Apache HAWQ: Building an easily accessable Data Lake
Data Lake vs Datawarehouse The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including: Apache…
-
Apache Zeppelin: Use with remote Spark cluster and Yarn
Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post. But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN.…
-
Apache Zeppelin: Visualization and Spark data processing
Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis: Data Ingestion Data Discovery Data Analytics Data Visualization & Collaboration Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like: Apache Spark (SQL, PySpark, Java, Scala) R…
-
Apache Spark 2.0
Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It…