{"id":509,"date":"2016-10-20T11:19:29","date_gmt":"2016-10-20T11:19:29","guid":{"rendered":"http:\/\/ds.eindeutigunsinnig.de\/?p=509"},"modified":"2018-01-21T11:31:47","modified_gmt":"2018-01-21T11:31:47","slug":"apache-hawq-building-easily-accessable-data-lake","status":"publish","type":"post","link":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/","title":{"rendered":"Apache HAWQ: Building an easily accessable Data Lake"},"content":{"rendered":"<h2><a href=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-293 alignright\" src=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png\" alt=\"Apache HAWQ for Data Lake Architecture\" width=\"205\" height=\"209\" \/><\/a>Data Lake vs Datawarehouse<\/h2>\n<p>The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including:<\/p>\n<ul>\n<li><a href=\"http:\/\/spark.apache.org\/\" target=\"_blank\">Apache Spark<\/a><\/li>\n<li>Map Reduce<\/li>\n<li><a href=\"https:\/\/hive.apache.org\/\" target=\"_blank\">Apache Hive<\/a><\/li>\n<li><a href=\"http:\/\/hawq.incubator.apache.org\/\" target=\"_blank\">Apache HAWQ<\/a><\/li>\n<\/ul>\n<p>All of these tools have advantages and disadvantages when used to process data, but all of them combined make your data accessible. This is the first step in building a Data Lake. You have to have your data, even schemaless data accessible to your customers.<br \/>\nA classical Datawarehouse on the opposite only contains structured data, that is at least preproccessed and has a fixed schema. Data in a classical Datawarehouse is not the raw data entered into the system. You need a seperate staging area for tranformations. Usually this is not accessible for all consumers of your data, but only for the Datawarehouse developers.<\/p>\n<h2>Data Lake Architecture using Apache HAWQ<\/h2>\n<p>It is a challenge to build a Data Lake with Apache HAWQ, but this can be overcome on the design part. One solution to build such a system can be seen in then picture below.<br \/>\n<a href=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/data_lake_with_hawq.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/data_lake_with_hawq.png\" alt=\"data_lake_with_hawq\" width=\"600\" height=\"210\" class=\"alignnone size-medium wp-image-300\" \/><\/a><\/p>\n<h3>Data Entry<\/h3>\n<p>To make utilization of <a href=\"http:\/\/hawq.incubator.apache.org\/\" target=\"_blank\">Apache HAWQ<\/a> possible the starting point is a controlled Data Entry. This is a compromise between schemaless and schematized data. <a href=\"https:\/\/avro.apache.org\/\" target=\"_blank\">Apache AVRO<\/a> is a way to do this. Schema evolution is an integral part of AVRO and it provides structures to save unstrcutured data, like maps and arrays. A separate article about AVRO will be one of this next topics here, to explain schema evolution and how to make the most of it.<br \/>\nData structured in schema can then be pushed message wise into a messaging queue. Chose a queue that fits your needs best. If you need secure transactions <a href=\"https:\/\/www.rabbitmq.com\/\" target=\"_blank\">RabbitMQ<\/a> may be the right choice. Another option is <a href=\"http:\/\/kafka.apache.org\/\" target=\"_blank\">Apache Kafka<\/a>.<\/p>\n<h3>Pre-aggregating Data<\/h3>\n<p>Processing and storing single message on HDFS is not an option, so there is need of another system to aggregate messages before storing them on HDFS. For this a software project called <a href=\"https:\/\/nifi.apache.org\/\" target=\"_blank\">Apache Nifi<\/a> is a good choice. This system comes with processors that make things like this pretty easy. It has a processor called MergeContent that merges single AVRO messages and removes all headers but one, before writing them to HDFS.<br \/>\n<a href=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/nifi_merge_messages.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/nifi_merge_messages.png\" alt=\"nifi_merge_messages\" width=\"600\" height=\"66\" class=\"alignnone size-medium wp-image-302\" \/><\/a><br \/>\nIf those messages are still not above the HDFS blocksize, there is the possibility to read messages from HDFS and merge them into larger files still.<br \/>\n<a href=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/nifi_merge_files.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/nifi_merge_files.png\" alt=\"nifi_merge_files\" width=\"600\" height=\"62\" class=\"alignnone size-medium wp-image-301\" \/><\/a> <\/p>\n<h3>Making data available in the Data Lake<\/h3>\n<p>Use <a href=\"https:\/\/hive.apache.org\/\" target=\"_blank\">Apache Hive<\/a> to make data accessible from AVRO format. HAWQ could read the AVRO files directly, but Hive handles schema evolution in a more effective way. For example, if there is the need to add a new optional field to an existing schema, add a default value for that field and Hive will fill entries from earlier messages with this value. So if HAWQ now accesses this Hive table it automatically reads the default value for field added later with default values. It could not do this by itself. Hive also has a more robust way of handling and extracting keys and values from map fields right now.<\/p>\n<h3>Data Lake with SQL Access<\/h3>\n<p>All data is available in <a href=\"http:\/\/hawq.incubator.apache.org\/\" target=\"_blank\">Apache HAWQ<\/a> now. This enables tranformations using SQL and making all of your data accessible by a broad audience in your company. SQL skills are more common than say Spark programming in Java, Scala or PySpark. From here it is possible to give analysts access to all of the data or building data marts for single subjects of concern using SQL transformations. Connectivity to reporting tools like <a href=\"http:\/\/www.tableau.com\/\" target=\"_blank\">Tableau<\/a> is possible with a driver for Postgresql. Even advanced analytics are possible, if you install <a href=\"https:\/\/madlib.incubator.apache.org\/\" target=\"_blank\">Apache MADlib<\/a> on your HAWQ cluster.<\/p>\n<h2>Using Data outside of HAWQ<\/h2>\n<p>It is even possible to use all data outside of HAWQ, if there is a need for it. Since all data is available in AVRO format, accessing it by means of Apache Spark with Apache Zeppelin is also possible. Hive queries are possible too, since all data is registered there using external tables, which we used for integration into HAWQ.<br \/>\nAccessing results of such processing in HAWQ is possible too. Save the results in AVRO format for integration in the way described above or use <a href=\"http:\/\/hdb.docs.pivotal.io\/201\/hawq\/reference\/cli\/admin_utilities\/hawqregister.html\" target=\"_blank\">&#8220;hawq register&#8221;<\/a> to access parquet files directly from HDFS. <\/p>\n<h2>Conclusion<\/h2>\n<p>Using Apache HAWQ as base of a Data Lake is possible. Just take some contraints into consideration. But entering data with semi-structured with AVRO format also saves work later when you process the data. The main advantage is, that you can utilize SQL as an interface to all of you data. This enables many people in your company to access your data and will help you on your way to Data Driven decisions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Lake vs Datawarehouse The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including: Apache [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,4,6,9],"tags":[15,17,21,27,55,56],"ppma_author":[144],"class_list":["post-509","post","type-post","status-publish","format-standard","hentry","category-big-data","category-data-lake","category-data-warehouse","category-tools","tag-apache-hawq","tag-apache-nifi","tag-avro-schema","tag-data-lake","tag-massively-parallel-processing-mpp-databases","tag-mpp","author-marc"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053<\/title>\n<meta name=\"description\" content=\"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053\" \/>\n<meta property=\"og:description\" content=\"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/\" \/>\n<meta property=\"og:site_name\" content=\"DATA DO - \u30c7\u30fc\u30bf \u9053\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataScientists\/\" \/>\n<meta property=\"article:published_time\" content=\"2016-10-20T11:19:29+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-01-21T11:31:47+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png\" \/>\n<meta name=\"author\" content=\"Marc Matt\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Marc Matt\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/\"},\"author\":{\"name\":\"Marc Matt\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/person\\\/723078870bf3135121086d46ebb12f19\"},\"headline\":\"Apache HAWQ: Building an easily accessable Data Lake\",\"datePublished\":\"2016-10-20T11:19:29+00:00\",\"dateModified\":\"2018-01-21T11:31:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/\"},\"wordCount\":820,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2016\\\/10\\\/hawq_logo.png\",\"keywords\":[\"Apache HAWQ\",\"Apache Nifi\",\"AVRO Schema\",\"Data Lake\",\"Massively Parallel Processing (MPP) databases\",\"MPP\"],\"articleSection\":[\"Big Data\",\"Data Lake\",\"Data Warehouse\",\"Tools\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/\",\"url\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/\",\"name\":\"Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2016\\\/10\\\/hawq_logo.png\",\"datePublished\":\"2016-10-20T11:19:29+00:00\",\"dateModified\":\"2018-01-21T11:31:47+00:00\",\"description\":\"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#primaryimage\",\"url\":\"http:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2016\\\/10\\\/hawq_logo.png\",\"contentUrl\":\"http:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2016\\\/10\\\/hawq_logo.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/10\\\/20\\\/apache-hawq-building-easily-accessable-data-lake\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/datascientists.info\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Apache HAWQ: Building an easily accessable Data Lake\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#website\",\"url\":\"https:\\\/\\\/datascientists.info\\\/\",\"name\":\"Data Scientists\",\"description\":\"Digging data, Big Data, Analysis, Data Mining\",\"publisher\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/datascientists.info\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\",\"name\":\"DATA DO - \u30c7\u30fc\u30bf \u9053\",\"url\":\"https:\\\/\\\/datascientists.info\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2026\\\/02\\\/Bildschirmfoto-vom-2026-02-02-08-13-21.png\",\"contentUrl\":\"https:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2026\\\/02\\\/Bildschirmfoto-vom-2026-02-02-08-13-21.png\",\"width\":250,\"height\":174,\"caption\":\"DATA DO - \u30c7\u30fc\u30bf \u9053\"},\"image\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/DataScientists\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/person\\\/723078870bf3135121086d46ebb12f19\",\"name\":\"Marc Matt\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g53b84b5f47a2156ba8b047d71d6d05fc\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g\",\"caption\":\"Marc Matt\"},\"description\":\"Senior Data Architect with 15+ years of experience helping Hamburg's leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities. I help clients: Migrate &amp; Modernize: Transitioning on-premise data warehouses to Google Cloud\\\/AWS to reduce costs and increase agility. Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs. Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow. Proven track record leading engineering teams.\",\"sameAs\":[\"https:\\\/\\\/data-do.de\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053","description":"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/","og_locale":"en_US","og_type":"article","og_title":"Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053","og_description":"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.","og_url":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/","og_site_name":"DATA DO - \u30c7\u30fc\u30bf \u9053","article_publisher":"https:\/\/www.facebook.com\/DataScientists\/","article_published_time":"2016-10-20T11:19:29+00:00","article_modified_time":"2018-01-21T11:31:47+00:00","og_image":[{"url":"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png","type":"","width":"","height":""}],"author":"Marc Matt","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Marc Matt","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#article","isPartOf":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/"},"author":{"name":"Marc Matt","@id":"https:\/\/datascientists.info\/#\/schema\/person\/723078870bf3135121086d46ebb12f19"},"headline":"Apache HAWQ: Building an easily accessable Data Lake","datePublished":"2016-10-20T11:19:29+00:00","dateModified":"2018-01-21T11:31:47+00:00","mainEntityOfPage":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/"},"wordCount":820,"commentCount":0,"publisher":{"@id":"https:\/\/datascientists.info\/#organization"},"image":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#primaryimage"},"thumbnailUrl":"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png","keywords":["Apache HAWQ","Apache Nifi","AVRO Schema","Data Lake","Massively Parallel Processing (MPP) databases","MPP"],"articleSection":["Big Data","Data Lake","Data Warehouse","Tools"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/","url":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/","name":"Apache HAWQ: Building an easily accessable Data Lake - DATA DO - \u30c7\u30fc\u30bf \u9053","isPartOf":{"@id":"https:\/\/datascientists.info\/#website"},"primaryImageOfPage":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#primaryimage"},"image":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#primaryimage"},"thumbnailUrl":"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png","datePublished":"2016-10-20T11:19:29+00:00","dateModified":"2018-01-21T11:31:47+00:00","description":"The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in realt-time or batch analysis.","breadcrumb":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#primaryimage","url":"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png","contentUrl":"http:\/\/datascientists.info\/wp-content\/uploads\/2016\/10\/hawq_logo.png"},{"@type":"BreadcrumbList","@id":"https:\/\/datascientists.info\/index.php\/2016\/10\/20\/apache-hawq-building-easily-accessable-data-lake\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/datascientists.info\/"},{"@type":"ListItem","position":2,"name":"Apache HAWQ: Building an easily accessable Data Lake"}]},{"@type":"WebSite","@id":"https:\/\/datascientists.info\/#website","url":"https:\/\/datascientists.info\/","name":"Data Scientists","description":"Digging data, Big Data, Analysis, Data Mining","publisher":{"@id":"https:\/\/datascientists.info\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/datascientists.info\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/datascientists.info\/#organization","name":"DATA DO - \u30c7\u30fc\u30bf \u9053","url":"https:\/\/datascientists.info\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datascientists.info\/#\/schema\/logo\/image\/","url":"https:\/\/datascientists.info\/wp-content\/uploads\/2026\/02\/Bildschirmfoto-vom-2026-02-02-08-13-21.png","contentUrl":"https:\/\/datascientists.info\/wp-content\/uploads\/2026\/02\/Bildschirmfoto-vom-2026-02-02-08-13-21.png","width":250,"height":174,"caption":"DATA DO - \u30c7\u30fc\u30bf \u9053"},"image":{"@id":"https:\/\/datascientists.info\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataScientists\/"]},{"@type":"Person","@id":"https:\/\/datascientists.info\/#\/schema\/person\/723078870bf3135121086d46ebb12f19","name":"Marc Matt","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g53b84b5f47a2156ba8b047d71d6d05fc","url":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","caption":"Marc Matt"},"description":"Senior Data Architect with 15+ years of experience helping Hamburg's leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities. I help clients: Migrate &amp; Modernize: Transitioning on-premise data warehouses to Google Cloud\/AWS to reduce costs and increase agility. Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs. Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow. Proven track record leading engineering teams.","sameAs":["https:\/\/data-do.de"]}]}},"authors":[{"term_id":144,"user_id":1,"is_guest":0,"slug":"marc","display_name":"Marc Matt","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts\/509","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/comments?post=509"}],"version-history":[{"count":2,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts\/509\/revisions"}],"predecessor-version":[{"id":523,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts\/509\/revisions\/523"}],"wp:attachment":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/media?parent=509"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/categories?post=509"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/tags?post=509"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/ppma_author?post=509"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}