Apache AVRO: Data format for evolution of data

Flexible Data Format: Apache AVRO

Apache AVRO is a data serialization format. It comes with an data definition format that is easy to understand. With the possibility to add optional fields there is a solution for evolution of the schemas for the data.

Defining a Schema

Defining a schema in Apache AVRO is quite easy, since it is a JSON object in the form of:

{ 
"type": "typename",
"name": "name of field",
"doc": "documentation of field"
}

A schema does consists of:

a namespace
name
type = record
documentation
fields

A typical schema would look like this:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		}
      ]
}

Changing a schema downwards compatible

If you now want to add a new field and stay compatible to your existing schema, you can just do the following:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		},
                {
                       "name": "optional_new_field",
                       "type": ["null", "string"],
                       "default": "New Field",
                       "doc": "This is a new options field"
                }
      ]
}

This change is still compatible to the version above. It is marked as options by [“null”, “string”] in the type field. The attribute default will fill this field with the value documented here, if the field is not existing in the data.

Serializing data using the schema

Once the schema is defined, it can be used to serialize the data. This serialization also serves as a compression format of about 30% in comparison to normal text. Serializazion is possible in a wide range of programming languages, but it is best implemented in Java. A tutorial on how to serialize data using a schema can be found here.

Reading the data with Apache Hive

Data stored in Apache AVRO is easily accessible if read by Hive external tables. The data is autmatically deserialized and made human readable. Apache Hive even supports default values on schema changes. If new fields are added with a default value, the Hive table can read all versions of the schema in the same table.

Flexible Data Format: Apache AVRO

Defining a Schema

Changing a schema downwards compatible

Serializing data using the schema

Reading the data with Apache Hive

Comments

Leave a Reply Cancel reply