Apache AVRO Schema evolution and Apache Hive

Apache AVRO&#153 for schema evolutionA perfect fit

Apache Avro™ and Apache Hive™ can go hand into glove, when it comes to using schema evolution. Accessing data stored in Apache Avro™ with Apache Hive™ works pretty straight forward, since Hive comes with a built-in (de-)serializer for this format. You just need to specify the format of the table as shown below.

CREATE TABLE schema_evo
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
    'avro.schema.url'='http://url_to_schema_evo/schema_evo.avsc');

As you see, there is no need to even specify the column names, as they will be derived from the schema. The above example uses an URL for the schema definition, but you could just as easily define your schema explicitly in the create table statement. This is done by using the following options in the statement:

TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "com.example",
  "name": "schema_evo",
  "type": "record",
  "fields": [
    {
      "name":"id",
      "type":"int",
      "doc":"unique identifier of row"
    },
    {
      "name":"name",
      "type":"string",
      "doc":"name of field entry"
    },
    {
      "name":"description",
      "type":"string",
      "doc":"description of field entry"
    }
  ]
}');

For the purpose of using schema evolution in Apache AVRO™ I would disencourage this, since then you will need to adapt your tables each time you change the schema.

Making use of Schema Evolution to handle Data

After you set up the tables in Apache Hive™ and already stored some data, you do realize, that some information is missing. Since changing the schema and still be able to process the files before the change is a feature of Apache AVRO™, this seems to be no major issue. Hive does support those changes. To make them compatible with your old data in accessed by Hive, the new fields need to be optional. Optional fields are defined by saying type should be [“null”, “datatype of field”].

{
"name": "optional",
"type": ["null", "string"],
"doc": "optional field"
}

To make that field usable in Apache Hive™, you need to supply a default value. This changes the field definition.

{
"name": "optional",
"type": ["null", "string"],
"default": null,
"doc": "optional field"
}

I intentionally made the default value null. Then the new field contains NULL values when queried in Apache Hive™. Different approaches for the optional field proved unsatisfying for me, when handling the data and processing it further. A filled default value also would need further explaination for other people using the data. The complete schema then looks like this:

{
  "namespace": "com.example",
  "name": "schema_evo",
  "type": "record",
  "fields": [
    {
      "name":"id",
      "type":"int",
      "doc":"unique identifier of row"
    },
    {
      "name":"name",
      "type":"string",
      "doc":"name of field entry"
    },
    {
      "name":"description",
      "type":"string",
      "doc":"description of field entry"
    },
    {
      "name": "optional",
      "type": ["null", "string"],
      "default": null,
      "doc": "optional field"
    }
  ]
}
Please follow and like us:

Apache AVRO: Data format for evolution of data

Apache AVROFlexible Data Format: Apache AVRO

Apache AVRO is a data serialization format. It comes with an data definition format that is easy to understand. With the possibility to add optional fields there is a solution for evolution of the schemas for the data.

Defining a Schema

Defining a schema in Apache AVRO is quite easy, since it is a JSON object in the form of:

{ 
"type": "typename",
"name": "name of field",
"doc": "documentation of field"
}

A schema does consists of:

  • a namespace
  • name
  • type = record
  • documentation
  • fields

A typical schema would look like this:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		}
      ]
}

Changing a schema downwards compatible

If you now want to add a new field and stay compatible to your existing schema, you can just do the following:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		},
                {
                       "name": "optional_new_field",
                       "type": ["null", "string"],
                       "default": "New Field",
                       "doc": "This is a new options field"
                }
      ]
}

This change is still compatible to the version above. It is marked as options by [“null”, “string”] in the type field. The attribute default will fill this field with the value documented here, if the field is not existing in the data.

Serializing data using the schema

Once the schema is defined, it can be used to serialize the data. This serialization also serves as a compression format of about 30% in comparison to normal text. Serializazion is possible in a wide range of programming languages, but it is best implemented in Java. A tutorial on how to serialize data using a schema can be found here.

Reading the data with Apache Hive

Data stored in Apache AVRO is easily accessible if read by Hive external tables. The data is autmatically deserialized and made human readable. Apache Hive even supports default values on schema changes. If new fields are added with a default value, the Hive table can read all versions of the schema in the same table.

Please follow and like us: