Apache AVRO Schema evolution and Apache Hive

Apache AVRO&#153 for schema evolutionA perfect fit

Apache Avro™ and Apache Hive™ can go hand into glove, when it comes to using schema evolution. Accessing data stored in Apache Avro™ with Apache Hive™ works pretty straight forward, since Hive comes with a built-in (de-)serializer for this format. You just need to specify the format of the table as shown below.

CREATE TABLE schema_evo
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
    'avro.schema.url'='http://url_to_schema_evo/schema_evo.avsc');

As you see, there is no need to even specify the column names, as they will be derived from the schema. The above example uses an URL for the schema definition, but you could just as easily define your schema explicitly in the create table statement. This is done by using the following options in the statement:

TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "com.example",
  "name": "schema_evo",
  "type": "record",
  "fields": [
    {
      "name":"id",
      "type":"int",
      "doc":"unique identifier of row"
    },
    {
      "name":"name",
      "type":"string",
      "doc":"name of field entry"
    },
    {
      "name":"description",
      "type":"string",
      "doc":"description of field entry"
    }
  ]
}');

For the purpose of using schema evolution in Apache AVRO™ I would disencourage this, since then you will need to adapt your tables each time you change the schema.

Making use of Schema Evolution to handle Data

After you set up the tables in Apache Hive™ and already stored some data, you do realize, that some information is missing. Since changing the schema and still be able to process the files before the change is a feature of Apache AVRO™, this seems to be no major issue. Hive does support those changes. To make them compatible with your old data in accessed by Hive, the new fields need to be optional. Optional fields are defined by saying type should be [“null”, “datatype of field”].

{
"name": "optional",
"type": ["null", "string"],
"doc": "optional field"
}

To make that field usable in Apache Hive™, you need to supply a default value. This changes the field definition.

{
"name": "optional",
"type": ["null", "string"],
"default": null,
"doc": "optional field"
}

I intentionally made the default value null. Then the new field contains NULL values when queried in Apache Hive™. Different approaches for the optional field proved unsatisfying for me, when handling the data and processing it further. A filled default value also would need further explaination for other people using the data. The complete schema then looks like this:

{
  "namespace": "com.example",
  "name": "schema_evo",
  "type": "record",
  "fields": [
    {
      "name":"id",
      "type":"int",
      "doc":"unique identifier of row"
    },
    {
      "name":"name",
      "type":"string",
      "doc":"name of field entry"
    },
    {
      "name":"description",
      "type":"string",
      "doc":"description of field entry"
    },
    {
      "name": "optional",
      "type": ["null", "string"],
      "default": null,
      "doc": "optional field"
    }
  ]
}
Please follow and like us: