Using Avro's code generation from Maven
Avro has the ability to generate Java code from Avro schema, IDL and protocol files. Avro also has a plugin which allows you to generate these Java sources directly from Maven, which is a good idea as it avoids issues that can arise if your schema/protocol files stray from the checked-in code generated equivalents.
Today I created a simple GitHub project called avro-maven because I had to fiddle a bit to get Avro and Maven to play nice. The GitHub project is self-contained and also has a README which goes over the basics. In this post I’ll go over how to use Maven to generate code for schema, IDL and protocol files.
pom.xml updates to support the Avro plugin
Avro schema files only define types, whereas IDL and protocol files model types as well as RPC semantics such as messages. The only difference between IDL and protocol files is that IDL files are Avro’s DSL for specifying RPC, versus protocol files are the same in JSON form.
Each type of file has an entry that can be used in the goals
element as can be seen below. All three can be used together, or if you only have schema files you can safely remove the protocol
and idl-protocol
entries (and vice-versa).
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
<goal>protocol</goal>
<goal>idl-protocol</goal>
</goals>
</execution>
</executions>
</plugin>
...
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-compiler</artifactId>
<version>${avro.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<version>${avro.version}</version>
</dependency>
</dependencies>
By default the plugin assumes that your Avro sources are located in ${basedir}/src/main/avro
, and that you want your generated sources to be written to ${project.build.directory}/generated-sources/avro
, where ${project.build.directory}
is typically the target
directory. Keep reading if you want to change any of these settings.
Avro configurables
Luckily Avro’s Maven plugin offers the ability to customize various code generation settings. The following table shows the configurables that can be used for any of the schema, IDL and protocol code generators.
Configurable | Default value | Description |
sourceDirectory | ${basedir}/src/main/avro | The Avro source directory for schema, protocol and IDL files. |
outputDirectory | ${project.build.directory}/generated-sources/avro | The directory where Avro writes code-generated sources. |
testSourceDirectory | ${basedir}/src/test/avro | The input directory containing any Avro files used in testing. |
testOutputDirectory | ${project.build.directory}/generated-test-sources/avro | The output directory where Avro writes code-generated files for your testing purposes. |
fieldVisibility | PUBLIC_DEPRECATED | Determines the accessibility of fields (e.g. whether they are public or private). Must be one of PUBLIC, PUBLIC_DEPRECATED or PRIVATE. PUBLIC_DEPRECATED merely adds a deprecated annotation to each field, e.g. "@Deprecated public long time". |
In addition, the includes
and testIncludes
configurables can also be used to specify alternative file extensions to the defaults, which are **/*.avsc
, **/*.avpr
and **/*.avdl
for schema, protocol and IDL files respectively.
Let’s look at an example of how we can specify all of these options for schema compilation.
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/myavro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
<testSourceDirectory>${project.basedir}/src/main/myavro/</testSourceDirectory>
<testOutputDirectory>${project.basedir}/src/test/java/</testOutputDirectory>
<fieldVisibility>PRIVATE</fieldVisibility>
<includes>
<include>**/*.avro</include>
</includes>
<testIncludes>
<testInclude>**/*.test</testInclude>
</testIncludes>
</configuration>
</execution>
</executions>
</plugin>
As a reminder everything covered in this blog article can be seen in action in the GitHub repo at https://github.com/alexholmes/avro-maven.
About the author
Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.
If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.
RECENT BLOG POSTS
-
Configuring memory for MapReduce running on YARN
This post examines the various memory configuration settings for your MapReduce job.
-
Big data anti-patterns presentation
Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
-
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
-
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
-
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.