How to use Apache Phoenix JDBC driver to run reports on HBase

Introduction

In this article we will show how to run reports on HBase using the open source Apache Phoenix JDBC driver. Phoenix is now a stable and performant solution, which "became a top-level Apache project in 2014. Apache Phoenix is included in the Hortonworks distribution for HDP 2.1 and above, is available as part of Cloudera labs, and is part of the Hadoop ecosystem." (Wikipedia article)

The driver will only work in TIBCO JasperReports® Server v6.2 or later as the minimum ANTLR library version required by Phoenix is available since that release.

Earlier versions of TIBCO Jaspersoft® Studio and TIBCO JasperReports® Library might work independently from the server as they have different ANTLR dependencies.

Please note that only standard / production reports will work in JasperReports Server , Domains will not work because of the limited JDBC metadata implementation in Phoenix (see our JDBC metadata requirements in the Platform Support Datasheet PDF vs. Phoenix implementation).

Installing Apache Phoenix

HBase master and region servers

Follow the download and installation steps on the Phoenix website, depending on your HBase version. For this article, we have tested Phoenix 3.3.1 with Cloudera CDH 4.x quickstart VM – which contained version 4.8.3 at the time of testing.

We had to switch the whole Hadoop cluster from JDK 6 to JDK 7, or the Phoenix libraries, compiled with Java 7, would not run at all.

Client side – in this case JasperReports Server, Jaspersoft Studio or any application using JasperReports Library

There are two flavours of Phoenix you can install:

  1. phoenix-[version]-client-hadoopX.jar – replace [version] with 3.3.1 and X with 1 or 2, based on your Hadoop version. This jar is standalone and includes its own version of HBase (+ dependencies)
  2. phoenix-[version]-client-without-hbase.jar – replace [version] with 3.3.1. This jar does not contain HBase nor Hadoop dependencies, which means you will have to provide your own jar files for those

Although option a is simpler, in our test we have used option b (client without HBase) because the HBase libraries included with Phoenix were not compatible with HBase installed on CDH 4.8.3.

For example, for JasperReports Server we have copied the following files to <js-webapp>/WEB-INF/lib:

From the the Phoenix download:

phoenix-3.3.1-client-without-hbase.jar

From one of the Hadoop cluster nodes (HBase master node on CDH in our test, but you should find similar files in your Hadoop distribution):

/usr/lib/hadoop/hadoop-common.jar

/usr/lib/hadoop/hadoop-auth.jar

/usr/lib/hbase/hbase.jar

The paths above are normally symbolic links to the specific versions of the libraries. You can copy those with a version in the file name or those without, it will not make a difference.

The same files above will be required (plus probably /usr/lib/hadoop/lib/common-collections-3.2.1.jar) in the classpath, in order to use Phoenix in Jaspersoft Studio or any applications using the the JasperReports Library. See "To add a JAR file to a data adapter's classpath" in the documentation for more details.

Setting up the data source

In JasperReports Server, create a new Data Source as superuser. Assuming your ZooKeeper quorum server host name is cluster1, you could use:

Type: JDBC Data Source

JDBC Driver: select Other...

Then the following fields will be available.

JDBC Driver (required): org.apache.phoenix.jdbc.PhoenixDriver

URL (required): jdbc:phoenix:cluster1

User Name and Password are not required in a default configuration.

Building a sample report

Any reports with an SQL query compatible with Phoenix can be used with a Data Source configured as above. Here's some sample DDL and a query to get you started. You can load employee data from the Foodmart sample database (gzipped and attached to this article in tab-separated values as employee.tsv.gz) to a standard HBase table, then create a table in Phoenix which exposes that data via SQL.

While extracting the data to the tsv file, we generated a UUID value to be used as the primary key of the HBase table, as it wouldn't be recommended to use the employee ID column which is a sequential integer.

Run this command on the HBase master to create the table:

$ echo "create 'employee', 'cf1'" | hbase shell

In the same directory of the file employee.tsv, you can then load the data with these commands:

$ gunzip employee.tsv.gz
$ hadoop jar /usr/lib/hbase/hbase.jar importtsv \
         -Dimporttsv.columns=HBASE_ROW_KEY,cf1:employee_id,cf1:full_name \
                                          ,cf1:first_name,cf1:last_name \
                                          ,cf1:position_id,cf1:position_title \
                                          ,cf1:store_id,cf1:department_id \
                                          ,cf1:birth_date,cf1:hire_date \
                                          ,cf1:end_date,cf1:salary \
                                          ,cf1:supervisor_id,cf1:education_level \
                                          ,cf1:marital_status,cf1:gender \
                                          ,cf1:management_role  employee employee.tsv

Connect to Phoenix via sqlline.py (part of the Phoenix package) directly on the HBase master or via any JDBC client configured to use Phoenix (e.g. SQuirreL), and run this DDL:

CREATE TABLE "public"."employee" (  pk                     VARCHAR PRIMARY KEY, 
                                   "cf1"."employee_id"     VARCHAR, 
                                   "cf1"."full_name"       VARCHAR,
                                   "cf1"."first_name"      VARCHAR,
                                   "cf1"."last_name"       VARCHAR,
                                   "cf1"."position_id"     VARCHAR,
                                   "cf1"."position_title"  VARCHAR,
                                   "cf1"."store_id"        VARCHAR,
                                   "cf1"."department_id"   VARCHAR,
                                   "cf1"."birth_date"      VARCHAR,
                                   "cf1"."hire_date"       VARCHAR,
                                   "cf1"."end_date"        VARCHAR,
                                   "cf1"."salary"          VARCHAR,
                                   "cf1"."supervisor_id"   VARCHAR,
                                   "cf1"."education_level" VARCHAR,
                                   "cf1"."marital_status"  VARCHAR,
                                   "cf1"."gender"          VARCHAR,
                                   "cf1"."management_role" VARCHAR
                                 );

You should then be able to query the data in your report, for example:

  SELECT *
    FROM "public"."employee"
ORDER BY "employee_id" 
   LIMIT 100;

For your convenience, a simple report is attached to this article as HBaseTest.jrxml, created with JSS 6.1.1.

Alternatively you can create the table in Phoenix directly without going through HBase first and then load the data via traditional SQL (using UPSERT, as INSERT and UPDATE are not available), or using bulk loading.

Notes

sqlline.py

If you have trouble starting sqlline.py locally on the master, set this environmental variable and try again:

$ export HBASE_CONF_PATH=/usr/lib/hadoop/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop/hadoop-common.jar:/usr/lib/hadoop/hadoop-auth.jar:/usr/lib/hbase/hbase.jar

You will need to change the paths above depending on your Hadoop distribution.

ZooKeeper

As in our case CDH was running in a test VM, we encountered this issue: http://stackoverflow.com/questions/18428722/hbase-java-client-unknown-ho...

That was worked around by adding localhost.localdomain to the existing /etc/hosts entry for cluster1, which was already pointing to the right IP address.

This should not be an issue in a Hadoop cluster where ZooKeeper is properly configured.

Feedback