Hadoop-Hive Data Sources

Unlike traditional databases, Hadoop supports huge amounts of data, often called big data. As of version 5.6.1, JasperReports Server supports two data source types that process requests to a Hadoop cluster:

CDH 5 Hive-Impala Data Source - If you access Hadoop-Hive2 or Hadoop-Impala through a Cloudera 5 server only.
JDBC Data Source with Hive JDBC driver - Most other Hive 1, Hive 2 and Impala servers.

Depending on whether you use Hive 1, Hive 2, or Impala, there are certain restrictions on accessing data in Hadoop.

CDH 5 Hive-Impala data sources have very low latency, and allow interactivity with Ad Hoc views, filters, and dashboards. However, Hadoop-Impala data sources still have the following limitations:

CDH 5 Hive-Impala data sources are not supported for OLAP connections.
CDH 5 Hive-Impala data sources cannot be used directly in Domains. To use Hadoop-Impala in a Domain, see Big Data Connectors for Virtual Data Sources.
CDH 5 Hive-Impala data sources can be used in Ad Hoc Topics, but they do not support query optimization.
You must configure your query limits to handle big data (see Ad Hoc Data Policies for Big Data).
You must configure your JVM memory to handle the expected amount of data (see the JasperReports Server Installation Guide).

The JDBC driver for Hive works with most other Hive 1, Hive 2, and Impala servers, and it can be used with Domains. However, the original Hive 1 server has high latency with access times on the order of 30 seconds and up to 2 minutes. Hive 2 is much faster, but still not as fast as relational databases. As a result, Hadoop-Hive data sources have certain limitations and guidelines for use in JasperReports Server:

Hadoop-Hive data sources are not suitable for creating reports interactively in the Ad Hoc Editor.
Reports based on Hadoop-Hive are not suitable for dashboards.
Filters and query-based input controls that rely on Hadoop-Hive data sources will be slow to populate the list of choices.
You must configure your query limits and timeout to handle latency (see Ad Hoc Data Policies for Big Data).
You must configure your JVM memory to handle the expected amount of data (see the JasperReports Server Installation Guide).

In general, reports based on JDBC-Hive data sources are best suited to be run in the background from the repository. For very large reports, consider scheduling them to run at night so the output is available when you need it during the day.

To create a Hive JDBC data source, follow the same procedure as in JDBC Data Sources.

To create a CDH 5 Hive-Impala data source:

1. Log on as an administrator.
2. Click View > Repository, expand the folder tree, and right-click a folder to select Add Resource > Data Source from the context menu. Alternatively, you can select Create > Data Source from the main menu on any page and specify a folder location later. If you have installed the sample data, the suggested folder is Data Sources. The New Data Source page appears.
3. In the Type field, select Impala Data Source. The information on the page changes to reflect what’s needed to define a Hadoop-Hive data source.

You have the option to use profile attributes to derive the values for data source parameters. See Attributes in Data Source Definitions

Hadoop-Hive Data Source Page

4. Fill in the required fields, along with any optional information you choose.

The JDBC URL depends on which type of server you are using:

Hive 2:




5. Click Test Connection to validate the data source.
6. When the test is successful, click Save. The Save dialog appears.

Saving the Hadoop-Hive Connection

7. Enter the data source name and, optionally, a description. the Resource ID appears based on the name you enter.
8. Expand the folder tree and select the location for your data source, then click Save. The data source appears in the repository.