Connecting to Amazon EMR

Amazon Elastic MapReduce (EMR) is an Amazon Web Service that provisions and manages a Hadoop cluster for customers to process large data sets using a MapReduce pattern. Hadoop is used in a variety of batch-oriented applications. The Hive module provides Amazon EMR with SQL-like query semantics (called HiveQL). JasperReports Server can issue Hive queries to provide an interactive way of analyzing the data in a Hadoop (EMR) Cluster. For more information about EMR, see http://aws.amazon.com/emr.

Amazon EMR supports two different modes of operation; a transient cluster, and a persistent (or “long running”) cluster. A transient cluster executes workloads where you want to periodically process a large data set, for example, processing daily logs. In this case the cluster is created, a series of operations (called steps) is run, and the cluster is terminated. A persistent cluster on the other hand is always running. An example of a common use case for a persistent cluster is a Hive data warehouse. To operate a persistent cluster acting as a Hive data warehouse you launch the Amazon EMR job in interactive mode with the option to install Hive. The cluster then services ad-hoc Hive queries for reporting and analysis from the JasperReports Server interface.

JasperReports Server comes with a Hive connector as one of the standard data source connectors. Hive uses the Thrift protocol to issue queries to the Hive Thrift service, which is the standard approach for database connections to Hive.

Amazon EMR automatically starts the Thrift service when the Hive option is installed on the cluster. This service runs on the master node of the Amazon EMR cluster, and the TCP port used to connect to the service is determined by the version of Hive being run on EMR (10000-10004).

Hive VersionTCP Port
0.11.010004
0.8.110003
0.7.110002
0.710001
0.510000

For more information on Hive versions and ports, see Using the Hive JDBC Driver.

There are two distinct methods of connecting your JasperReports Server AWS instance to an Amazon EMR cluster, which are discussed below.



Option 1: Using VPC (Virtual Private Cloud) and Security Groups

Amazon Virtual Private Cloud (VPC) constructs an isolated virtual network with security and networking control in the cloud. You are able to define subnets with private IP address ranges, routes, virtual private network connections, and internet accessibility within the VPC.

This approach provides network security and gives the JasperReports Server direct access to the Amazon EMR cluster once configured within the isolation of the Amazon VPC.

One of the benefits of using this approach is that if new nodes are added, no further security configuration will be required as long as those instances are deployed to the same subnet within the Amazon VPC and associated with the same security group as the other nodes.

For this type of connectivity we will use the Security Groups feature of VPC. This feature allows us to protect your server instances from unauthorized access over the Hive port.

To set this up you will need to create your Amazon EMR cluster in the same VPC that you have your JasperReports Server instance. You can either launch the instance into the same subnet as the JasperReports Server, or a different subnet. The subnet that contains the EMR cluster must be able to communicate to the internet. The control logic that manages the cluster communicates with the cluster over public IP addresses.

You will need to enable communications between the EMR cluster and JasperReports Server instances. Security groups act as firewalls in AWS and are assigned to instances. Members of the same security group can belong to different subnets within the VPC and an instance may belong to more than one security group. EMR will create two security groups for the cluster, one for the master node (which hosts the Thrift service) and one for the slave nodes. You should also define a security group for your JasperReports Server instances. The JasperReports Server instances need to communicate with the master node on ports 10000-10004. In order to enable this communication you will define a rule that permits the communication in the EMR master security group. A rule can either permit access for another security group or for a range of IP addresses defined by a CIDR (this address range can either be a private IP address range or a public IP address range). The best approach in this case is to define a rule for the master node security group that allows the JasperReports Server security group to connect on the port range 10000-10004.

Operationally, a VPC configuration is less complex to manage than the next option described, SSH port tunneling, since no special configuration or software needs to run on the JasperReports Server. Further, you can have different EMR clusters in the same VPC connected to your JasperReports Server instance without the need of doing multiple port forwarding. Using more sophisticated features you can even extend to other VPCs using a VPN connection.

This simplifies the creation of the data source since in JasperReports Server the Hive JDBC data source will use just the EMR master node DNS name as the Host Name with the proper Hive port.

You can use the Test Connection button to verify that JasperReports Server can connect to your EMR cluster. If the connection is working you will see a ‘success’ message at the top of the screen.
If the connection fails, verify the Hive’s JDBC URL field in the screen shown above has the correct host and port number. The JasperReports Server log file (located at: <apache-tomcat-home>/webapps/jasperserver-pro/WEB-INF/logs/jasperserver.log) provides more information on the connection error if needed for debugging purposes.
It is important to note that while the traffic between the JasperReports Server and the EMR master node is contained within the VPC network, it is not encrypted in transit within the subnet.



Option 2: Using an SSH Tunnel for Port Forwarding

SSH (secure shell) lets you create an encrypted and authenticated connection between the JasperReports Server and the Elastic MapReduce master node.
The standard Amazon EMR configuration uses SSH to connect with the master node using a public/private key pair for authentication and encryption (defined when you create your EMR cluster).

We will be using SSH access to forward connections on ports from the EMR master node to the JasperReports Server instance to allow the Hive connector to access the Hadoop Cluster.
For Hive we will use SSH to forward the proper TCP ports used by Hive (10000-10004) from the JasperReports Server machine to the Amazon EMR master node.
This option is easier to configure and implement for a single Jaspersoft server making it a great way of doing quick connections for testing or a proof of concept.

The two scenarios can be combined to use SSH within the VPC in addition to security groups. In this case, traffic does not traverse the Internet, the Jaspersoft server is authenticated to the Amazon EMR master node, and data is encrypted within the VPC. The following diagram illustrates this configuration.

Note: this option can also be used to connect from outside the AWS Cloud to your EMR Cluster. One common example of this is to use SSH tunneling to connect Jaspersoft Studio to your EMR cluster to test and preview reports and create Ad Hoc Topics.

How to connect:

  1. Login to the console of your JasperReports Server machine
  2. Copy the private key file for the key pair (.pem file) specified when creating the Amazon EMR job flow and upload it to the Home folder of your JasperReports Server machine
  3. From the command line, create an SSH tunnel to the master node of your Hive job flow as follows:

If your EMR Cluster is using...Use the following command...
Hive 0.11.0ssh -o ServerAliveInterval=10 -L 10004:localhost:10004 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
Hive 0.8.1ssh -o ServerAliveInterval=10 -L 10003:localhost:10003 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
Hive 0.7.1ssh -o ServerAliveInterval=10 -L 10002:localhost:10002 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
Hive 0.7ssh -o ServerAliveInterval=10 -L 10001:localhost:10001 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
Hive 0.5ssh -o ServerAliveInterval=10 -L 10000:localhost:10000 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem

The MasterNodeDNS is the public DNS name of the master node of the Hadoop cluster and mysecretkey.pem is the name of your AWS secret key file you uploaded in step 2 above.
You will find the DNS name for the master node of the Amazon EMR cluster in the AWS management console for Amazon EMR, in the description tab under Master Public DNS Name.
This process will map your localhost port to the Amazon EMR Cluster Hive port. Now that JasperReports Server has access to the cluster, we can log in and create the data source connection.
To create the data source follow the standard steps for creating a new Hive connection as below.

  1. Menu -> Create Data Source



  2. Select Hadoop-Hive Data Source



  3. Fill in the Data Source name and Save location. For the Hive JDBC URL the only change you need is to set the port number to the proper port you have forwarded in your SSH tunnel which will be different depending on the Hive Version of your EMR Cluster



Summary Table of Options

VPCSSH
SecurityWithin network/security groupFully encrypted
Best ForScalabilityEase of Configuration
Target Use CasesProductionTest/Proof of Concept
Feedback
randomness