Hadoop Data Platform (HDP)

Hadoop is an open-source solution that provides storage and computing facilities for processing large-scale data in clusters. EPAM Cloud Hadoop service can be used by developers to test and debug their Hadoop jobs before running them on production.

EPAM Cloud Hadoop Data Platform Service is based on Hortonworks Hadoop Distribution v.2.

Have a Question?

The current page gives the general information on the service and the main workflows. However, while working with the services, our users encounter new questions they need assistance with. The most frequently asked questions on EPAM Cloud Services are gathered on the Cloud Services FAQ page.
Visit the page to check whether we have a ready answer for your question.

Related CLI Commands

The table below provides the list of service-related commands and their descriptions:

Command Short Command Description
or2-manage-service ...
-s hadoop -a -l slaves -h shape
or2ms Starts the service in the specified project and region
or2-describe-hadoop or2dh Gives the list, the states, and the DNS names of existing Hadoop resources.
or2-manage-hadoop or2mh Creates and removes Hadoop resources

Further on this page, you can find the examples of the commands usage for Hadoop Service manipulation.

Service Activation

To start the Hadoop service, run the or2-manage-service (or2ms) command with --activate (-a), --service-name (-s) hadoop and other necessary flags:

or2ms -p project -r region -a -s Hadoop -l slaves -h -shape

where:

  • -l (--hadoop-slave-count) specifies the number of Hadoop slave machines that will be run (1 by default, if the property is not specified)
  • -h (--shape) specifies the instance shape for Hadoop slave machines (MEDIUM by default, if the property is not specified)

You can also use the -k (--key-name) option to specify the SSH key for all the created resources.

The service when activated, by default starts the following virtual machines:

VM Role OS Shape Number
Hadoop Client Ubuntu 12.04 x64 SMALL 1
Hadoop Resource Manager Ubuntu 12.04 x64 MEDIUM 1
Hadoop Name Node Ubuntu 12.04 x64 MEDIUM 1
Hadoop Slave Ubuntu 12.04 x64 MEDIUM
(alterable with -h parameter)
1
(alterable with -l parameter)

To check whether Hadoop has started properly, you should login to the client and run a test job. The cluster is ready to work if the test job is performed without issues.

Each project can have only one Hadoop service activated for the region, but the service can include several Hadoop clients, each responsible for its own job. For more details on new Hadoop resources creation, see the Manipulating Slaves and Clients section.

Retrieving Hadoop Data Platform Info

EPAM Orchestrator supports the following commands for Log service manipulations:

To see the list, the states and the DNS names of Hadoop resources, run the or2-describe-hadoop (or2dh) command:

or2dh -p project -r region

You can also use the or2-describe-services (or2dser) command with the -s Hadoop flag to find the Hadoop client DNS.

Running Jobs

To run a Hadoop job, you have to connect to the Hadoop Client you want to use. To connect, use the DNS name retrieved by the or2dh command and the following credentials:

  • User: hdfs
  • Password: user

When connected to the Client, run the following command on it:

hadoop jar job_path [job parameters]

Where:

  • job_path stands for the path to the .jar file describing the Hadoop job
  • job_parameters stands for the list of the parameters that can be accepted by the current job

Each Hadoop client has a set of pre-installed demo jobs you can run to check how the service works. For example, the command below will call the job with a program that is intended to calculate the Pi number:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.4.0.2.1.5.0-695.jar pi 10 20

Here,

  • pi specifies the name of a valid program available within the specified job
  • 10 stands for an integer specifying the number of mappings
  • 20 stands for an integer specifying the number of samples per map.

You can also see the other examples in the /usr/lib/hadoop-mapreduce/ folder.

Manipulating Slaves and Clients

It is possible to change the number of existing Hadoop Clients and Slaves after the service activation. This can be done with the or2-manage-hadoop (or2mh) command:

or2mh -p project -r region -a -c number -s resource_type

where:

  • -a (--add) - a flag identifying that a resource should be added
  • resource_type is a type of the Hadoop resource (client, slave)
  • number identifies the number of the new resources to be created (1 by default)

You can also specify the resource shape with the -h / --shape parameter.

To remove a slave, use the or2mh command with the -m /--remove to parameter:

or2mh -p project -r region -c number -s resource_type --remove

When you remove a slave, it will still be detected as active on Resource Manager and Name Node UIs for some time.

Web UI

To monitor the Hadoop Service performance, you can login to the Hadoop Resource Manager using its DNS name and port 8088:

http://<hadoop_rm_dns>:8088

When you connect, you can see the list of the performed jobs (called Applications here), their status, and get the access to other details:

To login to Hadoop Name Node UI, use the Name Node DNS and the 50070 port:

http://<hadoop_nn_dns>:50070

Hadoop Name Node has the following interface:

You don't have to specify any credentials to connect to Hadoop Resource Manager and Name Node web interfaces.

Pricing

The service usage price is defined by the price of the created Hadoop Resources.The default parameters of a minimum set of Hadoop VMs are:

  • Image: CentOS6-template
  • Shape: MEDIUM (three VMs); SMALL (one VM)

Therefore, the approximate monthly cost of minimum set of Hadoop Data Platform Service in case of 100% and 24/7 load is about $234.34 in EPAM-BY1 region (as to 11/09/2015). The price can vary depending on the region, the number of clients and slaves and their shapes.
To get more detailed estimations, please, use our Cost Estimator tool.

References

More information on the Hadoop Data Platform Service can be found in the EPAM Cloud Services Guide. For detailed description of the Maestro CLI commands used to manage the Hadoop Data Platform Service, refer to the Maestro CLI User Guide.