Creating a Hadoop TDP cluster on AWS

Simon ROUSSEAU included in categories AWS Ansible Cloud Hadoop Terraform

2022-06-18 2023-03-16 946 words 5 minutes

Contents

Warning

The TDP project having evolved since the writing of this article, the use of the sources for this deployment will require adaptation.

After several years of using the CDH image provided by Cloudera to teach introductory courses in Big Data, updating my demo Hadoop cluster became necessary. Unfortunately, Cloudera has chosen to no longer provide the Quickstart image. It was therefore impossible to have an up-to-date cluster and I had to settle for a CDH 5.13.0 (release date: October 12, 2017).

Then I heard about the Trunk Data Platform (TDP) project. This project offers the main Big Data tools based on the Apache ecosystem, it has the advantage of being open source and being deployable on several servers like a standard Hadoop cluster.

At present, TDP is mainly maintained by the association TOSIT but is open to any contribution through their various public projects. In order to make it easier to use, they also set up a project TDP - Getting started. This allows you to create a Hadoop cluster on local virtual machines from any Linux machine with a minimum of CPU and memory (32 GB of memory recommended all the same).

TDP would therefore allow me to have an up-to-date Hadoop cluster! Nevertheless, the constraint of having a computer with 32 GB of memory quickly became limiting. So I decided to deport this Hadoop cluster to EC2 instances hosted on AWS.

For this, I proceeded in three steps:

Infrastructure deployment with Terraform on AWS;
Preconfiguring nodes with Ansible based on the dynamic inventory generated in the previous step;
Deployment from TDP.

Note

All the sources concerning this deployment are available here: https://github.com/siwon/tosit_tdp_aws.

Warning

The TDP project having evolved since the writing of this article, the use of the sources for this deployment will require adaptation.

Infrastructure deployment

The first step is to deploy the infrastructure and for that I logically chose Terraform.

For the architecture, I was inspired by TDP - Getting started project recommendations. I nevertheless took the liberty of adding an additional node that will be dedicated to deployments in order to optimize network flows during installation.

In order to anticipate service resilience issues, I have chosen to distribute the nodes between the Availability Zones available in my region (eu-west-3 which is Paris). Please note, however, that rack awareness has not been configured.

Finally, I wanted to secure master nodes and worker nodes, so these are not publicly addressable, but only through edge nodes or the deployer node.

Nous obtenons donc l’architecture suivante :

We have therefore:

1x deployer node
1x edge node
3x master nodes
3x worker nodes

The sizing of deployed instances is as follows:

Table 1. Server sizing
Zone	Name	Instance type (vCPU, Memory)	OS	Storage (Gio)
`eu-west-3a`	`srv-tdp-deployer-default`	`t3a.large` (2, 8 Gio)	`debian-11`	`8`
`eu-west-3a`	`srv-tdp-edge-000-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8`
`eu-west-3a`	`srv-tdp-master-000-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8`
`eu-west-3b`	`srv-tdp-master-001-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8`
`eu-west-3c`	`srv-tdp-master-002-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8`
`eu-west-3a`	`srv-tdp-worker-000-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8` + `1`
`eu-west-3b`	`srv-tdp-worker-001-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8` + `1`
`eu-west-3c`	`srv-tdp-worker-002-default`	`t3a.large` (2, 8 Gio)	`centos-7`	`8` + `1`

Preparing instances

With the infrastructure now deployed, we can proceed to:

The deployer node configuration
Copying inventory files generated by Terraform to deployer node
Apply the prerequisites for deploying TDP on edge, master and worker nodes
Prepare data disks on worker nodes

All of these prerequisites are implemented using Ansible.

Deployer configuration

The preparation of the deployer node is relatively basic and consists of installing Ansible and retrieving all the TDP roles.

Copy configuration files generated by Terraform

Unlike the TDP - Getting started project, the node inventory is not generated by Vagrant. The nodes and addresses of the servers are therefore variable so we must extract this information from the Terraform deployment.

For this, different files are generated at the output of Terraform:

deployer-default.yml: Inventory file for all preparatory work for the installation of TDP.
hosts-default.yml: File listing the servers as well as their sizing allowing feeding the hosts variable in the inventory/group_vars/all.yml file of the project TDP - Getting started.
inventory-default.yml: Inventory file dedicated to the deployment of TDP services.

These files are therefore deployed on the deployment server in order to update the configuration files.

Configuring instance name and domain

Another difference between using Vagrant and AWS, we cannot predict or even force the use of a hostname or domain when creating EC2 instances.

TDP being sensitive to the names and domains of instances for the generation of certificates, we must therefore configure them.

Configuring data disks

The last step is to initialize and mount the additional disks deployed on the worker nodes.

TDP deployment

All the prerequisites for using TDP having been met, we can now proceed to deploy the services of our Hadoop cluster.

To do this, all you need to do is:

Connect to the deployer node: ssh -i /tmp/ssh-private-key-tdp-default.pem admin@<Public IP Deployer Node>
Move to the tdp-getting-started directory: cd tdp-getting-started
Launch deployment with Ansible Playbook: ansible-playbook deploy-all.yml

Running the entire script takes a few minutes.

Once the deployment is complete, you can connect to an edge node and access the various tools offered by TDP (HDFS, Spark, Hive, etc.).

Conclusion

This modest TDP deployment project on AWS is not intended to provide a production service. Nevertheless, it allows experiments and demonstrations to be carried out with an up-to-date Hadoop stack without requiring a Cloudera license.

The next steps are to be:

Expose UIs;
Configure Rack Awareness.

Regarding my introductory courses in Big Data, this Hadoop stack will be perfect for practical work. All I have to do is adapt the subjects and that’s it!