Quick Start

We have a complete Hadoop cluster solution for you
- no matter where you are on your Big Data journey.

Last updated: December 17th, 2021

Requirements

Docker is a mandatory prerequisite for Hadjo. Behind the scenes Hadjo creates for you Docker images, configures a Docker local network and manages Docker instances in a their own network. To check if Docker is installed and configured well, please issue the following command "docker ps -a" on your command shell. You should see something similar:

$docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

If there are any created containers they will be listed. An example with one container output is: $docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1df0795984e5 mysql:5.7.17 "docker-entrypoint.s…" 8 days ago Up 2 hours 0.0.0.0:53306->3306/tcp mysql
Note if there are no permission errors on Unix type OS - Docker gives the options to "sudo" or add an user to the Docker group. Hadjo requires the user which runs the application to be able to execute Docker commands without "sudo". Please, check Docker setup for your OS.

Although Hadjo supports Docker with Toolbox it is better (if possible) to run Docker natively on the corresponding OS. Docker Toolbox is for older Mac and Windows systems (ex. Windows 8) that do not meet the requirements of Docker Desktop for Mac and Docker Desktop for Windows.

Download

Hadjo ships as a portable application for Windows, Linux and macOS:

All platforms (zip)

Installation

No installation is needed

Double-click on the downloaded jar file or run "java -jar hadjo-version.jar".

Getting started with Hadjo

Hadjo is a desktop platform for developers, DevOps, BI specialists, students or just anyone that is interested in developing, deploying, running or just studying cluster Hadoop applications. A Hadoop cluster managed by Hadjo is a group of Docker containers that are orchestrated to run together in a private network on your notebook, PC or Mac.
Prior knowledge of Docker to run Hadjo is not required. The nodes are all managed for you behind the scenes. Some understanding of Docker concepts and knowledge of basic commands may be of help but not required.

Welcome! We are thrilled that you want to learn Hadjo. The Get Started section will guide you through step by step:

  • With 3 clicks you will run a cluster with 1 master and 2 slave nodes
  • Play with that cluster from Hadjo - open UI Hadoop screens, ssh to master node, etc....
  • Add a new node
  • Further steps

Build a cluster (just a few clicks)

Start Hadjo application. When you open the latter for the first time a "demo" workspace will be loaded - "HouseElfs-2.8". It represents a cluster of one master named "Dobby" and two slaves "Winky" and "Kreacher"(The names are taken from the Harry Potter series). This configuration represents a Hadoop version 2.8 cluster that will be running with Java 1.8 on each node.



Click on the "Build" button at the bottom right corner (mouse over image)



Click on the "Start" button of the "Build" screen (mouse over image). When prompted click "Yes".
Hadjo will initiate a Docker image build process that can take some time. The "Build Log" area provides information of the process. When the image is built you should see at the end of "Build Log":

Docker image "hadjo/HouseElfs-2.8_cluster" has been created
The new image has been built successfully.
You can now play with your cluster.


Run your cluster (just a few clicks)

As the image is built, we shall go back to the monitoring screen (the first screen that you have seen).
Now press "Esc" or click "Cancel" to close the current dialog.
A the monitoring screen, click on the "Start all" button at the bottom center area. See the screenshot below (mouse over image):



When prompted click on "Yes" button. Then, please, wait until all the three gears under the "Status" turn green and rotating...



If you get the green gears on "Status" and "Software on instance" (see the screenshot above, mouse-over image) then ...
CONGRATULATIONS !!! You have got your Hadoop cluster up and running.
Now, let's have some fun with it!

Click on the green gear "Apache Hadoop (2.8.5)" on the row of "Dobby (master)", see the screenshot below:



This will open a new popup which displays the running processes on "Dobby" - the master node running on Ubuntu.
Click on the active link "HDFS Namenode Web UI". See the screenshot below:



A new browser window pops up with the Hadoop's web application for Namenode information at http://localhost:50070.
You are now able to browse HDFS and perform many operations related to Hadoop's file system. The slaves nodes "Kreacher" and "Winky" are displayed. See the screenshot below:



Next, we shall open a new handy Web management screen of Hadoop - Yarn ResourceManager UI. From the open popup click on the active link "Yarn ResourceManager Web UI". See the image below.



A new browser window pops up with the Hadoop's web application for ResourceManager information at http://localhost:8088.
Now, you can look at the YARN web GUI to observe your Hadoop applications (ex. MapReduce, Pig jobs, etc.). You can monitor the application submission ID, the user who submitted the application, the name of the application, the queue in which the application is submitted, the start time and finish time in the case of finished applications, and the final status of the application, using the ResourceManager UI. The screenshot below shows the information you can get from the YARN web UI:

Add slave to cluster (while running)

Great, you have reached with a running cluster. Let's add a new node without restarting the cluster!
Click "Add New" button on monitoring screen at the bottom left corner.
At the edit instance dialog a new IP is automatically assigned. The IP can be any valid unused value. For now we shall not change it. On the "instance name" input type "Hokey" (just another house elf from "Harry Potter") as the Hadoop node slave name.
See the screenshot below and mouse-over for more details.




Press "OK" in order to save the new node.
The new node is added but not yet started (mind the gray gears under "Status"). Click on the green triangular button to run the node (mouse-over image below to find it). It will be joining the cluster!
See the screenshot below and mouse-over for more details.



The new node will transition til it gets the green gears and is now part of the cluster. Great!
Now you can check your own HDFS GUI and Yarn ResourceManager GUI to see your new node.

Cluster Interactions (execute Linux command on master node)

At the final stage of this tutorial let's play a little with the cluster!!!
We shall execute a Linux command on the running master server. From the menu "Cluster Interactions" select "Execute command (master OS)" and in the input field type "jps". This command will provide information of the runniing Java processes
See the screenshot below and mouse-over for more details.


see Linux command results below:

Cluster Interactions (upload local file to master node OS)

We shall upload a file from your PC (laptop or Mac) to the running master's OS. From the menu "Cluster Interactions" select "File Upload (master OS)" and browse for a local file. You can choose whatever file you want. For the sake of this tutorial we have selected a file named "readme.txt".
When prompted for master OS directory full path - leave it with default value "/home/hadjo" which is the user home directory of the user on master that runs Hadoop. Press "Ok"
See the screenshot below and mouse-over for more details.



on success upload you should see:



Let's check if the file is really there :-)
Using the knowledge from previous section execute "ls -l /home/hadjo" Linux command on master.
See the screenshot below and mouse-over for more details.

Cluster Interactions (upload local file to HDFS)

We shall upload a file from your PC (laptop or Mac) to Hadoop HDFS. From the menu "Cluster Interactions" select "File Upload (HDFS)" and browse for a local file. You can choose whatever file you want. For the sake of this tutorial we have selected a file named "ElasticSearchTutorial.pdf".
When prompted for master OS directory full path - leave it with default value "/home/hadjo" which is the user home directory of the user on master that runs Hadoop. Press "Ok"
See the screenshot below and mouse-over for more details.



on success HDFS file upload you should see:



Let's check if the file is really there in Hadoop HDFS :-)
Using the knowledge from previous section execute "hdfs dfs -ls -R input/" Linux command on master. It will recursively list directories and file from the given path. We sould like to check if the non-existent directories on HDFS are created and if the file is there
See the screenshot below and mouse-over for more details.

Cluster Interactions (SSH access to master and slaves)

You can login to your "master" and slaves using an SSH client of your choice. What you need to know:
  • To login to master use: hadjo@localhost:22 with password hadjo. Note : If you are using Docker with Toolbox, please check the configurations of the VM that runs Docker and find the IP to connect to and map SSH port if needed
  • After you login to master you should see:
    Your Big Data environment is ready...
    hadjo@Dobby:~$

    Play around...
    hadjo@Dobby:~$whoami
    hadjo
    hadjo@Dobby:~$

     
  • Jump SSH to slave "Winky" and see the running Java processes :
    Starting from master SSH terminal:
    hadjo@Dobby:~$ssh hadjo@192.168.0.3
    Your Big Data environment is ready...
    hadjo@Winky:~$ jps
    9106 Jps
    202 NodeManager
    126 DataNode
    exit

     
  • Jump SSH to slave "Kreacher" and see the running Java processes :
    Starting from master SSH terminal:
    hadjo@Dobby:~$ssh hadjo@192.168.0.4
    Your Big Data environment is ready...
    hadjo@Kreacher:~$ jps
    9191 Jps
    203 NodeManager
    127 DataNode
    exit

     

Summary

And there it is, you have completed the tutorial after a fun(hopefully) journey and now you are ready to take the Big Data world! You have learnt how to create your own cluster (without paying AWS a dime), run your own nodes, got hands on experience with basic interactions with HDFS and execute Linux commands on your nodes. When you are showing or building your Hadoop apps, you can be sure that you will be able to get it in front of people with minimal effort.
Your journey into the Hadoop world has just begun! Hopefully this step by step tutorial has served its purpose to get you excited about Hadoop in a very affordable way.
Thank you for your time!
“One can never have enough socks.“ - Harry Potter and the Philosopher’s Stone

Stop Cluster

A the monitoring screen, click on the "Stop all" button at the bottom center area. See the screenshot below (mouse over image):


Another option is to stop nodes one by one.

Summary

And there it is, you have completed the tutorial after a fun(hopefully) journey and now you are ready to take the Big Data world! You have learnt how to create your own cluster (without paying AWS a dime), run your own nodes, got hands on experience with basic interactions with HDFS and execute Linux commands on your nodes. When you are showing or building your Hadoop apps, you can be sure that you will be able to get it in front of people with minimal effort.
Your journey into the Hadoop world has just begun! Hopefully this step by step tutorial has served its purpose to get you excited about Hadoop in a very affordable way.
Thank you for your time!
“One can never have enough socks.“ - Harry Potter and the Philosopher’s Stone

Hadjo Features

We are listing only the major features which are given below:

  • How it works
  • Workspaces
  • Cluster management
  • Settings

How it works (the magic behind the scenes)

     When you add, change or delete cluster nodes - Hadjo behind the scenes asks Docker instances to be added, modified or removed. When you setup the local environment Hadjo asks Docker to apply the needed changes - create or modify an existing private Docker network (on your laptop, PC or Mac), generate the Docker Buildfile for image creation, ask Docker to create an image. All Docker interactions are done for you by Hadjo. No prior knowledge of Docker is required to use Hadjo.
     Let's play with the example "demo" cluster named "HouseElfs-2.8" for a better clarity. First, please, start the cluster under workspace "HouseElfs-2.8" (see how to run the cluster if not sure). The image below shows the running cluster of one master and a few slaves:




Open a command shell and execute docker ps -a --format "{{.ID}} {{.Names}} {{.Status}} {{.Image}}"

You should get something like that:
$docker ps -a --format "{{.ID}} {{.Names}} {{.Status}} {{.Image}}"
94d7c2def3e0 Kreacher Up 26 minutes hadjo/houseelfs-2.8_cluster
91f8af5403a4 Winky Up 26 minutes hadjo/houseelfs-2.8_cluster
9b860ee94da0 Dobby Up 26 minutes hadjo/houseelfs-2.8_cluster
1df0795984e5 dtp_mysql Exited (255) 20 hours ago mysql:5.7.17


For demonstrational purposes we have left a stopped container named "dtp_mysql" which has nothing to do with Hadjo and your cluster. Hadjo containers and other unrelated containers co-exist just fine. Hadjo does not affect other Docker containers. What concerns our case is that we see the nodes "Kreacher", "Winky" and "Dobby" with their statuses and the Docker image behind them "hadjo/houseelfs-2.8_cluster". One Docker image fits all nodes - master and slaves.

(optional) There exists a very handy GUI tool for Docker - Kitematic. If you have installed it let's open it and have another view of your cluster. On the left you can see all your Docker containers (Hadjo related or not). Here "dtp_mysql" is not related to Hadjo, but left for demonstration. If you click on the master container on the right side you can see the open for connections ports and the mounted directory on the container:
(see the screenshot below and mouse over more more info)



Data Storage
The cluster files from master and slaves are stored on the local "host" OS (your Windows, Linux or macOS) under directory <your_user_home_directory>/hadjo-storage/docker_run_mount/<works_space_name>. For example, if your Linux username is "potter" and the active workspace name is "HouseElfs-2.8", then the cluster files can be found under /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8. In the given demo example the data related to a node is stored as:

  • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Dobby The master node contains sub-folders:
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Dobby/app-logs (software logs, ex. Hadoop)
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Dobby/namenode (Hadoop namenode data)
  • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Kreacher The slave node contains sub-folders:
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Kreacher/app-logs (software logs, ex. Hadoop)
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Kreacher/datanode (Hadoop datanode data)
  • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Winky The slave node contains sub-folders:
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Winky/app-logs (software logs, ex. Hadoop)
    • /home/potter/hadjo-storage/docker_run_mount/HouseElfs-2.8/Winky/datanode (Hadoop datanode data)

This sub-section is purely informational for the curious minds. It is not required to remember any of that in order to work with Hadjo!
“One can never have enough socks.“ - Harry Potter and the Sorcerer’s Stone

Workspace:

A workspace is related to a single cluster - its nodes, selected software on them (ex. Hadoop version, Java version, etc.) and cluster environment (ex. network settings). The screens that are strongly related for a workspace are "Cluster Environment" and "Nodes". There is always and active workspace when working with Hadjo.

The title of the main window shows the currently active workspace.



  • New New workspace - creates a new empty workspace with no nodes. The current active workspace remains unchanged. You can switch to your new workspace any time. "Cluster Environment" section will be populated with default values all of which can be modified.
  • Open Open workspace - switches to an existing workspace. The cluster must be stopped in order to open a new workspace. Switching between workspaces does not affect the cluster data and logs. For example : whatever you put on HDFS from one workspace will be still there when opening the same workspace back. A cluster is bound to its workspace. A different workspace means a different cluster. There is a clean separation of data between clusters and they do not know of each other.
  • Save as Save as new workspace - creates a new workspace with the nodes and settings of the current one. You can switch to your new workspace any time. Data related to the running cluster of the current workspace is not copied.
  • Delete Delete workspace - deletes all data and nodes from the chosen workspace. The Docker images is also removed behind the scenes. Can not delete and active workspace.
  • Exit Exit - exits the application. If instances are running, they will not be stopped. On next start running instances will be detected and put in managed context.
Cluster management
This area is the one that will be most frequently used. Actions are activated by the menu "Cluster management" or buttons at the bottom of the Nodes screen. Some actions are active while others are not due to the running status of a node or the cluster itself:



All actions are applicable to the context of the currently active workspace. Non-active workspaces are not affected.
  • Cluster Nodes - opens the main screen with the existing nodes
  • Start Start (all) - Starts the whole cluster by rising up the "master" node first. If some nodes are already running, the command will start all non-running instances. The already running instances will not be affected.
  • Stop Stop (all) - Stops the whole cluster by taking down the salve nodes first. The "master" node is put on halt last. Already stopped instances are not affected, the command will ignore them.
  • Stats CPU/Mem Statistics - Displays the current state of CPU and memory usage of the instances in accordance to the allocated resources on Docker container manager.
  • Build Build - Opens the screen for building the main Docker image of the current active workspace
  • Clean Clean - Cleans the content of the active workspace's instances - applications logs and mounted content. On next instance start the content will be populated as if as new.
Cluster Interactions
This area provides some handy functionalities for interacting with your cluster. It is recommended to use SSH to work with the latter for a richer experience:



All actions are applicable to the "master" of the currently active workspace. Non-active workspaces are not affected.
  • Master Execute command (master OS) - executes a Linux command on "master" node and display an output
  • master File upload (master OS) - Browse and choose a file from your own PC, laptop or Mac. You will be prompted to specify a full path on the "master" where the file will be copied.
  • Yarn File upload (HDFS) - Browse and choose a file from your own PC, laptop or Mac. You will be prompted to specify a relative or full path on HDFS where the file will be placed. It will be available on your Hadoop cluster. Note : non-existent HDFS directories of your supplied path will be created!
Read Hadoop Cluster Logs (a few mouse clicks)
When you run a Hadoop cluster it is very handy to access the logs of the master and slaves in no time. Hadjo gives the ability to access these with a just a few mouse clicks.
In order to access the logs of any of the nodes (master or slave) click on the green gear active link. The example below shows a click on the master's "Apache Hadoop (2.8.5)" active link on the row of "Dobby (master)". See the screenshot below and mouse-over for more details:



Next, click on the "View Logs" button at the bottem left corner of the popup. See the screenshot below and mouse over for more details. It will open the local OS directory. The latter is mounted on the Hadoop node (Linux OS) and Hadoop is configured to store its logs there by a full Linux path (done by Hadjo, you do not need to worry about it). That is how the logs are available in both local OS and nodes's OS !!! You can even track the logs in real time on your OS while the cluster is still running
See the screenshot below how to access the logs and mouse-over the image for more clarity:



Settings (Workspace)
Manage the settings of the workspace environment and any configurations of the cluster software:



Most actions are applicable to the currently active workspace.
  • Language Language - changes the application language. By default English is loaded.
  • Environment Cluster Environment - The settings in this area are strongly related to the Docker image creation (done behind the scenes by Hadjo) and Docker network setup (also done by Hadjo through Docker). The image users, groups and passwords are related to the Linux cluster instances (using the built Docker image). The local mount directory should be readable and writable by the Docker containers and for that reason by default it is under the user's home directory! "Image OS user" and "Image OS user password" are important if you plan to use SSH access to "master" node. The container network name and range have a cross-workspace context. In most of the cases it is just fine to use the same name and IP range when switching between workspaces.
  • Software Cluster Software - by default the nodes installable software (ex. Hadoop) comes with basic default settings (described in another section). If you need to modify Hadoop configurations, ex. increase the replication factor from 1 to 2, just edit and save hdfs-site.xml. See the screenshot below for an example of changing hdfs-site.xml of Hadoop version 2.9.2!



After your file browser opens, find and edit hdfs-site.xml. See the screenshot below:



In order for the configurations to take effect a new image build must be performed (from menu "Cluster Management/Build").
In order to reset a software configuration, click its "Reset" button. Another "Reset" button at the bottom of the list resets all made changes to any selected software. Again, the changes will be effective on the cluster after a new image build.

Development

Section devoted to Hadoop application development - MapReduce, custom YARN applications and more...

Software on your cluster

The table below gives selectable software combinations that have been tested in several Hadjo clusters.

Supported Hadoop & Java Combinations
Software Java 8 Java 9 Java 10
Hadoop 2.8 ok ok ok
Hadoop 2.9 ok ok ok
Hadoop 2.10 ok ok ok
Hadoop 3.0 ok ok ok
Hadoop 3.1 ok ok ok
Hadoop 3.2 ok ok ok
Hadoop 3.3 ok ok ok

Video