Setting Up Multi-Node Hadoop Cluster , just got easy !

Knoldus

In this blog,we are going to embark the journey of how to setup the Hadoop Multi-Node cluster on a distributed environment.

So lets do not waste any time, and let’s get started.
Here are steps you need to perform.

Prerequisite:

1.Download & install Hadoop for local machine (Single Node Setup)
http://hadoop.apache.org/releases.html – 2.7.3
use java : jdk1.8.0_111
2. Download Apache Spark from : http://spark.apache.org/downloads.html
choose spark release : 1.6.2

1. Mapping the nodes

First of all ,we have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.

# vi /etc/hostsenter the following lines in the /etc/hosts file.192.168.1.xxx hadoop-master 192.168.1.xxx hadoop-slave-1192.168.56.xxx hadoop-slave-2

View original post 687 more words

Setting Up Multi-Node Hadoop Cluster , just got easy !

Creating a DSL (Domain Specific Language) using ANTLR ( Part-II) : Writing the Grammar file.

Knoldus

Earlier we discussed in our blog how to configure the ANTLR plugin for the intellij for getting started with our language.

In this post we will discuss the basics of the ANTLR  and exactly how can we get started with our main goal. What is the lexer, parser and what are their roles and many other things. So lets get started,

Antlr stands for ANother Tool for Language Recognition. The tool is able to generate compiler or interpreter for any computer language. If you need to parse languages like Java , scala, php then this is the thing that you are looking for.
Here is the list of some projects that uses ANTLR.

View original post 611 more words

Creating a DSL (Domain Specific Language) using ANTLR ( Part-II) : Writing the Grammar file.

Boost Factorial Calculation with Spark

Knoldus

We all know that, Apache Spark is a fast and a general engine for large-scale data processing. It can process data up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

But, is that the only task (i.e., MapReduce) for which Spark can be used ? The answer is: No. Spark is not only a Big Data processing engine. It is a framework which provides a distributed environment to process data. This means we can perform any type of task using Spark.

For example, lets take Factorial. We all know that calculating Factorial for Large numbers is cumbersome in any programming language and on top of that, CPU takes a lot of time to complete the calculations. So, what can be the solution ?

Well, Spark can be the solution to this problem. Lets see that in form of code.

First, we will try to implement Factorial using only Scala in a Tail…

View original post 97 more words

Boost Factorial Calculation with Spark

Logging Spark Application on standalone cluster

Knoldus

Logging of the application is much important to debug application, and logging spark application on standalone cluster is little bit different. We have two components for our spark application – Driver and Executer. Spark default use log4j logger to log  application. So whenever we use spark on local machine or spark-shell its use default log4j.properties from /spark/conf/log4j.properties by default spark logging rootCategory=INFO, console. But when we deploy our application on spark standalone cluster its different, we need to log executer and driver logs into some specific file.

So to log spark application on standalone cluster we don’t need to add log4j.properties into the application jar we should create the log4j.properties for driver and executer.

We need to create separate log4j.properties file for executer and driver both like below

# Set everything to be logged to the console
log4j.rootCategory=INFO,FILE
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.File={Enter path of the file}
log4j.appender.FILE.MaxFileSize=10MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p…

View original post 127 more words

Logging Spark Application on standalone cluster

Spark-shell on yarn resource manager: Basic steps to create hadoop cluster and run spark on it

Knoldus

In this blog we will install and configure hdfs and yarn with minimal configuration to create a local machine cluster. After that we will try to submit job to yarn cluster with the help of spark-shell, So lets start.

Before install hadoop in your standalone machine some prerequisite are:

  • Java 7
  • ssh

Now to install hadoop on standalone machine we create a dedicated user for it as follows. Its not mandatory but its recommended.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Above steps create a hduser and hadoop group in your machine.

Second step is to configure ssh in your local machine, Hadoop require ssh access to manage its nodes. For configure ssh for hduser to login in localhost without password, we need to run following commands.

$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now you can check your…

View original post 369 more words

Spark-shell on yarn resource manager: Basic steps to create hadoop cluster and run spark on it

Congregating Spark files on S3

Knoldus

We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data.

Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows to hoe. For e.g., there was a situation where we need to upload files written by Spark cluster at one location on Amazon S3. In Local mode, this task is easy to handle as all files, or partitions as we say in Spark, are written on one node, i.e., local node.

But, when Spark is running on a cluster, then the files are written or saved on worker nodes. The master node contains only reference or empty folder. This makes uploading all files to one location…

View original post 212 more words

Congregating Spark files on S3

Ganglia Cluster Monitoring: monitoring spark cluster

Knoldus

Ganglia is cluster monitoring tool to monitor the health of distributed cluster of spark and hadoop. I know you all have question that we already have a Application UI (http://masternode:4040) and Cluster UI (http://masternode:8080) then why we need ganglia? So answer is, Spark cluster UI and application UI dont provide us all information related to our cluster like Network I/O and health of every node. And with the help of spark default monitoring we cant monitor whole cluster hardware health and all matrices for each parameter like cpu usage, ip addresses, memory etc. So we now we got the answer that ganglia use for Advance Monitoring of any cluster.

Now we see how Ganglia works and its internal architecture.

Ganglia have 3 main components as follows:

  • gmond: gmond is monitoring deamon which collect data from each node in the cluster and send it to specific host.
  • gmetad: gmetad is a…

View original post 572 more words

Ganglia Cluster Monitoring: monitoring spark cluster