Introduction to Hadoop

Posted on October 18, 2015 by ProTechSkills

The document starts with the introduction to Hadoop and covers the Hadoop 1.x / 2.x services (HDFS / MapReduce / YARN). It also explains the architecture of Hadoop, the working of Hadoop distributed file system and MapReduce programming model. Hadoop Introduction

Install Hive with local metastore

Posted on January 25, 2015 by ProTechSkills

Being a data-warehousing framework, a single session for Hive is not preferred. To solve this limitation of Embedded Metastore, a support for Local Metastore was developed. A separate database service runs as a process on same or remote machine. The Metastore service still runs in the same JVM within hive […]

Install Hive with embedded metastore

Posted on January 25, 2015January 26, 2015 by ProTechSkills

Hive package comes with derby as default embeded metastore. Follow below mentioned steps to install Hive with embedded metastore: 1. Download the latest version of Hive from here. 2. Uncompress the package on linux: tar –xzvf apache-hive-0.13.1-bin.tar.gz 3. Add following to ~/.bash_profile sudo nano ~/.bash_profile export HIVE_HOME=/home/hduser/hive-0.13.1 export PATH=$PATH:$HIVE_HOME/bin Where […]

Apache Hadoop YARN: Best Practices

Posted on January 25, 2015June 7, 2016 by ProTechSkills

Found a nice presentation on YARN:Best Practices by Hortonworks…!!

Start HDFS High Availability Cluster

Posted on September 22, 2014April 10, 2016 by ProTechSkills

In the previous blog, we discussed about the HDFS high availability configuration. This blog describes the steps to start an HDFS high availability cluster. Pre-requisites Before starting with HDFS high availability cluster, make sure that the cluster meets the following pre-requisites: a) If you have enabled Automatic Failover for Hot-BackUp during NameNode failover, then before starting with HDFS high availability cluster, […]

HDFS High Availability Configuration

Posted on August 29, 2014November 29, 2015 by ProTechSkills

In the previous blog, we discussed about the HDFS High availability architecture. This blog describes the configurations for HDFS high availability in a Hadoop cluster. Pre-requisites Before configuring HDFS high availability, make sure that your Hadoop cluster has the following pre-requisites: a) You must have at-least two nodes to enable HDFS high availability. b) If you want to configure […]

Hadoop HDFS Concepts

Posted on August 26, 2014November 12, 2015 by ProTechSkills

This presentation gives an overview of Hadoop HDFS concepts like Blocks, Rack Awareness, Safe Mode etc. Hadoop HDFS Concepts from tutorialvillage

Hadoop Cluster Setup

Posted on August 22, 2014November 17, 2014 by ProTechSkills

If you wish to deploy Hadoop Single node setup, please follow the blog here. Pre-requisites Before starting with Hadoop cluster setup, make sure that the node have the following pre-requisites: a) Any Linux Operating system b) Sun Java 1.6 or above should already be installed and the version should be same across all […]

MongoDB installation from tar distribution

Posted on August 20, 2014August 20, 2014 by ProTechSkills

Follow the steps below for MongoDB installation using tar distribution: 1. Download the stable release of MongoDB from here. 2. Extract the distribution tar –xzvf mongodb-linux 3. Create a directory for mongo db in /opt. mkdir /opt/mongodb 4. Move the distribution files to mongodb directory mv mongolinux/* /opt/mongodb 5. Add […]

MapReduce Introduction

Posted on August 7, 2014September 8, 2014 by ProTechSkills

Hadoop MapReduce is a software framework designed to develop applications to process large dataset in parallel in a reliable and fault tolerant manner. A MapReduce application processes the input dataset into chunks in parallel on multiple nodes. The below diagram shows the different phases for a MapReduce application: There are two […]

PROTECHSKILLS

Big Data