Hadoop is a framework used for handling Big Data. It uses HDFS as the distributed storage mechanism and MapReduce as the parallel processing paradigm for data residing in HDFS. The key components of Mapreduce are Mapper and Reducer. When a MapReduce Job runs on a large dataset, Mappers generate large […]
Launch a Linux Virtual Machine on AWS
Amazon Elastic Compute Cloud (AWS EC2) is the Amazon Web Service which is used to create and run virtual machines (instances) in the cloud. This step-by-step guide helps administrators to successfully launch a Linux virtual machine on Amazon EC2 for Free of Cost. By referring previous post, you must have created […]
Create Free Account on AWS
The Amazon Web Services (AWS) Free Tier is designed to enable you to get hands-on experience with AWS Cloud Services. As described in previous post, you can use various AWS services for free with limited usage. Follow below steps to get started with Amazon Web Services (AWS): Go to AWS […]
Importing Data into Hive using Sqoop
Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Related Posts: […]
Importing Data using Sqoop
Sqoop is an Apache Hadoop top-level project and designed to move data between Hadoop and RDBMS. Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. sqoop tool-name [tool-arguments] In this post, we will cover […]
Introduction to Sqoop and Installation
To process and analyze data in Hadoop, it requires loading data into Hadoop file system that is present on Application server and databases. Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management […]
Introduction to Amazon Web Services (AWS)
Amazon Web Services (AWS), is a subsidiary of Amazon.com launched in 2006, which offers a suite of cloud computing services that make up an on-demand computing platform. The most central and best-known of these services arguably include Amazon Elastic Compute Cloud, also known as “EC2“, and Amazon Simple Storage Service, […]
Working with HDFS Snapshots
This blog gives an overview of HDFS snapshots and different operations that users / cluster administrators can perform related to HDFS snapshots. It also explains snapshot management through Cloudera Manager. Overview of HDFS Snapshots Snapshots are data backup for protection against user errors and disaster recovery. Snapshots can be taken on a sub-tree of […]
Cloudera Manager-Configuring Static Service Pools
This blog describes the configuration of static service pools through Cloudera Manager. It assumes that you have a cloudera cluster already running and you have read the previous blog related to concept of Cloudera Manager – cgroups and static service pools. Configuring Static Service Pools To configure, open the Configuration […]
Cloudera Manager – cgroups and static service pools
This blog describes the concept of cgroups and static service pools in Cloudera Manager. It assumes that you already have a cloudera cluster running. If not, then you can download a Cloudera Quick-Start VM from Cloudera. Defining cgroups Linux Control Groups (cgroups – abbreviated from control groups) is a Linux kernel feature that […]