To crack an interview, it is must that your basic concepts about frameworks are pretty clear. On request of many of our students, we’ve put together a comprehensive list of questions to help you get through your Big Data – Hadoop interview. We’ve made sure that the most probable questions asked during interviews are covered in this list. Below are most frequently asked Big Data- Hadoop questions in interviews.
What is Big Data?
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations.
What do the four V’s of Big Data denote?
- Volume – Scale of data
- Velocity – Analysis of streaming data
- Variety – Different forms of data
- Veracity – Uncertainty of data
What is Hadoop?
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Explain the architecture of Hadoop? or What is difference between Hadoop 1 and Hadoop 2?
What is difference between Hadoop 2 and Hadoop 3?
Hadoop has undergone many changes in three different versions. Hadoop 3 combines the efforts of hundreds of contributors over the last six years since Hadoop 2 launched.
|Features||Hadoop 2.x||Hadoop 3.x|
|Min java version Required||Java 7||Java 8|
|Fault Tolerance||Via replication (which is waste of space)||Via erasure coding|
|Storage Scheme||3x replication factor for data reliability, 200% overhead||Erasure coding for data reliability, 50% overhead|
|Yarn Timeline Service||Scalability issues||Highly scalable and reliable|
|Standby NN||Supports only 1 SBNN||Supports multiple SBNN|
|Heap Management||We need to configure Hadoop_ HFAPSIZE||Provides auto- tuning for heap|
|Data Balancing||For data balancing uses HDFS balancer||For data balancing uses intra data node balancer, which is invoked via HDFS Disk Balancer CLI.|
For more details click following link to check comparison between Hadoop 2 and Hadoop 3
What mode(s) can Hadoop be deployed in?
Hadoop can be deployed in stand-alone mode, pseudo-distributed mode or fully-distributed mode. Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes.
What is replication factor?
Replication factor controls how many times each individual block can be replicated – Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.
Explain what is distributed Cache in MapReduce Framework ?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used. The files could be an executable jar files or simple properties file.
What are the most common input formats defined in Hadoop?
These are the most common input formats defined in Hadoop:
Default input format in Hadoop is TextInputFormat.
What is InputSplit in Hadoop?
When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.
What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
How is the splitting of file invoked in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.
Consider case scenario: In M/R system, – HDFS block size is 64 MB
– Input format is FileInputFormat
– We have 3 files of size 64K, 65Mb and 127Mb
How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
– 1 split for 64K files
– 2 splits for 65MB files
– 2 splits for 127MB files
What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.
After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?
Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same.
Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.
If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result.
What is a Combiner?
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
– Overwrite it
– Warn you and continue
– Throw an exception and exit
The Hadoop job will throw an exception and exit.
Explain what is sqoop in Hadoop ?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
What are the core methods of a Reducer?
The 3 core methods of a reducer are –
1) setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
Function Definition- public void setup(context)
2) reduce () – It is heart of the reducer which is called once per key with the associated reduce task.
Function Definition – public void reduce(key,values,context)
3) cleanup () – This method is called only once at the end of reduce task for clearing all the temporary files or releasing the resources.
Function Definition – public void cleanup(context)
We hope, this post will help you a lot. Please leave a comment if you want to update any answer or want to add any new question.