Big Data Interview questions

Recently, with the advent of various sensors, the amount of data which is getting collected is getting enormous. The data is expanding in terms of volume and with the rise in the volume of the data, it is becoming difficult to handle such data. There are a sudden growth and demand for Data Engineers in the industry during the last five years, and it will keep on growing.


During this demand, the industry is in need of good Big Data Engineers and many companies are looking for Data Engineer and Data scientists post. Check out How to prepare for Data Engineer Interviews if you are appearing for Data Engineer Interviews.

In this article, I will give 10 important basic Big data questions for freshers asked in the interviews nowadays. Also, checkout Top 25 Data Engineer Questions frequently asked in the Data Engineering Interviews.

1. What is Big Data? Can you give real-life example to explain big data?

Big Data is a collection of large and complex datasets that are difficult to store, process and analyze using traditional Relational databases. Therefore we need specialized Big data tools to handle and process Big data. The data of patients in the hospital including all the personal details, diseases, test results, patient medical history etc is an example of Big Data.


2. What is 5V's in Big Data?

Volume: The amount of data growing exponentially.

Velocity: The velocity refers to the rate at which data is growing.

Variety: It refers to the heterogeneity of the data

Veracity: Uncertainity in data due to inconsistencies and incompleteness

Value: How can this Big Data adds value to the business or organizations.


3. Name some Big Data tools.

Apache Hadoop, Apache Pig, Apache Hive, Apache HBase, Apache Spark, Apache Kafka, Apache Flink


4. What is Hadoop? What are its components? How is it related to Big Data

Apache Hadoop is a framework which is used to handle, store and process Big Data. It is used for Big data analysis and generates outcomes for businesses from the data.

Components: 

MapReduce: Programming model which process data in parallel.

HDFS: A java based distributed file system used for data storage.

YARN: A framework that manages resources.


5. What is MapReduce?

A programming framework which process data in parallel at different clusters. It consists of two distinct tasks called Map and Reduce. The first job is Map where a block of data is processed and key-pair is generated as intermediate output, then it is passed to Reducer which takes key-value pairs as input from different mapper jobs and aggregates those them to form a smaller set of key-value pairs.


6. What is YARN?

YARN stands for yet another resource negotiator which is used for managing resources and scheduling jobs. YARN can dynamically allocate resources to the applications as needed, designed to improve the resource utilization and application performances unlike the static allocation of resources in MapReduce.


7. What is HDFS?

HDFS is a Hadoop distributed file system which is a storage unit for Hadoop and used to store different kinds of data block on a distributed environment. It uses master and slave topology.

NameNode: It is a master node which is used to maintain the metadata information of the block of data stored in HDFS like block location, replication factor etc.

DataNode: It is slave node which is used to store data in HDFS, it is managed by NameNode


8. What is Apache HBase? What are its components?


Hbase is multidimensional, scalable and distributed NoSQL database written in JAVA. It runs on the top of HDFS and provides BigTable like capabilities to Hadoop. It is designed to store large collections of sparse data sets. It has high throughput and low latency.


9. What is the difference between HBase and Relational databases?

HBase:
- It is schema-less
- It is column-oriented datastore
- It contains sparse data-sets
- Automatic partitioning is done in HBase

Relational Databases:
- It is a schema-based database
- It is a row-oriented database
- It doesn't contain sparse datasets but it stores normalized data
- There is no built-in support for Partitioning


9. What is Apache Spark? Why it is used? How is it different from MapReduce?

Apache Spark is the framework to perform data-analytics in a distributed environment. It does all the computations in-memory which increase the speed of processing. It is much faster than MapReduce for large-scale data processing due to its in-memory computation.

Spark also uses Master-slave architecture. It has once central coordinator called Driver which acts as Master and it manages Executors which acts as a slave. Each Driver and Executor runs in their own Java processes.

Driver: The main method runs in Driver. It converts user programs into tasks and schedules those tasks to the executors. It acts as central Master which manages multiple executors.

Executors:  Executors are worker nodes which runs individual tasks in a given spark job. Once they run the tasks they send results back to Driver process. Executors are launched at the beginning of Spark application and run for the entire lifetime of an application.


10. What is RDD in Spark?

RDD stands for Resilient Distributed datasets. It is a collection of immutable distributed objects. Each dataset in RDD is divided into logical partitions which may be computed on different nodes on a cluster. There are two ways to create RDD 1. Parallelizing existing collection in your driver program 2. referencing a dataset in external data storage system like HDFS


Comments

Popular posts from this blog

Tricky Questions or Puzzles in C

Program to uncompress a string ie a2b3c4 to aabbbcccc

Number series arrangement puzzles