Top 25 Data Engineer Interview Questions

In my last post How to prepare for Data Engineer Interviews I wrote about how one can prepare for the Data Engineer Interviews and in this blog post I am going to provide the Top 25 Basic data engineer interview questions asked frequently and its brief Answers. This is typically the first round of Interview where interviewer just want to access whether you are aware of basic concepts or not and therefore you don't need to explain it in detail. Just a single statement would be suffice. Lets get started


A. Programming (python interview questions for data engineer)

1. What is the Static method in Python?

Static methods are the methods which are bound to the Class rather than Class's Object. Thus, it can be called without creating objects of class. We can just call it using the reference of the class. Also, all the objects of the class shares only one copy of the static method.

2. What is a Decorator in Python?

Decorators provide additional functionality to the functions without directly changing the definition of it. You can define a decorator for a function using the @ symbol. Learn more about Decorators in Python

3. What is a Dictionary in Python?

Dictionaries are key-value pairs data structure in python. Its build on the concept of Hash Tables. It is quite efficient for search, deletion and insertion as its time complexity is O(1). Key must be a single immutable element like string, tuples, or numbers and value can be any python object such as list, tuples, or integers.

4. Difference between Tuples and List?

The first important difference is List are mutable and Tuple are immutable i.e. you cannot change tuples. Secondly, Lists have dynamic and Tuples have static characterstics and because of that tuples are much faster than Lists.

5. Difference between Array and Lists?

Lists can store heterogenous elements i.e. elements with different datatypes whereas Arrays can only store the homogenous elements and if the datatype of elements in Arrays are different than "incompatible data type" exception will be thrown.

6. What is NamedTuple and DefaultDict in Python?

NamedTuple are container like Dictionaries in collections modules, it is similar to Dictionaries except NamedTuple supports both access from key-value and iteration which dictionary lacks. DefaultDict is also container like Dictionaries except DefaultDict doesn't throws KeyError exception like Dictionaries and it provides default value for the key that doesn't exist.

7. Explain Generators functions in Python.

Generator functions are like normal function but rather than return statement it use yield keyword to generate the value. It returns generator object which is iterable and which can be used as iterator.


B. Data Structures

1. What is Binary Search? What is its time complexity?

Binary search is searching algorithm on sorted arrays, where if search key is less than the middle element then we search the key in left interval otherwise in right interval. The time complexity of Binary Search is O(logn).

2. What is the time complexity of Merge sort and Quicksort?

Time complexity of both Merge sort and Quicksort is O(nlogn)

3. What are BFS and DFS?

BFS (Breadth First search) is a technique to find the shortest path in the graph. It uses a concept of FIFO that uses Queue to store the nodes in the Graphs. When one node is visited and marked, it is dequeued from Queue and its adjacent nodes are stored in Queue and the process repeats until there is no adjacent unvisited nodes. On the other hand, DFS (Depth First Search) use concept of LIFO that uses Stack and store unvisited nodes until there is no further adjacent unvisited nodes for any particular node. At this point, it starts backtracking to traverse all the unvisited nodes.

4. What is memoization?

Memoization is simple optimization technique that stores the results for expensive function calls in cache and use it directly from memory rather than computing it again. In Dynamic programming, memoization is top down approach that is used to store the most recent state values in cache and use it without having to calculate it again leading to a increased performance of your program.

5. Explain Dynamic Programming in simple words.

Dynamic Programming is basically solving sub-problems and use its results later without having to re-calculate it.


C. Distributed Systems/Databases

1. What is CAP theorem?

CAP theorem in distributed systems states that during network failure you can only have Availability (A) or Consistency (C) but cannot have both along with Partition Tolerance (P). So basically, in the Big Data world you always need to do trade-off between Availability or Consistency of the systems. Read more about CAP Theorem here

2. What is Sharding in Distributed systems?

Sharding is a process of distributing the data among multiple databases on multiple machines in the cluster. It is necessary when the dataset is too large to just store it in a single database.

3. Explain Master and Slave architecture in distributed systems.

In a distributed systems, a cluster is a set of machines connected to each other and sharing the resources and computations among themselves. This setup follows the master-slave architecture in which there is one master node and several slaves nodes. Master node is responsible for distributing resources and tasks among different slave nodes and slave nodes works accordingly. In master-slave databases architecture all writes go to the master node and reads from slave nodes.

4. What is NoSQL database and how it is different from Relational Databases?

Checkout What is NoSQL database

5. What are the Columnar Databases? Give Examples.

A columnar database or column-oriented database is a database management system which stores the data in columns rather than rows. So basically in columnar database the columns are multiple files like structure and each of the entries in the files represents rows of the table. For Example C1, C2....CN are columns files and one row consist of R1, R2....RN entries in respective C1,C2...CN columns files. Columnar Databases are beneficial for OLAP (Online Analytical Processing) where it provides faster performance as compared to row-oriented databases.

6. Explain the scenario when you want to use De-Normalized tables.

De-Normalized tables are beneficial for storing and processing Big Data when you need faster and efficient analytics performance. By De-Normalizing the tables we avoid lot of complex joins between different tables that can affect the performance especially when the the tables have huge amount of data. Although, it is going to consume huge amount of Disk Data because De- Normalized tables contains lots of redundant data, but then again it is a trade-off between performance and disk space. And you should decide depending on your use-case.


D. Data Modelling

1. What is Star Schema in Data Modelling?

Star Schema is widely used to develop a Data Warehouse or Data Mart. It is a data modelling concept where you have a big Facts table and various smaller Dimensions tables. This schema model looks like star as the big Facts table lies in the center surrounded by various smaller Dimensions tables. Basically, the Facts table stores all the quantitative business data and Dimensions tables stores the characterstics of the facts data.

2. What is Snowflake Schema in Data Modelling?

Snowflake schema is variant of Star Schema. It is also represented in the form of Facts and Dimensions tables just like Star Schema except the fact that in Snowflake Schema the Dimensions tables are present in Normalized form in multiple related tables to make it less redundant. Snowflake schema is used when you want to save the disk space and when you want to achieve Data Integrity in your model. Although, having Normalized data means low performance due to complex joins between multiple tables.

3. When to use Star schema and when to use Snowflake schema?

The decision of choosing schemas highly depends on your Business use-case. If your use-case demands faster performance and doesn't care much about the Disk Space then Star Schema would be the best choice. On the other hand, if you want to have Data Integrity, more structured Data and low Disk Space then Snowflake would be the better choice.

4. What are fact and dimensions tables?

Facts table stores all the Quantitative Business Data which can be literally seen as "facts" in Businesses and Dimensions tables store the dimensions or characterstics of these "facts". Primary keys in the Dimensions tables are stored as Foreign key in Facts table.


E. Data Engineering/SQL

1. Explain how would you design end-to-end ETL pipelines.

The first step in designing ETL pipelines is Extract the raw source of the data and store it in the DataLake, the next step is the transformation of the data where important and useful data is extracted and transformed according to the requirement of business use-case and then finally the transformed data is loaded in the Data Warehouse or it is consumed by different applications. One must design a reliable and fault-tolerant pipelines, such that you don't lose the important data if in case the pipeline breaks.

2. How would you process a huge volume of data?

To process a huge volume data one should use an efficient distributed data processing framework like Hadoop, Spark or Beam (Dataflow). These frameworks will distribute your data and code among different machines in a cluster and process it in parallel. These frameworks can process petabytes of data efficiently. Moreover some of the frameworks are totally no-ops, so you don't need to manage the clusters and it will auto scale your clusters with workers based on the volume of the data.

3. What is the difference between GROUP BY and PARTITION BY in SQL?

GROUP BY normally reduces the number of rows after performing aggregation on the groups whereas PARTITION BY gives aggregated columns for each rows in the tables. So if the table has 10 rows and you perform window aggregation using PARTITION BY then the resulting table has 10 rows each with aggregated values. On the other hand, GROUP BY will reduce the rows based on the number of groups formed.

4. Explain different types of JOINS in SQL.

Most common JOINS in SQL are namely INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN. INNER JOIN only joins the tables with the matching rows in both the sides of the join, LEFT JOIN returns all the rows from the left side of join and matching rows from the right side of join, the rows for which no matching rows exists on the right side the result set will contain NULL values. RIGHT JOIN is just the opposite of LEFT JOIN where it returns all the rows from right side of join. FULL JOIN is the combination of LEFT and RIGHT JOIN where it returns all the rows from the left side of join as well as right side of join.

Also, checkout  Top 10 SQL Interview Questions to know about the most frequent SQL questions asked in Data Engineer Interviews.


Comments

Popular posts from this blog

Tricky Questions or Puzzles in C

Program to uncompress a string ie a2b3c4 to aabbbcccc

Number series arrangement puzzles