Top 51 Data Science Interview Questions!

Common Questions

Any Data Science interview started with some basic questions that set the tone for the rest of the process. These questions are short and direct, usually not tricky.

These questions can be from any related subject, and having just the right answers to these questions give your interviewer the idea about your fundamental knowledge.

This section will highlight the top Common Questions that are asked by the interviewers during a Data Science interview!

Ques 1. Differentiate between Data Science, Machine Learning, and AI.

Ans: Data Science, Machine Learning, and Artificial Intelligence, are inter-related fields, but are often mistakenly used interchangeably. Following table will clear the doubts in a better way:

Data Science	Machine Learning (ML)	Artificial Intelligence (AI)
Implementation of technology, computations and business skills to make business decisions.	Practical implementation of AI.	Equipping machines with knowledge and decision-making ability.
A subset of AI.	A subset of AI.	A bigger set.
Includes slicing and dicing the complex and large datasets to make inferences.	Includes building programs that build cognitive intelligence in machines.	Includes intelligent algorithms.
Applied in Target advertising, Internet search, Augmented Reality, etc.	Applied in Self-driving cars, financial services, etc.	Applied in Chatbots, voice recognition systems, data refining, etc.

Description: F:\Shravani\Data Science Campaign\Supporting Blogs\Top 5 Data Science Myths to Avoid\Images\Artificial Intelligence vs Machine Learning vs Data Science - Diagram (2).jpg

Ques 2. What do you mean by Data Integrity?

Ans: Data Integrity is term used to denote the standards made and applied in the Database Management Systems, to ensure data consistency and data correctness.

For example, if someone enters ‘Name’ in the place of the ‘Email Address’, then the Data Integrity constraints will be enforced and the form will not accept the wrong data for any entry.

Data Integrity practically ensures the insertion of data, updating the data, or any other operations are carried out in the right manner and do not affect the quality and consistency of the data. Data Integrity also ensures that the data is safeguarded from any outside factor.

Ques 3. What do you understand by Big Data?

Ans: Big Data can be defined as the humongous amount of data that is generated with an unprecedented speed and that is multiple formats. Basically, with digital mediums being the primary source, the characteristics of data has been changed dramatically. This 180-degree shift of characteristics can be defined with following points:

Volume:

Variety:

Velocity:

Value:

Variability:

These five characteristics are known as 5V’s of Big Data, and are given by Doug Laney, a Gartner Analyst.

Ques 4. Please explain the Role of Data Cleaning in data analysis.

Ans: Data Cleaning can be defined as the process that filters out the irrelevant data from the database. The data that are duplicate, missing, incorrect, or inconsistent, or outdated, are considered as Irrelevant Data. Any business or entity accumulates plenty of data, and all of them do not remain relevant with changing times. As the business needs change and requirement of data changes not all the data remains worth keeping anymore.

Therefore it becomes necessary to remove the piles of unwanted data and keep the database as updated as possible.

Role of Data Cleaning in Data Analysis

Some of the benefits of Data Cleaning stage are

By removing the irrelevant data, Data Cleaning also estimates the potential of any system to handle the Big Data requirements.
It also saves the business from making wrong decisions and poor analysis.
Data Cleaning saves the companies from incurring costs for troubleshooting damage caused by unwanted data.

Ques 5. What are some of the steps for Data Wrangling before applying machine learning algorithms?

Ans: Data Wrangling is the process of cleaning, restructuring, and converting the data into a standard format. This process helps make quick decisions based on reliable analysis. The steps in Data Wrangling are:

Discovering

the

Data:

Structuring the Data:

Cleaning the Data:

Enriching the Data:

Validating the Data:

Publishing the Data:

Ques 6. Differentiate between Data modeling and Database design?

Ans: The difference between Data Modeling and Database Design are as follows:

Data Modeling	Database Design
Process of creating a Data Model.	Process of Designing a Database.
Creates a conceptual model On the basis of the relationships between different data models.	Creates an output which is a detailed data model of a database
Applies formal Data Modeling Techniques.	Applies logical model, physical model, and storage parameters.

Ques 7. Please enumerate the various steps involved in an analytics project.

Ans: The various steps involved in an analytics project are as follows:

Problem Definition:

Data Mining:

Data Preparation:

Modeling:

Validation:

Ques 8. What do you understand by Deep Learning?

Ans: Deep Learning can be understood as the subset of Machine Learning, which again is a subset of Artificial Intelligence. Deep Learning is inspired by the functioning of human brain, in theory which is called as Neural Networks. Neural Networks have multiple hidden layers, connections and directions of data propagation.

Machine Learning Questions

Machine Learning is an integral part of the entire Data Science process. Therefore, all the Data Science Interviews contain Machine Learning questions. These questions are usually based on Machine Learning Algorithms and models.

These questions will be easier for you to understand if you have a sound knowledge of Machine Learning. Get your fundamentals of Machine Learning with the blog Data Science Vs Machine Learning Vs Data Analytics, and come back to these interview questions to brush up your skills.

Some of the top Machine Learning questions asked in a Data Science Interview are as follows:

Ques 9. Can you name the type of biases that occur in machine learning?

Ans: There are basically 4 types of biases that occur in Machine Learning:

Sample Bias:

Prejudice Bias:

Confirmation Bias:

Group Attribution Bias:

Ques 10. What do you understand by the Selection Bias? What are its various types?

Ans: Selection Bias occurs when the sample obtained for analysis does not comply with randomization properly. Therefore the sample causes distortion of statistical analysis, and hence requires proper attention to be given while collecting the sample.

Types of Selection Bias

Self-Selection Bias:

Selection from a Specific Area:

Exclusion:

Survivorship Bias:

Pre-Screening of Participants:

Ques 11. What are the assumptions of linear regression?

Ans: Assumptions of Linear Regression

There are 5 assumptions of Linear Regression:

Linear Relationship:

Multivariate Normality:

No or Little Multicollinearity:

No Auto-Correlation:

Homoscedasticity:

Ques 12. How is Logistic Regression done?

Ans: Logistic Regression is a statistical analysis that classifies any variable in two classes. This techniques is used when the variables are categorical.

Logistic Regression Equation: Also called Sigmoid Function, is expressed in following ways:

Description: F:\Shravani\Data Science Campaign\Supporting Blogs\Top Data Science Interview Questions\image4_gw5mmv.png

Logistic Regression Assumptions:

The dependent variable is binary in nature.
Only the meaningful variables are included.
The independent variables are independent of each other.
The sample sizes are large.
The factor level 1 of the dependent variable will produce the final outcome.
The independent variables are linearly related to log odds.

If the probability is > 0.5, then the data will be classified as the default class ‘Male’, otherwise ‘Female’.

So, here,

θ₀+ θ₁x₁+ θ₂x₂ >=0

If we have the dataset, where the θ values are:

θ = {0.69254, 0.49269, 0.19834}

Now, of we wish to predict the Gender of someone with Height=70 inches, and Weight= 180 pounds.

Then, as per the equation, and the above three θ values, the output would be, 1.91.

Since 1.91>0, the prediction would be that it is a ‘Male’.

Ques 13. How is standard deviation affected by the outliers?

Ans: Outliers may significantly affect the standard deviation. A single outlier can raise the standard deviation and may distort the spread.

Ques 14. What is z-score?

Ans: z-Score is a parameter that tells you how far a data point is from the mean.It is a method of comparing the results to the normal distribution. The range of z-Score ranges from -3 standard deviation to +3 standard deviation. The formula for z-Score is as follows:

z = (x – μ) / σ

Where;

μ: Mean of the data

σ: Population Standard Deviation

Ques 15. Explain Decision Tree Algorithm.

Ans: Decision Tree Algorithm is a Supervised Learning Technique used to solve Regression and Classification problems. The reason behind developing a Decision Tree Model, is its ability to predict the object class and value of target variables. It tries to solves the problems using Tree, which is called Decision tree. Every node of the tree is considered to be an attribute and every leaf represents the class label.x

Ques 16. What is the difference between supervised and unsupervised machine learning?

Ans: Difference between Supervised and Unsupervised Machine Learning

Supervised Machine Learning	Unsupervised Machine Learning
A known set of input data and known responses to the data are used to train a prediction model.	Only the input data (say, X) is present and no corresponding output variable is there.
Useful in cases of Classification and Regression Problems.	Useful in cases of Clustering, Anomaly Detection, Association, and Autoencoders.
Complex computations involved.	Less complex computations involved.

Ques 17. What are some pros and cons about your favorite statistical software?

Ans: Pros of R Programming Language

R is an open-source programming language developed for statistical analysis and computations.
R has a huge collection of statistical packages and libraries for dashboard.
R packages are solely managed by CRAN, and hence it is easier to install R.

Cons of R Programming Language

R is less robust and versatile, which is why it is limited to statistical computing and mathematical modeling.
R is a little complicated.
The focus of R is primarily into statistical analysis and hence it is better suited to academia and research.

Ques 18. Explain what a false positive and a false negative are. Why is it important these from each other?

Ans: False Positive and False Negative are formally called Type I and Type II errors respectively.

When we reject a null hypothesis when it is actually true, then it is called False Positive or Type I Error. On the contrary, when we accept a null hypothesis, when it is actually false, then it is called False Negative or Type II Error.

Ques 19. When would you use random forests Vs SVM and why?

Ans: Use of Random Forest vs SVM

Random Forest Algorithm is better suited for multi-class problems, SVM is better suitable for two-class problems.

Ques 20. Explain SVM machine learning algorithm in detail.

Ans: SVM or Support Vector Machine is a Supervised Learning Algorithm that is used in both the cases of Regression and Classification. SVM tries to plot all the features of data in n-dimensional space, where, ‘n’ is the number of features. This algorithm uses hyper planes to show different classes on the basis of provided Kernel function. There are four types of Kernels in SVM:

Ques 21. What is Random Forest? How does it work?

Ans: Random Forest is a machine learning algorithmthat is a collection of decision trees, which acts as an ensemble as different weak models work together to create a powerful one. Each decision tree in the Random Forest algorithm brings out one prediction and the class with the most number of votes becomes the prediction of the model.

Ques 22. What is the goal of A/B Testing?

Ans: A/B Testing, also called Split Testing refers to the testing technique, which is conducted to determine whether a new design requires improvement as per the parameter.

Ques 23. What is ‘Naive’ in a Naive Bayes?

Ans: Naive Bayes algorithm is a classification technique that assumes that the presence of a particular feature in an object is unrelated to the presence of other feature.

For example, a fruit can be considered to be an orange, if it is Orange in color, round in shape, and about 3 inches in diameter. Though all these features are dependent on each other, they all independently contribute to the probability that this fruit is an Orange. That is why it is known Naïve, as the assumptions made in this algorithm may or may not be true.

Programming Questions

Any Data Science project cannot be conceived into reality, unless a proper language is implemented as a medium that can converse with the machines to carry out a particular task. In the domain of Data Science, Python, R and SQL are the top most programming languages, and all the interviews contain some questions from this section for sure.

Choosing the right language to start off your journey in Data Science is very important. Read this blog SQL For Data Science | Python, R, Hadoop, & Tableau | What Should You Learn? And make the right decision right away!

Some of the top programming language questions asked in the Data Science Interviews are as follows!

Ques 24. Describe the different parts of an SQL query.

Ans: Different Parts of an SQL Query are as follows:

The SQL Operation: Four basic operations performed by SQL are SELECT, INSERT, UPDATE, and DELETE.

The Target: All the SQL DML statements work on one or more database tables. The objective of the Target Component is to identify those tables.
The Condition: This component identifies the particular rows on which the operation is to be taken place.

Ques 25. What’s the difference between DELETE and TRUNCATE?

Ans: Difference between DELETE and TRUNCATE

DELETE	TRUNCATE
It’s a DML command.	It’s a DDL command.
Used to delete the row from the table.	Used to delete all the rows from the table.
May contain a WHERE Clause.	Does not contain the WHERE Clause.

Ques 26. What is the difference between SQL and MySQL or SQL Server?

Ans: Difference between SQL and MySQL or SQL Server

SQL stands for Structured Query Language used for accessing and manipulating databases. MySQL and SQL Server are both Relational Database Management systems. MySQL is owned by Oracle and SQL Server is developed by Microsoft. MySQL is free to use and SQL Server is costly.

Ques 27. What is a lambda expression in Python?

Ans: Lambda Expression is a feature of Python Programming Language that allows creating functions with no names. These are small functions that consist of small body and only one expression.

Ques 28. How will you multiply a 4×3 matrix by a 3×2 matrix?

Ans: Following steps will help multiply a 4×3 matrix by a 3×2 matrix:

Z = np.dot(np.ones((4,3)), np.ones((3,2)))
print(Z)
array([[3., 3.],
[3., 3.],
[3., 3.],
[3., 3.]])

Begin with learning Python for Beginners course and increase the chances of your selection in one shot! Explore the curriculum here!

Ques 29. Python or R-Which one would you prefer for text analysis?

Ans: For text analysis Python is the better option than R as it has Pandas Library which allows faster text analytics. On the other hand, R is better for machine learning than just text analytics.

Ques 30. Name a few libraries in Python used for Data Analysis and Scientific computations.

Ans: Some of the libraries in Python that are used for Data Analytics and Scientific Computations areNumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn, etc.

Ques 31. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?

Ans: In Python, Seaborn is basically built on top of Matplotlib, hence both the libraries serve the purpose for basic plotting requirements. However, Maltplotlib is a low-level tool and Seaborn is a high-level tool. Hence, as per the level of requirement the choice of library should be changed.

Ques 32. Why you should use NumPy arrays instead of nested Python lists?

Ans: NumPy Arrays make the array operations easier than Nested Python Lists. The steps with NumPy get shorter and easier.

Ques 33. Which is the standard data missing marker used in Pandas?

Ans: NaN

Ques 34. How many data structures does R language have?

Ans: Some of the popular data structures in R are:

Vector
Matrix
Array
Lists
DataFrames

Ques 35. How can you produce co-relations and covariances in R lanaguge?

Ans: In R programming Language, cor( ) function is used for producing Correlations and cov( ) function to produce co-variances.

Ques 36. What according to you are disadvantages of R Programming over Python?

Ans: Disadvantages of R Programming over Python

Ques 37. What is multi-threading and how can you implement it in R programming language?

Ans: Multi-threading can be defined as the ability of a program or an Operating System to manage its usage by multiple users and handle multiple requests without having to run multiple copies of the program.

R is designed for single-threading operations only. However, in order to harness the power of parallel-processing, R programming has BLAS and LAPACK libraries for multi-threading. On any oerating system, whether it is Windows, Mac OS, or Linux, the MKL (Math Kernel Library) provides BLAS and LAPACKand installs RevoUtilsMath package into the default search path.

Ques 38. What is pruning in Decision Tree?

Ans: When the sub-nodes of a decision node are removed, that process if called Pruning.

Ques 39. What are the different types of sorting algorithms available in R language?

Ans: Some of the sorting algorithms available in R Programming are as follows:

Bubble Sort:

Insertion Sort:

Selection Sort:

Quick Sort:

Merge Sort:

Heap Sort:

Data Visualization Questions

Data Visualization is an important aspect of Data Science and therefore any aspirant should be prepared to answer the questions from this section. The most popular tool for Data Visualization in the field of Data Science is Tableau.

Ques 40. What are the different filters in Tableau?

Ans: There are 6 main Filters in Tableau:

Extract Filters:

Data Source Filters:

Context Filters:

Dimension Filters:

Measure Filters:

Table Calc Filters:

Ques 41. What are the sets and groups? Differentiate.

Ans: Difference between Sets and Groups

Sets	Groups
Sets are dynamic in nature.	Groups are not dynamic in nature.
Sets are complicated.	Groups are self-explanatory.
You can use it for multiple-dimensions.	You can use it for single-dimension.

Ques 42. How can you visualize more than three dimensions in a single chart?

Ans: Mostly any data is represented with the help of three dimensions, i.e., height, width, and depth. However, in order to include more than three dimensions visual cues like color, size, shape, aimations, etc., are used to denote the changes.

Ques 43. What are the different datatypes in Tableau?

Ans: Basically, there are 4 Datatypes in Tableau:

String:

For example:

Number:

For example:

Boolean:

For example:

Date & Datetime:

Ques 44. What is disaggregation and aggregation of data?

Ans: Disaggregation can be defined as the function that fetches the details of a key figure from the aggregated level to the detailed level. On the other hand Aggregation can be defined as the function that sums up the detailed level and showed on aggregated level.

Ques 45. Mention what are different Tableau files?

Ans: Various file types in Tableau are:

Workbooks:

Bookmarks:

Packaged Workbooks:

Data Extraction Files:

Data Connection Files:

Ques 46. What is story Tableau?

Ans: A story refers to the sheet which contains a sequence of worksheets or dashboards that in combination work to convey information. Stories can be created to show how facts are connected, provide context, and demonstrate how decisions related to outcomes, etc. Every sheet in a story is called a story point.

Ques 47. What is the maximum no. of rows Tableau can utilize at one time?

Ans: The maximum number of rows Tableau can utilize at one time is 16.

Ques 48. What is the difference between discrete and continuous in Tableau?

Ans: Difference between Discrete and Continuous in Tableau

Discrete	Continuous
Individually separate and distinct.	Forming an unbroken whole, without interruption.
Discrete fields draw headers	Continuous fields draw axes
Discrete fields can be sorted	Continuous fields cannot

Ques 49. What do you understand by blended axis?

Ans: In Tableau, the Measures can share a single axis and using that all the marks are shown in one single pane. Now, in order to compare multiple measures in a single view, there can be three ways:

Creating individual axes for each measure
Blend two measures to share the axis.
Add dual axes where there are two independent axes layered in the same pane.

Now, instead of adding row or column to the view, if the measures are blended, there is a single row or column and all the values for each measure is reflected along one continuous axis.

Ques 50. What is TDE file?

Ans: TDE stands for Tableau Data Extract, which is a file format used for compressed data sources. These file format provide better performance and are handling packaging, database, and other online sources.

Ques 51. What is Row-Level Security?

Ans: The Row-Level Security restricts the data on the basis of filters customized for the things Customers are. As per the tools being used by the users, row-level security can be configured.

Did you find these questions helpful? Comment below and let us know if you have more questions and we will provide the best answers to those in the most comprehensible manner possible. Stay tuned to blog.Simpliv.com for more updates and latest Data Science related blogs, interview questions and attractive infographics.

This course for Artificial Intelligence and Machine Learning is just the right package for Data Science aspirants to land a high-paying job in no time.