Data Science is one of the most dynamic fields in technology attracting innumerable candidates towards it. However, not everyone ends up landing on a good Data Scientist profile. With the cut-throat competition among the candidates, you need to have the edge to have an upper hand. Therefore, it is very much important for the aspirants to know those common and tricky questions that are asked by the interviews.
Before going through the Interview question, it is suggested that you get you acquire the fundamental knowledge of Data Science. Learn this blog What is Data Science | Start Your Career in Data Science Today! and get a strong foothold on the concepts of Data Science!
In this Blog, following you will learn about the top Data Science questions that are asked in the interviews. Here the questions are divided in following segments:
- Common Questions Asked in Data Science
- Machine Learning Questions
- Programming Questions
- Data Visualization Questions
Common Questions
Any Data Science interview started with some basic questions that set the tone for the rest of the process. These questions are short and direct, usually not tricky.
These questions can be from any related subject, and having just the right answers to these questions give your interviewer the idea about your fundamental knowledge.
This section will highlight the top Common Questions that are asked by the interviewers during a Data Science interview!
Ques 1. Differentiate between Data Science, Machine Learning, and AI.
Ans: Data Science, Machine Learning, and Artificial Intelligence, are inter-related fields, but are often mistakenly used interchangeably. Following table will clear the doubts in a better way:
Data Science | Machine Learning (ML) | Artificial Intelligence (AI) |
Implementation of technology, computations and business skills to make business decisions. | Practical implementation of AI. | Equipping machines with knowledge and decision-making ability. |
A subset of AI. | A subset of AI. | A bigger set. |
Includes slicing and dicing the complex and large datasets to make inferences. | Includes building programs that build cognitive intelligence in machines. | Includes intelligent algorithms. |
Applied in Target advertising, Internet search, Augmented Reality, etc. | Applied in Self-driving cars, financial services, etc. | Applied in Chatbots, voice recognition systems, data refining, etc. |
Ques 2. What do you mean by Data Integrity?
Ans: Data Integrity is term used to denote the standards made and applied in the Database Management Systems, to ensure data consistency and data correctness.
For example, if someone enters ‘Name’ in the place of the ‘Email Address’, then the Data Integrity constraints will be enforced and the form will not accept the wrong data for any entry.
Data Integrity practically ensures the insertion of data, updating the data, or any other operations are carried out in the right manner and do not affect the quality and consistency of the data. Data Integrity also ensures that the data is safeguarded from any outside factor.
Ques 3. What do you understand by Big Data?
Ans: Big Data can be defined as the humongous amount of data that is generated with an unprecedented speed and that is multiple formats. Basically, with digital mediums being the primary source, the characteristics of data has been changed dramatically. This 180-degree shift of characteristics can be defined with following points:
- 1. Volume: Data that is voluminous.
2. Variety: Data that is both structured and unstructured.
3. Velocity: Data that is generated at an unprecedented speed.
4. Value: Data that is critically valuable.
5. Variability: The dataset whose value ranges broadly.
These five characteristics are known as 5V’s of Big Data, and are given by Doug Laney, a Gartner Analyst.
Ques 4. Please explain the Role of Data Cleaning in data analysis.
Ans: Data Cleaning can be defined as the process that filters out the irrelevant data from the database. The data that are duplicate, missing, incorrect, or inconsistent, or outdated, are considered as Irrelevant Data. Any business or entity accumulates plenty of data, and all of them do not remain relevant with changing times. As the business needs change and requirement of data changes not all the data remains worth keeping anymore.
Therefore it becomes necessary to remove the piles of unwanted data and keep the database as updated as possible.
Role of Data Cleaning in Data Analysis
Some of the benefits of Data Cleaning stage are
- By removing the irrelevant data, Data Cleaning also estimates the potential of any system to handle the Big Data requirements.
- It also saves the business from making wrong decisions and poor analysis.
- Data Cleaning saves the companies from incurring costs for troubleshooting damage caused by unwanted data.
Ques 5. What are some of the steps for Data Wrangling before applying machine learning algorithms?
Ans: Data Wrangling is the process of cleaning, restructuring, and converting the data into a standard format. This process helps make quick decisions based on reliable analysis. The steps in Data Wrangling are:
- 1. Discovering the Data: At this stage, the data is understood in terms of meaning, purpose, and use. The criteria based on which the data are further classified and organized, is identified at this stage.
2. Structuring the Data: Usually the raw data is unorganized. Hence, at this stage the data is classified, organized, and structure, based on the criteria identified at the previous stage. One thing that should be kept in mind here is that the structuring of data should comply with the analytical standard.
3. Cleaning the Data: By cleaning the data, we mean finding the data that has missing values, outdated values, or incorrect data. At this stage, the data is cleaned to improve its quality.
4. Enriching the Data: Enriching the data implies using additional data to improve the quality of existing data. It is done by understanding the existing data and strategizing to understand if additional data needs to be augmented to better the data quality. It also requires the Data Scientists to analyze the data to conclude if new dataset can be derived from the existing data.
5. Validating the Data: At this stage the data is verified using some programming to ensure the security, consistency, and the quality of the data.
6. Publishing the Data: The new found and verified dataset is published so that it can be used in future.
Ques 6. Differentiate between Data modeling and Database design?
Ans: The difference between Data Modeling and Database Design are as follows:
Data Modeling | Database Design |
Process of creating a Data Model. | Process of Designing a Database. |
Creates a conceptual model On the basis of the relationships between different data models. | Creates an output which is a detailed data model of a database |
Applies formal Data Modeling Techniques. | Applies logical model, physical model, and storage parameters. |
Ques 7. Please enumerate the various steps involved in an analytics project.
Ans: The various steps involved in an analytics project are as follows:
- 1. Problem Definition: The first step of any data analytics project is to understand the business problem that is required to be solved, how that can be solved, and what should be the approach to solve the issue.
2. Data Mining: The next stage is to explore and mine the data. It includes three steps, inspecting the data, use exploratory analysis, and visualizing the data.
3. Data Preparation: At this stage the data is prepared for modeling by cleaning and scrubbing the datasets.
4. Modeling: In this stage, the relevant data sets are selected. All the data fields that are either irrelevant or not required at the moment, are removed, and modeling of data is done with the rest of important ones.
5. Validation: At the end, the final data model is validated by testing it for new datasets.
Ques 8. What do you understand by Deep Learning?
Ans: Deep Learning can be understood as the subset of Machine Learning, which again is a subset of Artificial Intelligence. Deep Learning is inspired by the functioning of human brain, in theory which is called as Neural Networks. Neural Networks have multiple hidden layers, connections and directions of data propagation.
Machine Learning Questions
Machine Learning is an integral part of the entire Data Science process. Therefore, all the Data Science Interviews contain Machine Learning questions. These questions are usually based on Machine Learning Algorithms and models.
These questions will be easier for you to understand if you have a sound knowledge of Machine Learning. Get your fundamentals of Machine Learning with the blog Data Science Vs Machine Learning Vs Data Analytics, and come back to these interview questions to brush up your skills.
Some of the top Machine Learning questions asked in a Data Science Interview are as follows:
Ques 9. Can you name the type of biases that occur in machine learning?
Ans: There are basically 4 types of biases that occur in Machine Learning:
- 1. Sample Bias: Sampling Bias occurs when the sample is not the true representation of universe, or in other words, the real scenario. The reason behind the occurrence of Sampling Bias is mostly human intervention. Any process that includes human interacting with the process exposes the entire system to errors.
2. Prejudice Bias: Prejudice Bias also occurs due to human involved in the process. It occurs as a result of people belonging to a certain social class, cultural background, nationality, etc. These factors unknowingly impact the Machine Learning Algorithms and cause skewness.
3. Confirmation Bias: Confirmation Bias occurs due to psychological tendency of human beings to complete the information as per their personal beliefs. Most of the times Data Scientists are under the pressure of delivering a particular result before even starting the process, which causes their outlook and ultimately causes Confirmation Bias.
4. Group Attribution Bias: Group Attribution Bias occurs when the model is inclined towards a particular direction, in terms of attributes. It causes the results to get affected and bring skewness.
Ques 10. What do you understand by the Selection Bias? What are its various types?
Ans: Selection Bias occurs when the sample obtained for analysis does not comply with randomization properly. Therefore the sample causes distortion of statistical analysis, and hence requires proper attention to be given while collecting the sample.
Types of Selection Bias
- 1. Self-Selection Bias: Occurs when the participating members influence the decision to participate in the process.
2. Selection from a Specific Area: Occurs when the sample units are selected from a specific area only.
3. Exclusion: Occurs when some of the groups are excluded from the study.
4. Survivorship Bias: Occurs when a sample includes the only those units that pass the selection process, and exclude the units that do not pass the process.
5. Pre-Screening of Participants: Occurs when the participants of the study are included from particular groups only.
Ques 11. What are the assumptions of linear regression?
Ans: Assumptions of Linear Regression
There are 5 assumptions of Linear Regression:
- 1. Linear Relationship: The relationship between the independent and the dependant variables should be linear.
2. Multivariate Normality: All the variables should be multivariate normal.
3. No or Little Multicollinearity: There should be little or no Multicolinearity in the data. When the independent variables are too highly correlated to each other, Multicolinearity occurs.
4. No Auto-Correlation: There should be little or no Autocorrelation in the data. When the residual data are not independent of each other, Autocorrelation takes place.
5. Homoscedasticity: The residuals should be equal across the regression line.
Ques 12. How is Logistic Regression done?
Ans: Logistic Regression is a statistical analysis that classifies any variable in two classes. This techniques is used when the variables are categorical.
Logistic Regression Equation: Also called Sigmoid Function, is expressed in following ways:
Logistic Regression Assumptions:
- The dependent variable is binary in nature.
- Only the meaningful variables are included.
- The independent variables are independent of each other.
- The sample sizes are large.
- The factor level 1 of the dependent variable will produce the final outcome.
- The independent variables are linearly related to log odds.
If the probability is > 0.5, then the data will be classified as the default class ‘Male’, otherwise ‘Female’.
So, here,
θ0+ θ1x1+ θ2x2 >=0
If we have the dataset, where the θ values are:
θ = {0.69254, 0.49269, 0.19834}
Now, of we wish to predict the Gender of someone with Height=70 inches, and Weight= 180 pounds.
Then, as per the equation, and the above three θ values, the output would be, 1.91.
Since 1.91>0, the prediction would be that it is a ‘Male’.
Ques 13. How is standard deviation affected by the outliers?
Ans: Outliers may significantly affect the standard deviation. A single outlier can raise the standard deviation and may distort the spread.
Ques 14. What is z-score?
Ans: z-Score is a parameter that tells you how far a data point is from the mean.It is a method of comparing the results to the normal distribution. The range of z-Score ranges from -3 standard deviation to +3 standard deviation. The formula for z-Score is as follows:
z = (x – μ) / σ
Where;
μ: Mean of the data
σ: Population Standard Deviation
Ques 15. Explain Decision Tree Algorithm.
Ans: Decision Tree Algorithm is a Supervised Learning Technique used to solve Regression and Classification problems. The reason behind developing a Decision Tree Model, is its ability to predict the object class and value of target variables. It tries to solves the problems using Tree, which is called Decision tree. Every node of the tree is considered to be an attribute and every leaf represents the class label.x
Ques 16. What is the difference between supervised and unsupervised machine learning?
Ans: Difference between Supervised and Unsupervised Machine Learning
Supervised Machine Learning | Unsupervised Machine Learning |
A known set of input data and known responses to the data are used to train a prediction model. | Only the input data (say, X) is present and no corresponding output variable is there. |
Useful in cases of Classification and Regression Problems. | Useful in cases of Clustering, Anomaly Detection, Association, and Autoencoders. |
Complex computations involved. | Less complex computations involved. |
Ques 17. What are some pros and cons about your favorite statistical software?
Ans: Pros of R Programming Language
- R is an open-source programming language developed for statistical analysis and computations.
- R has a huge collection of statistical packages and libraries for dashboard.
- R packages are solely managed by CRAN, and hence it is easier to install R.
Cons of R Programming Language
- R is less robust and versatile, which is why it is limited to statistical computing and mathematical modeling.
- R is a little complicated.
- The focus of R is primarily into statistical analysis and hence it is better suited to academia and research.
Ques 18. Explain what a false positive and a false negative are. Why is it important these from each other?
Ans: False Positive and False Negative are formally called Type I and Type II errors respectively.
When we reject a null hypothesis when it is actually true, then it is called False Positive or Type I Error. On the contrary, when we accept a null hypothesis, when it is actually false, then it is called False Negative or Type II Error.
Ques 19. When would you use random forests Vs SVM and why?
Ans: Use of Random Forest vs SVM
Random Forest Algorithm is better suited for multi-class problems, SVM is better suitable for two-class problems.
Ques 20. Explain SVM machine learning algorithm in detail.
Ans: SVM or Support Vector Machine is a Supervised Learning Algorithm that is used in both the cases of Regression and Classification. SVM tries to plot all the features of data in n-dimensional space, where, ‘n’ is the number of features. This algorithm uses hyper planes to show different classes on the basis of provided Kernel function. There are four types of Kernels in SVM:
- 1. Liner Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
Ques 21. What is Random Forest? How does it work?
Ans: Random Forest is a machine learning algorithmthat is a collection of decision trees, which acts as an ensemble as different weak models work together to create a powerful one. Each decision tree in the Random Forest algorithm brings out one prediction and the class with the most number of votes becomes the prediction of the model.
Ques 22. What is the goal of A/B Testing?
Ans: A/B Testing, also called Split Testing refers to the testing technique, which is conducted to determine whether a new design requires improvement as per the parameter.
Ques 23. What is ‘Naive’ in a Naive Bayes?
Ans: Naive Bayes algorithm is a classification technique that assumes that the presence of a particular feature in an object is unrelated to the presence of other feature.
For example, a fruit can be considered to be an orange, if it is Orange in color, round in shape, and about 3 inches in diameter. Though all these features are dependent on each other, they all independently contribute to the probability that this fruit is an Orange. That is why it is known Naïve, as the assumptions made in this algorithm may or may not be true.
Programming Questions
Any Data Science project cannot be conceived into reality, unless a proper language is implemented as a medium that can converse with the machines to carry out a particular task. In the domain of Data Science, Python, R and SQL are the top most programming languages, and all the interviews contain some questions from this section for sure.
Choosing the right language to start off your journey in Data Science is very important. Read this blog SQL For Data Science | Python, R, Hadoop, & Tableau | What Should You Learn? And make the right decision right away!
Some of the top programming language questions asked in the Data Science Interviews are as follows!
Ques 24. Describe the different parts of an SQL query.
Ans: Different Parts of an SQL Query are as follows:
- The SQL Operation: Four basic operations performed by SQL are SELECT, INSERT, UPDATE, and DELETE.
- The Target: All the SQL DML statements work on one or more database tables. The objective of the Target Component is to identify those tables.
- The Condition: This component identifies the particular rows on which the operation is to be taken place.
Ques 25. What’s the difference between DELETE and TRUNCATE?
Ans: Difference between DELETE and TRUNCATE
DELETE | TRUNCATE |
It’s a DML command. | It’s a DDL command. |
Used to delete the row from the table. | Used to delete all the rows from the table. |
May contain a WHERE Clause. | Does not contain the WHERE Clause. |
Ques 26. What is the difference between SQL and MySQL or SQL Server?
Ans: Difference between SQL and MySQL or SQL Server
SQL stands for Structured Query Language used for accessing and manipulating databases. MySQL and SQL Server are both Relational Database Management systems. MySQL is owned by Oracle and SQL Server is developed by Microsoft. MySQL is free to use and SQL Server is costly.
Ques 27. What is a lambda expression in Python?
Ans: Lambda Expression is a feature of Python Programming Language that allows creating functions with no names. These are small functions that consist of small body and only one expression.
Ques 28. How will you multiply a 4×3 matrix by a 3×2 matrix?
Ans: Following steps will help multiply a 4×3 matrix by a 3×2 matrix:
Z = np.dot(np.ones((4,3)), np.ones((3,2)))
print(Z)
array([[3., 3.],
[3., 3.],
[3., 3.],
[3., 3.]])
Begin with learning Python for Beginners course and increase the chances of your selection in one shot! Explore the curriculum here!
Ques 29. Python or R-Which one would you prefer for text analysis?
Ans: For text analysis Python is the better option than R as it has Pandas Library which allows faster text analytics. On the other hand, R is better for machine learning than just text analytics.
Ques 30. Name a few libraries in Python used for Data Analysis and Scientific computations.
Ans: Some of the libraries in Python that are used for Data Analytics and Scientific Computations areNumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn, etc.
Ques 31. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?
Ans: In Python, Seaborn is basically built on top of Matplotlib, hence both the libraries serve the purpose for basic plotting requirements. However, Maltplotlib is a low-level tool and Seaborn is a high-level tool. Hence, as per the level of requirement the choice of library should be changed.
Ques 32. Why you should use NumPy arrays instead of nested Python lists?
Ans: NumPy Arrays make the array operations easier than Nested Python Lists. The steps with NumPy get shorter and easier.
Ques 33. Which is the standard data missing marker used in Pandas?
Ans: NaN
Ques 34. How many data structures does R language have?
Ans: Some of the popular data structures in R are:
- Vector
- Matrix
- Array
- Lists
- DataFrames
Ques 35. How can you produce co-relations and covariances in R lanaguge?
Ans: In R programming Language, cor( ) function is used for producing Correlations and cov( ) function to produce co-variances.
Ques 36. What according to you are disadvantages of R Programming over Python?
Ans: Disadvantages of R Programming over Python
- 1. R is a little complicated and difficult to learn while Python has a relatively simpler and readable syntax and hence it is easy to learn.
2. Better suited for Academia and Research, less suited for Enterprise Level Machine Learning Modeling.
3. R has comparatively smaller user-base than Python.
Ques 37. What is multi-threading and how can you implement it in R programming language?
Ans: Multi-threading can be defined as the ability of a program or an Operating System to manage its usage by multiple users and handle multiple requests without having to run multiple copies of the program.
R is designed for single-threading operations only. However, in order to harness the power of parallel-processing, R programming has BLAS and LAPACK libraries for multi-threading. On any oerating system, whether it is Windows, Mac OS, or Linux, the MKL (Math Kernel Library) provides BLAS and LAPACKand installs RevoUtilsMath package into the default search path.
Ques 38. What is pruning in Decision Tree?
Ans: When the sub-nodes of a decision node are removed, that process if called Pruning.
Ques 39. What are the different types of sorting algorithms available in R language?
Ans: Some of the sorting algorithms available in R Programming are as follows:
- 1. Bubble Sort: In this algorithm, two consecutive elements of the list are compared and the positions are swapped as per as per ascending or descending order.
2. Insertion Sort: In this algorithm, sorted and unsorted elements are compared and unsorted elements are placed at the suitable place.
3. Selection Sort: In this algorithm, the smallest element in the list is placed at the start at each iteration.
4. Quick Sort: In this algorithm, a random element is selected and the rest of the array is divided into two parts. After this all the elements which are less than the random element are segregated into one array. And the elements which are greater than the random element are segregated into one array, and like this the array is sorted.
5. Merge Sort: In this algorithm, an array of length ‘n’ is broken down into n lists containing only one element, and then those are merged again to get the sorted list.
6. Heap Sort: Similar to selection sort, in this algorithm the smallest element is placed at the start. However, this algorithm uses trees method.
Data Visualization Questions
Data Visualization is an important aspect of Data Science and therefore any aspirant should be prepared to answer the questions from this section. The most popular tool for Data Visualization in the field of Data Science is Tableau.
Ques 40. What are the different filters in Tableau?
Ans: There are 6 main Filters in Tableau:
- 1. Extract Filters: This filter creates an extract of a small subset of data from the original source of data. Tableau creates a local copy of this subset and after that the value present in t as a filter. It further reduces the number of times Tableau has to query the data source.
2. Data Source Filters: This filter basically restricts the data to be fed into the tool and also restricts the data visible to other viewers.It filters the data and uploads only the remaining part to the tool.
3. Context Filters: This filter allows applying a general context to be applied in the overall analysis.
4. Dimension Filters: These are non-aggregated filters that are applied on individual dimensions.
5. Measure Filters: These filters are aggregated filters that are applied after non-aggregated filters. The Measure filters are applied using the measure field values.
6. Table Calc Filters: These filters are applied at the end. The Table Calc filters allow you to filter the view without having to affect the underlying data.
Ques 41. What are the sets and groups? Differentiate.
Ans: Difference between Sets and Groups
Sets | Groups |
Sets are dynamic in nature. | Groups are not dynamic in nature. |
Sets are complicated. | Groups are self-explanatory. |
You can use it for multiple-dimensions. | You can use it for single-dimension. |
Ques 42. How can you visualize more than three dimensions in a single chart?
Ans: Mostly any data is represented with the help of three dimensions, i.e., height, width, and depth. However, in order to include more than three dimensions visual cues like color, size, shape, aimations, etc., are used to denote the changes.
Ques 43. What are the different datatypes in Tableau?
Ans: Basically, there are 4 Datatypes in Tableau:
- 1. String: Strings in Tableau are represented in single quotes and contains any combination of zero and numbers. For example: ‘Me’.
2. Number: These are integers or floating points. For example: 2, 5.8.
3. Boolean: These are logical values. For example: True, False.
4. Date & Datetime: This is a datatype that represents date and time.
Ques 44. What is disaggregation and aggregation of data?
Ans: Disaggregation can be defined as the function that fetches the details of a key figure from the aggregated level to the detailed level. On the other hand Aggregation can be defined as the function that sums up the detailed level and showed on aggregated level.
Ques 45. Mention what are different Tableau files?
Ans: Various file types in Tableau are:
- 1. Workbooks: Workbooks contain one or more worksheets and dashboards.
2. Bookmarks: Bookmarks consist of a single worksheet that is used for sharing own work.
3. Packaged Workbooks: These files contain a workbook along with the supporting local file data and background images.
4. Data Extraction Files: These files are a local copy of a subset or entire data source.
5. Data Connection Files: It’s a small XML file that has various connection information.
Ques 46. What is story Tableau?
Ans: A story refers to the sheet which contains a sequence of worksheets or dashboards that in combination work to convey information. Stories can be created to show how facts are connected, provide context, and demonstrate how decisions related to outcomes, etc. Every sheet in a story is called a story point.
Ques 47. What is the maximum no. of rows Tableau can utilize at one time?
Ans: The maximum number of rows Tableau can utilize at one time is 16.
Ques 48. What is the difference between discrete and continuous in Tableau?
Ans: Difference between Discrete and Continuous in Tableau
Discrete | Continuous |
Individually separate and distinct. | Forming an unbroken whole, without interruption. |
Discrete fields draw headers | Continuous fields draw axes |
Discrete fields can be sorted | Continuous fields cannot |
Ques 49. What do you understand by blended axis?
Ans: In Tableau, the Measures can share a single axis and using that all the marks are shown in one single pane. Now, in order to compare multiple measures in a single view, there can be three ways:
- Creating individual axes for each measure
- Blend two measures to share the axis.
- Add dual axes where there are two independent axes layered in the same pane.
Now, instead of adding row or column to the view, if the measures are blended, there is a single row or column and all the values for each measure is reflected along one continuous axis.
Ques 50. What is TDE file?
Ans: TDE stands for Tableau Data Extract, which is a file format used for compressed data sources. These file format provide better performance and are handling packaging, database, and other online sources.
Ques 51. What is Row-Level Security?
Ans: The Row-Level Security restricts the data on the basis of filters customized for the things Customers are. As per the tools being used by the users, row-level security can be configured.
Did you find these questions helpful? Comment below and let us know if you have more questions and we will provide the best answers to those in the most comprehensible manner possible. Stay tuned to blog.Simpliv.com for more updates and latest Data Science related blogs, interview questions and attractive infographics.
This course for Artificial Intelligence and Machine Learning is just the right package for Data Science aspirants to land a high-paying job in no time.