# List of Data Scientist Interview Questions

**Data Scientist** are another type of logical information master who have the specialised in solving complex issues – and the interest to investigate what are the issues and how to solve them.

Here, we are sharing some **Data scientist Interview Questions** with you. Please check and let us know what else we can add.

**Data scientist Interview Questions**

**Q1) How can you define logistic regression? Also, give an example where logistic regression has been used by you recently.**

Logistic Regression can be easily defined as the logit model which is actually a technique to foretell about the total binary outcome based on a linear combination of all the predictor variables.

As an example, you might want to foretell if a certain political leader can win the ongoing or coming elections or not. This is a case where the results of the prediction depends on binary a calculation that is either 0 or 1, specifically, either win or lose.

All predictor variables mentioned here has a direct connection to the total sum of money that is spent for the coming or the ongoing election campaigning related to that certain political candidate. Also, you can take into account the actual time spent while campaigning along with other factors.

**Q2) Explain Recommender Systems.**

Recommender Systems in common language is basically a subclass method of filtering information which is used to speculate about the user’s preference of products and the ratings that the product can receive.

It is mostly used to get predictions about products or movies, social tags and research articles, music and news and countless other things.

**Q3) What makes data cleaning have an essential role to play in the analysis?**

There are innumerable sources from which data is accumulated and as the data increases so does the data cleaning time. Data scientists find it extremely hard to deal with such a huge amount of data if it is not formatted and transformed properly.

This time taking the process of data cleaning is quite a critical and essential part of the analysis as it can take up more than 80% time to clean data belonging to multiple sources.

**Q4) What are the differences among Univariate, Bivariate and Multivariate analysis.**

The above-mentioned terms refer to the descriptive techniques of statistical analysis which can more commonly be differentiated on the basis of the total number of variables that are involved at a certain timeline.

If you have made sales pie charts that are mainly founded on the territory, you can see that it involves just a single variable and thus, as an example can also be shown as a Univariate analysis.

Any analysis that attempts to throw light on the actual difference between any 2 variables at a given timeline, for example, you can take a scatterplot, then the analysis is referred to as a Bivariate analysis.

As another example, you can say that while analyzing the actual volume of the sales and the spending, you can get a bivariate analysis.

Multivariate analysis refers to the dealings of three variables or more in order to understand the resulting effect of the variables on the number of responses received.

**Q5) Define Linear Regression?**

Linear regression refers to a basic statistical technique where you take the total score of a variable Y and get the prediction with the help from the actual score of the second variable X. Both X and Y are variables namely predictor variable and criterion variable respectively.

**Q6) Give the definition of Interpolation and Extrapolation?**

If you have to estimate value from two known values belonging to a list of given values, it is known as Interpolation. The approximate value which you get when you extend a set of values which are already known it is called Extrapolation.

**Q7) Define power analysis?**

It is a design technique which is experimental in nature but helps to determine all the effects that a given sample size gets.

**Q8) Define Collaborative filtering?**

With the help from collaborating viewpoints, innumerable data sources and many agents the recommender systems use the filtering procedure to help find information and patterns.

**Q9) Differentiate between Cluster and Systematic Sampling?**

Sometimes when the target population that is to be studied is spread over a huge area and it is difficult to apply the simple random sampling method, the technique of Cluster sampling is applied.

It is a method of probability sampling where each of the sampling units is either a collection or a cluster of said elements. Another statistical technique where all elements are often selected from an ordered and known sampling frame is known as Systematic Sampling.

In the systematic sampling method, the list is completely progressed in a manner that is circular so that, when you do reach at the end of the whole list, it is again progressed from the topmost part. For example, the method of equal probability.

**Q10) Can you tell if the expected value and mean value are different?**

No, both the values are not different although they are used to present different contexts. Mean refers to the probability distribution. For example, the sample population. Expected value refers to a random variable context.

**Sampling Data**

Sampling Data gives Mean value as the only value.

Expected Value refers to the mean from all of the means that is the value which comes from various multiple samples. Expected value is also the known population mean.

**For Distributions**

Distribution creates no difference between Mean value and Expected value when the distribution is within the same population.

**Q11) Can you tell if gradient descent methods are always converging to the same point?**

No, if it only reaches the local optima point more commonly known as the local minima and does not get to the global optimal point then depending upon the data and also starting conditions.

**Q12) Differentiate between Supervised Learning an Unsupervised Learning?**

Supervised Learning is the application of knowledge to test data when the new knowledge is learned through an algorithm from training data. For example, Classification.

Unsupervised learning happens when the algorithm does not learn any new data as training data or response variable is not available. For example, Clustering.

**Q13) Define Eigenvalue and Eigenvector?**

Linear transformations are understood with Eigenvectors. Usually, the eigenvectors are calculated for a correlation matrix or covariance matrix in data analysis.

The direction along by which linear transformations happen either by acts of stretching, flipping or compressing is known as Eigenvectors.

The full strength of the complete transformation in the given direction of either the eigenvector or the factor because of which compression occurs is also called Eigenvalue.

**Q14) In how many ways can you treat outlier values?**

By the use of univariate or other mentioned graphical analysis method, you can identify outlier values. Less number of outlier values can be easily assessed individually.

For a huge number of outliers, the values are substituted either with the 99th or the 1st percentile values. Not all of the extreme values are outlier values. Outlier values can be treated in the following ways–

1) The value can be changed and brought in within the range

2) The value can also be removed.

**Q15) Explain the steps that are involved in an analytics project?**

- The business problem needs to be understood.
- It is important to be familiar with the data by exploring it.
- Detecting outliers and treating missing values and transforming variables etc are the basics of preparing the model data.
- When data preparation is done, you have to start running the model and analyze the consequence and tweak the approach also. This step can be iterative until you achieve the best possible result.
- The new set of data can be used for validating the model.
- You can analyze the total performance of the given model over a period of time when you implement the model and also track the consequence.

**Q16) Explain the ways to iterate over a list and retrieve element indices at the same time.**

Use the enumerate function to take every element in a sequence like a list. It will also add its location before it.

**Q17)**** How can you treat missing values during analysis?**

First, if you want to know how much missing values are there then you have to identify the same after checking the variables along with the missing values. Only identified patterns can lead the analyst to both interesting and meaningful insights into business.

In the case of unidentified patterns, the missing values are substituted with either mean or median values (imputation) or ignored. Consider the following factors-

- You have to understand and know the problem statement, data to find the answer. Assign a default value of either mean, minimum or maximum and also get into the data.
- A categorical variable already has an assigned default value. The missing value also gets a default value assigned.
- An on the way distribution of data has to get the mean value for normal distribution.
- Should missing values be treated or not? 80% of values for a variable that is missing then need you to drop the variable only.

**Q18)****What is box-cox transformation in regression models?**

If the response variable for a regression analysis does not satisfy one or more than one assumptions of an ordinary least squares regression for some reason then the residuals can either get curved as the prediction gets more or follow a total skewed distribution.

Such scenarios make it important to transform all the response variable so that the given data meet all the required assumptions.

A Box cox transformation is basically the statistical technique to assist the transformation of non-mornla dependent variables into an absolutely normal shape. The unusual data makes most of the statistical techniques assume total normality. Applying this helps you to run more tests.

**Q19)** **Differentiate between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?**

In Bayesian estimate, we already know about the data or the problem (prior). As various values of the parameters are present for the data explanation looking for multiple parameters like 5 gammas and 5 lambdas is possible.

Bayesian Estimate gives us multiple models for multiple predictions that is one for each pair of parameters with identical prior. You can predict a new example as computing the total weighted sum of the predictions gives meaning.

Maximum likelihood ignores the prior making it a Bayesian with the use of a flat prior.

**Q20)** **Differentiate between skewed and uniform distribution?**

Uniform distribution means the observations are spread equally without any perks. Distributions having more observations on a single side of the graph are known as skewed distribution.

Distributions with less number of observations towards the left or lower values are referred to as skewed left. Distributions with lesser observation towards the right or higher values are referred to as skewed right.

**Q21)** **Explain the importance of selection bias?**

The presence of no appropriate randomization achieved while selecting individuals, groups or data that is to be analyzed is the main cause of selection bias.

Selection bias says that the intended sample for analyzation may differ from the obtained sample. It has Attribute, Sampling Bias, Data and Time Interval.

**Q22)** **Give the basic assumptions made for linear regression?**

The basic assumptions are Normality of error distribution, linearity and additivity, and statistical independence of errors.

**Q23)****What are the steps to find out the correlation between a categorical variable and a continuous variable?**

Using the analysis of covariance technique is the best way to get the correlation between a categorical variable and a continuous variable.

**Q24)**. **How much different is a mean value from expected value?**

Although used in other contexts they are similar. Expected values usually refer to a random variable context and mean values refer to the context of the sample population in regards to a probability distribution.

**Q25)****. ****Why is it essential to clean a data set?**

The format formed while cleaning data allows data scientists to work easily. Unclean data sets lead to the formation of biased information that can affect business decisions. Data scientists spend almost 80% of their time cleaning data.

**Q26)**. **Name the steps that are involved in analytics projects?**

All analytics problem goes through the following steps mentioned below:

- First, the business problem is understood.
- Then data exploration is considered and conducted.
- After that, the data is prepared for modeling.
- The model is run before the results are analyzed.
- The model is then validated using completely new data sets.
- The model is implemented and the results are tracked for a given time period.

**Q27)****Explain the way to deal with various forms of seasonality in time series modeling?**

- Time series showing a repeated pattern over time have seasonality. For example, stationery sales go down during holiday season while air conditioner sales go up during the summers.
- The average value of the variables at various time periods makes the time series in seasonality non-stationary. You can remove seasonality by simply differentiating. The numerical difference between a certain value and a value with a periodic lag is known as seasonal differencing(eg., 12 for present monthly seasonality)

**Q28) Define the ways to assess a good logistic model?**

- Here are a few ways to assess the consequences of logistic regression analysis-
- •You can look at the true negatives and false positives through the Classification Matrix.
- •A concordance helps with the identification of the ability of the logistic model assisting to differentiate between an event taking place and not taking place.
- Lift assists to assess logistic models by comparing the totality with random selection.