Question 1

How can you define logistic regression? Also, give an example where logistic regression has been used by you recently.

Accepted Answer

Logistic Regression can be easily defined as the logit model which is actually a technique to foretell about the total binary outcome based on a linear combination of all the predictor variables.
As an example, you might want to foretell if a certain political leader can win the ongoing or coming elections or not. This is a case where the results of the prediction depends on binary a calculation that is either 0 or 1, specifically, either win or lose.
All predictor variables mentioned here has a direct connection to the total sum of money that is spent for the coming or the ongoing election campaigning related to that certain political candidate. Also, you can take into account the actual time spent while campaigning along with other factors.

Question 2

Explain Recommender Systems.

Accepted Answer

Recommender Systems in common language is basically a subclass method of filtering information which is used to speculate about the user's preference of products and the ratings that the product can receive.
It is mostly used to get predictions about products or movies, social tags and research articles, music and news and countless other things.

Question 3

What makes data cleaning have an essential role to play in the analysis?

Accepted Answer

There are innumerable sources from which data is accumulated and as the data increases so does the data cleaning time. Data scientists find it extremely hard to deal with such a huge amount of data if it is not formatted and transformed properly.
This time taking the process of data cleaning is quite a critical and essential part of the analysis as it can take up more than 80% time to clean data belonging to multiple sources.

Question 4

What are the differences among Univariate, Bivariate and Multivariate analysis.

Accepted Answer

The above-mentioned terms refer to the descriptive techniques of statistical analysis which can more commonly be differentiated on the basis of the total number of variables that are involved at a certain timeline.
If you have made sales pie charts that are mainly founded on the territory, you can see that it involves just a single variable and thus, as an example can also be shown as a Univariate analysis.
Any analysis that attempts to throw light on the actual difference between any 2 variables at a given timeline, for example, you can take a scatterplot, then the analysis is referred to as a Bivariate analysis.
As another example, you can say that while analyzing the actual volume of the sales and the spending, you can get a bivariate analysis.
Multivariate analysis refers to the dealings of three variables or more in order to understand the resulting effect of the variables on the number of responses received.

Question 5

Define Linear Regression?

Accepted Answer

Linear regression refers to a basic statistical technique where you take the total score of a variable Y and get the prediction with the help from the actual score of the second variable X.  Both X and Y are variables namely predictor variable and criterion variable respectively

Question 6

Give the definition of Interpolation and Extrapolation?

Accepted Answer

If you have to estimate value from two known values belonging to a list of given values, it is known as Interpolation. The approximate value which you get when you extend a set of values which are already known it is called Extrapolation.

Question 7

Define power analysis?

Accepted Answer

It is a design technique which is experimental in nature but helps to determine all the effects that a given sample size gets.

Question 8

Define Collaborative filtering?

Accepted Answer

With the help from collaborating viewpoints, innumerable data sources and many agents the recommender systems use the filtering procedure to help find information and patterns.

Question 9

Differentiate between Cluster and Systematic Sampling?

Accepted Answer

Sometimes when the target population that is to be studied is spread over a huge area and it is difficult to apply the simple random sampling method, the technique of Cluster sampling is applied.
It is a method of probability sampling where each of the sampling units is either a collection or a cluster of said elements. Another statistical technique where all elements are often selected from an ordered and known sampling frame is known as Systematic Sampling.
In the systematic sampling method, the list is completely progressed in a manner that is circular so that, when you do reach at the end of the whole list, it is again progressed from the topmost part. For example, the method of equal probability.

Question 10

Can you tell if the expected value and mean value are different?

Accepted Answer

No, both the values are not different although they are used to present different contexts. Mean refers to the probability distribution. For example, the sample population. Expected value refers to a random variable context.
Sampling Data
Sampling Data gives Mean value as the only value.
Expected Value refers to the mean from all of the means that is the value which comes from various multiple samples. Expected value is also the known population mean.
For Distributions
Distribution creates no difference between Mean value and Expected value when the distribution is within the same population.

Question 11

Can you tell if gradient descent methods are always converging to the same point?

Accepted Answer

No, if it only reaches the local optima point more commonly known as the local minima and does not get to the global optimal point then depending upon the data and also starting conditions.

Question 12

Differentiate between Supervised Learning an Unsupervised Learning?

Accepted Answer

Supervised Learning is the application of knowledge to test data when the new knowledge is learned through an algorithm from training data. For example, Classification.  

Unsupervised learning happens when the algorithm does not learn any new data as training data or response variable is not available. For example, Clustering.

Question 13

Define Eigenvalue and Eigenvector?

Accepted Answer

Linear transformations are understood with Eigenvectors. Usually, the eigenvectors are calculated for a correlation matrix or covariance matrix in data analysis.
The direction along by which linear transformations happen either by acts of stretching, flipping or compressing is known as Eigenvectors.
The full strength of the complete transformation in the given direction of either the eigenvector or the factor because of which compression occurs is also called Eigenvalue.

Question 14

In how many ways can you treat outlier values?

Accepted Answer

By the use of univariate or other mentioned graphical analysis method, you can identify outlier values. Less number of outlier values can be easily assessed individually.
For a huge number of outliers, the values are substituted either with the 99th or the 1st percentile values. Not all of the extreme values are outlier values. Outlier values can be treated in the following ways–
1) The value can be changed and brought in within the range
2) The value can also be removed.

Question 15

Explain the steps that are involved in an analytics project?

Accepted Answer

•	The business problem needs to be understood.
•	It is important to be familiar with the data by exploring it.
•	Detecting outliers and treating missing values and transforming variables etc are the basics of preparing the model data.
•	When data preparation is done, you have to start running the model and analyze the consequence and tweak the approach also. This step can be iterative until you achieve the best possible result.
•	The new set of data can be used for validating the model.
•	You can analyze the total performance of the given model over a period of time when you implement the model and also track the consequence.

Question 16

Explain the ways to iterate over a list and retrieve element indices at the same time.

Accepted Answer

Use the enumerate function to take every element in a sequence like a list. It will also add its location before it.

Question 17

How can you treat missing values during analysis?

Accepted Answer

First, if you want to know how much missing values are there then you have to identify the same after checking the variables along with the missing values. Only identified patterns can lead the analyst to both interesting and meaningful insights into business.
In the case of unidentified patterns, the missing values are substituted with either mean or median values (imputation) or ignored. Consider the following factors-
•	You have to understand and know the problem statement, data to find the answer. Assign a default value of either mean, minimum or maximum and also get into the data.
•	A categorical variable already has an assigned default value. The missing value also gets a default value assigned.
•	An on the way distribution of data has to get the mean value for normal distribution.
•	Should missing values be treated or not? 80% of values for a variable that is missing then need you to drop the variable only.

Question 18

What is box-cox transformation in regression models?

Accepted Answer

If the response variable for a regression analysis does not satisfy one or more than one assumptions of an ordinary least squares regression for some reason then the residuals can either get curved as the prediction gets more or follow a total skewed distribution.
Such scenarios make it important to transform all the response variable so that the given data meet all the required assumptions.
A Box cox transformation is basically the statistical technique to assist the transformation of non-mornla dependent variables into an absolutely normal shape. The unusual data makes most of the statistical techniques assume total normality. Applying this helps you to run more tests.

Question 19

Differentiate between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?

Accepted Answer

In Bayesian estimate, we already know about the data or the problem (prior). As various values of the parameters are present for the data explanation looking for multiple parameters like 5 gammas and 5 lambdas is possible.
Bayesian Estimate gives us multiple models for multiple predictions that is one for each pair of parameters with identical prior. You can predict a new example as computing the total weighted sum of the predictions gives meaning.
Maximum likelihood ignores the prior making it a Bayesian with the use of a flat prior.

Question 20

Differentiate between skewed and uniform distribution?

Accepted Answer

Uniform distribution means the observations are spread equally without any perks.  Distributions having more observations on a single side of the graph are known as skewed distribution.
Distributions with less number of observations towards the left or lower values are referred to as skewed left. Distributions with lesser observation towards the right or higher values are referred to as skewed right

Question 21

Explain the importance of selection bias?

Accepted Answer

The presence of no appropriate randomization achieved while selecting individuals, groups or data that is to be analyzed is the main cause of selection bias.
Selection bias says that the intended sample for analyzation may differ from the obtained sample. It has Attribute, Sampling Bias, Data and Time Interval.

Question 22

Give the basic assumptions made for linear regression?

Accepted Answer

The basic assumptions are Normality of error distribution, linearity and additivity, and statistical independence of errors.

Question 23

What are the steps to find out the correlation between a categorical variable and a continuous variable?

Accepted Answer

Using the analysis of covariance technique is the best way to get the correlation between a categorical variable and a continuous variable.

Question 24

How much different is a mean value from expected value?

Accepted Answer

Although used in other contexts they are similar. Expected values usually refer to a random variable context and mean values refer to the context of the sample population in regards to a probability distribution.

Question 25

Why is it essential to clean a data set?

Accepted Answer

The format formed while cleaning data allows data scientists to work easily. Unclean data sets lead to the formation of biased information that can affect business decisions. Data scientists spend almost 80% of their time cleaning data.

Question 26

Name the steps that are involved in analytics projects?

Accepted Answer

All analytics problem goes through the following steps mentioned below:
•	First, the business problem is understood.
•	Then data exploration is considered and conducted.
•	After that, the data is prepared for modeling.
•	The model is run before the results are analyzed.
•	The model is then validated using completely new data sets.
•	The model is implemented and the results are tracked for a given time period.

Question 27

Explain the way to deal with various forms of seasonality in time series modeling?

Accepted Answer

•	Time series showing a repeated pattern over time have seasonality. For example, stationery sales go down during holiday season while air conditioner sales go up during the summers.
•	The average value of the variables at various time periods makes the time series in seasonality non-stationary. You can remove seasonality by simply differentiating. The numerical difference between a certain value and a value with a periodic lag is known as seasonal differencing(eg., 12 for present monthly seasonality)

Question 28

Define the ways to assess a good logistic model?

Accepted Answer

•	Here are a few ways to assess the consequences of logistic regression analysis-
•	•You can look at the true negatives and false positives through the Classification Matrix.
•	•A concordance helps with the identification of the ability of the logistic model assisting to differentiate between an event taking place and not taking place.
•	Lift assists to assess logistic models by comparing the totality with random selection.

List of Data Scientist Interview Questions

Data scientist Interview Questions