What is Machine Learning? Machine learning is a field of computer science that takes input data and predicts the output based on various statistical techniques.
What are the different types of Machine Learning Algorithms? 1. Supervised Learning 2. Unsupervised Learning 3. Semi-Supervised Learning 4. Reinforcement Learning
What are bias and variance? Bias — It is defined as the difference between the predicted value and the actual value or in general terms, it is the error in your model. Variance — It is defined as the amount by which the predicted value differed in one training set over the expected values in all the other training sets.
How can you achieve optimum bias and variance? 1. By minimizing the total error 2. Using bagging and resampling techniques 3. Adjusting minor values in the algorithm
What are the assumptions of Linear Regression? 1. Relationship between x and y is linear 2. Each point on the graph are independent of each other 3. All the points in the dataset are normally distributed 4. All the points have equal variance
What is regularization? It is a machine learning technique that solves the problem of overfitting in the models. There are three types of regularization — Lasso Regression, Ridge Regression, and Elastic Net Regression.
What is gradient descent? This is a first-order iterative optimization algorithm that is used to find the local minima of the derivative function. In this iterative algorithm, the next step is to the opposite of the gradient i.e. towards the steepest side.
What is a learning rate? Learning rate is a hyperparameter used in neural networks which determines the amount of change in weights after each step. This is also called step size.
What is the use of MinMaxScalar() from sklearn.preprocessing? The MinMaxScalar() method is a normalization technique that brings all the data point values to a range of 0 to 1. The formula for this can be written as, MinMaxScalar_value = (value — min_value) / (max_value — min_value)
How does the z-score method work? The z-score method is present in the scipy.stats library. This is a normalization technique where each value is called the standard score. The standard score of each data point variable can be calculated as, standard_score = (variable-mean) / standard_deviation
What is a Convolutional Neural Network(CNN)? CNN is a type of Artificial Neural Network that is used for image processing. They are used to perform both descriptive and predictive tasks. It is specifically designed to process pixel data.
What is a Recurrent Neural Network(RNN)? RNN is a type of Artificial Neural Network where the connections between the nodes form a directed graph. This type of neural network is mainly used in textual mining.
What are the assumptions of Linear regression? There are 4 assumptions that are associated with the linear regression model, 1. Linearity- The relation between X and Y is linear 2. Independence- All the points in the dataset are independent of each other 3. Normality- All the points in the dataset are normally distributed 4. Homoscedasticity- The variance of residual is the same for any value of X
What is multicollinearity? Why is it considered a problem? Multicollinearity exists when an independent variable is highly correlated to more than one independent variable in multiple linear regression. It undermines the statistical significance of an independent variable and hence it is considered a problem.
How can you remove multicollinearity from the model? There are 2 ways to remove multicollinearity, 1. Removing highly correlated predictors from the model 2. Using Principal Component Analysis
What is logistic regression? This is a supervised machine learning algorithm. Its main target is to predict whether the output (a set of probabilities) is either True/False, Yes/No, etc.
Although logistic regression is a classification algorithm, why is there “regression” in it? The actual output of logistic regression is a set of real numbers or logs ranging from minus infinity to plus infinity. These values are then converted into a set of probabilities for classification.
What is Maximum Likelihood estimation? It is a type of statistical model that estimates the value of the observed data under the most probable conditions.
What are the evaluation metrics for Classification Algorithms? Confusion matrix, F1 score, Accuracy, Precision, Recall, and ROC curve.
What is a Decision Tree? It is a supervised machine learning algorithm that creates a tree-like model of decisions and their possible consequences.
What is pruning? Pruning is a process of limiting the size of a decision tree to avoid overfitting the data and also to reduce the complexity o the tree.
What is Bagging? It is also called Bootstrap Aggregating. It is a machine learning ensemble algorithm designed to increase accuracy and avoid overfitting of classification and regression models.
What is a Random Forest? It is an ensemble learning method for classification, regression, and other tasks that operate by constructing multiple decision trees. It gets outputs from all the decision trees and then selects the best one among them.
What is cross-validation? It is a method to evaluate a machine learning model with limited data by resampling procedures. It has only one parameter, k, which determines the number of groups the data is split into. It is also called k-fold cross-validation or out-of-sample testing.
What is KNN? The k-Nearest Neighbors algorithm is an unsupervised learning method where the observed data is categorized into the most frequent class out of the k nearest data points. It can be used for both classification and regression.
What are the advantages of the KNN algorithm? 1. Used for both regression and classification 2. Simple and easy to implement the algorithm 3. Quick calculation time 4. High accuracy
What are the disadvantages of the KNN algorithm? 1. Computationally expensive 2. With large data, predictions become slow 3. Irrelevant features affect the predictions
What is the k-means algorithm? The k-means algorithm is an unsupervised learning method where the observed data is categorized based on the least distance from the center of all classes to the observed data.
What are the advantages of the k-means algorithm? 1. Simple and easy to implement 2. Easily scalable to large datasets 3. Can create clusters of any shape
What are the disadvantages of the k-means algorithm? 1. It is difficult to determine the number of clusters manually 2. Outliers cannot be detected