Steps in Machine Learning (ML)
1. Problem definition: what is the problem, why does the problem need to be solved, how can the problem be solved?
2. Data Collection: What data is needed for the project, Where is it available, How can it be obtained?
3. Data Preparation: this stage usually takes the bulk of the project time. It involves exploring, conditioning and transforming data before modeling and analysis is done.
Activities involved in data preparation include:
• Feature Engineering
• Outlier treatment
• Data formatting
• Data cleaning
• Data Normalization
4. Algorithm Selection: Several Machine learning algorithms for handling various kinds of data – images, audio, text or numbers – exist. These algorithms are either supervised, unsupervised or reinforcement learning algorithms.
5. Data Modeling: A ML algorithm is trained with input data and Patterns in the data are discovered. This knowledge is used in predictions when new data is given. Frameworks like PyTorch and TensorFlow have pre-trained models that can be used in solving several problems.
6. Model Validation: the outputs of a model are compared to real world observations to know if they both correspond to each other in quantity and quality. After splitting a given data into train, test and cross validation datasets, the cross validation dataset is used to validate the model. Model validation involves tuning the hyper-parameters of the model.
Types of Model validation are:
• Split Sample Validation
• Cross Validation
• Bootstrapping Validation
Cross Validation
Jack-Knife/Leave-one-out
K-fold cross-validation i.e.10-fold cross-validation
7. Model Evaluation: The model is evaluated to test its final performance. In this stage we find out which model represents our data most. Then we determine how well a chosen model will work in the future. After model training is done and validation is carried out, the test dataset is used for evaluation of the model.
Different types of Evaluations exist for Classification and Regression algorithms.
Classification
• Confusion Matrix – Accuracy, False Positive, False Negative, Recall, Precision, Specificity, negative Predicted Value.
• Receiver Operating Characteristic Curve (ROC Curve)
• Area Under Curve (AUC)
• Gini Coefficient
• Gain and Lift Charts
• KS Chart (Kolmogorov-Smirnov)
Regression
• Sum Squared Error (SSE), Mean Square Error(MSE) and Root Mean Square (RMSE)
• Relative Squared Error (RSE)
• Mean Absolute Error (MAE)
• Mean absolute Deviation (MAD)
• Relative Absolute Error (RAE)
• Coefficient of Determination – SSR, SST and SSE
• Adjusted R2
• Analysis of Residuals
8. Model Deployment: ML model is put to use in order to automate data-driven decision making in the production line. Sometimes, it involves integrating the model into a broader system which might provide an interactive interface.
Footage: https://m.facebook.com/groups/BigDataPakistan/permalink/3504276386339699/