A complete machine learning scenario starts with a problem that you want to solve and ends with a mechanism by which you can make predictions about new instances of similar data.
Prerequisite for problems that could be solved by machine learning
1.The problem could be solved by finding patterns in data
2.You have access to a large set of existing data that includes the attribute (called a feature in machine learning) that you want to be able to infer from the other features.
If the above requirement is met, the steps for solving the machine learning problem is as follows.
Define your problem, find relevant data, define your performance metrics.
Define the problem means you need to define what is your target variable and what is the features are necessary for the problem. Like in the Zillow competition, your target variable is the price and all other variables like the size of the room, the number of bedrooms and etc are features.
No matter how powerful the machine learning algorithm is, you could not build a model without data. Like Andrew Ng said in his speech at Stanford business school, the major barrier for small companies is data. The data could come from an open source, or paid service or even legal website scraping (more on the legal issues with website scraping in later blogs). Of course, many tech giants like Google or Facebook collects data from the usage of their services. I would say this is one of the most important skills that is most often overlooked.
Once you have the data and define the target variable, you need to define a loss function to minimize or maximize for the problem.
Data exploration and cleaning.
This step is crucial but is less obvious to many beginners. The quality of the data determines the upper limit of the accuracy of your model. No matter how fancy your algorithm is, your model could not perform well if the data is corrupted.
But datasets in real-world situations have strange values, missing values and even simply wrong, we will need to explore the dataset and correct them.
Baseline model and feature selection
Next, we will need a baseline model to first set the lower minimum for performance metrics, which I usually use random forest and xgboost since they are fast, effective and could handle missing values well.
The criterion for choosing baseline model:
- Stable and good-enough performance, fast to get results
- Best if NA is allowed
The second point is kind of importance since we will definitely need to use baseline model to test if the data cleaning process makes prediction better or worse.
A common practice aside from using cross-validation to examine your model performance is to split the dataset into a training dataset and a validation dataset. Train the model on the training dataset and evaluate the data on the validation dataset
Once we have the baseline model, we will need to look at the importance of each variable in predicting the target variable to select some of them for the next step.
There is also the practice of creating or dropping features based on the domain knowledge or just intuition and check the importance of the new set of features with the baseline model.
PCA is the often chosen method.
Machine learning algorithm spot check
We will usually need to use some sampling method to work with the class imbalance in the data.
The next step is to find several best learning algorithms with all the features selected in the previous step.
Then, we will need to fine tuning those models with the hyperparameter search.
I often found that using adaptive resampling method increases model performance a little bit wit some cost on computation time. Check how to do that in R https://topepo.github.io/caret/adaptive-resampling.html
Build an ensemble model and train it on the entire dataset.
The last step is to find the correct stacking, bagging or boosting method to build your own ensemble model for the problem.
I will add more details in the future.