While working on several real data sets I have noticed some real patterns in the design flow for doing analysis. I've also noticed this from seeing how much direction some people need when tutoring a social scientist on statistical analysis. It also differentiates me from people who tend to find solutions before the problem.
In normal computer engineering problems there is a typical design flow one follows. Such as
- Formally stating the problem
- Dividing the problem into software and hardware parts
- Doing a block diagram
- Schematic/simulations
- PCB design/simulations
- assembly and prototyping
- test
A more detailed version of this is the topic of another post. Anyway it would be nice to have something like that for machine learning. I think it would go something like this
- Get an idea of the problem your working on and what the data is. Talk to people with domain knowledge.
- Acquire the data (not always as trivial as it sounds)
- Create a mathematical representation for the data. This does not mean specifying a complete generative model yet. But something in a more mathematical form than a .csv file or SQL DB. For instance, for financial quotes, this would be a multivariate time series. A point process with real values associated with each point. However, you'll having to answer questions here about if it is real time or trading time. How do you deal with the market being closed sometimes. Holidays? How about exchanges in different time zones? What about volumes? limit order imbalance between bid and ask. Is it as mid/ask representation or mid-price/spread? These questions are answered here.
- Now feature matrices have to be created. Most machine learning methods will require data in the form of a matrix. So, it is best to translate the more general abstract representation into a matrix. For the market example, some form on windowing will probably have to be used. The prices will probably also be transformed into return space as well.
- Write the code to generate these feature matrices
- Exploratory data analysis. This is kind of an art here. One should look at ways in which you can best visualize what is going on in the data set. Look at descriptive statistics: mean, variance, kurtosis. The correlation/mutual information between variables is also a must. Dimensionality reduction is a good bet when trying to do visualization. In a sense, exploratory data analysis consists of applying simple models to the data that don't get bogged own with complex learning and latent variable aspects. They can be more crude since they are just supposed to give you an idea of what is going on. Models of the more complex type are best left for the heavy lifting when one is getting serious about prediction and inference.
- Need an evaluation framework: what performance measures should be used. For regression problem RMSE is often used. ROC curves are common in classification. In some cases it might be better to look at the RMSE in log scale. A review of error metrics is a post of its own. For generative models, the marginal likelihood is definite to look at if you can compute it or approximate it. This will allow for model comparison/averaging. For problems involving decision making the performance of the whole system should be considered. For instance, in algorithmic trading the amount of money earned by the system should be considered in addition to the predictive accuracy. When the test set is small one also needs to get a confidence estimate on the performance. I am not sure if there is a good reference on finding confidence bounds on ROC/mutual information/RMSE etc. I would be great to find one.
- Write code for test framework. Usually it goes something like: Load the data, Plot some of it, Iterate over all feature extraction and model combos, Doing training and test on each, Then plot and report each. One could make some pseudo-code for machine learning test frameworks in different types of learning. I'll leave that post for later.
- Find existing code for standard models and feature extraction and plug them into the test framework. See how they do. Always try the appropriate variant of linear regression to get a baseline results. Try PCA or k-means as a basic feature extraction method.
- More feature extraction. The line between modeling and feature extraction is a bit artificial in my opinion. On one hand there is pre-processing, but after that usually comes some more serious feature extraction. Many feature extraction methods such as clustering or PCA are just unsupervised models where the latent variables are used as features. In the context of generative moels, these could be appended to the actual model. The only difference is with seperate feature extraction, the uncertainty over the values of the latent variables isn't propogated. This can lead to more tractability, however. Anyway, think of other feature extraction methods that might be more appropriate for your problem.
- Create more sophisticated models. If you need to code up custom models then there is a whole nother design flow to get them working
- Evaluate all your models. Go back and change them if you find there were bad decisions along the way. For that matter, decisions in any of these steps can be changed if they seem bad in hind sight.
- You can also add a mixture of experts model to get the best of both worlds with your models. If you have the marginal likelihood for all your models then model averaging or a generative mixture model can be used. If your doing classification, look at your ROC curves. One can always attain the convex hull of the ROC curves.
- The code for most of this is usually suited for prototyping and implemented in MATLAB and the like. One will usually need to take the time to implement some real application code to get the job done.
- Real world testing. You can never test too much.
These steps can be divided up into people who's specialty matches them best to a certain step. Extracting data can be a big task on its own. Likewise with implementing application code. Coding up a new model is a task that can also be specialized to a certain person.