Sunday, November 16, 2008

Properties for probabalistic models

It seems with probabalistic models there are some properties which are useful to completely understand your model. I've compiled a list of properties that I think are useful
  • Conditional independence properties. This should be the most appearent in the graphical models world. This is one of the reason I find it so helpful to draw a graphical model.
  • Invariances: scale, rotation, permutation, etc. This is one the best ways to understand the difference between PCA and factor analysis, FA. PCA is rotationally invariant under the data. While FA is scale invariant.
  • Exchangability. It importatant to understand what is exchangable when someone says a model has exchangability. For instance, the Indian Buffet Process, IBP, and Chinese Restaurant Process, CRP, are commonly said to have exchangability. This alone does not say much. The CRP is exchangable when viewed as a distribution over partitions. The probability customer #1 is at the same table as customer #3 is the same as the probability customer #1 is at the same table as customer #10. It is not exchangable when viewed as assigning table numbers to customers. After all, customer #1 will always be at table #1.
  • Identifiability for parameters. In some models different settings of the parameters can give the exact some likelihood for the observed data. The simplest case is the mixture of gaussians. One could switch the mean and covariances for the first and second components and it would have no effect on the likelihood of the data. Points that were previously likely to have some from component 1 are now more likely to be from component 2 and vice versa. A less trivial example exists in linear dynamical systems. The noise in the dynamics matrix can be set to identity wihout any effect on the likelihood if all the other parameter matrices are adjusted accordingly. The point of this is that any preference in the posterior for one of these settings of parameters over an equivalent one will be a result of the preferences in the prior.
  • Expected posterior predictive variance under synthetic data. This might be hard to compute but it would be interesting. In addition it could provide a sanity check on ones algorithm on either synthetic or real data.
  • Expected log likelihood. Similiar to the last idea. In other words look for the E[log-likelihood(D)]. This and the last example could be estimated by Monte Carlo simulation. Sample synthetic data and then do inference on it and check the likelihood. However, this will not work as a sanity check if you are looking for bugs in your code because your code was used to make the Monte Carlo estimate. It would work as a sanity check for model fitness. You could compare the likelihood of the real data to the expected likelihood under the model. This is just a sanity check. If one is looking for a more principled way of evaluating your model for a particular data set then I would reccomend Bayesian model comparison.
  • Standard statistical properties: variance, bias, consistency. Well, if you want to keep frequentists happy.
  • Model tecnicalities. I've kind of made this term up here because I don't have a better name for it. In Salakhutdinov's paper on Probabilistic Matrix Factorization for the Netflix problem he modeled the score a particular user gave a particular movie as sampled from logistic(normal()). In other words, he assumed the score was a continuous value between 0 and 1. To make this work at all we scaled the actual scores (1 to 5) on to a 0 to 1 scale. The synthetic data will be qualitatively different from the real data. However, the model is still okay for real data. In most cases, I suppose, model technicalities don't get in the way of effective inference, but I still think one should be mindful of their existence. Another example is with clustering data points in space. One could calculate a distance matrix and then use a generative model for random matrices to model the data. However, for many distributions on matrices the synthetic data will not be embeddable in Euclidean space. In other words, if you draw a matrix from a distribution on matrices to get a distance matrix it is not guaranteed you can find a set of points in Euclidean space that have that distance matrix. I would consider that a model technicality as well.
  • Units: people seem to forget about the units of the paramters in the model. If your observations are lengths, for instance, the mean of the distribution might be measured in m while the variance is m^2. An implication of this is that the statement that the variance is larger than the mean is meaningless because the units are different. Its like saying the area of a table is larger than its width.
  • Other stylistic facts of synthetic data. This is the everything else I forgot category. Sample data from the model. Does it seem reasonable? See if it has certain properties you find desirable or undesirable.
There are also the computational properties
  • Is the model analytically tractable? Can you get a closed form posterioir predictive? marginal likelihood? Expectation for the posterior on the parameters?
  • Is there some inherent computational complexity, a lower bound on the time to compute the exact solution. In most cases it is not practical to prove some sort of lower bound on the Big-O order to do inference. If you can, however, the results could be very interesting. While on this topic it is interesting to ask if there is a deeper reason why the Gaussian is so much more analytically tractable to work with than other distributions. Is there an intuitive explanation as a result of the central limit theorem?

Tuesday, November 11, 2008

Design Flow for Statistical/Machine Learning problems

While working on several real data sets I have noticed some real patterns in the design flow for doing analysis. I've also noticed this from seeing how much direction some people need when tutoring a social scientist on statistical analysis. It also differentiates me from people who tend to find solutions before the problem.

In normal computer engineering problems there is a typical design flow one follows. Such as
  1. Formally stating the problem
  2. Dividing the problem into software and hardware parts
  3. Doing a block diagram
  4. Schematic/simulations
  5. PCB design/simulations
  6. assembly and prototyping
  7. test
A more detailed version of this is the topic of another post. Anyway it would be nice to have something like that for machine learning. I think it would go something like this

  1. Get an idea of the problem your working on and what the data is. Talk to people with domain knowledge.
  2. Acquire the data (not always as trivial as it sounds)
  3. Create a mathematical representation for the data. This does not mean specifying a complete generative model yet. But something in a more mathematical form than a .csv file or SQL DB. For instance, for financial quotes, this would be a multivariate time series. A point process with real values associated with each point. However, you'll having to answer questions here about if it is real time or trading time. How do you deal with the market being closed sometimes. Holidays? How about exchanges in different time zones? What about volumes? limit order imbalance between bid and ask. Is it as mid/ask representation or mid-price/spread? These questions are answered here.
  4. Now feature matrices have to be created. Most machine learning methods will require data in the form of a matrix. So, it is best to translate the more general abstract representation into a matrix. For the market example, some form on windowing will probably have to be used. The prices will probably also be transformed into return space as well.
  5. Write the code to generate these feature matrices
  6. Exploratory data analysis. This is kind of an art here. One should look at ways in which you can best visualize what is going on in the data set. Look at descriptive statistics: mean, variance, kurtosis. The correlation/mutual information between variables is also a must. Dimensionality reduction is a good bet when trying to do visualization. In a sense, exploratory data analysis consists of applying simple models to the data that don't get bogged own with complex learning and latent variable aspects. They can be more crude since they are just supposed to give you an idea of what is going on. Models of the more complex type are best left for the heavy lifting when one is getting serious about prediction and inference.
  7. Need an evaluation framework: what performance measures should be used. For regression problem RMSE is often used. ROC curves are common in classification. In some cases it might be better to look at the RMSE in log scale. A review of error metrics is a post of its own. For generative models, the marginal likelihood is definite to look at if you can compute it or approximate it. This will allow for model comparison/averaging. For problems involving decision making the performance of the whole system should be considered. For instance, in algorithmic trading the amount of money earned by the system should be considered in addition to the predictive accuracy. When the test set is small one also needs to get a confidence estimate on the performance. I am not sure if there is a good reference on finding confidence bounds on ROC/mutual information/RMSE etc. I would be great to find one.
  8. Write code for test framework. Usually it goes something like: Load the data, Plot some of it, Iterate over all feature extraction and model combos, Doing training and test on each, Then plot and report each. One could make some pseudo-code for machine learning test frameworks in different types of learning. I'll leave that post for later.
  9. Find existing code for standard models and feature extraction and plug them into the test framework. See how they do. Always try the appropriate variant of linear regression to get a baseline results. Try PCA or k-means as a basic feature extraction method.
  10. More feature extraction. The line between modeling and feature extraction is a bit artificial in my opinion. On one hand there is pre-processing, but after that usually comes some more serious feature extraction. Many feature extraction methods such as clustering or PCA are just unsupervised models where the latent variables are used as features. In the context of generative moels, these could be appended to the actual model. The only difference is with seperate feature extraction, the uncertainty over the values of the latent variables isn't propogated. This can lead to more tractability, however. Anyway, think of other feature extraction methods that might be more appropriate for your problem.
  11. Create more sophisticated models. If you need to code up custom models then there is a whole nother design flow to get them working
  12. Evaluate all your models. Go back and change them if you find there were bad decisions along the way. For that matter, decisions in any of these steps can be changed if they seem bad in hind sight.
  13. You can also add a mixture of experts model to get the best of both worlds with your models. If you have the marginal likelihood for all your models then model averaging or a generative mixture model can be used. If your doing classification, look at your ROC curves. One can always attain the convex hull of the ROC curves.
  14. The code for most of this is usually suited for prototyping and implemented in MATLAB and the like. One will usually need to take the time to implement some real application code to get the job done.
  15. Real world testing. You can never test too much.

These steps can be divided up into people who's specialty matches them best to a certain step. Extracting data can be a big task on its own. Likewise with implementing application code. Coding up a new model is a task that can also be specialized to a certain person.