The Grad Factor: Properties for probabalistic models

It seems with probabalistic models there are some properties which are useful to completely understand your model. I've compiled a list of properties that I think are useful

Conditional independence properties. This should be the most appearent in the graphical models world. This is one of the reason I find it so helpful to draw a graphical model.
Invariances: scale, rotation, permutation, etc. This is one the best ways to understand the difference between PCA and factor analysis, FA. PCA is rotationally invariant under the data. While FA is scale invariant.
Exchangability. It importatant to understand what is exchangable when someone says a model has exchangability. For instance, the Indian Buffet Process, IBP, and Chinese Restaurant Process, CRP, are commonly said to have exchangability. This alone does not say much. The CRP is exchangable when viewed as a distribution over partitions. The probability customer #1 is at the same table as customer #3 is the same as the probability customer #1 is at the same table as customer #10. It is not exchangable when viewed as assigning table numbers to customers. After all, customer #1 will always be at table #1.
Identifiability for parameters. In some models different settings of the parameters can give the exact some likelihood for the observed data. The simplest case is the mixture of gaussians. One could switch the mean and covariances for the first and second components and it would have no effect on the likelihood of the data. Points that were previously likely to have some from component 1 are now more likely to be from component 2 and vice versa. A less trivial example exists in linear dynamical systems. The noise in the dynamics matrix can be set to identity wihout any effect on the likelihood if all the other parameter matrices are adjusted accordingly. The point of this is that any preference in the posterior for one of these settings of parameters over an equivalent one will be a result of the preferences in the prior.
Expected posterior predictive variance under synthetic data. This might be hard to compute but it would be interesting. In addition it could provide a sanity check on ones algorithm on either synthetic or real data.
Expected log likelihood. Similiar to the last idea. In other words look for the E[log-likelihood(D)]. This and the last example could be estimated by Monte Carlo simulation. Sample synthetic data and then do inference on it and check the likelihood. However, this will not work as a sanity check if you are looking for bugs in your code because your code was used to make the Monte Carlo estimate. It would work as a sanity check for model fitness. You could compare the likelihood of the real data to the expected likelihood under the model. This is just a sanity check. If one is looking for a more principled way of evaluating your model for a particular data set then I would reccomend Bayesian model comparison.
Standard statistical properties: variance, bias, consistency. Well, if you want to keep frequentists happy.
Model tecnicalities. I've kind of made this term up here because I don't have a better name for it. In Salakhutdinov's paper on Probabilistic Matrix Factorization for the Netflix problem he modeled the score a particular user gave a particular movie as sampled from logistic(normal()). In other words, he assumed the score was a continuous value between 0 and 1. To make this work at all we scaled the actual scores (1 to 5) on to a 0 to 1 scale. The synthetic data will be qualitatively different from the real data. However, the model is still okay for real data. In most cases, I suppose, model technicalities don't get in the way of effective inference, but I still think one should be mindful of their existence. Another example is with clustering data points in space. One could calculate a distance matrix and then use a generative model for random matrices to model the data. However, for many distributions on matrices the synthetic data will not be embeddable in Euclidean space. In other words, if you draw a matrix from a distribution on matrices to get a distance matrix it is not guaranteed you can find a set of points in Euclidean space that have that distance matrix. I would consider that a model technicality as well.
Units: people seem to forget about the units of the paramters in the model. If your observations are lengths, for instance, the mean of the distribution might be measured in m while the variance is m^2. An implication of this is that the statement that the variance is larger than the mean is meaningless because the units are different. Its like saying the area of a table is larger than its width.
Other stylistic facts of synthetic data. This is the everything else I forgot category. Sample data from the model. Does it seem reasonable? See if it has certain properties you find desirable or undesirable.

There are also the computational properties

Is the model analytically tractable? Can you get a closed form posterioir predictive? marginal likelihood? Expectation for the posterior on the parameters?
Is there some inherent computational complexity, a lower bound on the time to compute the exact solution. In most cases it is not practical to prove some sort of lower bound on the Big-O order to do inference. If you can, however, the results could be very interesting. While on this topic it is interesting to ask if there is a deeper reason why the Gaussian is so much more analytically tractable to work with than other distributions. Is there an intuitive explanation as a result of the central limit theorem?

The Grad Factor

Sunday, November 16, 2008

Properties for probabalistic models

No comments:

Blog Archive

About Me

Blogroll