Saturday, December 27, 2008

NIPS Report

After attending NIPS 2008 I figure I should write up my impression. For me, some of the highlights were

  1. Shai Ben-David's student, Margareta Ackerman, gave a key note presentations on the quality of clustering. After hearing hear presentation and talking to her at her poster, I was unimpressed. I also think that a lot of the clustering quality stuff is BS. They create a set of axioms to evaluate the quality of clustering algorithm. I find all of them to be somewhat questionable. Comparison of unsupervised algorithms, such as clustering, can be done via comparisons of the marginal likelihood. It seemed that many of the ideas involved in Bayesian model comparison were a foreign language to Ackerman. A full review of the topic deserves its own post.
  2. Han Liu, John Lafferty and Larry Wasserman presented a paper on a joint sparsity regression. It builds on a previous paper where they modify L1 regularization for joint sparsity in multiple regression causing certain factors to have zero influence for all the different regressions. They extend this to the non-parametric case. Each regression is a additive model of nonlinear functions of each input. The joint sparsity model causes certain functions to be zero everywhere for each regression. The regularization is quite strange. L1 regularization is equivalent to a MAP solution with a laplacian prior. I am not sure what equivalent priors these regularization methods have. In the non-parametric single regression case, I think the regularizer is equivalent to a GP prior on the function where the covariance function is a kronecker delta. I have not proven that, however. A degeneracy of this model is that it causes the functions to go to zero for every input that has not been observed. Searching through the paper, I found that they used gaussian kernel smothing on the function afterwards to smooth it out, which seems like a bit of a hack to me. A full review of the topic deserves its own post.
  3. Byron Yu and John P Cunningham presented a paper on Gaussian Process Factor Analysis. They used independent GPs over time as the latent factors and then used a factor analysis like linear transformation to explain the observed time series. They applied to some neural spike data and got quite interesting results. They were able to visualize the acitivty of a monkey's motor cortex in 2D when throwing a ball.
  4. Ben Calderhead and Mark Girolami presented a paper on Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes. They were trying to infer the parameters of a nonlinear ODE. It was a little kludgy as their model setup violated the likelihood principle. They modeled the time series of the system state with GPs. They also modeled the derivatives for the system state via the ODE. So, they had two models for the data. They inferred the parameters be trying to minimize the difference between the two. I think they modeled the difference as a Gaussian centered at zero. By inspecting the graphical model, however, you notice that the observed variables are independent of the parameters to the ODE. So, if one wanted to be completely principled you could not use the model to infer anything about the parameters.
  5. Rebecca Saxe gave a presentation that I found quite interesting. She showed a certain brain region that was involved in thinking about how other people where thinking. It was therfore involved in moral judgement, because moral judgements often hinge on intent. She first correlated the activity in the region with people's moral judgements about hypothetical scenario's. She later showed how she was able to use TMS on subjects and change their moral judgements about the hypothetical scenarios.
  6. The was a lot of hype about infer.NET. It will make exploring different models much easier. It seems much more powerful than VIBES or BUGS.
  7. I attended the causality workshop which explored methods for inferring causuation from observational or experiments where you don't have direct control over the variables you'd like to. Judea Pearl gave a very enthusiantic presentation, I am not sure if I would consider it to be good presentation, however. There was some tension in the air between Phillip Dawid and Judea Pearl over their views on causation and have created to camps in the field. I don't think they are as far apart as they think. The divide is not as big as it is between Bayesian and frequentist, for example. Judea Pearl presented his do-calculus for inferering causation in causal graphs, which are derived using a set of axioms. Dawid gave a presentation high lighting what I hope most people already know: conditional independence in graphical models is not neccessarily the same thing as causation and that nothing is as good as a randomized experiment. However, Kevin Murphy, in Dawid's camp, showed one can prove all of the do-calculus rules using the IDAG. If one sets up a graphical model using inputs variables for causation one can derive the do-calculus rules using the standard conditional independence properties of graphical models. Wrapping ones mind around what is the correct aproach for causation is much more difficult and subtle than that for prediction. I beleive this is related to the fact it is much more difficult to get a ground truth when testing causal inference methods. Guyon high lighted this fact in relation to her causality challenge.
  8. Shakir mohamed presented a paper extending PCA to other data types with distributions in the exponential family. Normal PCA works under an assumption of gaussianity in the data. EPCA can assume a bournulli distribution for example.
  9. Jurgen Van Gael presented a paper where he extended the iHMM to a factorial iHMM. He basically went from an HMM to FHMM but with iHMMs. An iHMM is an HMM with an infinite number of latent states. The transition matrix is from a hierarcical DP.
  10. Sebastian Seung gave a presentation to decode the Connectome. The connectome is basically the connection matrix between all the neuron's the brain. It is likely summarizing a brain as a graph with each neuron as a node and each synapse as a edge. The difficulty of the task is converting images of 20-30 nm thick brain slices to a connection matrix. So far they have only done C elegans, which has a mere 300 neurons. With that scientists have reverse engineered a lot C elegans behaviour. They are currently working on decoding a cubic mm of mouse brain. They are using computer vision algorithms to automate and speed up the process. He eluded to the massive amounts of data involved. By my calculations, merely storing the connectome of the human brain would require 432 TB. The imagery would be vastly more. If one had the connectome matrix it would open up tons of possibilities for analysis. I would like to run spectral clustering on the graph and see how closely graph clusters correspond to anatomical structures. Of course, I don't know how one would run spectral clustering (ie do an eigen decomposition) on a matrix that large. Sebastion gave a video with 3D graphics illustrating the imaging process, which seemed like it was for the discovery channel. The star wars music in the background was a bit much ;)
  11. There was a paper on bootstraping the ROC curve. Basically, they are trying to get confidence bounds on a ROC curve. It is important get a sense of confidence on your performance to tbe sure that is was not from random chance. It is interesting to me because I have looked into model based approches to estimating the ROC curve.
Obviously, this is only a small subset of NIPS. However, it will give me a lot of material when it is my turn to present in my gorups weekly meetings. The list of proceedings is here

Teamwork and Machine Learning

In many engineering programs there is a focus on how to work in teams and how to divide projects into parts. In embedded systems it may start with the division between hardware and software. Then it may be further divided by different subsystems and software libraries. I haven't seen much emphasis on the different roles in machine learning. I see the different categories as being

  1. Acquiring the data and getting in a database.
  2. Extracting the data from the database into .csv and .mat files and into the form that can be sent directly into an algorithm.
  3. Designing new models, coding up the inference methods, and testing the algorithms on synthetic data.
  4. Creating a test bed to divide the data into training and test, evaluate different methods, and report results.
  5. Determining what feature matrices and models to use and putting everything together.
  6. Implementing libraries that can be used in actual applications
  7. Testing the real world libraries

From what I've seen not enough emphasis is placed on the division of the tasks academically or industrially. I think it is most effiecient to divide these tasks among different people who can be specialized. It is somewhat wasteful to take a person who is an expert in designing inference algorithms and have them spend most of their time setting up a database.

Sunday, November 16, 2008

Properties for probabalistic models

It seems with probabalistic models there are some properties which are useful to completely understand your model. I've compiled a list of properties that I think are useful
  • Conditional independence properties. This should be the most appearent in the graphical models world. This is one of the reason I find it so helpful to draw a graphical model.
  • Invariances: scale, rotation, permutation, etc. This is one the best ways to understand the difference between PCA and factor analysis, FA. PCA is rotationally invariant under the data. While FA is scale invariant.
  • Exchangability. It importatant to understand what is exchangable when someone says a model has exchangability. For instance, the Indian Buffet Process, IBP, and Chinese Restaurant Process, CRP, are commonly said to have exchangability. This alone does not say much. The CRP is exchangable when viewed as a distribution over partitions. The probability customer #1 is at the same table as customer #3 is the same as the probability customer #1 is at the same table as customer #10. It is not exchangable when viewed as assigning table numbers to customers. After all, customer #1 will always be at table #1.
  • Identifiability for parameters. In some models different settings of the parameters can give the exact some likelihood for the observed data. The simplest case is the mixture of gaussians. One could switch the mean and covariances for the first and second components and it would have no effect on the likelihood of the data. Points that were previously likely to have some from component 1 are now more likely to be from component 2 and vice versa. A less trivial example exists in linear dynamical systems. The noise in the dynamics matrix can be set to identity wihout any effect on the likelihood if all the other parameter matrices are adjusted accordingly. The point of this is that any preference in the posterior for one of these settings of parameters over an equivalent one will be a result of the preferences in the prior.
  • Expected posterior predictive variance under synthetic data. This might be hard to compute but it would be interesting. In addition it could provide a sanity check on ones algorithm on either synthetic or real data.
  • Expected log likelihood. Similiar to the last idea. In other words look for the E[log-likelihood(D)]. This and the last example could be estimated by Monte Carlo simulation. Sample synthetic data and then do inference on it and check the likelihood. However, this will not work as a sanity check if you are looking for bugs in your code because your code was used to make the Monte Carlo estimate. It would work as a sanity check for model fitness. You could compare the likelihood of the real data to the expected likelihood under the model. This is just a sanity check. If one is looking for a more principled way of evaluating your model for a particular data set then I would reccomend Bayesian model comparison.
  • Standard statistical properties: variance, bias, consistency. Well, if you want to keep frequentists happy.
  • Model tecnicalities. I've kind of made this term up here because I don't have a better name for it. In Salakhutdinov's paper on Probabilistic Matrix Factorization for the Netflix problem he modeled the score a particular user gave a particular movie as sampled from logistic(normal()). In other words, he assumed the score was a continuous value between 0 and 1. To make this work at all we scaled the actual scores (1 to 5) on to a 0 to 1 scale. The synthetic data will be qualitatively different from the real data. However, the model is still okay for real data. In most cases, I suppose, model technicalities don't get in the way of effective inference, but I still think one should be mindful of their existence. Another example is with clustering data points in space. One could calculate a distance matrix and then use a generative model for random matrices to model the data. However, for many distributions on matrices the synthetic data will not be embeddable in Euclidean space. In other words, if you draw a matrix from a distribution on matrices to get a distance matrix it is not guaranteed you can find a set of points in Euclidean space that have that distance matrix. I would consider that a model technicality as well.
  • Units: people seem to forget about the units of the paramters in the model. If your observations are lengths, for instance, the mean of the distribution might be measured in m while the variance is m^2. An implication of this is that the statement that the variance is larger than the mean is meaningless because the units are different. Its like saying the area of a table is larger than its width.
  • Other stylistic facts of synthetic data. This is the everything else I forgot category. Sample data from the model. Does it seem reasonable? See if it has certain properties you find desirable or undesirable.
There are also the computational properties
  • Is the model analytically tractable? Can you get a closed form posterioir predictive? marginal likelihood? Expectation for the posterior on the parameters?
  • Is there some inherent computational complexity, a lower bound on the time to compute the exact solution. In most cases it is not practical to prove some sort of lower bound on the Big-O order to do inference. If you can, however, the results could be very interesting. While on this topic it is interesting to ask if there is a deeper reason why the Gaussian is so much more analytically tractable to work with than other distributions. Is there an intuitive explanation as a result of the central limit theorem?

Tuesday, November 11, 2008

Design Flow for Statistical/Machine Learning problems

While working on several real data sets I have noticed some real patterns in the design flow for doing analysis. I've also noticed this from seeing how much direction some people need when tutoring a social scientist on statistical analysis. It also differentiates me from people who tend to find solutions before the problem.

In normal computer engineering problems there is a typical design flow one follows. Such as
  1. Formally stating the problem
  2. Dividing the problem into software and hardware parts
  3. Doing a block diagram
  4. Schematic/simulations
  5. PCB design/simulations
  6. assembly and prototyping
  7. test
A more detailed version of this is the topic of another post. Anyway it would be nice to have something like that for machine learning. I think it would go something like this

  1. Get an idea of the problem your working on and what the data is. Talk to people with domain knowledge.
  2. Acquire the data (not always as trivial as it sounds)
  3. Create a mathematical representation for the data. This does not mean specifying a complete generative model yet. But something in a more mathematical form than a .csv file or SQL DB. For instance, for financial quotes, this would be a multivariate time series. A point process with real values associated with each point. However, you'll having to answer questions here about if it is real time or trading time. How do you deal with the market being closed sometimes. Holidays? How about exchanges in different time zones? What about volumes? limit order imbalance between bid and ask. Is it as mid/ask representation or mid-price/spread? These questions are answered here.
  4. Now feature matrices have to be created. Most machine learning methods will require data in the form of a matrix. So, it is best to translate the more general abstract representation into a matrix. For the market example, some form on windowing will probably have to be used. The prices will probably also be transformed into return space as well.
  5. Write the code to generate these feature matrices
  6. Exploratory data analysis. This is kind of an art here. One should look at ways in which you can best visualize what is going on in the data set. Look at descriptive statistics: mean, variance, kurtosis. The correlation/mutual information between variables is also a must. Dimensionality reduction is a good bet when trying to do visualization. In a sense, exploratory data analysis consists of applying simple models to the data that don't get bogged own with complex learning and latent variable aspects. They can be more crude since they are just supposed to give you an idea of what is going on. Models of the more complex type are best left for the heavy lifting when one is getting serious about prediction and inference.
  7. Need an evaluation framework: what performance measures should be used. For regression problem RMSE is often used. ROC curves are common in classification. In some cases it might be better to look at the RMSE in log scale. A review of error metrics is a post of its own. For generative models, the marginal likelihood is definite to look at if you can compute it or approximate it. This will allow for model comparison/averaging. For problems involving decision making the performance of the whole system should be considered. For instance, in algorithmic trading the amount of money earned by the system should be considered in addition to the predictive accuracy. When the test set is small one also needs to get a confidence estimate on the performance. I am not sure if there is a good reference on finding confidence bounds on ROC/mutual information/RMSE etc. I would be great to find one.
  8. Write code for test framework. Usually it goes something like: Load the data, Plot some of it, Iterate over all feature extraction and model combos, Doing training and test on each, Then plot and report each. One could make some pseudo-code for machine learning test frameworks in different types of learning. I'll leave that post for later.
  9. Find existing code for standard models and feature extraction and plug them into the test framework. See how they do. Always try the appropriate variant of linear regression to get a baseline results. Try PCA or k-means as a basic feature extraction method.
  10. More feature extraction. The line between modeling and feature extraction is a bit artificial in my opinion. On one hand there is pre-processing, but after that usually comes some more serious feature extraction. Many feature extraction methods such as clustering or PCA are just unsupervised models where the latent variables are used as features. In the context of generative moels, these could be appended to the actual model. The only difference is with seperate feature extraction, the uncertainty over the values of the latent variables isn't propogated. This can lead to more tractability, however. Anyway, think of other feature extraction methods that might be more appropriate for your problem.
  11. Create more sophisticated models. If you need to code up custom models then there is a whole nother design flow to get them working
  12. Evaluate all your models. Go back and change them if you find there were bad decisions along the way. For that matter, decisions in any of these steps can be changed if they seem bad in hind sight.
  13. You can also add a mixture of experts model to get the best of both worlds with your models. If you have the marginal likelihood for all your models then model averaging or a generative mixture model can be used. If your doing classification, look at your ROC curves. One can always attain the convex hull of the ROC curves.
  14. The code for most of this is usually suited for prototyping and implemented in MATLAB and the like. One will usually need to take the time to implement some real application code to get the job done.
  15. Real world testing. You can never test too much.

These steps can be divided up into people who's specialty matches them best to a certain step. Extracting data can be a big task on its own. Likewise with implementing application code. Coding up a new model is a task that can also be specialized to a certain person.

Saturday, September 20, 2008

Coding Style

While at Google at got exerience with code reviews. Code readability and style is taken very seriously there and can drag out code reviews. So, I looked back at comments I received in our web integrated revision control system to see what my common issues where. I found a lot of stuff about style one would usually not notice.

  • Comment Grammar. They do check for spelling and Grammar in your comments. It gets kind of annoying to edit your sentences when you have to manually break at 80 cols and enter a new comment symbol. It would be cool if there was an IDE that would pop up a text box and let you type with a spell checker and a grammer checker. Then it would insert it into the code for you.
  • 80 characters line width. Seems like those 82's always sneak in. Check your code with a linter to find this kind of stuff!
  • Put a good description of the input and output of a program at the top. In a doc comment if your using Python
  • Document why your using < vs <=. This helps avoid of by one errors.
  • Variable names: The have to be descriptive, but not too long. Its is a definite trade off. I guess there is a feng shui to this one.
  • Capitilization style: the is usually a convention for capitilzation for variables, global constants, class names and function names. Some lint tools do this.
  • White space rules: for indentation and line continuation. (lintable)
  • Whitespace after , or binary operators. There is also not supposed to be whitespace at the end of a line. (lintable)
  • Don't hard code in any numbers, filenames, or many strings. They should be global constants.
  • Put the units in a variable name if it is a quanitity with a unit such as seconds
  • Use libraries whenever possible. This is why it is important to know what all the libraries do.
  • Avoid deprecated code.
  • Avoid redundant functions calls. For example, using flush() before close() if close does flush automatically.
  • Comment almost every line. There is probably some good ratio of comment lines to code lines if you where to compile some stats. Don't explain the language in the comments explain the program. To do otherwise would violate style rules as well.


I had a great time at MLSS, but some people dropped the ball when it came to IT and logistics. So there are some things people need to get right at the next MLSS.

We need to make sure we have:
- IT:
- internet in the rooms where people are staying
- a lot bandwidth available (at least 50 Mbps total)
- enough wireless routers to handle the load of ~200 connections for some buffer
- a power strip at every table for people to plug their laptops into
- some US power plug tables and some Euro ones given there will be a lot of euro people
- Some tables with Cat 5 for people who have wifi troubles
- A Google Map with placemarks of all the locations we need to know
- message boards
- A transport board setup a few months in advance so participants can arrange transportation pland with each other.
- Video recordings of the lectures for

- Talks:
- Have the slides, references, labs, and code online in advance on the website (not some funky ftp server on someones laptop)
- All the lectures and activities scheduled in Google calenders
- Facebook and LinkedIn groups setup before hand
- A whiteboard for the lecturers to right on (with markers that work. Seems whiteboards always have dead markers.)
- Enough lab space for everyone to fit
- A notebook and a pen in the backpack schwag that is given out. People don't usually remember this when packing.
- More practical sessions and have lectures oriented towards practical sessions. I like the philosophy of Nando De Freitas, if you can't code it you don't understand it.
- The should also be feedback forms for the participants to fill out on the lecturers

- Maybe a mini-library with a few books around like Bishop ect.
- A vending machine around where people are staying in case they get the munchies
- A whereivebeen map showing where everone is from
- A MLSS reader in one pdf with all the relevent readers that would be useful in preperation for the lectures and practical sessions
- Energy drinks in addition to coffee at the breaks (In Cambridge we'll probably have tea being the UK)

Another possibility is for MLSS to experiment with the I-Clicker. My brother had to get one for his freshman physics class. The lecturer can ask multiple choice questions and get feedback from the whole class with the remote. It gives the lecturers feedback on if they are not explaining something in enough detail. It also would make the lectures more interactive so people would be less inclined to fall asleep.

Saturday, September 13, 2008


At MLSS I've seen the spectrum of very good to very bad lectures. There seems to be a few rules that one could follow to avoid the pitfalls of the bad presentations. Such as,

Explain things through examples rather than cryptic mathematical definitions
Don't explain anything to complicated with your hands. Draw everything on a board.
Provide a motivation and backgorund on what you want to do.

Being a graphical models person I can't understand anything with out a graphical model anymore. It helps to specify the full bayesian inference scheme as the objective even if it is intractable. It helps orient people as a target for what you are trying to acheive. Many people explain non-probabalisitc methods without going through ths step. However, many of these methods have an implicit graphical model which be used to help orient people.

It is also good to use examples. People are very good at learning from examples. Once people understand something 90% it is then appropriate to supply a cryptic mathematic definition to completely disambiguate. For example, the DP is much more understandable with the stick breaking construction then the abstract definition. In physics thermodynamics is much easierto explain if you have one of those particle java applets that explains an ideal gas.

At one of the lecturers at MLSS I could look back in the audience and see that 95% of people where either surfing the internet, reading a book, or staring off into space. The lecturer seemed preety oblivious to it. Maybe he didn't even care. Many of us suspected he simply gave us a presentation he gave to his research group. He didn't take the time to make a new presentation even though his curent one was very innappropriate.

As for student presentations at the CBL, I've found there are a few reasons there are confusing:
1) the presenter doesn't really understand the topic themselves
2) Going over too much in too little time
3) language barrier

When niavely listenting to confusing presentations it seems like the person understands the topic so much no one else can understand at their level. However, in my experience, when you begin asking questions it becomes clear the presenting doesn't really understand the topic.

On the CBL wiki I've made a presentation guide with the following tips:

Technical Things

* Always label graph axis and be clear what graph represents.
o Use units if applicable (on graph axis and in general)
* Use page numbers on slides that show the current page number and total number of pages
* Don't forget your video adapter if you have a Mac. Mac folks always seem to forget this. Hugo has one in his desk that people can borrow.
* If your not presenting on your own computer then you should put the presentation in multiple formats: ppt, pdf, odf. Don't expect everyone else to have open office, for instance.
* Also, if not using your on own computer, make sure that the computer in the presentation has the necessary software for any demos. This includes Flash Player, Java applet support, any necessary video codecs, ect.
* Don't put too much text on a slide and keep fonts big enough

Before Starting the Talk

* think about, what kind of talk you want to give (rough idea of an algorithm, detailed description of sth, ...)
o depending on this you might not want to use too many equations (although the slides are not complete)
o keep it simple!
* give at least one test talk

Starting the Talk

* Might be best to start the presentation with material you expect most people already know. This allows you to synchronize that people are on the same page. Then start to introduce new things.
* It is good to establish the practical importance of whatever your presenting. Giving an example problem helps give people context. If everything is in the abstract then things become much more confusing.

During the Talk

* Always define what variables represent. maybe keep them on white board on the side.
o If necessary, define the dimensions of matrices and, if not obvious, the range of values variables can take (zero to one, positive, ...)
* If presenting a probabalistic model then put its graphical model, in Bishop notation, in your presentation.
* Give intuitive feel for all equations and variables to the extent possible
o Do this with examples and analogies
* Don't try to convey any important information with your hands alone.
o Never write out equations in the air with your hands (I've had a teacher who does this)
* Don't be afraid to write out and derive equations on the board
* In engineering problems it is always good to explain the input, output, and interface of any given system up front. If this is not clear people will get confused.
* If longer than an hour or so, give breaks for caffeine, snacks, etc.
* Don't rush through the slides. People should be able to read it! Explain, what's going on. Depending on your presentation style (more or less tutorial-like): 2-3 minutes per slide (in average) seems good


* Speak loudly, clearly, and not too fast
o Mumbling technical comments on the side only confuses people

Some Dos and Don'ts

* Don't point your laser pointer at people. It always seems kind of awkward.
* Don't point with hands. People can't see what your pointing at exactly. Use the laser pointer.
* Don't point to everything with the laser pointer.
* Do look at the audience
* Do modulate your voice and be interested in your own stuff. It's not trivial to most others!
* Do use examples and demos
o In intro physics they always like to use those Java applets of a spring oscillator and so on. Try to do the same if possible/applicable.

After the Talk

* Post your slides on the Presentation Archive
* Use our template if you can get Latex to work


* some hints if you give a short talk
* All the stuff in this guide to Terrible Presentations (…and how to not give one)

Friday, September 12, 2008


I have been at MLSS for the last 12 days. We are located here.

There are a lot of good people here from ETH, ParisTech, Max Plank, and so on.

The lectures by Nando de Freitas, Richard Sutton, and Yann LeCunn were the most interesting. The was an intersting lecture by Shai Ben David on the theory of clustering. I am not sure what to think of it yet.

Nando talked about his new search engine Worio. It is supposed to cluster web pages when you enter terms with multiple meanings. It sounds like an intersting idea. He also showed a demo of an image search where you can refine you query by selecting images you like and don't like.

Last Sunday many of us went on a 50 km bike ride here

We've been to the beach several days as well. The water here is about 20 ÂșC, much warmer than the Pacific. I've taken several pictures which I will need to post too.

The internet hasn't been very good here though. We don't have internet in the rooms and the wifi can't handle the load of all the attendees.

Hello World

Hello world test for my new blog of cool machine learning anecdotes.