Monday, December 27, 2010


Undirected Grad has a blog post about symbolic regression using a new peice of software called Eureka. I have tried it out and it is pretty effective at uncovering the latent function from the synthetic experiments a tried, such as y = logistic(x^2 + sin(x)).

NIPS 2010 highlights

It was the last year in Vancouver/Whistler. So, luckily the snow conditions were good ;)

On the technical side:

Switched Latent Force Models for Movement Segmentation
Mauricio Alvarez, Jan Peters, Bernhard Schoelkopf, Neil Lawrence
They modeled an input/output system governed by a linear differential equation where the input was distributed according a switching GP. They took advantage of the fact that the derivative of a function from a GP is also GP distributed, as well as linearity properties. Therefore, the output of the system was also distributed according to a switching GP model. They used the model to segment human motion. I liked it since it was closely related to my ICML paper on GP change point models. They claimed the advantage of their method is that it enforced continuity in the time series across segment switches. Although, this can easily be done in my setup I am glad my paper got a citation ;)

Global seismic monitoring as probabilistic inference
Nimar Arora, Stuart Russell, Paul Kidwell, Erik Sudderth
They used graphical models to infer if earthquakes and other seismic events (e.g. nuclear tests) are noise (from local events near a seismic sensor) or from a genuine event, which should be noticed by multiple seismic sensors.

A Bayesian Approach to Concept Drift
Stephen Bach, Mark Maloof
This paper is also similar to the Adams & MacKay change point framework. They replaced the base model (UPM) with a discriminative classifier (such as Bayesian logistic regression). They admitted to fitting some of the hyper-parameters to the test, which is cheating. However, they tried to justify it by saying that it is inappropriate to try to learn the frequency of concept drifts (change points) from training data. I don't think the argument is coherent.

Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, Mayur Naik
They did an analysis of programs to predict their execution time. The novelty of the paper is that they created features by "splicing" the program; they found small snippets of the program that could be executed quickly. They used the output of these snippets as features for a LASSO regression with polynomial regression. Polynomial basis functions are sensible since the run-time of a program is usually approximately linear, quadratic, or cubic in some aspect of its input. I pointed them to Zoubin's polybayes.m demo as a way of selecting the order of a polynomial from data. Symbolic regression using Eureka might also be illuminating.

Slice sampling covariance hyperparameters of latent Gaussian models
Iain Murray, Ryan Adams
Iain presented a some tricks for transforming the sample space in GP classification to drastically improve the convergence of sampling GP hyper-parameters. Iain is a fan of re-parameterizing models to spaces that makes sampling easier. He claims the naive sampling method gets stuck in an "entropic barrier." He says this a third and often ignored, but common, failure mode of MC methods. The are other two are: the sampling method getting stuck in one mode of the posterior and dimensions that are highly correlated.

Heavy-Tailed Process Priors for Selective Shrinkage
Fabian Wauthier, Michael Jordan
Fabian did GP classification while applying heavy tail noise to the latent GP before squashing the function through a sigmoid/probit. They claim GPC often gives over confident predictions in sparsely sampled areas of the input space. This method claims to alleviate the problem. Since the problem does not occur in synthetic data I asked him what he thought was the underlying model assumption violated. He believes the root cause is the stationarity assumption in most GP kernels is inappropriate in many cases.

Copula Processes
Andrew Wilson, Zoubin Ghahramani
It was nice to see that Andrew attracted quite a crowd at his poster.

At the workshops I liked:
Natively probabilistic computation: principles and applications
Vikash Mansinghka, Navia Systems
Vikash argued that his accelerated hardware could do millions of samples per second in Gibbs sampling an MRF (1000x improvement). The hardware restricted the flexibility of what kind of sampling you could do. The loss in performance from lossing that flexibility was compensated for many times over by using the hardware acceleration. He argues that maybe the best approach is to use simple samplers and his accelerated hardware over sophisticated samplers in software.

There was talk about the prospect of moving to analog computation for sampling. A lot of energy is used in CPUs to make them completely deterministic with digital computation, but then in MC methods we artificially introduce randomness. Maybe it is better to do MC computations with analog. However, Vikash said that we must limit the analog computation to very small accelerated units within a digital processor in order for it to be manageable. The analog element would require custom ICs, which requires more funding than he currently has. However, he has selectively reduced the bit precision of many of his computations, which he says can be done when the quantities are random. This saves chip real-estate and power.

Tuesday, October 19, 2010

LaTeX A0 Summary

Following the well known latex-one-pager there is now the LaTeX A0 that combines 16 pages of LaTeX summary into poster.

Monday, September 20, 2010

MLSP 2010 highlights

Here are my highlights from MLSP.

To my knowledge this was the first machine learning conference to occur within the arctic circle ~ 68 N. The conference took place on top of the gondola. The key highlight of the conference was the summer bobsled track from the conference center to the village. The food was mostly raindeer (in various forms) and berries ;)

On the technical side:

Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process
Regression Models:
Simo Sarka had a poster where he converted (almost arbitrary) stationary GP time series models into a state space model. He then used to Kalman filter to do O(T) predictions. As opposed to O(T^3) for general GPs and O(T^2) or O(TlogT) with Toeplitz tricks if the time series is in discrete time. Simo's method works in continuous time as well.

Recent directions in nonparametric Bayesian machine learning Zoubin gave a lecture were he made an unapologetic advertisement for NP-Bayes.

Tom M. Mitchell: Machine Learning for Studying Neural Representations of Word Meanings An interesting talk showing the cutting edge of machine learning applied to fMRI data.

PASS-GP: Predictive Active Set Selection for Gaussian Processes A new approach to sparse GPs involving selecting a subset of data points.

Archetypal Analysis for Machine Learning Mikkel's old NIPS pal an enthusiastic talk on "Archetypal Analysis", which most of the MLSP crowd was unfamiliar with.

CBMS highlights

Here are my highlights from CBMS: the non-parametric Bayes conference at UC Santa Cruz. It was organized more like a summer school, however.

The conference was dominated by Peter Muller, who gave 10 1.5 hour lectures on non-parametric Bayes. He talked mainly of Dirichlet processes and the generalizations to them: Pitman-Yor, Polya trees, ect. He presented a "graphical model of graphical models" demonstrating the connection between the related models. He went through each model and compared them by their predictive probability function (PPF), which is the one-step-ahead predictive distribution for the models. Notably absent from his unifying view was Gaussian processes.

Michael Jordan gave one lecture where he went through various models various NP Bayes models he has worked with: LDA, IBPs, sticky HMMs, ... He didn't get too technical, but tried to give a high level view of many models motivated by applications such as speaker diarization.

Wes Johnson gave one lecture giving examples of NP Bayes in biology.

Finally, Peter Hoff gave one lecture "Alternative approaches to Bayesian nonparametrics". He gave some examples of how doing Bayesian inference with an unknown Gaussian has a better predictive probability than using a DP-mixture for N <> 100 were referred to "large" and N > 5000 as "huge".

The slides are available here:

Sunday, July 4, 2010

Why LaTeX is superior to MS Word

I recently submitted a paper to Interdisciplinary Graduate Conference (IGC) 2010. I prepared well formatted 8 page LaTeX document. However, the conference was organized by humanities students who had never heard of LaTeX. They wanted a .doc file. I then had to go through the painful process of converting my LaTeX document into a Word one. After that painful experience, I could not resist writing a rant on the process. I have experienced both forms of writing: I used word for every (lab) report in my under-grad. Once in my PhD program I was converted to LaTeX and have not looked back since.

  1. Speed: Anything that requires a mouse and clicking through menus will be slower than one where you can write it out in a few key strokes. This means that writing characters with accents, special symbols, and especially equations will be much faster to write in LaTeX.

    1. MS Word (and even worse open office) can get sluggish when editing large documents with lots of equations and figures.

  2. Security: stored in plain text

    1. MS Word stores its files in a bloated binary form. If a file gets corrupted for whatever reason you could be locked out of your file and many hours of work. Likewise if some bug in MS Word is causing it to crash when opening your file. With plain text source files, if all else fails you can always open and edit the file in a simple text editor.

  3. Separation of context and formatting

    1. The plain text style of LaTeX simply using section and subsection etc allows the writer to simply think about the logical flow of the document without worrying about superficial details such as font sizes and styles

    2. You know all your sections and subsections headings are in the correct font size. This is much harder to check using MS Word.

  4. Integratable with SVN

    1. Being plain text it is easier for SVN (and other revision control systems) to merge files being edited

    2. It also takes up less space on the server storing revisions

    3. It is also possible to use any diff tool to compare revisions

      1. You also know the diff tool will show you ALL changes in the state of the file. There is no such guarantee when using features such as track changes in MS Word.

  5. Control

    1. MS Word often tries to outsmart you. It will automatically capitalize, automatically try to select whole sentences, automatically insert bullet points, and try to infer when you’re done with a sub/superscript when writing equations. Software that tries to out-smart you will often out-dumb you. It tries too hard to infer what you want and often gets it wrong.

      1. A good example is the use of - vs. -- vs. ---. Microsoft assumes most users are not smart enough to infer which type of dash to use in which situation. So MS Word tries to figure it out automatically. It can be really annoying when it gets it wrong. With LaTeX you just write what you want in 1 to 3 key strokes!

      2. There is also the quote directions `` ''. In LaTeX it is specified manually while in word it is automatic, and can be annoying if word infers it wrong.

    2. Notion of state: every change in the file is visible. Nothing is hidden from you in a plain text source file.

      1. There is no hidden meta information

  6. Flexibility

    1. It is easy to search for $x^2$ or \footnote in LaTeX the is no easy way to do the analogous searches using CTRL-F in MS Word

    2. You can also search the file using regular expressions

    3. Macros for a more semantic representation can be written in just 1 line with a few key strokes

      1. For example, I often define \field{} using \mathcal{} and then \R using \field{R}

    4. In MS Word macros are often full blown VB scripts

      1. They should be disabled any way since they are a security risk

  7. More easily scriptable

    1. For instance, I have MATLAB code that exports a matrix of results in MATLAB to a LaTeX table

    2. It would require a full blown C++/VB program in visual studio using all sorts of crazy APIs (and therefore reading tons of documentation) to do the same thing in MS Word

  8. Speed-quality trade-off in formatting

    1. Because MS Word has to reformat the document every keystroke it has to use inferior typesetting methods to prevent the GUI from becoming glacially slow

    2. Since you only recompile after significant changes in LaTeX, it can afford to use more expensive type-setting algorithms (especially for equations) that might take 10 seconds to run.

  9. Interoperable

    1. Since things like bibtex are also plain text it is much easier for third parties to create applications such as Jabref. You don't get stuck using one particular reference manager. If you don't like one there are others to use instead. And if all else fails you can always edit it in notepad.

    2. Therefore, there is no vendor lock-in and you can't get stuck using one piece of software for backward compatibility reasons that may in the future become inferior to the alternatives.

  10. Misc

    1. Equations always come out looking crappy

    2. Figure captions aren't proper; MS Word will let them cross page boundaries for instance

    3. Footnotes are a pain, especially if you have multiple footnotes on the same page

    4. Equation numbering is a pain in Word

    5. It is possible to embed pdf figures in LaTeX which allows for vector graphics and avoids file bloat

      1. However, I beleive new versions of word allow for the insertion of eps figures

  11. Cost

    1. LaTeX is free while MS Office can cost a few hundred dollars

      1. There is open office but that is even worse

    2. I am not a "free-tard" so this is not my top concern

Advantages of MS Word:

  1. MS Word has a grammar checker. To my knowledge, none of the LaTeX editors have a grammar checker.

    1. Of course, the grammar checker should usually be taken with a grain of salt. However, it is good at catching typos such as interchanging it/is/if and they/then, which a spell checker will not find and are easy to glance over when proof reading.

  2. The equation editor is better than it used to be. Writing an equation heavy document in word used to be almost impossible. However, it is now doable, but still much slower than LaTeX.

ICML 2010 Highlights

636: Sparse Gaussian Process Regression via L1 Penalization
Feng Yan (Purdue University); Yuan Qi (Purdue University)
They introduced a way to do sparse GPs for large amounts of data by adding a L1 penalization to the influence of data points. It effectively removes irrelevant data points using a convex optimization. It avoids the local optima problems of normal sparse GPs.

I liked all the papers in the application track:
901: Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine
Thore Graepel (Microsoft Research); Joaquin Quiñonero Candela (Microsoft Research); Thomas Borchert (Microsoft Research); Ralf Herbrich (Microsoft Research)

902: Detecting Large-Scale System Problems by Mining Console Logs
Wei Xu (UC Berkeley); Ling Huang (Intel Labs Berkeley); Armando Fox (UC Berkeley); David Patterson (UC Berkeley); Michael I. Jordan (UC Berkeley)
I liked this since it is somewhat related to my project.

903: The Role of Machine Learning in Business Optimization
Chid Apte (IBM T. J. Watson Research Center)
IBM is increasing the efficiency of collecting back taxes in NY state using machine learning (which some people found scary).

374: Local Minima Embedding
Minyoung Kim (CMU); Fernando De la Torre (CMU)
The idea is to visualize the a high dimensional objective function in a lower dimension that can be visualized while preserving the local optima. Its a really good idea, but it is not good enough to help with the hard problems we would want to solve with it (ie visualizing local optima in high dimensional neural network optimization).

495: Hilbert Space Embeddings of Hidden Markov Models
Le Song (Cmu); Byron Boots (Carnegie Mellon University); Sajid Siddiqi (Google); Geoffrey Gordon (Carnegie Mellon University); Alex Smola (Yahoo! Research)

551: Distance Dependent Chinese Restaurant Processes
David Blei (Princeton University); Peter Frazier (Cornell)

AISTATS 2010 Highlights

The conference kicked off with
Forensic Statistics: Where are We and Where are We Going?
Richard Gill
To append to Sebastien's comment the statistical flaws introduced by the doctors and lawyers in the case included: confusion that P(data|guilty) = P(guilty|data), multiplying p-vals over multiple tests, the post-hoc problem in frequentist hypothesis testing, arbitrary starting and stopping rules, violated iid assumptions, and fitting the protocol to define events to maximize significance.

I liked
Reduced-rank hidden Markov models
S. Siddiqi, B. Boots and G. Gordon
some cool stuff about alternatives to EM for training HMMs that are closed form with bounds on the loss of statistical efficiency compared to EM and without EM's local optima problem.

Gaussian processes with monotonicity information Jaakko Riihimäki, Aki Vehtari ; 9:645-652, 2010.

I learned what Phil Hennig is up to in
Coherent inference on optimal play in game trees
P. Hennig, D. Stern and T. Graepel

Zoubin pulled in another best paper award in
Learning the structure of deep sparse graphical models
R. Adams, H. Wallach and Z. Ghahramani
Adams and Wallach used some of the MacKay magic.

On the Sunday after the conference there was the active learning workshop. Don Rubin (co-inventor of EM and co-author on Gelman) was the invited speaker who talked about experimental design and causality. He is definetly on the stats side of the ai-stats. I asked him about measuring test set performance and he asked why I would want to divide my data set in half (definitely out of sync with the ML culture!). It was also interesting to see the philosophical cracks between him and Phil Dawid and the division between the Dawid/Rubin/Pearl views on causal inference.

NIPS 2009 highlights

Reading Tea Leaves: How Humans Interpret Topic Models, Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David Blei, Princeton University Really interesting idea for evaluating an unsupervised method. They use human subjects on Amazon Turk for controlled experiments to determine the interpretability of topic model outputs. They find that higher perplexity doesn't always correspond to higher interpretability by humans. The model with better perplexity may have found more structure in the data, but that may or may not correspond to our intuitive notions of a "topic". After talking to them there are some big open questions: 1) Do correlated topic models have better perplexity than LDA, but lower interpretability, because people don't think about words being assigned to multiple topics OR because there are unknown models, which better explain the data, and are better on both measures? 2) Is it possible to use the feedback from human subjects to do semi-supervised or active learning to improve topic models? I really liked this paper, because many machine learning researchers would find a contrived automatic measure for interpretability, which may not reflect interpretability from human subjects, which is what you really care about. This is especially true in the empirical risk minimization (ERM) community where a contrived, but automatic, measure is preferred because it is easier to optimize. Semi-supervised learning using feedback from humans in complex experiments makes it much harder to operate in terms if risk.

Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization (Tutorial) Gunnar Martinsson Encouraging results on doing large scale matrix computations.

Sequential Monte-Carlo Methods (Tutorial) Arnaud Doucet and Nando de Freitas Really clear description of bootstrap particle filters.

Gaussian process regression with Student-t likelihood, Jarno Vanhatalo, Pasi Jylanki and Aki Vehtar they claim their Laplace approximation is better than EP and MCMC.

On Stochastic and Worst-case Models for Investing, Elad Hazan, IBM, and Satyen Kale, Yahoo! Research Tried to create bounds on worst-case scenario's for finding a portfolio. Interesting idea, but after talking to the authors, it seems they made the implicit assumption that the market can't drop more than 50% in a day, which they didn't mention.

Fast subtree kernels on graphs, Nino Shervashidze and Karsten Borgwardt, MPIs Tuebingen They use a kernel that can be used for regression/classification when the inputs are graphs (such as a molecular structure). They used SVMs in the paper, but it could be easily used in the GP context (which of course excites me more).

Invited Talk: Bayesian Analysis of Markov Chains, Persi Diaconis, Stanford I was excited to finally see the much talked of Diaconis in the flesh.

Machine Learning for Sustainability, J. Zico Kolter, Stanford University, Thomas Dietterich, Oregon State University, and Andrew Ng, Stanford University (mini-symposium) They really put their money where their mouth is. One of the speakers did his presentation over Skype to avoid the CO2 necessary to fly from San Francisco to Vancouver.

Improving Existing Fault Recovery Policies, Guy Shani and
Christopher Meek, Microsoft Research
Interesting since its similar to what I do ;)