tag:blogger.com,1999:blog-14794960380606164722016-11-25T04:15:36.735-08:00The Grad FactorBigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-1479496038060616472.post-66258492329288849772012-07-27T20:02:00.001-07:002012-07-27T20:02:15.769-07:00NIPS 2011<br /><div style="margin-bottom: 0in;"><span style="color: black;"><span style="font-family: 'Times New Roman';">Here are my high lights from NIPS 2011 in </span></span><span style="background-color: white;">Granada, Spain:</span></div><div style="margin-bottom: 0in;"><span style="color: black;"><span style="font-family: 'Times New Roman';"><br /></span></span></div><div style="margin-bottom: 0in;"><span style="color: black;"><span style="font-family: 'Times New Roman';">How biased are maximum entropy models? </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Jakob H. Macke, Iain Murray, Peter E. Latham </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">They show that some of the common approaches to maximum entropy learning (subject to constraints in the data like moments) can severely under-estimate the entropy of the data. One might naively assume max-ent over-estimates the entropy of the data. Iain calls his paper a "health warning" for methodology he says he sees many neuroscientists use. </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Siwei Lyu </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Looked like an interesting paper but the author was MIA at the poster </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Statistical Tests for Optimization Efficiency </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling /0.5 </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">The idea is that in a conjugate gradient (CG) optimization routine for learning parameters you can approximate the derivatives as long as they have the same sign as the true derivatives, i.e. you usually take steps in the right direction. If the objective is of the form </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">J(theta) + sum_i=1^N f(x_i,y_i,theta) </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">then you can randomly sub-sample the data when computing the objective and use a statistical test to limit the false positive rate: taking an optimization step in the wrong direction. It would be interesting to extend this to Gaussian process (GP) hyper-parameter optimization where the objective contains a sum over all pairs of data points (if you convert the matrix operations to sums). </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Probabilistic amplitude and frequency demodulation </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Richard Turner, Maneesh Sahani </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Rich extended some of the work on angular distributions with GPs he gave a research talk on a while back. He provides a fully probabilistic interpretation to signal processing frequency analysis methods. </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">A Collaborative Mechanism for Crowdsourcing Prediction Problems </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Jacob D. Abernethy, Rafael M. Frongillo </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">They describe a prediction market mechanism that would more efficiently combine information from participants in an ML competition. Instead of a winner take all approach like in the NetFlix competition, which ended up being a competition between a few giant ensembles, participants would make bets in a prediction market about how much their contribution would improve the performance if integrated into a prediction system. This alleviates the need for participants to organize themselves into conglomerates, i.e. ensembles. Amos Storkey gave a similar talk at the workshops on using prediction market mechanisms for model combination. I really like this idea and it seems to be gaining some traction. </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Variational Gaussian Process Dynamical Systems </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Andreas C. Damianou, Michalis Titsias, Neil D. Lawrence </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">They do nonlinear state space modeling with a Gaussian process time series (GPTS) on the latent states and a GP-LVM like model on the observations. This is similar to Turner et. al. (2009) except there an autoregressive Gaussian process (ARGP) is used on the latent states. However, using a GPTS on the latent states makes it easier to apply variational methods to integrate out the pseudo inputs. That combined with some automatic relevance determination (ARD) on the GPTS hyper-parameters, allows them to claim that you need not bother worrying about the right latent dimension or number of pseudo-inputs: Just select as large of number as you can handle computationally and the method will automatically ignore the excess dimensions/pseudo-inputs without over-fitting. This means they should be able to make a plot of pseudo-inputs/latent dimensions against performance and see the performance level out for a sufficiently large number of pseudo-inputs/latent dimensions and not go down much thereafter.</span></span><span style="color: black;"></span><span style="color: black;"><span style="font-family: 'Times New Roman';">It would be really cool if they could make the plots to illustrate that. </span></span><br /><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">Bernhard Scholkopf gave a key note talk on some of the work on causal inference he has been doing. </span></span><br /><span style="color: black;"><span style="font-family: 'Times New Roman';">The talk did not seem to distinguish the generative/discriminative model distinction with "causal and anti-causal learning". He claimed his work on MNIST was anti-causal while his later work on image restoration had been causal. It seems discriminative vs generative would have been better terms to apply to the approaches where the data and task contained no interventions and really didn't warrant worrying about causality. Even in the MNIST case it is not clear it was "anti-causal": did the human draw a particular image because of the digit label, or did a human labeler apply a certain label because of the image he found in the data set? If we drop the causal and anti-causal learning terminology, this issue becomes irrelevant. </span></span><br /></div><div style="margin-bottom: 0in;"><br /></div><div style="margin-bottom: 0in;"> <span style="color: black;"><span style="font-family: 'Times New Roman';">References:</span></span></div><div style="margin-bottom: 0in;"> <br /></div><span style="color: black;"><span style="font-family: 'Times New Roman';">Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. </span></span><a href="http://mlg.eng.cam.ac.uk/pub/pdf/TurDeiRas10.pdf"><span style="color: black;"><span style="font-family: 'Times New Roman';"><b>State-space inference and learning with Gaussian processes</b></span></span></a><span style="color: black;"><span style="font-family: 'Times New Roman';">. In Yee Whye Teh and Mike Titterington, editors, </span></span><em><span style="color: black;"><span style="font-family: 'Times New Roman';">13th International Conference on Artificial Intelligence and Statistics</span></span></em><span style="color: black;"><span style="font-family: 'Times New Roman';">, volume 9 of </span></span><em><span style="color: black;"><span style="font-family: 'Times New Roman';">W&CP</span></span></em><span style="color: black;"><span style="font-family: 'Times New Roman';">, pages 868-875, Chia Laguna, Sardinia, Italy, May 2010. Journal of Machine Learning Research.</span></span><br />BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-59542784271005033072010-12-27T19:25:00.000-08:002010-12-27T19:28:23.224-08:00EurekaUndirected Grad has a <a href="http://www.informaniac.net/2010/10/ecmlpkdd-2010-highlights.html">blog post</a> about symbolic regression using a new peice of software called Eureka. I have tried it out and it is pretty effective at uncovering the latent function from the synthetic experiments a tried, such as y = logistic(x^2 + sin(x)).BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-91388063673420241682010-12-27T19:23:00.000-08:002010-12-27T19:24:41.943-08:00NIPS 2010 highlightsIt was the last year in Vancouver/Whistler. So, luckily the snow conditions were good <span class="moz-smiley-s3" title=";)"><span>;)</span></span><br /><br />On the technical side:<br /><br />Switched Latent Force Models for Movement Segmentation<br />Mauricio Alvarez, Jan Peters, Bernhard Schoelkopf, Neil Lawrence<br />They modeled an input/output system governed by a linear differential equation where the input was distributed according a switching GP. They took advantage of the fact that the derivative of a function from a GP is also GP distributed, as well as linearity properties. Therefore, the output of the system was also distributed according to a switching GP model. They used the model to segment human motion. I liked it since it was closely related to my ICML paper on GP change point models. They claimed the advantage of their method is that it enforced continuity in the time series across segment switches. Although, this can easily be done in my setup I am glad my paper got a citation <span class="moz-smiley-s3" title=";)"><span>;)</span></span><br /><br />Global seismic monitoring as probabilistic inference<br />Nimar Arora, Stuart Russell, Paul Kidwell, Erik Sudderth<br />They used graphical models to infer if earthquakes and other seismic events (e.g. nuclear tests) are noise (from local events near a seismic sensor) or from a genuine event, which should be noticed by multiple seismic sensors.<br /><br />A Bayesian Approach to Concept Drift<br />Stephen Bach, Mark Maloof<br />This paper is also similar to the Adams & MacKay change point framework. They replaced the base model (UPM) with a discriminative classifier (such as Bayesian logistic regression). They admitted to fitting some of the hyper-parameters to the test, which is cheating. However, they tried to justify it by saying that it is inappropriate to try to learn the frequency of concept drifts (change points) from training data. I don't think the argument is coherent.<br /><br />Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression<br />Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, Mayur Naik<br />They did an analysis of programs to predict their execution time. The novelty of the paper is that they created features by "splicing" the program; they found small snippets of the program that could be executed quickly. They used the output of these snippets as features for a LASSO regression with polynomial regression. Polynomial basis functions are sensible since the run-time of a program is usually approximately linear, quadratic, or cubic in some aspect of its input. I pointed them to Zoubin's polybayes.m demo as a way of selecting the order of a polynomial from data. Symbolic regression using Eureka might also be illuminating.<br /><br />Slice sampling covariance hyperparameters of latent Gaussian models<br />Iain Murray, Ryan Adams<br />Iain presented a some tricks for transforming the sample space in GP classification to drastically improve the convergence of sampling GP hyper-parameters. Iain is a fan of re-parameterizing models to spaces that makes sampling easier. He claims the naive sampling method gets stuck in an "entropic barrier." He says this a third and often ignored, but common, failure mode of MC methods. The are other two are: the sampling method getting stuck in one mode of the posterior and dimensions that are highly correlated.<br /><br />Heavy-Tailed Process Priors for Selective Shrinkage<br />Fabian Wauthier, Michael Jordan<br />Fabian did GP classification while applying heavy tail noise to the latent GP before squashing the function through a sigmoid/probit. They claim GPC often gives over confident predictions in sparsely sampled areas of the input space. This method claims to alleviate the problem. Since the problem does not occur in synthetic data I asked him what he thought was the underlying model assumption violated. He believes the root cause is the stationarity assumption in most GP kernels is inappropriate in many cases.<br /><br />Copula Processes<br />Andrew Wilson, Zoubin Ghahramani<br />It was nice to see that Andrew attracted quite a crowd at his poster.<br /><br />At the workshops I liked:<br />Natively probabilistic computation: principles and applications<br />Vikash Mansinghka, Navia Systems<br />Vikash argued that his accelerated hardware could do millions of samples per second in Gibbs sampling an MRF (1000x improvement). The hardware restricted the flexibility of what kind of sampling you could do. The loss in performance from lossing that flexibility was compensated for many times over by using the hardware acceleration. He argues that maybe the best approach is to use simple samplers and his accelerated hardware over sophisticated samplers in software.<br /><br />There was talk about the prospect of moving to analog computation for sampling. A lot of energy is used in CPUs to make them completely deterministic with digital computation, but then in MC methods we artificially introduce randomness. Maybe it is better to do MC computations with analog. However, Vikash said that we must limit the analog computation to very small accelerated units within a digital processor in order for it to be manageable. The analog element would require custom ICs, which requires more funding than he currently has. However, he has selectively reduced the bit precision of many of his computations, which he says can be done when the quantities are random. This saves chip real-estate and power.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-71560691794394133752010-10-19T05:57:00.000-07:002010-10-19T06:00:12.523-07:00LaTeX A0 SummaryFollowing the well known <a href="http://www.google.co.uk/url?sa=t&source=web&cd=1&ved=0CBkQFjAA&url=http%3A%2F%2Fwww2.imm.dtu.dk%2Fpubdb%2Fviews%2Fedoc_download.php%2F131%2Fpdf%2Fimm131.pdf&rct=j&q=latex%20one%20pager&ei=AZa9TOSNCJWSjAezx8ipAg&usg=AFQjCNEWZQSMa5C-WCNqhFOD_CaKxgdwng&cad=rja">latex-one-pager</a> there is now the <a href="http://mlg.eng.cam.ac.uk/rdturner/latexA0.pdf">LaTeX A0</a> that combines 16 pages of LaTeX summary into poster.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-35765377070297897432010-09-20T06:32:00.000-07:002010-09-20T06:35:03.697-07:00MLSP 2010 highlightsHere are my highlights from MLSP.<br /><br />To my knowledge this was the first machine learning conference to occur within the arctic circle ~ 68 N. The conference took place on top of the gondola. The key highlight of the conference was the summer bobsled track from the conference center to the village. The food was mostly raindeer (in various forms) and berries ;)<br /><br />On the technical side:<br /><br />Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process<br />Regression Models:<br />Simo Sarka had a poster where he converted (almost arbitrary) stationary GP time series models into a state space model. He then used to Kalman filter to do O(T) predictions. As opposed to O(T^3) for general GPs and O(T^2) or O(TlogT) with Toeplitz tricks if the time series is in discrete time. Simo's method works in continuous time as well.<br /><br />Recent directions in nonparametric Bayesian machine learning Zoubin gave a lecture were he made an unapologetic advertisement for NP-Bayes.<br /><br />Tom M. Mitchell: Machine Learning for Studying Neural Representations of Word Meanings An interesting talk showing the cutting edge of machine learning applied to fMRI data.<br /><br />PASS-GP: Predictive Active Set Selection for Gaussian Processes A new approach to sparse GPs involving selecting a subset of data points.<br /><br />Archetypal Analysis for Machine Learning Mikkel's old NIPS pal an enthusiastic talk on "Archetypal Analysis", which most of the MLSP crowd was unfamiliar with.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com2tag:blogger.com,1999:blog-1479496038060616472.post-55082112780917443452010-09-20T06:28:00.000-07:002010-09-20T06:32:17.976-07:00CBMS highlightsHere are my highlights from CBMS: the non-parametric Bayes conference at UC Santa Cruz. It was organized more like a summer school, however.<br /><br />The conference was dominated by Peter Muller, who gave 10 1.5 hour lectures on non-parametric Bayes. He talked mainly of Dirichlet processes and the generalizations to them: Pitman-Yor, Polya trees, ect. He presented a "graphical model of graphical models" demonstrating the connection between the related models. He went through each model and compared them by their predictive probability function (PPF), which is the one-step-ahead predictive distribution for the models. Notably absent from his unifying view was Gaussian processes.<br /><br />Michael Jordan gave one lecture where he went through various models various NP Bayes models he has worked with: LDA, IBPs, sticky HMMs, ... He didn't get too technical, but tried to give a high level view of many models motivated by applications such as speaker diarization.<br /><br />Wes Johnson gave one lecture giving examples of NP Bayes in biology.<br /><br />Finally, Peter Hoff gave one lecture "Alternative approaches to Bayesian nonparametrics". He gave some examples of how doing Bayesian inference with an unknown Gaussian has a better predictive probability than using a DP-mixture for N <> 100 were referred to "large" and N > 5000 as "huge".<br /><br />The slides are available here:<br /><br /> http://www.ams.ucsc.edu/notesBigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com1tag:blogger.com,1999:blog-1479496038060616472.post-80536479642182681622010-07-04T10:42:00.000-07:002010-09-20T13:45:09.446-07:00Why LaTeX is superior to MS WordI recently submitted a paper to Interdisciplinary Graduate Conference (IGC) 2010. I prepared well formatted 8 page LaTeX document. However, the conference was organized by humanities students who had never heard of LaTeX. They wanted a .doc file. I then had to go through the painful process of converting my LaTeX document into a Word one. After that painful experience, I could not resist writing a rant on the process. I have experienced both forms of writing: I used word for every (lab) report in my under-grad. Once in my PhD program I was converted to LaTeX and have not looked back since.<br /><br /><ol><br /><li>Speed: Anything that requires a mouse and clicking through menus will be slower than one where you can write it out in a few key strokes. This means that writing characters with accents, special symbols, and especially equations will be much faster to write in LaTeX.</li><br /><ol><li>MS Word (and even worse open office) can get sluggish when editing large documents with lots of equations and figures.</li></ol><br /><li>Security: stored in plain text</li><br /><ol><li>MS Word stores its files in a bloated binary form. If a file gets corrupted for whatever reason you could be locked out of your file and many hours of work. Likewise if some bug in MS Word is causing it to crash when opening your file. With plain text source files, if all else fails you can always open and edit the file in a simple text editor.</li></ol><br /><li>Separation of context and formatting</li><br /><ol><li>The plain text style of LaTeX simply using section and subsection etc allows the writer to simply think about the logical flow of the document without worrying about superficial details such as font sizes and styles</li><br /><li>You know all your sections and subsections headings are in the correct font size. This is much harder to check using MS Word.</li></ol><br /><li>Integratable with SVN</li><br /><ol><li>Being plain text it is easier for SVN (and other revision control systems) to merge files being edited</li><br /><li>It also takes up less space on the server storing revisions</li><br /><li>It is also possible to use any diff tool to compare revisions</li><br /><ol><li>You also know the diff tool will show you ALL changes in the state of the file. There is no such guarantee when using features such as track changes in MS Word.</li></ol></ol><br /><li>Control</li><br /><ol><li>MS Word often tries to outsmart you. It will automatically capitalize, automatically try to select whole sentences, automatically insert bullet points, and try to infer when you’re done with a sub/superscript when writing equations. Software that tries to out-smart you will often out-dumb you. It tries too hard to infer what you want and often gets it wrong.</li><br /><ol><li>A good example is the use of - vs. -- vs. ---. Microsoft assumes most users are not smart enough to infer which type of dash to use in which situation. So MS Word tries to figure it out automatically. It can be really annoying when it gets it wrong. With LaTeX you just write what you want in 1 to 3 key strokes!</li><br /><li>There is also the quote directions `` ''. In LaTeX it is specified manually while in word it is automatic, and can be annoying if word infers it wrong.</li></ol><br /><li>Notion of state: every change in the file is visible. Nothing is hidden from you in a plain text source file.</li><br /><ol><li>There is no hidden meta information</li></ol></ol><br /><li>Flexibility</li><br /><ol><li>It is easy to search for $x^2$ or \footnote in LaTeX the is no easy way to do the analogous searches using CTRL-F in MS Word</li><br /><li>You can also search the file using regular expressions</li><br /><li>Macros for a more semantic representation can be written in just 1 line with a few key strokes</li><br /><ol><li>For example, I often define \field{} using \mathcal{} and then \R using \field{R}</li></ol><br /><li>In MS Word macros are often full blown VB scripts</li><br /><ol><li>They should be disabled any way since they are a security risk</li></ol></ol><br /><li>More easily scriptable</li><br /><ol><li>For instance, I have MATLAB code that exports a matrix of results in MATLAB to a LaTeX table</li><br /><li>It would require a full blown C++/VB program in visual studio using all sorts of crazy APIs (and therefore reading tons of documentation) to do the same thing in MS Word</li></ol><br /><li>Speed-quality trade-off in formatting</li><br /><ol><li>Because MS Word has to reformat the document every keystroke it has to use inferior typesetting methods to prevent the GUI from becoming glacially slow</li><br /><li>Since you only recompile after significant changes in LaTeX, it can afford to use more expensive type-setting algorithms (especially for equations) that might take 10 seconds to run.</li></ol><br /><li>Interoperable</li><br /><ol><li>Since things like bibtex are also plain text it is much easier for third parties to create applications such as Jabref. You don't get stuck using one particular reference manager. If you don't like one there are others to use instead. And if all else fails you can always edit it in notepad.</li><br /><li>Therefore, there is no vendor lock-in and you can't get stuck using one piece of software for backward compatibility reasons that may in the future become inferior to the alternatives.</li></ol><br /><li>Misc</li><br /><ol><li>Equations always come out looking crappy</li><br /><li>Figure captions aren't proper; MS Word will let them cross page boundaries for instance</li><br /><li>Footnotes are a pain, especially if you have multiple footnotes on the same page</li><br /><li>Equation numbering is a pain in Word</li><br /><li>It is possible to embed pdf figures in LaTeX which allows for vector graphics and avoids file bloat</li><br /><ol><li>However, I beleive new versions of word allow for the insertion of eps figures</li></ol></ol><br /><li>Cost</li><br /><ol><li>LaTeX is free while MS Office can cost a few hundred dollars</li><br /><ol><li>There is open office but that is even worse</li></ol><br /><li>I am not a "free-tard" so this is not my top concern</li></ol><br /></ol><br /><br />Advantages of MS Word:<br /><ol><br /><li>MS Word has a grammar checker. To my knowledge, none of the LaTeX editors have a grammar checker.</li><br /> <ol><li>Of course, the grammar checker should usually be taken with a grain of salt. However, it is good at catching typos such as interchanging it/is/if and they/then, which a spell checker will not find and are easy to glance over when proof reading.</li></ol><br /><li>The equation editor is better than it used to be. Writing an equation heavy document in word used to be almost impossible. However, it is now doable, but still much slower than LaTeX.</li><br /></ol>BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com8tag:blogger.com,1999:blog-1479496038060616472.post-81422243260793887722010-07-04T07:08:00.000-07:002010-07-04T07:09:15.116-07:00ICML 2010 Highlights636: Sparse Gaussian Process Regression via L1 Penalization<br />Feng Yan (Purdue University); Yuan Qi (Purdue University)<br />They introduced a way to do sparse GPs for large amounts of data by adding a L1 penalization to the influence of data points. It effectively removes irrelevant data points using a convex optimization. It avoids the local optima problems of normal sparse GPs.<br /><br />I liked all the papers in the application track:<br />901: Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine<br />Thore Graepel (Microsoft Research); Joaquin Quiñonero Candela (Microsoft Research); Thomas Borchert (Microsoft Research); Ralf Herbrich (Microsoft Research)<br /><br />902: Detecting Large-Scale System Problems by Mining Console Logs<br />Wei Xu (UC Berkeley); Ling Huang (Intel Labs Berkeley); Armando Fox (UC Berkeley); David Patterson (UC Berkeley); Michael I. Jordan (UC Berkeley)<br />I liked this since it is somewhat related to my project.<br /><br />903: The Role of Machine Learning in Business Optimization<br />Chid Apte (IBM T. J. Watson Research Center)<br />IBM is increasing the efficiency of collecting back taxes in NY state using machine learning (which some people found scary).<br /><br />374: Local Minima Embedding<br />Minyoung Kim (CMU); Fernando De la Torre (CMU)<br />The idea is to visualize the a high dimensional objective function in a lower dimension that can be visualized while preserving the local optima. Its a really good idea, but it is not good enough to help with the hard problems we would want to solve with it (ie visualizing local optima in high dimensional neural network optimization).<br /><br />495: Hilbert Space Embeddings of Hidden Markov Models<br />Le Song (Cmu); Byron Boots (Carnegie Mellon University); Sajid Siddiqi (Google); Geoffrey Gordon (Carnegie Mellon University); Alex Smola (Yahoo! Research)<br /><br />551: Distance Dependent Chinese Restaurant Processes<br />David Blei (Princeton University); Peter Frazier (Cornell)BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com19tag:blogger.com,1999:blog-1479496038060616472.post-81480156884354623242010-07-04T07:05:00.000-07:002010-07-04T07:07:39.328-07:00AISTATS 2010 HighlightsThe conference kicked off with<br />Forensic Statistics: Where are We and Where are We Going?<br /><i>Richard Gill</i><br />To append to Sebastien's comment the statistical flaws introduced by the doctors and lawyers in the case included: confusion that P(data|guilty) = P(guilty|data), multiplying p-vals over multiple tests, the post-hoc problem in frequentist hypothesis testing, arbitrary starting and stopping rules, violated iid assumptions, and fitting the protocol to define events to maximize significance.<br /><br />I liked<br />Reduced-rank hidden Markov models<br /><i>S. Siddiqi, B. Boots and G. Gordon<br /></i>some cool stuff about alternatives to EM for training HMMs that are closed form with bounds on the loss of statistical efficiency compared to EM and without EM's local optima problem.<br /><br />Gaussian processes with monotonicity information <b><i> Jaakko Riihimäki, Aki Vehtari </i></b> ; 9:645-652, 2010.<br /><br />I learned what Phil Hennig is up to in<br />Coherent inference on optimal play in game trees<br /><i>P. Hennig, D. Stern and T. Graepel</i><br /><br />Zoubin pulled in another best paper award in<br />Learning the structure of deep sparse graphical models<br /><i>R. Adams, H. Wallach and Z. Ghahramani</i><br />Adams and Wallach used some of the MacKay magic.<br /><br />On the Sunday after the conference there was the active learning workshop. Don Rubin (co-inventor of EM and co-author on Gelman) was the invited speaker who talked about experimental design and causality. He is definetly on the stats side of the ai-stats. I asked him about measuring test set performance and he asked why I would want to divide my data set in half (definitely out of sync with the ML culture!). It was also interesting to see the philosophical cracks between him and Phil Dawid and the division between the Dawid/Rubin/Pearl views on causal inference.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-67011326903699079592010-07-04T07:02:00.000-07:002010-07-04T07:05:16.825-07:00NIPS 2009 highlightsReading Tea Leaves: How Humans Interpret Topic Models, Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David Blei, Princeton University Really interesting idea for evaluating an unsupervised method. They use human subjects on Amazon Turk for controlled experiments to determine the interpretability of topic model outputs. They find that higher perplexity doesn't always correspond to higher interpretability by humans. The model with better perplexity may have found more structure in the data, but that may or may not correspond to our intuitive notions of a "topic". After talking to them there are some big open questions: 1) Do correlated topic models have better perplexity than LDA, but lower interpretability, because people don't think about words being assigned to multiple topics OR because there are unknown models, which better explain the data, and are better on both measures? 2) Is it possible to use the feedback from human subjects to do semi-supervised or active learning to improve topic models? I really liked this paper, because many machine learning researchers would find a contrived automatic measure for interpretability, which may not reflect interpretability from human subjects, which is what you really care about. This is especially true in the empirical risk minimization (ERM) community where a contrived, but automatic, measure is preferred because it is easier to optimize. Semi-supervised learning using feedback from humans in complex experiments makes it much harder to operate in terms if risk.<br /><br />Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization (Tutorial) Gunnar Martinsson Encouraging results on doing large scale matrix computations.<br /><br />Sequential Monte-Carlo Methods (Tutorial) Arnaud Doucet and Nando de Freitas Really clear description of bootstrap particle filters.<br /><br />Gaussian process regression with Student-t likelihood, Jarno Vanhatalo, Pasi Jylanki and Aki Vehtar they claim their Laplace approximation is better than EP and MCMC.<br /><br />On Stochastic and Worst-case Models for Investing, Elad Hazan, IBM, and Satyen Kale, Yahoo! Research Tried to create bounds on worst-case scenario's for finding a portfolio. Interesting idea, but after talking to the authors, it seems they made the implicit assumption that the market can't drop more than 50% in a day, which they didn't mention.<br /><br />Fast subtree kernels on graphs, Nino Shervashidze and Karsten Borgwardt, MPIs Tuebingen They use a kernel that can be used for regression/classification when the inputs are graphs (such as a molecular structure). They used SVMs in the paper, but it could be easily used in the GP context (which of course excites me more).<br /><br />Invited Talk: Bayesian Analysis of Markov Chains, Persi Diaconis, Stanford I was excited to finally see the much talked of Diaconis in the flesh.<br /><br />Machine Learning for Sustainability, J. Zico Kolter, Stanford University, Thomas Dietterich, Oregon State University, and Andrew Ng, Stanford University (mini-symposium) They really put their money where their mouth is. One of the speakers did his presentation over Skype to avoid the CO2 necessary to fly from San Francisco to Vancouver.<br /><br />Improving Existing Fault Recovery Policies, Guy Shani and<br />Christopher Meek, Microsoft Research<br />Interesting since its similar to what I do <span class="moz-smiley-s3" title=";)"><span>;)</span></span>BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-15664897876963573372008-12-27T22:13:00.000-08:002008-12-29T21:02:07.122-08:00NIPS ReportAfter attending NIPS 2008 I figure I should write up my impression. For me, some of the highlights were<br /><br /><ol><li>Shai Ben-David's student, <a href="http://www.cs.uwaterloo.ca/%7Emackerma/">Margareta Ackerman</a>, gave a key note presentations on the quality of clustering. After hearing hear presentation and talking to her at her poster, I was unimpressed. I also think that a lot of the clustering quality stuff is BS. They create a set of axioms to evaluate the quality of clustering algorithm. I find all of them to be somewhat questionable. Comparison of unsupervised algorithms, such as clustering, can be done via comparisons of the marginal likelihood. It seemed that many of the ideas involved in Bayesian model comparison were a foreign language to Ackerman. A full review of the topic deserves its own post.<br /></li><li>Han Liu, John Lafferty and Larry Wasserman presented a <a href="http://books.nips.cc/papers/files/nips21/NIPS2008_0329.pdf">paper</a> on a joint sparsity regression. It builds on a previous paper where they modify L1 regularization for joint sparsity in multiple regression causing certain factors to have zero influence for all the different regressions. They extend this to the non-parametric case. Each regression is a additive model of nonlinear functions of each input. The joint sparsity model causes certain functions to be zero everywhere for each regression. The regularization is quite strange. L1 regularization is equivalent to a MAP solution with a laplacian prior. I am not sure what equivalent priors these regularization methods have. In the non-parametric single regression case, I think the regularizer is equivalent to a GP prior on the function where the covariance function is a kronecker delta. I have not proven that, however. A degeneracy of this model is that it causes the functions to go to zero for every input that has not been observed. Searching through the paper, I found that they used gaussian kernel smothing on the function afterwards to smooth it out, which seems like a bit of a hack to me. A full review of the topic deserves its own post.</li><li>Byron Yu and <a href="http://www-npl.stanford.edu/%7Ejcunnin/">John P Cunningham</a> presented a paper on Gaussian Process Factor Analysis. They used independent GPs over time as the latent factors and then used a factor analysis like linear transformation to explain the observed time series. They applied to some neural spike data and got quite interesting results. They were able to visualize the acitivty of a monkey's motor cortex in 2D when throwing a ball.<br /></li><li>Ben Calderhead and Mark Girolami presented a <a href="http://nips.cc/Conferences/2008/Program/event.php?ID=1357">paper</a> on Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes. They were trying to infer the parameters of a nonlinear ODE. It was a little kludgy as their model setup violated the likelihood principle. They modeled the time series of the system state with GPs. They also modeled the derivatives for the system state via the ODE. So, they had two models for the data. They inferred the parameters be trying to minimize the difference between the two. I think they modeled the difference as a Gaussian centered at zero. By inspecting the graphical model, however, you notice that the observed variables are independent of the parameters to the ODE. So, if one wanted to be completely principled you could not use the model to infer anything about the parameters.<br /></li><li><a href="http://saxelab.mit.edu/">Rebecca Saxe</a> gave a presentation that I found quite interesting. She showed a certain brain region that was involved in thinking about how other people where thinking. It was therfore involved in moral judgement, because moral judgements often hinge on intent. She first correlated the activity in the region with people's moral judgements about hypothetical scenario's. She later showed how she was able to use TMS on subjects and change their moral judgements about the hypothetical scenarios.</li><li>The was a lot of hype about <a href="http://research.microsoft.com/en-us/um/cambridge/projects/infernet/">infer.NET</a>. It will make exploring different models much easier. It seems much more powerful than <a href="http://vibes.sourceforge.net/">VIBES</a> or <a href="http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml">BUGS</a>.</li><li>I attended the causality workshop which explored methods for inferring causuation from observational or experiments where you don't have direct control over the variables you'd like to. Judea Pearl gave a very enthusiantic presentation, I am not sure if I would consider it to be good presentation, however. There was some tension in the air between Phillip Dawid and Judea Pearl over their views on causation and have created to camps in the field. I don't think they are as far apart as they think. The divide is not as big as it is between Bayesian and frequentist, for example. Judea Pearl presented his do-calculus for inferering causation in causal graphs, which are derived using a set of axioms. Dawid gave a presentation high lighting what I hope most people already know: conditional independence in graphical models is not neccessarily the same thing as causation and that nothing is as good as a randomized experiment. However, <a href="http://www.cs.ubc.ca/%7Emurphyk/">Kevin Murphy</a>, in Dawid's camp, showed one can prove all of the do-calculus rules using the IDAG. If one sets up a graphical model using inputs variables for causation one can derive the do-calculus rules using the standard conditional independence properties of graphical models. Wrapping ones mind around what is the correct aproach for causation is much more difficult and subtle than that for prediction. I beleive this is related to the fact it is much more difficult to get a ground truth when testing causal inference methods. Guyon high lighted this fact in relation to her <a href="http://www.causality.inf.ethz.ch/challenge.php">causality challenge</a>.</li><li>Shakir mohamed presented a <a href="http://books.nips.cc/papers/files/nips21/NIPS2008_0745.pdf">paper</a> extending PCA to other data types with distributions in the exponential family. Normal PCA works under an assumption of gaussianity in the data. EPCA can assume a bournulli distribution for example.</li><li>Jurgen Van Gael presented a <a href="http://books.nips.cc/papers/files/nips21/NIPS2008_0109.pdf">paper</a> where he extended the iHMM to a factorial iHMM. He basically went from an HMM to FHMM but with iHMMs. An iHMM is an HMM with an infinite number of latent states. The transition matrix is from a hierarcical DP.<br /></li><li><a href="http://hebb.mit.edu/people/seung/">Sebastian Seung</a> gave a presentation to decode the Connectome. The connectome is basically the connection matrix between all the neuron's the brain. It is likely summarizing a brain as a graph with each neuron as a node and each synapse as a edge. The difficulty of the task is converting images of 20-30 nm thick brain slices to a connection matrix. So far they have only done C elegans, which has a mere 300 neurons. With that scientists have reverse engineered a lot C elegans behaviour. They are currently working on decoding a cubic mm of mouse brain. They are using computer vision algorithms to automate and speed up the process. He eluded to the massive amounts of data involved. By my calculations, merely storing the connectome of the human brain would require 432 TB. The imagery would be vastly more. If one had the connectome matrix it would open up tons of possibilities for analysis. I would like to run spectral clustering on the graph and see how closely graph clusters correspond to anatomical structures. Of course, I don't know how one would run spectral clustering (ie do an eigen decomposition) on a matrix that large. Sebastion gave a video with 3D graphics illustrating the imaging process, which seemed like it was for the discovery channel. The star wars music in the background was a bit much ;)<br /></li><li>There was a paper on bootstraping the ROC curve. Basically, they are trying to get confidence bounds on a ROC curve. It is important get a sense of confidence on your performance to tbe sure that is was not from random chance. It is interesting to me because I have looked into model based approches to estimating the ROC curve.<br /></li></ol>Obviously, this is only a small subset of NIPS. However, it will give me a lot of material when it is my turn to present in my gorups weekly meetings. The list of proceedings is <a href="http://books.nips.cc/nips21.html">here</a>BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com19tag:blogger.com,1999:blog-1479496038060616472.post-11594482806456125882008-12-27T22:00:00.000-08:002008-12-27T22:12:09.975-08:00Teamwork and Machine LearningIn many engineering programs there is a focus on how to work in teams and how to divide projects into parts. In embedded systems it may start with the division between hardware and software. Then it may be further divided by different subsystems and software libraries. I haven't seen much emphasis on the different roles in machine learning. I see the different categories as being<br /><br /><ol><li>Acquiring the data and getting in a database.</li><li>Extracting the data from the database into .csv and .mat files and into the form that can be sent directly into an algorithm.</li><li>Designing new models, coding up the inference methods, and testing the algorithms on synthetic data.<br /></li><li>Creating a test bed to divide the data into training and test, evaluate different methods, and report results.<br /></li><li>Determining what feature matrices and models to use and putting everything together.</li><li>Implementing libraries that can be used in actual applications</li><li>Testing the real world libraries</li></ol><br />From what I've seen not enough emphasis is placed on the division of the tasks academically or industrially. I think it is most effiecient to divide these tasks among different people who can be specialized. It is somewhat wasteful to take a person who is an expert in designing inference algorithms and have them spend most of their time setting up a database.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-52709855141791032472008-11-16T04:39:00.000-08:002008-11-17T03:44:56.999-08:00Properties for probabalistic modelsIt seems with probabalistic models there are some properties which are useful to completely understand your model. I've compiled a list of properties that I think are useful<br /><ul><li>Conditional independence properties. This should be the most appearent in the graphical models world. This is one of the reason I find it so helpful to draw a graphical model.<br /></li><li>Invariances: scale, rotation, permutation, etc. This is one the best ways to understand the difference between PCA and factor analysis, FA. PCA is rotationally invariant under the data. While FA is scale invariant.<br /></li><li>Exchangability. It importatant to understand what is exchangable when someone says a model has exchangability. For instance, the Indian Buffet Process, <a href="http://learning.eng.cam.ac.uk/zoubin/papers/ibp-nips05.pdf">IBP</a>, and Chinese Restaurant Process, <a href="http://videolectures.net/icml05_jordan_dpcrp/">CRP</a>, are commonly said to have exchangability. This alone does not say much. The CRP is exchangable when viewed as a distribution over partitions. The probability customer #1 is at the same table as customer #3 is the same as the probability customer #1 is at the same table as customer #10. It is not exchangable when viewed as assigning table numbers to customers. After all, customer #1 will always be at table #1.</li><li>Identifiability for parameters. In some models different settings of the parameters can give the exact some likelihood for the observed data. The simplest case is the mixture of gaussians. One could switch the mean and covariances for the first and second components and it would have no effect on the likelihood of the data. Points that were previously likely to have some from component 1 are now more likely to be from component 2 and vice versa. A less trivial example exists in <a href="http://www.cs.toronto.edu/%7Eroweis/papers/NC110201.pdf">linear dynamical systems</a>. The noise in the dynamics matrix can be set to identity wihout any effect on the likelihood if all the other parameter matrices are adjusted accordingly. The point of this is that any preference in the posterior for one of these settings of parameters over an equivalent one will be a result of the preferences in the prior.<br /></li><li>Expected posterior predictive variance under synthetic data. This might be hard to compute but it would be interesting. In addition it could provide a sanity check on ones algorithm on either synthetic or real data.<br /></li><li>Expected log likelihood. Similiar to the last idea. In other words look for the E[log-likelihood(D)]. This and the last example could be estimated by Monte Carlo simulation. Sample synthetic data and then do inference on it and check the likelihood. However, this will not work as a sanity check if you are looking for bugs in your code because your code was used to make the Monte Carlo estimate. It would work as a sanity check for model fitness. You could compare the likelihood of the real data to the expected likelihood under the model. This is just a sanity check. If one is looking for a more principled way of evaluating your model for a particular data set then I would reccomend Bayesian model comparison.<br /></li><li>Standard statistical properties: variance, bias, consistency. Well, if you want to keep frequentists happy.<br /></li><li>Model tecnicalities. I've kind of made this term up here because I don't have a better name for it. In Salakhutdinov's paper on <a href="http://www.cs.toronto.edu/%7Ersalakhu/papers/nips07_pmf.pdf">Probabilistic Matrix Factorization</a> for the Netflix problem he modeled the score a particular user gave a particular movie as sampled from logistic(normal()). In other words, he assumed the score was a continuous value between 0 and 1. To make this work at all we scaled the actual scores (1 to 5) on to a 0 to 1 scale. The synthetic data will be qualitatively different from the real data. However, the model is still okay for real data. In most cases, I suppose, model technicalities don't get in the way of effective inference, but I still think one should be mindful of their existence. Another example is with clustering data points in space. One could calculate a distance matrix and then use a generative model for random matrices to model the data. However, for many distributions on matrices the synthetic data will not be embeddable in Euclidean space. In other words, if you draw a matrix from a distribution on matrices to get a distance matrix it is not guaranteed you can find a set of points in Euclidean space that have that distance matrix. I would consider that a model technicality as well.</li><li>Units: people seem to forget about the units of the paramters in the model. If your observations are lengths, for instance, the mean of the distribution might be measured in m while the variance is m^2. An implication of this is that the statement that the variance is larger than the mean is meaningless because the units are different. Its like saying the area of a table is larger than its width.<br /></li><li>Other stylistic facts of synthetic data. This is the everything else I forgot category. Sample data from the model. Does it seem reasonable? See if it has certain properties you find desirable or undesirable.<br /></li></ul>There are also the computational properties<br /><ul><li>Is the model analytically tractable? Can you get a closed form posterioir predictive? marginal likelihood? Expectation for the posterior on the parameters?<br /></li><li>Is there some inherent computational complexity, a lower bound on the time to compute the exact solution. In most cases it is not practical to prove some sort of lower bound on the Big-O order to do inference. If you can, however, the results could be very interesting. While on this topic it is interesting to ask if there is a deeper reason why the Gaussian is so much more analytically tractable to work with than other distributions. Is there an intuitive explanation as a result of the central limit theorem?<br /></li></ul>BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-40914059124010316582008-11-11T13:27:00.000-08:002008-11-11T14:26:33.306-08:00Design Flow for Statistical/Machine Learning problemsWhile working on several real data sets I have noticed some real patterns in the design flow for doing analysis. I've also noticed this from seeing how much direction some people need when tutoring a social scientist on statistical analysis. It also differentiates me from people who tend to find solutions before the problem.<br /><br />In normal computer engineering problems there is a typical design flow one follows. Such as<br /><ol><li>Formally stating the problem</li><li>Dividing the problem into software and hardware parts</li><li>Doing a block diagram</li><li>Schematic/simulations</li><li>PCB design/simulations</li><li>assembly and prototyping</li><li>test</li></ol>A more detailed version of this is the topic of another post. Anyway it would be nice to have something like that for machine learning. I think it would go something like this<br /><br /><ol><li>Get an idea of the problem your working on and what the data is. Talk to people with domain knowledge.</li><li>Acquire the data (not always as trivial as it sounds)</li><li>Create a mathematical representation for the data. This does not mean specifying a complete generative model yet. But something in a more mathematical form than a .csv file or SQL DB. For instance, for financial quotes, this would be a multivariate time series. A point process with real values associated with each point. However, you'll having to answer questions here about if it is real time or trading time. How do you deal with the market being closed sometimes. Holidays? How about exchanges in different time zones? What about volumes? limit order imbalance between bid and ask. Is it as mid/ask representation or mid-price/spread? These questions are answered here.</li><li>Now feature matrices have to be created. Most machine learning methods will require data in the form of a matrix. So, it is best to translate the more general abstract representation into a matrix. For the market example, some form on windowing will probably have to be used. The prices will probably also be transformed into return space as well.</li><li>Write the code to generate these feature matrices</li><li>Exploratory data analysis. This is kind of an art here. One should look at ways in which you can best visualize what is going on in the data set. Look at descriptive statistics: mean, variance, kurtosis. The correlation/mutual information between variables is also a must. Dimensionality reduction is a good bet when trying to do visualization. In a sense, exploratory data analysis consists of applying simple models to the data that don't get bogged own with complex learning and latent variable aspects. They can be more crude since they are just supposed to give you an idea of what is going on. Models of the more complex type are best left for the heavy lifting when one is getting serious about prediction and inference.<br /></li><li>Need an evaluation framework: what performance measures should be used. For regression problem RMSE is often used. ROC curves are common in classification. In some cases it might be better to look at the RMSE in log scale. A review of error metrics is a post of its own. For generative models, the marginal likelihood is definite to look at if you can compute it or approximate it. This will allow for model comparison/averaging. For problems involving decision making the performance of the whole system should be considered. For instance, in algorithmic trading the amount of money earned by the system should be considered in addition to the predictive accuracy. When the test set is small one also needs to get a confidence estimate on the performance. I am not sure if there is a good reference on finding confidence bounds on ROC/mutual information/RMSE etc. I would be great to find one.<br /></li><li>Write code for test framework. Usually it goes something like: Load the data, Plot some of it, Iterate over all feature extraction and model combos, Doing training and test on each, Then plot and report each. One could make some pseudo-code for machine learning test frameworks in different types of learning. I'll leave that post for later.</li><li>Find existing code for standard models and feature extraction and plug them into the test framework. See how they do. Always try the appropriate variant of linear regression to get a baseline results. Try PCA or k-means as a basic feature extraction method.</li><li>More feature extraction. The line between modeling and feature extraction is a bit artificial in my opinion. On one hand there is pre-processing, but after that usually comes some more serious feature extraction. Many feature extraction methods such as clustering or PCA are just unsupervised models where the latent variables are used as features. In the context of generative moels, these could be appended to the actual model. The only difference is with seperate feature extraction, the uncertainty over the values of the latent variables isn't propogated. This can lead to more tractability, however. Anyway, think of other feature extraction methods that might be more appropriate for your problem.</li><li>Create more sophisticated models. If you need to code up custom models then there is a whole nother design flow to get them working</li><li>Evaluate all your models. Go back and change them if you find there were bad decisions along the way. For that matter, decisions in any of these steps can be changed if they seem bad in hind sight.</li><li>You can also add a mixture of experts model to get the best of both worlds with your models. If you have the marginal likelihood for all your models then model averaging or a generative mixture model can be used. If your doing classification, look at your ROC curves. One can always attain the convex hull of the ROC curves.<br /></li><li>The code for most of this is usually suited for prototyping and implemented in MATLAB and the like. One will usually need to take the time to implement some real application code to get the job done.</li><li>Real world testing. You can never test too much.</li></ol><br />These steps can be divided up into people who's specialty matches them best to a certain step. Extracting data can be a big task on its own. Likewise with implementing application code. Coding up a new model is a task that can also be specialized to a certain person.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com1tag:blogger.com,1999:blog-1479496038060616472.post-6144123811173150582008-09-20T11:54:00.001-07:002008-09-20T12:14:03.205-07:00Coding StyleWhile at Google at got exerience with code reviews. Code readability and style is taken very seriously there and can drag out code reviews. So, I looked back at comments I received in our web integrated revision control system to see what my common issues where. I found a lot of stuff about style one would usually not notice.<br /><br /><ul><li>Comment Grammar. They do check for spelling and Grammar in your comments. It gets kind of annoying to edit your sentences when you have to manually break at 80 cols and enter a new comment symbol. It would be cool if there was an IDE that would pop up a text box and let you type with a spell checker and a grammer checker. Then it would insert it into the code for you.</li><li>80 characters line width. Seems like those 82's always sneak in. Check your code with a linter to find this kind of stuff!</li><li>Put a good description of the input and output of a program at the top. In a doc comment if your using Python</li><li>Document why your using < vs <=. This helps avoid of by one errors.<br /></li><li>Variable names: The have to be descriptive, but not too long. Its is a definite trade off. I guess there is a feng shui to this one.</li><li>Capitilization style: the is usually a convention for capitilzation for variables, global constants, class names and function names. Some lint tools do this.<br /></li><li>White space rules: for indentation and line continuation. (lintable)<br /></li><li>Whitespace after , or binary operators. There is also not supposed to be whitespace at the end of a line. (lintable)</li><li>Don't hard code in any numbers, filenames, or many strings. They should be global constants.</li><li>Put the units in a variable name if it is a quanitity with a unit such as seconds</li><li>Use libraries whenever possible. This is why it is important to know what all the libraries do.</li><li>Avoid deprecated code.</li><li>Avoid redundant functions calls. For example, using flush() before close() if close does flush automatically.</li><li>Comment almost every line. There is probably some good ratio of comment lines to code lines if you where to compile some stats. Don't explain the language in the comments explain the program. To do otherwise would violate style rules as well.</li></ul>BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-1717793131513866322008-09-20T11:21:00.000-07:002008-10-02T08:39:46.104-07:00MLSSI had a great time at MLSS, but some people dropped the ball when it came to IT and logistics. So there are some things people need to get right at the next MLSS.<br /><br />We need to make sure we have:<br />- IT:<br />- internet in the rooms where people are staying<br />- a lot bandwidth available (at least 50 Mbps total)<br />- enough wireless routers to handle the load of ~200 connections for some buffer<br />- a power strip at every table for people to plug their laptops into<br />- some US power plug tables and some Euro ones given there will be a lot of euro people<br />- Some tables with Cat 5 for people who have wifi troubles<br />- A Google Map with placemarks of all the locations we need to know<br />- message boards<br />- A transport board setup a few months in advance so participants can arrange transportation pland with each other.<br />- Video recordings of the lectures for <a href="http://www.videolectures.net/">videolectures.net</a><br /><br />- Talks:<br />- Have the slides, references, labs, and code online in advance on the website (not some funky ftp server on someones laptop)<br />- All the lectures and activities scheduled in Google calenders<br />- Facebook and LinkedIn groups setup before hand<br />- A whiteboard for the lecturers to right on (with markers that work. Seems whiteboards always have dead markers.)<br />- Enough lab space for everyone to fit<br />- A notebook and a pen in the backpack schwag that is given out. People don't usually remember this when packing.<br />- More practical sessions and have lectures oriented towards practical sessions. I like the philosophy of <a href="http://www.cs.ubc.ca/%7Enando/">Nando De Freitas</a>, if you can't code it you don't understand it.<br />- The should also be feedback forms for the participants to fill out on the lecturers<br /><br />Other:<br />- Maybe a mini-library with a few books around like Bishop ect.<br />- A vending machine around where people are staying in case they get the munchies<br />- A <a href="http://www.whereivebeen.com/">whereivebeen</a> map showing where everone is from<br />- A MLSS reader in one pdf with all the relevent readers that would be useful in preperation for the lectures and practical sessions<br />- Energy drinks in addition to coffee at the breaks (In Cambridge we'll probably have tea being the UK)<br /><br />Another possibility is for MLSS to experiment with the <a href="http://www.iclicker.com/">I-Clicker</a>. My brother had to get one for his freshman physics class. The lecturer can ask multiple choice questions and get feedback from the whole class with the remote. It gives the lecturers feedback on if they are not explaining something in enough detail. It also would make the lectures more interactive so people would be less inclined to fall asleep.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-18205775880533307322008-09-13T12:23:00.000-07:002008-09-13T13:03:45.269-07:00PresentationsAt MLSS I've seen the spectrum of very good to very bad lectures. There seems to be a few rules that one could follow to avoid the pitfalls of the bad presentations. Such as,<br /><br />Explain things through examples rather than cryptic mathematical definitions<br />Don't explain anything to complicated with your hands. Draw everything on a board.<br />Provide a motivation and backgorund on what you want to do.<br /><br />Being a graphical models person I can't understand anything with out a graphical model anymore. It helps to specify the full bayesian inference scheme as the objective even if it is intractable. It helps orient people as a target for what you are trying to acheive. Many people explain non-probabalisitc methods without going through ths step. However, many of these methods have an implicit graphical model which be used to help orient people.<br /><br />It is also good to use examples. People are very good at learning from examples. Once people understand something 90% it is then appropriate to supply a cryptic mathematic definition to completely disambiguate. For example, the DP is much more understandable with the stick breaking construction then the abstract definition. In physics thermodynamics is much easierto explain if you have one of those particle java applets that explains an ideal gas.<br /><br />At one of the lecturers at MLSS I could look back in the audience and see that 95% of people where either surfing the internet, reading a book, or staring off into space. The lecturer seemed preety oblivious to it. Maybe he didn't even care. Many of us suspected he simply gave us a presentation he gave to his research group. He didn't take the time to make a new presentation even though his curent one was very innappropriate.<br /><br />As for student presentations at the CBL, I've found there are a few reasons there are confusing:<br />1) the presenter doesn't really understand the topic themselves<br />2) Going over too much in too little time<br />3) language barrier<br /><br />When niavely listenting to confusing presentations it seems like the person understands the topic so much no one else can understand at their level. However, in my experience, when you begin asking questions it becomes clear the presenting doesn't really understand the topic.<br /><br />On the CBL wiki I've made a presentation guide with the following tips:<br /><br />Technical Things<br /><br /> * Always label graph axis and be clear what graph represents.<br /> o Use units if applicable (on graph axis and in general)<br /> * Use page numbers on slides that show the current page number and total number of pages<br /> * Don't forget your video adapter if you have a Mac. Mac folks always seem to forget this. Hugo has one in his desk that people can borrow.<br /> * If your not presenting on your own computer then you should put the presentation in multiple formats: ppt, pdf, odf. Don't expect everyone else to have open office, for instance.<br /> * Also, if not using your on own computer, make sure that the computer in the presentation has the necessary software for any demos. This includes Flash Player, Java applet support, any necessary video codecs, ect.<br /> * Don't put too much text on a slide and keep fonts big enough<br /><br />Before Starting the Talk<br /><br /> * think about, what kind of talk you want to give (rough idea of an algorithm, detailed description of sth, ...)<br /> o depending on this you might not want to use too many equations (although the slides are not complete)<br /> o keep it simple!<br /> * give at least one test talk<br /><br />Starting the Talk<br /><br /> * Might be best to start the presentation with material you expect most people already know. This allows you to synchronize that people are on the same page. Then start to introduce new things.<br /> * It is good to establish the practical importance of whatever your presenting. Giving an example problem helps give people context. If everything is in the abstract then things become much more confusing.<br /><br />During the Talk<br /><br /> * Always define what variables represent. maybe keep them on white board on the side.<br /> o If necessary, define the dimensions of matrices and, if not obvious, the range of values variables can take (zero to one, positive, ...)<br /> * If presenting a probabalistic model then put its graphical model, in Bishop notation, in your presentation.<br /> * Give intuitive feel for all equations and variables to the extent possible<br /> o Do this with examples and analogies<br /> * Don't try to convey any important information with your hands alone.<br /> o Never write out equations in the air with your hands (I've had a teacher who does this)<br /> * Don't be afraid to write out and derive equations on the board<br /> * In engineering problems it is always good to explain the input, output, and interface of any given system up front. If this is not clear people will get confused.<br /> * If longer than an hour or so, give breaks for caffeine, snacks, etc.<br /> * Don't rush through the slides. People should be able to read it! Explain, what's going on. Depending on your presentation style (more or less tutorial-like): 2-3 minutes per slide (in average) seems good<br /><br />Voice<br /><br /> * Speak loudly, clearly, and not too fast<br /> o Mumbling technical comments on the side only confuses people<br /><br />Some Dos and Don'ts<br /><br /> * Don't point your laser pointer at people. It always seems kind of awkward.<br /> * Don't point with hands. People can't see what your pointing at exactly. Use the laser pointer.<br /> * Don't point to everything with the laser pointer.<br /> * Do look at the audience<br /> * Do modulate your voice and be interested in your own stuff. It's not trivial to most others!<br /> * Do use examples and demos<br /> o In intro physics they always like to use those Java applets of a spring oscillator and so on. Try to do the same if possible/applicable.<br /><br />After the Talk<br /><br /> * Post your slides on the Presentation Archive<br /> * Use our template if you can get Latex to work<br /><br />References<br /><br /> * some hints if you give a short talk<br /> * All the stuff in this guide to <a href="http://www.ece.wisc.edu/%7Ekati/PresentationGuide.ppt">Terrible Presentations</a> (…and how to not give one)BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-68605541200649721572008-09-12T13:47:00.000-07:002008-09-13T12:23:15.938-07:00MLSSI have been at <a href="http://mlss08.futurs.inria.fr/">MLSS</a> for the last 12 days. We are located <a href="http://maps.google.com/maps?hl=en&ie=UTF8&ll=46.242967,-1.551175&spn=0.006426,0.013819&t=h&z=16">here.</a><br /><br />There are a lot of good people here from ETH, ParisTech, Max Plank, and so on.<br /><br />The lectures by Nando de Freitas, Richard Sutton, and Yann LeCunn were the most interesting. The was an intersting lecture by Shai Ben David on the theory of clustering. I am not sure what to think of it yet.<br /><br />Nando talked about his new search engine <a href="http://worio.com/search/">Worio</a>. It is supposed to cluster web pages when you enter terms with multiple meanings. It sounds like an intersting idea. He also showed a demo of an image search where you can refine you query by selecting images you like and don't like.<br /><br />Last Sunday many of us went on a 50 km bike ride <a href="http://maps.google.com/maps?f=d&saddr=Le+Gillieux,+France&daddr=Unknown+road+to:46.183753,-1.383934+to:La+Flotte,+France&hl=en&geocode=%3BFWrcwAIdyjfq_w%3B%3B&mra=dpe&mrcr=0&mrsp=2&sz=13&via=1,2&sll=46.183158,-1.381874&sspn=0.051462,0.11055&ie=UTF8&ll=46.195517,-1.398697&spn=0.205802,0.4422&t=h&z=11">here</a><br /><br />We've been to the beach several days as well. The water here is about 20 ºC, much warmer than the Pacific. I've taken several pictures which I will need to post too.<br /><br />The internet hasn't been very good here though. We don't have internet in the rooms and the wifi can't handle the load of all the attendees.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0tag:blogger.com,1999:blog-1479496038060616472.post-52155479974983419092008-09-12T13:18:00.000-07:002008-09-12T13:20:04.488-07:00Hello WorldHello world test for my new blog of cool machine learning anecdotes.BigBayeshttp://www.blogger.com/profile/04690943437385612942noreply@blogger.com0