Posts

Showing posts with the label scikit-learn

Scikit-learn sprint and 0.14 release candidate (Update: binaries available :)

Image
Yesterday a week-long scikit-learn coding sprint in Paris ended. And let me just say: a week is pretty long for a sprint. I think most of us were pretty exhausted in the end. But we put together a release candidate for 0.14 that Gael Varoquaux tagged last night. You can install it via: Ā  pip install -U https://github.com/scikit-learn/scikit-learn/archive/0.14a1.zip There are also tarballs on github and binaries on sourceforge . If you want the most current version, you can check out the release branch on github: https://github.com/scikit-learn/scikit-learn/tree/0.14.X The full list of changes can be found in what's new . The purpose ofĀ  the release candidate is to give users a chance to give us feedback before the release. So please try it out and report back if you have any issues.

pystruct: more structured prediction with python

Image
Some time ago I wrote about a structured learning project I have been working on for some time, called pystruct . After not working on it for some time, I think it has come quite a long way the last couple of weeks as I picked up work on structured SVMs again. So here is a quick update on what you can do with it. To the best of my knowledge this is the only tool with ready-to-use functionality to learn structural SVMs (or max-margin CRFs) on loopy graphs - even though this is pretty standard in the (computer vision) literature.

Machine Learning Cheat Sheet (for scikit-learn)

Image
As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way). One of the requests there was to provide some sort of flow chart on how to do machine learning. As this is clearly impossible, I went to work straight away. This is the result: [edit2] clarification: With ensemble classifiers and ensemble regressors I mean random forests , extremely randomized trees, gradient boosted trees , and the soon-to-be-come weight boosted trees (adaboost). [/edit2] Needless to say, this sheet is completely authoritative.

Scikit-Learn 0.13 released! We want your feedback.

After a little delay, the team finished work on the 0.13 release of scikit-learn. There is also a user survey that we launched in parallel with the release, to get some feedback from our users. There is a list of changes and new features on the website . You can upgrade using easy-install or pip using: pip install -U scikit-learn or easy_install -u scikit-learn There were more than 60 people contributing to this release, with 24 people having 10 commits or more. Again many improvements are behind the scenes or only slightly notable. We improved test coverage a lot and we have much more consistent parameter names now. There is now also a user guide entry for the classification metrics, and their naming was improved. This was one of the many improvements Arnaud Joly , who joined the project very recently but nevertheless wound up being the one with the second most commits in this release! Now let me get to some of the more visible highlights of this release from my pers...

Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]

Image
Recently we added another method for kernel approximation, the Nystrƶm method, to scikit-learn , which will be featured in the upcoming 0.13 release. Kernel-approximations were my first somewhat bigger contribution to scikit-learn and I have been thinking about them for a while. To dive into kernel approximations, first recall the kernel-trick .

Another look at MNIST

Image
I'm a bit obsessed with MNIST. Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons. When preparing the workshop we held yesterday I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are really easy! Basically all pairs of numbers can be separated perfectly using a linear classifier! And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset. To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points. You can generate it using this very simple gist . There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that ...

Workshop on Python, Machine Learning and Scikit-Learn

Today there was a workshop at my uni, organized by my Professor Sven Behnke, together with my colleagues Hannes Schulz, Nenard BireŔev and me. The target group was a local graduate school with a general scientific background, but not much CS or machine learning. The workshop consisted of us explaining the methods and the students then playing around with them and answering some questions using IPython notebooks that we provided (if you still don't know about IPython Notebooks, watch this talk now ). Using the notebooks worked out great! There is only so much you can teach in a 5 hour workshop but I think we got across some basic concepts of machine learning and working with data in Python. We got some positive feedback and the students really went exploring. We covered PCA, k-means, linear regression, logistic regression and nearest neighbors, including some real-world examples. You can find all resources, including tex and notebooks for generating figures etc. on gith...

A Wordcloud in Python

Image
Last week I was at Pycon DE , the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while: doing a wordl -like word cloud . I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations. So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since). While I was bored in the train last week, I came up with this code . A little today-themed taste:

Animating Random Projections of High Dimensional Data

Image
Recently Jake showed some pretty cool videos in his blog . This inspired me to go back to an idea I had some time ago, about visualizing high-dimensional data via random projections. I love to do exploratory data analysis with scikit-learn , using the manifold , decomposition and clustering module. But in the end, I can only look at two (or three) dimensions. And I really like to see what I am doing.

Recap of my first Kaggle Competition: Detecting Insults in Social Commentary [update 3]

Image
Recently I entered my first kaggle competition - for those who don't know it, it is a site running machine learning competitions. A data set and time frame is provided and the best submission gets a money prize, often something between 5000$ and 50000$. I found the approach quite interesting and could definitely use a new laptop, so I entered Detecting Insults in Social Commentary. My weapon of choice was Python with scikit-learn - for those who haven't read my blog before: I am one of the core devs of the project and never shut up about it.

Scikit-learn 0.12 released

Last night I uploaded the new version 0.12 of scikit-learn to pypi . Also the updated website is up and running and development now starts towards 0.13 . The new release has some nifty new features ( see whatsnew ): * Multidimensional scaling * Multi-Output random forests ( like these ) * Multi-task Lasso * More loss functions for ensemble methods and SGD * Better text feature extraction

Learning Gabor filters with ICA and scikit-learn

Image
My colleague Hannes works in deep learning and started on a new feature extraction method this week. As with all feature extraction algorithms, it was obviously of utmost importance to be able to learn Gabor filters . Inspired by his work and Natural Image Statistics , a great book on the topic of feature extraction from images, I wanted to see how hard it is to learn Gabor filters with my beloved scikit-learn . I chose independent component analysis , since this is discusses to some depth in the book. Luckily mldata had some image patches that I could use for the task. Here goes: import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_mldata from sklearn.decomposition import FastICA # fetch natural image patches image_patches = fetch_mldata("natural scenes data") X = image_patches.data # 1000 patches a 32x32 # not so much data, reshape to 16000 patches a 8x8 X = X.reshape(1000, 4, 8, 4, 8) X = np.rollaxis(X, 3, 2).reshape(-1, 8 *...

Generating Data for benchmarking clustering algorithms

Image
As I have been working on some clustering algorithms recently, I invested some time last weekend to refactor some code inside sklearn to generate some toy data sets to visualize the results of clustering algorithms. That looks something like this: While the first two are nice to show off that your algorithm can handle non-convex clusters, these data sets obviously look nothing like the data you'll see in practice. So I wanted to have some a bit more general data set generator. What I ended up doing is a nonparametric mixture of Gaussians. While Gaussians are a bit boring, combining them with a non-parametric prior makes them somewhat more general. As I didn't found some very easy to use package to do that (though David pointed out p ymc ) I went ahead and wrote the generative model down myself. It's a mixture of Gaussians with a Chinese restaurant process as prior for the mixture components and Wishard-Gaussian priors for mean and variance. You can find the ...

Matplotlib Errorbars Weirdness

Some time ago, I was having trouble with using error bars with a legend in matplotlib. I did some hack to fix it that time. Today, I came across the same issue again at the scikit-learn sprint. Obviously this time we wanted to do it right^TM. The problem is that when doing a plot with error bars, the colors in the legend don't correspond to the colors of the lines. The problem seems to be solved in never versions, though. After fiddling a bit with it I found the problem: The errorbars command actually returns a list of artists, corresponding to the lines AND the individual error bars. That seems to really confuse the legend. You can solve the problem by doing errorbarĀ  = pl.errorbars(x, y, err) line = errorbar[0] legend(line, 'the line') taking the zeroth element of the tuple extracts just the line from the collection of artists and everything works out :)

scikits.learn aka sklearn 0.9 released

My favourite machine learning library has got a major upgrade (unless you're using the dev version like me ;) Important new features include manifold learning, Dirichlet process mixture models and dataset downloader / import functions . You can find the changelog here .