Shogun - A Large Scale Machine Learning Toolbox

This is the official homepage of the SHOGUN machine learning toolbox.

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21], Liblinear [20], LibSVM [2], SVMLight, [3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts) [10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software.

We took part in Google Summer of Code 2011

Thanks to the work of 5 hard working and talented students, we now have various new features implemented in shogun: Interfaces to new languages like java, c#, ruby, lua written by Baozeng; A model selection framework written by Heiko Strathman, many dimension reduction techniques written by Sergey Lisitsyn, Gaussian Mixture Model estimation written by Alesis Novik and a full-fledged online learning framework developed by Shashwat Lal Das. All of this work has been integrated in shogun 1.0.0. In case you want to know more about shogun check out the documentation and read our overview paper:

Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien,
Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc.
The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11:1799-1802, June 2010.

Download Releases

SHOGUN Version 2.1.0 (lib 13.0, data 0.5, param 1)
(updated 17.03.2013)

Source Code (ftp | http)
Source Code md5sum (ftp | http)
Source Code PGP Signature (ftp | http)
Data (ftp | http)
Data md5sum (ftp | http)

Older Versions

This release also contains several enhancements, cleanups and bugfixes:

Features:
- Linear Time MMD two-sample test now works on streaming-features, which allows to perform tests on infinite amounts of data. A block size may be specified for fast processing. The below features were also added. By Heiko Strathmann.
- It is now possible to ask streaming features to produce an instance of streamed features that are stored in memory and returned as a CFeatures* object of corresponding type. See CStreamingFeatures::get_streamed_features().
- New concept of artificial data generator classes: Based on streaming features. First implemented instances are CMeanShiftDataGenerator and CGaussianBlobsDataGenerator. Use above new concepts to get non-streaming data if desired.
- Accelerated projected gradient multiclass logistic regression classifier by Sergey Lisitsyn.
- New CCSOSVM based structured output solver by Viktor Gal
- A collection of kernel selection methods for MMD-based kernel two- sample tests, including optimal kernel choice for single and combined kernels for the linear time MMD. This finishes the kernel MMD framework and also comes with new, more illustrative examples and tests. By Heiko Strathmann.
- Alpha version of Perl modular interface developed by Christian Montanari.
- New framework for unit-tests based on googletest and googlemock by Viktor Gal. A (growing) number of unit-tests from now on ensures basic funcionality of our framework. Since the examples do not have to take this role anymore, they should become more ilustrative in the future.
- Changed the core of dimension reduction algorithms to the Tapkee library.
Bugfixes:
- Fix for shallow copy of gaussian kernel by Matt Aasted.
- Fixed a bug when using StringFeatures along with kernel machines in cross-validation which cause an assertion error. Thanks to Eric (yoo)!
- Fix for 3-class case training of MulticlassLibSVM reported by Arya Iranmehr that was suggested by Oksana Bayda.
- Fix for wrong Spectrum mismatch RBF construction in static interfaces reported by Nona Kermani.
- Fix for wrong include in SGMatrix causing build fail on Mac OS X (thanks to @bianjiang).
- Fixed a bug that caused kernel machines to return non-sense when using custom kernel matrices with subsets attached to them.
- Fix for parameter dictionary creationg causing dereferencing null pointers with gaussian processes parameter selection.
- Fixed a bug in exact GP regression that caused wrong results.
- Fixed a bug in exact GP regression that produced memory errors/crashes.
- Fix for a bug with static interfaces causing all outputs to be
- 1/+1 instead of real scores (reported by Kamikawa Masahisa).
Cleanup and API Changes:
- SGStringList is now based on SGReferencedData.
- "confidences" in context of CLabel and subclasses are now "values".
- CLinearTimeMMD constructor changes, only streaming features allowed.
- CDataGenerator will soon be removed and replaced by new streaming- based classes.
- SGVector, SGMatrix, SGSparseVector, SGSparseVector, SGSparseMatrix refactoring: Now contains load/save routines, relevant functions from CMath, and implementations went to .cpp file.

Documentation and Examples

We use Doxygen for both user and developer documentation which may be read online here. More than 600 documented examples for the interfaces python_modular, octave_modular, r_modular, static python, static matlab and octave, static r, static command line and C++ libshogun developer interface can be found in the online documentation. In addition, examples are shipped in the examples/(un)documented/[interface] directory in the source code (where interface is one of r, octave, matlab, python, python_modular, r_modular, octave_modular, cmdline, libshogun).

English

Chinese

Note that documentation for python-modular is most complete and also that python's help function will show the documentation when working interactively:

$ python
Python 2.4.4 (#2, Jan  3 2008, 13:36:28) 
[GCC 4.2.3 20071123 (prerelease) (Debian 4.2.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from shogun.Classifier import SVM
>>> help(SVM)

class SVM(CSVM)
 |  Method resolution order:
 |      SVM
 |      CSVM
 |      CKernelMachine
 |      Classifier
 |      SGObject
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, kernel, alphas, support_vectors, b)
[...]

Below we provide some of the (in the meantime outdated) examples that were used to carry out experiments for a number of publications. Note that more than 600 examples and updated versions of all of these can also be found in the source code and in the online documentation.

Click on the corresponding link to see classification and regression examples for Matlab(tm), R, Octave or Python:	Below one finds some Bioinformatics examples (for octave and matlab) as presented at BOSC 2006:	Multiple Kernel Learning examples (JMLR 2006 paper "Large Scale Multiple Kernel Learning"):
R Interface Examples Octave Interface Examples Python Interface Examples Matlab Interface Examples	Spectrum Kernel Weighted Degree Kernel Weighted Degree Kernel with Shifts	MKL for classifying x-mas stars MKL for regression MKL for mixture of sine waves

Developer Information

Want to contribute ? We maintain SHOGUNs source code via git and are looking forward to your patches!

Class Design and Source Code

If you are interested in developing C++ applications with libshogun or want to extend shogun read the developer tutorial
```
http://www.shogun-toolbox.org/doc/en/current/developer_tutorial.html
```

Check out some basic examples on how to develop with libshogun

http://www.shogun-toolbox.org/doc/en/current/libshogun_examples.html

To browse the most up-to-date source code use
```
https://github.com/shogun-toolbox/shogun
```

To access the up-to-date source code clone

git clone git://github.com/shogun-toolbox/shogun.git

To access themost up-to-date data sets clone

git clone git://github.com/shogun-toolbox/shogun-data.git

We keep mirrors for source code and data also at shogun-toolbox.org:

git clone git://shogun-toolbox.org/shogun.git
git clone git://shogun-toolbox.org/shogun-data.git

Feature matrix

The pdf document with the machine learning toolbox feature comparison that we originally submitted to JMLR can be found here. An up-to-date version of this matrix is located at Google Spreadsheet. Please notify us about possible corrections and changes.

A comparison of shogun with the popular machine learning toolboxes weka, kernlab, dlib, nieme, orange, java-ml, pyML, mlpy, pybrain, torch3, scikit-learn. A '?' denotes unkown, '-' feature is missing. This table is availabe as a google spreadsheet.

	feature	shogun	weka	kernlab	dlib	nieme	orange	java-ml	pyML	mlpy	pybrain	torch3	scikit-learn
General Features	Graphical User Interface
	One Class Classification
	Classification
	Multiclass classification
	Regression
	Structured Output Learning
	Pre-Processing
	Built-in Model Selection Strategies
	Visualization
	Test Framework
	Large Scale Learning
	Semi-supervised Learning
	Multitask Learning
	Domain Adaptation
	Serialization
	Parallelized Code
	Performance Measures (auROC etc)
	Image Processing

Supported Operating Systems	Linux
	Windows
	Mac OSX
	Other Unix

Language Bindings	Python
	R
	Matlab
	Octave
	C/C++
	Command Line
	Java
	C#
	Lua
	Ruby

SVM Solvers	SVMLight
	LibSVM
	SVM Ocas
	LibLinear
	BMRM
	LaRank
	SVMPegasos
	SVM SGD
	other

Regression	Kernel Ridge Regression
	Support Vector Regression
	Gaussian Processes
	Relevance Vector Machine

Multiple Kernel Learning	MKL
	q-norm MKL

Classifiers	Naive Bayes
	Bayesian Networks
	Multi Layer Perceptron
	RBF Networks
	Logistic Regression
	LASSO
	Decision Trees
	k-NN

Linear Classifiers	Linear Programming Machine
	LDA

Distributions	Markov Chains
	Hidden Markov Models

Kernels	Linear
	Gaussian
	Polynomial
	String Kernels
	Sigmoid Kernel
	Kernel Normalizer

Feature Selection	Forward
	Wrapper methods
	Recursive Feature Selection

Missing Features	Mean value imputation
	EM-based/model based imputation

Clustering	Hierarchical Clustering
	k-means

Optimization	BFGS
	conjugate gradient
	gradient descent
	bindings to CPLEX
	bindings to Mosek
	bindings to other solver

Supported File Formats	Binary
	Arff
	HDF5
	CSV
	libSVM/ SVMLight format
	Excel

Supported Data Types	Sparse Data Representation
	Dense Matrices
	Strings
	Support for native (e.g. C) types (char, signed and unsigned int8, int16, int32, int64, float, double, long double)

References

[1]	C.Cortes and V.N. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.

[2]	C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

[3]	T.Joachims. Making large-scale SVM learning practical. In B.Schoelkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169--184, Cambridge, MA, 1999. MIT Press.

[4]	V. Sindhwani, S. S. Keerthi. Large Scale Semi-supervised Linear SVMs. SIGIR, 2006.

[5]	L. Zanni, T. Serafini, G. Zanghirati. Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems. JMLR 7(Jul), 1467-1492, 2006.

[6]	A.Zien, G.Raetsch, S.Mika, B.Schoelkopf, T.Lengauer, and K.-R. Mueller. Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. Bioinformatics, 16(9):799-807, September 2000.

[7]	T.S. Jaakkola and D.Haussler.Exploiting generative models in discriminative classifiers. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 487-493, 1999.

[8]	K.Tsuda, M.Kawanabe, G.Raetsch, S.Sonnenburg, and K.R. Mueller. A new discriminative kernel from probabilistic models. Neural Computation, 14:2397--2414, 2002.

[9]	C.Leslie, E.Eskin, and W.S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R.B. Altman, A.K. Dunker, L.Hunter, K.Lauderdale, and T.E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564-575, Kaua'i, Hawaii, 2002.

[10]	(1, 2, 3) G.Raetsch and S.Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277-298. MIT Press series on Computational Molecular Biology. MIT Press, 2004.

[11]	(1, 2) G.Raetsch, S.Sonnenburg, and B.Schoelkopf. RASE: recognition of alternatively spliced exons in c. elegans. Bioinformatics, 21:i369--i377, June 2005.

[12]	(1, 2) S.Sonnenburg, G.Raetsch, and B.Schoelkopf. Large scale genomic sequence SVM classifiers. In Proceedings of the 22nd International Machine Learning Conference. ACM Press, 2005.

[13]	(1, 2) S.Sonnenburg, G.Raetsch, and C.Schaefer. Learning interpretable SVMs for biological sequence classification. In RECOMB 2005, LNBI 3500, pages 389-407. Springer-Verlag Berlin Heidelberg, 2005.

[14]	(1, 2) G.Raetsch, S.Sonnenburg, and C.Schaefer. Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics, Special Issue from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada, 18 December 2004, 7:(Suppl. 1):S9, March 2006.

[15]	S.Sonnenburg.New methods for splice site recognition. Master's thesis, Humboldt University, 2002. supervised by K.-R. Mueller H.-D. Burkhard and G.Raetsch.

[16]	S.Sonnenburg, G.Raetsch, A.Jagota, and K.-R. Mueller. New methods for splice-site recognition. In Proceedings of the International Conference on Artifical Neural Networks, 2002. Copyright by Springer.

[17]	S.Sonnenburg, A.Zien, and G.Raetsch. ARTS: Accurate Recognition of Transcription Starts in Human. 2006. (accepted).

[18]	S.Sonnenburg, G.Raetsch, C.Schaefer, and B.Schoelkopf,Large Scale Multiple Kernel Learning, Journal of Machine Learning Research, 2006, K.Bennett and E.P.-Hernandez Editors

[19]	M.Kloft, U.Brefeldt, S.Sonnenburg, A.Zien, P.Laskov, K.-R. Mueller, Efficient and Accurate Lp-Norm Multiple Kernel Learning, Advances in Neural Information Processing Systems 21, MIT Press, Cambridge, MA,2009

[20]	R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear/

[21]	V. Franc, S. Sonnenburg. Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization, Journal of Machine Learning Research 10(2009), 2157--2192, Software available at http://jmlr.csail.mit.edu/papers/v10/franc09a.html

[22]	S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs, Research Report, Center for Machine Perception, K13133 FEE Czech Technical University, 2009

[23]	S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs. Proceedings of the 27nd International Machine Learning Conference, 2010.

Shogun - A Large Scale Machine Learning Toolbox

We took part in Google Summer of Code 2011

Screenshots

Getting Started

Applications

Licensing Information

Cite us

Download Releases

This release also contains several enhancements, cleanups and bugfixes:

Features:

Bugfixes:

Cleanup and API Changes:

Documentation and Examples

English

Chinese

Publications and Presentations

Bug-Reports, Mailinglist, Planet

IRC and Contact

Developer Information

Class Design and Source Code

Related Projects

Feature matrix

Acknowlegements

References