-
Notifications
You must be signed in to change notification settings - Fork 37
3 User Manual
A comprehensive user manual is here. This page contains a few examples for basic learning pipelines and is a great place to start.
GURLS (GURLS++) basically consists of a set of tasks, each one belonging to a predefined category, and of a method (a class in the C++ implementation) called GURLS Core that is responsible for processing an ordered sequence of tasks called pipeline. An additional ”options structure”, often referred to as OPT, is used to store all configuration parameters needed to customize the tasks behaviour. Tasks receive configuration parameters from the options structure in read-only mode and, after terminating, their results are appended to the structure by the GURLS Core in order to make them available to the subsequent tasks. This allows the user to easily skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. All tasks belonging to the same category can be interchanged with each other, so that the user can easily choose how each task shall be carried out.
Gurls Design
The gurls command accepts exaclty four arguments:
- The NxD input data matrix (N is the number of samples, D is the number of variables).
- The NxT output labels matrix (T is the number of outputs. For (multi-class) classification, labels +1 and -1 must be in the One-Vs-All format)..
- An options' structure.
- A job-id number.
The three main fields in the options' structure are:
- opt.name: defines a name for a given experiment.
- opt.seq: specifies the sequence of tasks to be executed.
-
opt.process: specifies what to do with each task. In particular here are the codes:
- 0 = Ignore
- 1 = Compute
- 2 = Compute and save
- 3 = Load from file
- 4 = Explicitly delete
seq
of the options' structure as
{'<CATEGORY1>:<TASK1>';'<CATEGORY2>:<TASK2>';...}
These tasks can be combined in order to build different train-test pipelines. The most popular learning pipelines are outlined in the following.
We want to run the training on a dataset {Xtr,ytr} and the test on a different dataset {Xte,yte}. We are interested in the precision-recall performance measure as well as the average classification accuracy. In order to train a linear classifier using a leave one out cross-validation approach, we just need the following lines of code:
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'paramsel:loocvprimal','rls:primal','pred:primal','perf:precrec','perf:macroavg'}; opt.process{1} = [2,2,0,0,0]; opt.process{2} = [3,3,2,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
The meaning of the above code fragment is the following:
- For the training data: calculate the regularization parameter lambda, minimizing classification accuracy via Leave-One-Out cross-validation and save the result, solve RLS for a linear classifier in the primal space and save the solution. Ignore the rest.
- For the test data set, load the used lambda (this is important if you want to save this value for further reference), load the classifier. Predict the output on the test-set and save it. Evaluate the two aforementioned performance measures and save it.
opt.name
is implicitly specified by the defopt
function which assigns to it its only input argument.
Fields opt.seq
and opt.process
have to be explicitly assigned.
name = 'ExampleExperiment'; opt = defopt(name); [Xtr] = norm_zscore(Xtr, ytr, opt); [Xte] = norm_testzscore(Xte, yte, opt); opt.seq = {'split:ho','paramsel:hoprimal','rls:primal','pred:primal','perf:macroavg'}; opt.process{1} = [2,2,2,0,0]; opt.process{2} = [3,3,3,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
Here the training set is first normalized and the column-wise means and covariances are saved to file. Then the test data are normalized according to the stats computed with the training set.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'kernel:linear', 'paramsel:loocvdual', 'rls:dual', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,0,0]; opt.process{2} = [3,3,3,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’paramsel:hoprimal’,’rls:primal’,’pred:primal’,’perf:rmse’}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2]; opt.hoperf = @perf_rmse; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Here GURLS is used for regression. Note that the objective function is explicitly set to @perf_rmse, i.e. root mean square error, whereas in the first example opt.hoperf is set to its default @perf_macroavg which evaluates the average classification accuracy per class. The same code can be used for multiple output regression.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'paramsel:siglam', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,0,0,0]; opt.process{2} = [3,3,3,2,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
Here parameter selection for gaussian kernel requires selection of both the regularization parameter lambda and the kernel parameter sigma, and is performed selecting the task siglam
for the
category paramsel
. Once the value for kernel parameter sigma has been chosen, the gaussian kernel is built through the kernel task with option rbf.
name = 'ExampleExperiment'; opt = defopt(name); opt.seq = {'split:ho','paramsel:siglamho', 'kernel:rbf', 'rls:dual', 'predkernel:traintest', 'pred:dual', 'perf:macroavg'}; opt.process{1} = [2,2,2,2,0,0,0]; opt.process{2} = [3,3,3,3,2,2,2]; gurls (Xtr, ytr, opt,1) gurls (Xte, yte, opt,2)
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’paramsel:calibratesgd’,’rls:pegasos’,’pred:primal’,’perf:macroavg’}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Here the optimization is carried out using a stochastic gradient descent algorithm, namely Pegasos, Shalev-Shwartz, Singer and Srebro (2007). Note that the above pipeline uses the default value for option 'subsize' (50). If such a value is used with data sets with less than 50 samples the following error will be displayed:
GURLS usage error: the option subsize of the option list must be smaller than the number of training samples!!
Set
opt.subsize = subsize;
with subsize
smaller than the number of training samples to avoid errors.
name = ’ExampleExperiment’; opt = defopt(name); opt.seq = {’split:ho’,’paramsel:horandfeats’,’rls:randfeats', ’pred:randfeats', 'perf:macroavg'} opt.process{1} = [2,2,2,0,0]; opt.process{2} = [3,3,3,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2)
Computes a classifier for the primal formulation of RLS using the Random Features approach proposed by Rahimi and Recht (2007). In this approach the primal formulation is used in a new space built through random projections of the input data.
The GURLS
class implements the GURLS Core. Its only method run
runs the learning pipeline and is the main method the user would directly call. It accepts exaclty four arguments:
- The NxD input data matrix (N is the number of samples, D is the number of variables).
- The NxT labels vector (T is the number of outputs. For (multi-class) classification, labels +1 and -1 must be in the One-Vs-All format)..
- An options' structure.
- A job-id number.
GURLS.run
needs to be invoked again.
The options’ structure is built through the GurlsOptionsList
class with default fields and values. The three main fields in the options’ structure are:
-
name
: identifies the file where results shall be saved. -
seq
: specifies the (ordered) sequence of tasks, i.e. the pipeline, to be executed. Each task is defined by providing a task category and a choice amongst those available for that category, e.g. with "optimizer:rlsprimal" one sets the optimizer to be Regularized Least Squares in the primal space (see Section GURLS++ Available Methods to know the available categories and choices for each categories). -
process
: specifies what to do with each task. Possible instructions are:ignore
compute
computeNsave
load
delete
In the ’demo’ directory you will find GURLSloocvprimal.cpp. The meaning of the demo is the following:
- For the training data: calculate the regularization parameter minimizing classification accuracy via Leave-One-Out cross-validation and save the result, solve RLS for a linear classifier in the primal space and save the solution. Ignore the rest.
- For the test data set, load the used (this is important if you want to save this value for further reference), load the classifier. Predict the output on the test-set and save it. Evaluate the average classification accuracy and as well as the precision-recall save them. In the following we report and comment the salient part of the demo.
gMat2D<T> *Xtr, *Xte, *ytr, *yte; Xtr.readCSV(xtr_file); Xte.readCSV(xte_file); ytr.readCSV(ytr_file); yte.readCSV(yte_file);
then initialize an object of class GURLS and build an options’ list and by assigning it a name, in this case "Gurlslooprimal"
GURLS G; GurlsOptionsList* opt = new GurlsOptionsList("Gurlslooprimal", true);
specify the task sequence
OptTaskSequence *seq = new OptTaskSequence(); *seq << "paramsel:loocvprimal" << "optimizer:rlsprimal"; << "pred:primal" << "perf:macroavg" << "perf:precrec"; opt->addOpt("seq", seq);
initialize the process option
GurlsOptionsList * process = new GurlsOptionsList("processes", false);
and define instructions for the training process
OptProcess* process1 = new OptProcess(); *process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::ignore << GURLS::ignore << GURLS::ignore; process->addOpt("one", process1); opt->addOpt("processes", process);
and testing process
OptProcess* process2 = new OptProcess(); *process2 << GURLS::load << GURLS::load << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave; process->addOpt("two", process2);
run gurls for training
string jobId0("one"); G.run(Xtr, ytr, *opt, jobId0);
run gurls for testing
G.run(Xte, yte, *opt, jobId1); string jobId1("two");
The method run
of class GURLS
executes an ordered sequence of tasks, the pipeline, specified in the field seq of the options’ structure as
{"<CATEGORY1>:<TASK1>";"<CATEGORY2>:<TASK2>";...}
These tasks can be combined in order to build different train-test pipelines. A list of the currently implemented GURLS tasks organized by category, is summarized in Table 1. In order to run the other examples you just have to substitute the code fragment for the task pipeline
*seq << ...
and for the sequence of instructions for the training process
*process1 << ...
and testing process
*process2 << ...
with the desired task pipeline and instructions sequence. In the following we report the fragment of code defining he task sequence and the training and testing instructions some popular learning pipelines.
tasks pipeline
*seq << "split:ho" << "paramsel:hoprimal" << "optimizer:rlsprimal"; *seq << "pred:primal" << "perf:macroavg";
instructions sequences
*process1 << GURLS::computeNsave << GURLS::computeNsave << GURLS::computeNsave; *process1 << GURLS::ignore << GURLS::ignore; *process2 << GURLS::load << GURLS::load << GURLS::load; *process2 << GURLS:: computeNsave << GURLS:: computeNsave;
tasks pipeline
*seq <<"split:ho"<<"paramsel:siglamho"<<"kernel:rbf"<<"optimizer:rlsdual"; *seq <<"pred:dual"<<"predkernel:traintest"<<"perf:macroavg";
instructions sequences
*process1 <<GURLS::computeNsave<<GURLS::computeNsave<<GURLS::computeNsave; *process1 <<GURLS::computeNsave<<GURLS::ignore<<GURLS::ignore<<GURLS::ignore; *process2 <<GURLS::load<<GURLS::load<<GURLS::load<<GURLS::load; *process2 <<GURLS:: computeNsave<<GURLS:: computeNsave<< GURLS:: computeNsave;
Here parameter selection for gaussian kernel requires selection of both the RLS regularization parameter and the kernel parameter, and is performed selecting the task siglamho
for the category paramsel
. Once the value for the kernel parameter is chosen, the gaussian kernel is built through the kernel
task category with choice rbf
.
The options structure passed as third input to GURLS.run
has a set of default fields and values. Some of these fields can be manually changed as in the following line of code
opt.addOpt("<FIELD>", <VALUE>);
where &lt;VALUE&gt;
belongs the correct class of options amongst:
- OptNumber
Below we list the most important fields that can be customized
-
nlambda (OptNumber 20)
: number of values for the regularization parameter
nsigma (OptNumber 25)
: number of values for the kernel parameter.
* nholdouts (OptNumber 1)
: number of data splits to be used for hold-out CV.
* hoproportion (OptNumber 0.2)
: proportion between training and validation set in parameter selection
-
hoperf (OptFunction &quot;macroavg&quot;)
: objective function to be used for parameter selection. -
epochs (OptNumber 4)
: number of passes over the training set for stocastic gradient descent. -
subsize (OptNumber 50)
: training set size used for parameter selection when using stocastic gradient descent. -
singlelambda (OptFunction &quot;mean&quot;)
: function for obtaining one value for the regularization parameter, given the parameter choice for each class in multiclass classification (for each output in multiple output regression).
The bGURLS package includes all the design patterns described for GURLS, and has been complemented with additional big data through a data structure called bigarray, which allows to handle data matrices as large as a machine's available space on hard drive instead of its RAM.
The bGURLS Core is identified with the bgurls
command, which behaves as gurls
.
As gurls
it accepts exactly four arguments:
- the bigarray of the input data.
- the bigarray of the labels vector.
- An options' structure.
- A job-id number.
bigdefopt
function with default fields and values.
Most of the main fields in the options' structure are the same as in GURLS,
however bgurls
requires the options' structure to have the additional field files
, which must be a structure with fields:
-
Xva_filename
: the prefix of the files that constitute the bigarray of the input data used for validation -
yva_filename
: the prefix of the files that constitute the bigarray of the labels vector used for validation -
pred_filename
: the prefix of the files that constitute the bigarray of the predicted labels for the test set -
XtX_filename
: the name of the files where pre-computed matrix X'X is stored -
Xty_filename
: the name of the files where pre-computed matrix Xt'y is stored -
XvatXva_filename
: the name of the files where pre-computed matrix Xva'Xva is stored -
Xvatyva_filename
: the name of the files where pre-computed matrix Xva'yva is stored
Let us consider the demo bigdemoA.m
in the demo directory to better understand the usage of bGURLS. The demo computes a linear classifier with the regularization parameter chosen via hold-out validation, and then evaluate the prediction accuracy on a test set.
The data set used in the demo is the bio
data set used in Lauer and Guermeur 2011, which is saved
in the demo directory as a .zip file, 'bio\_unique.zip', containing two files:
- 'X.csv': containing the input nxd data matrix, where n is the number of samples (24,942) and d is the number of variables (68)
- 'Y.csv': containing the input nx1 label vector
In the following we examine the salient part of the demo in details. First unzip the data file
unzip('bio_unique.zip','bio_unique')
and set the name of the data files
filenameX = 'bio_unique/X.csv'; %nxd input data matrix filenameY = 'bio_unique/y.csv'; %nx1 or 1xn labels vector
Now set the size of the blocks for the bigarrays (matrices of size blocksizexd must fit into memory):
blocksize = 1000;
the fraction of total samples to be used for testing:
test_hoproportion = .2;
the fraction of training samples to be used for validation:
va_hoproportion = .2;
and the directory where all processed data is going to be stored:
dpath = 'bio_data_processed';
Now set the prefix of the files that will constitute the bigarrays
mkdir(dpath) files.Xtrain_filename = fullfile(dpath, 'bigarrays/Xtrain'); files.ytrain_filename = fullfile(dpath, 'bigarrays/ytrain'); files.Xtest_filename = fullfile(dpath, 'bigarrays/Xtest'); files.ytest_filename = fullfile(dpath, 'bigarrays/ytes'); files.Xva_filename = fullfile(dpath, 'bigarrays/Xva'); files.yva_filename = fullfile(dpath, 'bigarrays/yva');
and the name of the files where pre-computed matrices will be stored
files.XtX_filename = fullfile(dpath, 'XtX.mat'); files.Xty_filename = fullfile(dpath, 'Xty.mat'); files.XvatXva_filename = fullfile(dpath,'XvatXva.mat'); files.Xvatyva_filename = fullfile(dpath, 'Xvatyva.mat');
We are now ready to prepare the data for bGURLS.
The following line of command reads files filenameX
and filenameY
blockwise -- thus avoiding to load all file at the same time-- and stores them in the bigarray format, after having split the data into train, validation and test set
bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)
Bigarrays are now stored in the file names specified in the structure files
.
We can now precompute matrices that will be recursively used in the training phase,
and store them in the file names specified in the structure files
bigMatricesBuild(files)
The data set is now prepared for running the learning pipeline with the bgurls
command. This phase behaves almost completely as in GURLS. The only differences are that:
- we need not to load the data into memory, but simply 'load' the bigarray, that is load the information necessary to access the data blockwise.
- we have to specify in the options' structure the path where the already computed matrix multiplications, and bigarrays for validation data are stored.
name = fullfile(wpath,'gurls'); opt = bigdefopt(name); opt.seq = {'paramsel:dhoprimal','rls:dprimal','pred:primal','perf:macroavg'}; opt.process{1} = [2,2,0,0]; opt.process{2} = [3,3,2,2];
Note that no task is defined for the split
category,
as data has already been split in the preprocessing phase and bigarrays for validation were built.
In the following fragment of code we add to the options' structure the information relative to the already computed matrix multiplications and to the validation bigarrays
opt.files = files; opt.files = rmfield(opt.files,{'Xtrain_filename';'ytrain_filename';'Xtest_filename';'ytest_filename'}); %not used by bgurls opt.files.pred_filename = fullfile(dpath, 'bigarrays/pred');
Note that we have also defined where the predicted labels shall be stored as bigarray.
Now we have to 'load' bigarrays for training
X = bigarray.Obj(files.Xtrain_filename); y = bigarray.Obj(files.ytrain_filename); X.Transpose(true); y.Transpose(true);
and run bgurls
on the training set
bgurls(X,y,opt,1)
In order to run the testing process, we first have to 'load' bigarrays variables for test data
X = bigarray.Obj(files.Xtest_filename); y = bigarray.Obj(files.ytest_filename); X.Transpose(true); y.Transpose(true);
and then we can finally run bgurls
on the test set
bgurls(X,y,opt,2);
Now you should have a mat file named 'gurls.mat' in your path. This file contains all the information about your experiment. If you want to see the mean accuracy, for example, load the file in your workspace and type
>> mean(opt.perf.acc)
If you are interested in visualizing or printing stats and facts about your experiment, check the documentation about the summarizing functions in the gurls package.
Other two demos can be found in the 'demo' directory. The three demos differ in the format of the input data, as we tried to provide examples for the most common data formats.
The data set used in bigdemoB
is again the bio data set, though in a slightly different format as it is already split into train and test data. The bigTrainPrepare
and bigTestPrepare
take care of preparing the train and test set separately.
The data set used in bigdemoC
is the ImageNet data set, which is automatically downloaded from http://bratwurst.mit.edu/sbow.tar, when running the demo. This data set is stored in 1000 .mat files where the i-th file contains the variable x
which is a dxn_i input data matrix for the n_i samples of class i. The bigTrainTestPrepare_manyfiles
takes care of preparing the bigarrays for the ImageNet data format.
Note that, while the bio data is not properly a big data set, the ImageNet occupies about 1G of RAM and can thus be called a big data set.
In order to run bGURLS on other data formats, one can simply use bigdemoA
after having substituted the line
bigTrainTestPrepare(filenameX, filenameY,files,blocksize,va_hoproportion,test_hoproportion)
with a suitable fragment of code. The remainder of the data preparation, that is the computation and storage of the relevant matrices, can be left unchanged.
The usage of bGURLS ++ is very similar to that of GURLS++, with the following exceptions:
- the Gurls Core is implemented via the
BGURLS
class instead of theGURLS
one; - The first two inputs of
BGURLS
must be bigarrays (a data structure which allows to handle data matrices as large as a machine’s available space on hard drive instead of its RAM) rather than matrices; - The options structure must be of class
BGurlsOptionsList
rather thanGurlsOptionsList
; - The only allowed "big" task categories for bGURLS++ are
bigsplit
,bigparamsel
,bigoptimizer
,bigpred
andbigperf
;
Let us consider the demo bigmedmo.cpp
in the demo subdirectory to better understand the usage of the bGURLS++ module. The data set used in the demo is the bio data set used in [Lauer], which is saved in the demo directory as a .zip file, ’bio_traintest_csv.zip’, containing four files:
- ’Xtr.csv’: containing the NtrxD training data matrix, where Ntr is the number of training samples and D is the number of variables;
- ’Ytr.csv’: containing the Ntrx1 training label vector;
- ’Xte.csv’: containing the NtexDd test data matrix, where Nte is the number of test samples;
- ’Yte.csv’: containing the Ntex1 test label vector;
The data is loaded as bigarray (actually only the information realtive to the data, not the data itself) with the following fragment of code:
BigArray<T> Xtr(path(shared_directory / "Xtr.h5").native(), 0, 0); Xtr.readCSV(path(input_directory / "Xtr.csv").native()); BigArray<T> Xte(path(shared_directory / "Xte.h5").native(), 0, 0); Xte.readCSV(path(input_directory / "Xte.csv").native()); BigArray<T> ytr(path(shared_directory / "ytr.h5").native(), 0, 0); ytr.readCSV(path(input_directory / "ytr.csv").native()); BigArray<T> yte(path(shared_directory / "yte.h5").native(), 0, 0); yte.readCSV(path(input_directory / "yte.csv").native());
The options’ structure is built with default values via the following line of code:
BGurlsOptionList opt("bio_demoB", shared_directory.native(),true);
The pipeline is built as in GURLS++, though with the bGURLS ++ task categories
OptTaskSequence *seq = new OptTaskSequence(); *seq<<"bigsplit:ho"<<"bigparamsel:hoprimal"<<"bigoptimizer:rlsprimal" << "bigpred:primal" << "bigperf:macroavg"; opt.addOpt("seq",seq);
The two sequences of actions identifying the training and test processes are defined exactly as in GURLS++, whereas the processes are run through the BGURLS method as in the following:
BGURLS G; G.run(Xtr,ytr,opt,jobid1); G.run(Xte,yte,opt,jobid2);
You can visualize the results of one or more experiments (i.e. GURLS pipelines) using the summary_&#42;
functions.
Below we show the usage of these set of functions for two sets of experiments each one run 5 times. First we have to run the experiments. nRuns
contains the number of runs for each experiment, and filestr
contains the names of the experiments.
nRuns = {5,5}; filestr = {’hoprimal’; ’hodual’}; for i = 1:nRuns{1}; opt = defopt(filestr{1} ’_’ num2str(i)]; opt.seq = {’paramsel:loocvprimal’,’rls:primal’,’pred:primal’,’perf:macroavg’,’perf:precrec’}; opt.process{1} = [2,2,0,0,0]; opt.process{2} = [3,3,2,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2) end
for i = 1:nRuns{2}; opt = defopt(filestr{2} ’_’ num2str(i)]; opt.seq = {’kernel:linear’, ’paramsel:loocvdual’,’rls:dual’, ’pred:dual’, ’perf:macroavg’, ’perf:precrec’}; opt.process{1} = [2,2,2,0,0,0]; opt.process{2} = [3,3,3,2,2,2]; gurls(Xtr, ytr, opt,1) gurls(Xte, yte, opt,2) end
In order to visualize the results we have to specify in fields which fields of opt are to be displayed (as many plots as the elements of fields will be generated)
>> fields = {’perf.ap’,’perf.acc’};
we can generate "per-class" plots with the following command:
>> summary_plot(filestr,fields,nRuns)
and “global” plots with:
>> summary_overall_plot(filestr,fields,nRuns)
this generates “global” table:
>> summary_table(filestr, fields, nRuns)
This plots times taken by each step of the pipeline for performance reference:
>> plot_times(filestr,nRuns)