next up previous contents
Next: Adapting METAL-MLEE Up: METAL The METAL Machine Previous: Standard Database Format   Contents

Subsections


The Programs

The following section describes each of the programs in the METAL-MLEE in more detail. Note that all programs will accept the -h option that will show an explanation of all valid options and the version and versiondate of the program. The following documentation only contains explanations of those options that are releveant for the use of METAL-MLEE for use with the datamining advisor. For a more complete documentation of the programs see [Petrak 2002a].

Main experimentation program: run_exp

The run_exp program performs the follwing tasks for a given base database:

For the error estimation, the input base-database will be randomly shuffled and split into one or more pairs of training- and evaluation data. The program run_exp needs a random seed to control the random shuffling. The random seed defines the exact way how the data is being shuffled and partitioned. This allows to run the program on different machines, at different times, with different learning algorithms and still obtain comparable error estimates and comparable files with predictions.

The program creates a standardized set of output files in the output directory specified (see Section 8).

Synopsis

  run_exp -h
  run_exp -f stem -s seed [-v] ...

Important Options

The following describes just the subset of options that are important for use in the METAL-setting:

Specifying a CPU time limit

METAL-MLEE allows you to specify a CPU time limit for each call of an external learning algorithm programm. This is necessary since otherwise the only way to end an experiment where one of the algorithm loops or takes too much time would be to terminate the whole experiment, loosing all the data for all the algorithms. You can specify the CPI time limit, in seconds, using option -t:
run_exp -f stem -s 1 -t 3600
This example sets the CPU time limit to one hour. (The default is 43200 seconds, or 12 hours, use the value 0 to unlimit CPU time usage)

Note however that not all OS's support this. Currently this is not possible under Windows. On some systems that do not support this but do support killing processes the coded workaround that tries to kill the process after a specfified number of elapsed (not CPU!) seconds might work, but this is not guaranteed either.

Specifying learning algorithms

Learning algorithms are always invoked through interface scripts. If you want to use a learning algorithm with METAL-MLEE for which there is not already an interface script included in the scripts subdirectory, you need to create a new one (see Section 6). Interface scripts for learning algorithms are named run_cla_laname for classification algorithms and run_rla_laname for regression algorithms, where laname is the name under which the learning algorithm should be known to METAL-MLEE.

To invoke one or more algorithms for an experiment, give this name of the algorithm as an argument to the run_exp option -l. In order to use more than one algorithm, specify the option -l multiple times, e.g.:

run_exp -f somestem -s 1 -l alg1 -l alg2
This example shows how to specify to run algorithms alg1 and alg2.

Instead of specifying the list of learning algorithms every time, you can specify the list to be used as a default in the configuration file config.pm.

Passing parameters to the learning algorithms

Interface scripts both execute the training and the prediction phase of a learning algorithm. In order to specify to which phase the parameter should be passed, you need to specify a ``sub option'':

run_exp -f stem -s 1 -l "alg1 -at -A" 
  -l "alg2 -ae '-r 1.1 -s 2.1' "
This example shows how to add the option -A to the call of algorithm alg1 for the training phase and options -r 1.1 -s 2.1 to the testing phase of algorithm alg2. You can use suboption -a to specify what to pass to both the training and the testing phase calls.

You can also use the same algorithm twice with different parameter settings. However for this to work, you also have to specify different algorithm suffices for each of the different calls:

run_exp -f somestem -s 1 -l 'alg1 -at "-c 0.1" -asuf c0.1' 
  -l 'alg1 -at "-c 0.2" -asuf c0.2'
This suffix will be appended everywhere the algorithm name is mentioned, i.e. statistics, the log and results file will now contain entries for an algorithm alg1c0.1 and an algorithm alg1c0.2.


Algorithm interface programs

Interface programs are used to provide the main experimentation program run_exp with one standard interface to many different learning algorithms. In order for this to work, learning algorithms must fulfill some requirements that are listed in Section 6 which also explains how to adapt and add interface programs for new learning algorithms.

Interface programs must reside in the same directory as the run_exp. The follow a simple naming scheme: run_cla_xxx for an interface program to a classification learning algorithm named xxx and run_rla_yyyy for an interface program to a regression learning algorithm named yyyy. To specify a learning algorithm as an argument to run_exp or in the configuration file config.pm only the name of the learning algorithm must be given (i.e. xxx or yyyy only).

All interface programs take the same set of options. For testing purposes, or when debugging problems encountered during the execution of run_exp it can be useful to directly run an interface program. For this, a pair of training and testing datasets and a names file must exists (i.e. three files with the same filestem and the following extensions: .data, .test anda .names).

Here are the most important options for manually running an interface script:

Extracting information from the results: parse_results

This program will make it easier to extract the interesting information from the files generated for an experiment. The standard files that are normally created are the files ending in the following extensions: =.results=, =.dct=, =.log=, and =.stats= (see Section 8). The =.log= file contains a log of all actions performed and the other three files contain result data (and are often collectively referred to as result files). These three files contain lines of the format:

Some qualified variablename: value

Each line contains a value for a variable. The value is everything after the colon (a value can be multidimensional, i.e. consist of more than one word, but usually just is a single word or number). The variable name is everything before the colon and consist of several words. The following line gives the value of the error estimate for algorithm =c50boost= in cross validation fold 2 of repitition 0 in a .stats file:

Error c50boost 0 2: 0.34123110000

The =parse_results= program can be used to extract the values for certain variables and create a file that contains just the values of these variables, separated by commas organized by lines. The program can be used to generate one line for each filestem, one line for each filestem/algorithm combination or one line for each filestem/algorithm/crossvalidation-fold combination or one line for each filestem/pair-of-algorithms combination.

The following example demonstrates how the program can be used to extract different types of data:

% ls
allrep_2.dct      allrep_2.stats  led24_2.results  segment_2.dct      segment_2.stats
allrep_2.results  led24_2.dct     led24_2.stats    segment_2.results

% parse_results *.* -f %DS -f %LA -f stats.Error 
allrep_2,basedef,0.032873806998939555
allrep_2,basedef200,0.032873806998939555
allrep_2,baserand,0.9899257688229056
allrep_2,c50boost,0.009544008483563097
allrep_2,c50rules,0.009278897136797455
...
allrep_2,clemRBFN,0.032873806998939555
allrep_2,lindiscr,0.08510074231177095
allrep_2,ltree,0.008748674443266172
allrep_2,mlcib1,0.024920466595970307
allrep_2,mlcnb,0.05726405090137858
allrep_2,ripper,0.010604453870625663

% parse_results *.* -breakup ds -f %DS -f results.DBSize -f results.N_discrete_attr
allrep_2,3772,21
led24_2,3200,24
segment_2,2310,0

Synopsis and options

parse_results -h
parse_results filelist -f fieldspec [-f fieldspec ...] 
  [-breakup ds | la | lapair | foldla]
  [-o outfile] [-n outnamesfile] [-fn] [-hostnorm] [-algnorm alg]
  [-s sep] [-m mv] [-mnp x] [-strip]
  [ignoredc] [-ignoreresults] [-ignorestats]

Normalize time measurements: parse_times

The METAL-MLEE package is intended to simplify the process of obtaining machine learning experimentation results that possibly get carried out on different hosts. The run_exp script collects the timing information returned from the interface scrips and puts them into the .results file. However, CPU time measurements obtained on different hosts are not comparable. The task of parse_times is to analyze the experimentation results that were obtained on different machines for the same dataset, using the same seed and algorithms. From the times measured on different machines, the program will create a table of factors which roughly represent the relative performance increase or decrease relative to one reference host. The table generated can then be used by the parse_results script normalize all time measurements to the reference machine.

WARNING: this feature should be used with extreme caution! You should be aware that the factor can only be used as a very rough aproximation to the speed differences between two machines. Several factors make this approach rather inaccurate:

Synopsis and options

parse_times -host hostname -from YYYYMMDD -to YYYYMMDD
  [-calc avg | last | median] [-xlispstat filename]  filelist

Checking the database format: check_database.pl

The script check_database.pl will check the format of a database for compliance with the standard database format needed by METAL (see Section 4). Note that unless you specify the option -nocheckformat, this script will automatically get called from run_exp in order to make invalid results caused by a wrong format - which might otherwise go undetected - less likely.

Synopsis and options

check_database.pl -f filestem [-regr] [-limit maxerrs] [-max maxlines] 
  [-dbg] [-o]

Checking experiment output: check_results.pl

A single run of run_exp can create many files and a a very large .log file, so it is often hard to quickly determine if some algorithm failed and in which fold of the experiment. The check_results.pl makes this easier.

Synopsis and options

  check_results.pl -h
  check_results.pl -f stem [-N n] [-l alg1 [-l alg2] ...] [-v] [-d] [-dd]

Other programs and helper files included in the distribution


Calculate quick measures from names files: parse_names

The parse_names program calculates a few measures about the number of attributes, and number of values for discrete attributes from a names file. These measures are included in the .results file.

  parse_names -f namesfile

The output shows:

Select a subset of features: project

This script selects a list of attributes from the input files specified by the infilestem and writes a set of output files specified by the outfilestem:

  project infilestem outfilestem attrlist

attrlist should be a comma-separated list of attribute numbers, where numbering starts with one. To pass this as a single argument it might be necessary to enclose the list in single or double quotes (depending on the shell you are using).

The script expects a .data, a .names, and a .test file to exist and will create the corresponding output files.

NOTE: The script uses the cut command internally to select the attributes. Many preinstalled cut commands only allow for a small number of fields and short records to be processed. Therefore, for most databases, the GNU-cut command or an equivalent version without these limitations should be used. You can specify the path to the cut command in the config.pm configuration file if it should differ from the one in the binary path.

Sample interface scripts

The scripts subdirectory in the METAL-MLEE distribution contains several interface scripts for classification learning algorithms, regression learning algorithms, preprocessing algorithms and landmark measurement algorithms. These files can be used to adapt METAL-MLEE to other learning algorithms by using them as templates.

The following interface scripts for classification learning algorithms are included:

The following interface scripts for regression learning algorithms are included:

The following interface scripts for measuring/landmarking algorithms are included:

The following interface scripts for preprocessing algorithms are included:

The Clementine command line interface: run_clem

The run_clem program simplifies the use of Clementine learning algorithms from the command line. The program analyzes the input files and creates the necessary information to modify a template Clementine stream file which is then used in a batch-mode run of Clemeninte. The script directory contains stream templates for the learning algorithms MLP, RBFN and C5, these are called c5.str, mlp.str, and rbfn.str.

WARNING: the method described has only been tested with version 5.0.1. There was an unresolved problem whith the version 5.1 when it first came out but this has not been rechecked since (which????)

  run_clem -h 
  run_clem -f filestem -m method {-train|-test} [-p n|c] [-d path] 
    [-r stem] [-cmd cmd] [-nc] [-i] [-s seed] [-c4] [-v] [-vl]

Calculate landmarks: landmark.pl

This Perl-script calculates landmark measurements for a database and is used internally by the landmark interface scripts.


next up previous contents
Next: Adapting METAL-MLEE Up: METAL The METAL Machine Previous: Standard Database Format   Contents
2002-10-17