Next: Adapting METAL-MLEE Up: METAL The METAL Machine Previous: Standard Database Format Contents

Subsections

The Programs

The following section describes each of the programs in the METAL-MLEE in more detail. Note that all programs will accept the -h option that will show an explanation of all valid options and the version and versiondate of the program. The following documentation only contains explanations of those options that are releveant for the use of METAL-MLEE for use with the datamining advisor. For a more complete documentation of the programs see [Petrak 2002a].

Main experimentation program: `run_exp`

The run_exp program performs the follwing tasks for a given base database:

Optionally run the data characterization program
Run a parallel error estimation for a list of specified learning algorithms for which interface programs (see 5.2) exist. A 10-fold crossvalidation procedure will usually be carried out for this, but other estimation procedures can be specified.
Create a file with the correct target values for each estimation fold.
Create files with the predicted target values for each fold and learning algorithm.
Measure thge CPU times for each fold and learning algorithm spent for the training and the evaluation phase.
Create a log file that contains the details of the experiment and create the results-file that contains a machine-readable set of meta-data about the experiment.
Optionally create performance statistics for all algorithms and algorithm pairs by running the run_stats program. This will create the stats-file with a machine readable set of meta-data about learning algorithm performance.

For the error estimation, the input base-database will be randomly shuffled and split into one or more pairs of training- and evaluation data. The program run_exp needs a random seed to control the random shuffling. The random seed defines the exact way how the data is being shuffled and partitioned. This allows to run the program on different machines, at different times, with different learning algorithms and still obtain comparable error estimates and comparable files with predictions.

The program creates a standardized set of output files in the output directory specified (see Section 8).

Synopsis

  run_exp -h
  run_exp -f stem -s seed [-v] ...

Important Options

The following describes just the subset of options that are important for use in the METAL-setting:

-h: Show detailled usage information, defaults, and program version.
-f stem: The filestem, including the full path to the location of the database. In other words, the full filename of the data or names file without the extension (requried).
-s seed: The seed to be used for the random number generator that determines how the data file will be shuffled before the estimation procedure is being carried out. If no seed is given, the value 1 will be used. The special value "norand" will supress random shuffling and keep the ordering of the database file. This parameter is ignored for estimation strategy "leave one out".
-regr: Indicate that the database describes a regression problem (i.e. the target variable is numberic). If omitted, a classification problem (i.e. the target variable is discrete) is assumed.
-dt path: The path to the directory that should be used to store temporary files. Default is /tmp. This directory must be on a device that has enough free space to hold all the intermediate files. Note that unless option -k or -d or -lad is specified, temporary files should get removed at the end of an experiment. However due to several reasons the directory can fill up with leftover files, so be sure to remove unneded ones regularily.
-d: Switch on debug mode: this will show much more information in the log file and on the console (-d implies -v which will show everything that goes into the logfile on the console too)
-lad: Switch on debugging for interface scripts. This will pass the option -d to all the interface scripts, causing a lot more output from the interface scripts get logged in the logfile.

Specifying a CPU time limit

METAL-MLEE allows you to specify a CPU time limit for each call of an external learning algorithm programm. This is necessary since otherwise the only way to end an experiment where one of the algorithm loops or takes too much time would be to terminate the whole experiment, loosing all the data for all the algorithms. You can specify the CPI time limit, in seconds, using option -t:

run_exp -f stem -s 1 -t 3600

This example sets the CPU time limit to one hour. (The default is 43200 seconds, or 12 hours, use the value 0 to unlimit CPU time usage)

Note however that not all OS's support this. Currently this is not possible under Windows. On some systems that do not support this but do support killing processes the coded workaround that tries to kill the process after a specfified number of elapsed (not CPU!) seconds might work, but this is not guaranteed either.

Specifying learning algorithms

Learning algorithms are always invoked through interface scripts. If you want to use a learning algorithm with METAL-MLEE for which there is not already an interface script included in the scripts subdirectory, you need to create a new one (see Section 6). Interface scripts for learning algorithms are named run_cla_laname for classification algorithms and run_rla_laname for regression algorithms, where laname is the name under which the learning algorithm should be known to METAL-MLEE.

To invoke one or more algorithms for an experiment, give this name of the algorithm as an argument to the run_exp option -l. In order to use more than one algorithm, specify the option -l multiple times, e.g.:

run_exp -f somestem -s 1 -l alg1 -l alg2

This example shows how to specify to run algorithms alg1 and alg2.

Instead of specifying the list of learning algorithms every time, you can specify the list to be used as a default in the configuration file config.pm.

Passing parameters to the learning algorithms

Interface scripts both execute the training and the prediction phase of a learning algorithm. In order to specify to which phase the parameter should be passed, you need to specify a ``sub option'':

run_exp -f stem -s 1 -l "alg1 -at -A" 
  -l "alg2 -ae '-r 1.1 -s 2.1' "

This example shows how to add the option -A to the call of algorithm alg1 for the training phase and options -r 1.1 -s 2.1 to the testing phase of algorithm alg2. You can use suboption -a to specify what to pass to both the training and the testing phase calls.

You can also use the same algorithm twice with different parameter settings. However for this to work, you also have to specify different algorithm suffices for each of the different calls:

run_exp -f somestem -s 1 -l 'alg1 -at "-c 0.1" -asuf c0.1' 
  -l 'alg1 -at "-c 0.2" -asuf c0.2'

This suffix will be appended everywhere the algorithm name is mentioned, i.e. statistics, the log and results file will now contain entries for an algorithm alg1c0.1 and an algorithm alg1c0.2.

Algorithm interface programs

Interface programs are used to provide the main experimentation program run_exp with one standard interface to many different learning algorithms. In order for this to work, learning algorithms must fulfill some requirements that are listed in Section 6 which also explains how to adapt and add interface programs for new learning algorithms.

Interface programs must reside in the same directory as the run_exp. The follow a simple naming scheme: run_cla_xxx for an interface program to a classification learning algorithm named xxx and run_rla_yyyy for an interface program to a regression learning algorithm named yyyy. To specify a learning algorithm as an argument to run_exp or in the configuration file config.pm only the name of the learning algorithm must be given (i.e. xxx or yyyy only).

All interface programs take the same set of options. For testing purposes, or when debugging problems encountered during the execution of run_exp it can be useful to directly run an interface program. For this, a pair of training and testing datasets and a names file must exists (i.e. three files with the same filestem and the following extensions: .data, .test anda .names).

Here are the most important options for manually running an interface script:

-h: Show all possible options and a short explanation.
-istem STEM: The file stem (including the path) that identifies the three files (data, test, and names file) needed. When invoked from within the run_exp program the filestem will usually also include the seed and the process ID to avoid duplicate file names for the temporaryly created files.
-tmppath PATH: Where to store intermediate or temporary data. This is currnetly not used by run_exp since the training/test/names files are stored in the temporary directory anyways and it is easier to derive other filenames for temporary files directly from this filestem.
-a args: Pass additional arguments to all calls of the algorithm (training and testing)
-at args: Additional arguments for the training call
-ae args: Additional arguments for the testing (evaluation) call
-cpulimit n: Try to limit the CPU time limit to that many seconds (might not work on all systems)
-kmodel file: Copy the model to this file
-nopgm dont actually call external programs, for debuggin
-portable/-noportable: Usually the program tries to figure out how to limit CPU time and how to determine the system/user CPU time needed for the algorithm on a specific system. The -portable switch can be used to run (experimental) code that will try to do everything with Perl-code that is as portable as possible. Note that portable mode still has its flaws - especially the termination of processes does not work correctly on most systems. If the -portable option is used, -cputime limit will be interpreted as a limit for elapsed runnning time instead of CPU time.
-k: Do not delete intermediate datasets
-d: Switch on debug mode
-v: Switch on verbose mode

Extracting information from the results: `parse_results`

This program will make it easier to extract the interesting information from the files generated for an experiment. The standard files that are normally created are the files ending in the following extensions: =.results=, =.dct=, =.log=, and =.stats= (see Section 8). The =.log= file contains a log of all actions performed and the other three files contain result data (and are often collectively referred to as result files). These three files contain lines of the format:

Some qualified variablename: value

Each line contains a value for a variable. The value is everything after the colon (a value can be multidimensional, i.e. consist of more than one word, but usually just is a single word or number). The variable name is everything before the colon and consist of several words. The following line gives the value of the error estimate for algorithm =c50boost= in cross validation fold 2 of repitition 0 in a .stats file:

Error c50boost 0 2: 0.34123110000

The =parse_results= program can be used to extract the values for certain variables and create a file that contains just the values of these variables, separated by commas organized by lines. The program can be used to generate one line for each filestem, one line for each filestem/algorithm combination or one line for each filestem/algorithm/crossvalidation-fold combination or one line for each filestem/pair-of-algorithms combination.

The following example demonstrates how the program can be used to extract different types of data:

% ls
allrep_2.dct      allrep_2.stats  led24_2.results  segment_2.dct      segment_2.stats
allrep_2.results  led24_2.dct     led24_2.stats    segment_2.results

% parse_results *.* -f %DS -f %LA -f stats.Error 
allrep_2,basedef,0.032873806998939555
allrep_2,basedef200,0.032873806998939555
allrep_2,baserand,0.9899257688229056
allrep_2,c50boost,0.009544008483563097
allrep_2,c50rules,0.009278897136797455
...
allrep_2,clemRBFN,0.032873806998939555
allrep_2,lindiscr,0.08510074231177095
allrep_2,ltree,0.008748674443266172
allrep_2,mlcib1,0.024920466595970307
allrep_2,mlcnb,0.05726405090137858
allrep_2,ripper,0.010604453870625663

% parse_results *.* -breakup ds -f %DS -f results.DBSize -f results.N_discrete_attr
allrep_2,3772,21
led24_2,3200,24
segment_2,2310,0

Synopsis and options

parse_results -h
parse_results filelist -f fieldspec [-f fieldspec ...] 
  [-breakup ds | la | lapair | foldla]
  [-o outfile] [-n outnamesfile] [-fn] [-hostnorm] [-algnorm alg]
  [-s sep] [-m mv] [-mnp x] [-strip]
  [ignoredc] [-ignoreresults] [-ignorestats]

filelist: The list of files to process. The easiest way to do this is to use a glob-pattern. For example, if there is a subdirectory below the current directory for each filestem and you want to process all results files for all filestems, the simplest way to specify this is ``*/*.{dct,results,stats}''.
-f fieldspec: This option can occur more than once and specifies (in order) the list of fields to include in the output. A fieldspec is either a qualified fieldname, a special fieldname or a function. A qualified fieldname is of the form filspec.fieldname where filespec is one of stats, dct, or results and the fieldname is the name portion of one of the fields that occur in that file, e.g. dct.Nr_attributes or stats.Error. The following special field names canbe used: %LA the name the learning algorithm (not for breakup=ds); %DS the filestem as extracted from the file processed (i.e. this will usually include the seed and eny suffixes - the 'true' filestem can be extracted using results.Filestem or results.InFilestem; %FLD the fold number (only for breakup=foldla); %REP the repeat number; %LA1 and %LA2 the names of both learning algorithms for breakup=lapair. Functions must get specified in the form NAME(arg). The following functions are currently defined: AVG, SUM, COUNT, MIN, MAX will all calculate the corresponding function over all fields that match a regular field name pattern. For example to find the maximum value for all fields with a name that starts with Attr_Count_All_Value in the dct file, use 'MAX(dct.Attr_Count_All_Value.*)'. Note that the pattern must be a Perl-type regular expression, not a glob pattern. This featrue cannot be used to calculated functions over qualified variable names, e.g. 'MAX(results.Traintime.*)' with breakup=ds will not work. The function ACC(field) will calculate 1-field.
-breakup x: Specify for which level of detail the program will create individual lines in the output. the default is la, which produces one line for each combination of filestem and learning algorithm. The option lapair will generate one line of output for each combination of filestem and pairs of learning algorithms, ds generates one line of output for each filestem and foldla generates one line for each combination of filestem, learning algorithm and crossvalidation fold
-o filename: Specify a file where to write the output to (if not given: standard output).
-n filename: Specify a file where to write a C4.5 names file for the output - the program will try to guess the type and possible values of attributes and will also try to convert field names to something that is usable with most learning algorithms that use C4.5 format. Note that the generated file will just contain a line for each field in the output and is thus not directly usable for C4.5 (for this you need to remove the line for the last field and add a class label definition line at the beginning instead).
-fn: include a line with fieldnames as the first line of output - this is useful for many programs that can process CSV files (e.g. R).
-hostnorm file -algnorm alg: Specify the name of a file that contains host normalization data. All fields continaing the string ``time'' will then automatically get normalized based on the timing factors for each host. If algnorm alg is given, the times will be expressed as a multiple of the time the algorithm alg needed. For more information on time normalization see the next section.
-s sep: Use sep to separate fields instead of commas
-m mv: Use mv instead of a question mark to indicate missing values.
-mnp x: Use x instead of mv to indicate a value for which no field has been found in the input files.
-strip: Strip strange characters from all non-numeric output. This can help to make the output more easily digestable by other programs.
-ignoredct, -ignoreresults, ignorestats: Do not process the corresponding files. This can speed up processing significantly.

Normalize time measurements: `parse_times`

The METAL-MLEE package is intended to simplify the process of obtaining machine learning experimentation results that possibly get carried out on different hosts. The run_exp script collects the timing information returned from the interface scrips and puts them into the .results file. However, CPU time measurements obtained on different hosts are not comparable. The task of parse_times is to analyze the experimentation results that were obtained on different machines for the same dataset, using the same seed and algorithms. From the times measured on different machines, the program will create a table of factors which roughly represent the relative performance increase or decrease relative to one reference host. The table generated can then be used by the parse_results script normalize all time measurements to the reference machine.

WARNING: this feature should be used with extreme caution! You should be aware that the factor can only be used as a very rough aproximation to the speed differences between two machines. Several factors make this approach rather inaccurate:

the CPU time measurement itself can depend on the load on the system and other factors that vary over time on the same system.
any inaccuracies will be multiplied if the measurements are close to the measurement resolution of the system. Because of this the parse_times program will ignore all time measures .
different machines will optimize different instruction mixes and thus the speedup depends on the instruction mix needed for a specific execution. This means that different learning algorithms on the same dataset can show different speedups, and that the same algorithm will show different speedups for different datasets.

Synopsis and options

parse_times -host hostname -from YYYYMMDD -to YYYYMMDD
  [-calc avg | last | median] [-xlispstat filename]  filelist

filelist A list of .results. files to process, each containing timing information for the same set of learning algorithms on the same dataset.
-calc x What to do if several measurements for the same algorithm and host are found (this will be the case if the experiment gets repeated on the same machine and the run_exp option -o is not given, causing all results to get appended in the same file instead of overwriting old results). Possible values are: avg - calculate the average; median - calculate the median; and last - use the last (most recent) value found.
-xlispstat filename write data for subsequent processing in XLISPSTAT or LISP to this file.
-from YYYYMMDD -to YYYYMMDD: The generated table will contain this date as the date identifying the start and end of the validity period for the factors. Since machines can get upgraded or other things can change significantly over time that will influence the speedup factor, you can restrict the validity of the factor to a certain time period. The parse_results program will automatically use the factor from the correct time period based on the experimenation date found in the results files.

Checking the database format: `check_database.pl`

The script check_database.pl will check the format of a database for compliance with the standard database format needed by METAL (see Section 4). Note that unless you specify the option -nocheckformat, this script will automatically get called from run_exp in order to make invalid results caused by a wrong format - which might otherwise go undetected - less likely.

Synopsis and options

check_database.pl -f filestem [-regr] [-limit maxerrs] [-max maxlines] 
  [-dbg] [-o]

-f filestem: Filestem (and path) of the database to process. The files <filstem>.data and <filestem>.names must exist.
-regr: Indicate that the database is for a regression, not classification problem.
-limit n: Limit the number of errors reported to n.
-max n: Limit the number of input records to be processed. This will increase speed but decrease to likelihood of finding rare errors.
-dbg: Switch on debug mode
-o: Save the output in a file with the name <filestem>.check_metal

Checking experiment output: `check_results.pl`

A single run of run_exp can create many files and a a very large .log file, so it is often hard to quickly determine if some algorithm failed and in which fold of the experiment. The check_results.pl makes this easier.

Synopsis and options

  check_results.pl -h
  check_results.pl -f stem [-N n] [-l alg1 [-l alg2] ...] [-v] [-d] [-dd]

-f stem The file stem of the files to check including the seed, i.e that part of the filename up and including the seed.
-N n The number of folds. If this is not specified the program will guess from the files it finds.
-l alg Can be specified more than once to provide the list of learning algorithms. If none is specified the program will try to guess the list of learning algorithms from what is there.
-v More verbose output.
-d Debug - implies -v.
-dd Even more debug messages.

Other programs and helper files included in the distribution

Calculate quick measures from names files: `parse_names`

The parse_names program calculates a few measures about the number of attributes, and number of values for discrete attributes from a names file. These measures are included in the .results file.

  parse_names -f namesfile

The output shows:

Type_data: The type of data file: class or regr
N_continuous_attr: The number of numeric attributes
N_discrete_attr: The number of nun-numeric attributes
N_total_discrete_vals: The total number of values added up over all discrete attributes.
Avg_discrete_vals: The average number of values over all discrete attributes.
Log_discrete_combinations: The natural logarithm of the product of the number of values of all discrete attributes ( $\log(\Pi_1^k \vert a_i\vert)$ where $\vert a_i\vert$ is the number of values for discrete attribute number )
Avg_discrete_combinations: The value of Log_discrete_combinations divided by the number of discrete attributes.
N_classes: The number of classes.

Select a subset of features: `project`

This script selects a list of attributes from the input files specified by the infilestem and writes a set of output files specified by the outfilestem:

  project infilestem outfilestem attrlist

attrlist should be a comma-separated list of attribute numbers, where numbering starts with one. To pass this as a single argument it might be necessary to enclose the list in single or double quotes (depending on the shell you are using).

The script expects a .data, a .names, and a .test file to exist and will create the corresponding output files.

NOTE: The script uses the cut command internally to select the attributes. Many preinstalled cut commands only allow for a small number of fields and short records to be processed. Therefore, for most databases, the GNU-cut command or an equivalent version without these limitations should be used. You can specify the path to the cut command in the config.pm configuration file if it should differ from the one in the binary path.

Sample interface scripts

The scripts subdirectory in the METAL-MLEE distribution contains several interface scripts for classification learning algorithms, regression learning algorithms, preprocessing algorithms and landmark measurement algorithms. These files can be used to adapt METAL-MLEE to other learning algorithms by using them as templates.

The following interface scripts for classification learning algorithms are included:

run_cla_TEMPLATE A template file that is explained in greater detail in Section 6.
run_cla_basedef, run_cla_basedef200: An interface to the baseclearn learning algorithm, which essentially ``learns'' the most frequent class from the input database. The basedef200 interface runs the baseclearn learning algorithm for only the first 200 records in the database. The learning algorithm is available for download from http://ofai.at/~johann.petrak/baseclearn.html.
run_cla_baserand: Uses the baseclearn learning algorithm internally, but uses a random class label determined from the names file instead of the most frequent one.
run_cla_c45rules, run_cla_c45tree: An interface to a modified version of the c4.5 and c4.5rules programs. The modified version of c4.5 is available from http://ofai.at/~johann.petrak/c45ofai.html (The modified version adds several new features, but the necessary features are: portability to Win32 and a program to assign class labels to a test dataset)
run_cla_c50boost, run_cla_c50rules, run_clac50tree= These are interface scripts to the commercially available C50 learning algorithm, a successor of c4.5. The programs are available at http://www.rulequest.com. You also need a modified version of the program that assigns the class labels, which is available at http://ofai.at/~johann.petrak/c45ofai.html.
run_cla_clemMLP, run_cla_clemRBFN These are interface scripts to the Clementine learning algorithms MLP and RBFN, respectively. The interface scripts use the program run_clem internally to call the Clementine learning algorithm in batch mode. See the description of that program for details.
run_cla_lindiscr The interface to the linear discriminant algorithm LinDiscr. (availabilty details?)
run_cla_ltree The interface to the linear tree learning algorithm Ltree (availabilty details?)
run_cla_mlcib1 The interface to a 1NN learning algorithm that is based on the MLC++ machine learning library (availability?)
run_cla_nb The interface to a naive-bayes learning algorithm that is based on the MLC++ machine learning library
run_cla_ripper The interface to the ripper learning algorithm. The program is available from ???

The following interface scripts for regression learning algorithms are included:

run_rla_baggedrt The interface script to the regression tree algorithm rt4.1. Available from ????
run_ral_cart The interface to the cart learning algorithm that is implemented in rt4.1.
run_rla_clemMLP, run_rla_clemRBFN The interface to the MLP and RBFN learning algorithms of Clementine.
run_rla_cubist The interface to the cubist regression rule algorithm, available from http://www.rulequest.com
run_rla_cubistdemo The interface to the demo version of the cubist program (will only process a limited number of records)
run_rla_kernel The interface to the kernel regression learning algorithm that is implemented in rt4.1
run_rla_lr The interface to a linear regression model learner that is implemented in rt4.1.
run_rla_mars The interface to the mars learning algorithm (availability?)
run_rla_rtplt The interface to the ??? learning algorithm that is implemented in rt4.1
run_rla_svmtorch The interface to the support vector machine algorithm svmtorch, available from ???

The following interface scripts for measuring/landmarking algorithms are included:

run_cma_lindiscr This interface to the linear discriminant algorithm LinDiscr (see classification learning algorithm interfaces).
run_cma_lm1 (experimental)
run_cma_mlcnb Use the mlcnb learning algorithm as a landmark.
run_cma_nodes This interfaces to the landmarks.pl script that calculates several landmarks. See below a more detailled description of landmarks.pl

The following interface scripts for preprocessing algorithms are included:

run_cpa_disc The interface to the discretization program discretiser. The interface script also needs the wrapper script disc_wrapper.perl. Both programs are available from ????
run_cpa_fselC50T The interface to a simple feature selection algorithm that uses the c5.0 decision tree learning algorithm for a quick guess to find relevant attributes. The script also needs the atrib_list program and the project program internally.
run_cpa_fselQ1 The interface to a simple feature selection algorithm that compares the class-posterior means of attributes to guess their relevance, fselQ1. The interface uses the project program internally.

The Clementine command line interface: `run_clem`

The run_clem program simplifies the use of Clementine learning algorithms from the command line. The program analyzes the input files and creates the necessary information to modify a template Clementine stream file which is then used in a batch-mode run of Clemeninte. The script directory contains stream templates for the learning algorithms MLP, RBFN and C5, these are called c5.str, mlp.str, and rbfn.str.

WARNING: the method described has only been tested with version 5.0.1. There was an unresolved problem whith the version 5.1 when it first came out but this has not been rechecked since (which????)

  run_clem -h 
  run_clem -f filestem -m method {-train|-test} [-p n|c] [-d path] 
    [-r stem] [-cmd cmd] [-nc] [-i] [-s seed] [-c4] [-v] [-vl]

-f filestem: The input filestem - there must be a .names and a .data file for training mode, or a .names and a -m method: The complete filename (including the path if necessary) of the stream file template to be used.
-train | -test: Indicate training or test mode. In training mode, a model file is created, in test mode, the model file is used to create a file containing the predicted values for the target variable.
-p n|c: This is needed internally for modifying the stream file. Usually it can be guessed from the input names file. Use ``n'' for numeric and ``c'' for discrete target variables.
d path: The directory where generated files should be stored. These are the modified stream files, the model file, and the generated ``analysis'' and ``matrix'' files.
r stem: The filestem to use for the generated files. The default is originalstem>.<method>.
-cmd cmd: The command to use to call the Clementine program (default: clementine).
-nc: Do not remove temporary files after termination - useful for debugging.
-i: Interactive - run the generated stream in an interactive Clementine session, invoking the Clemetine GUI. This can help with finding problems and checking if everything is done correctly.
-s seed: A random seed - this can be used for streams that need a randomization seed internally.
c4: Accept data and test files where the records are terminated by a dot (if not specified: dont expect/accept the terminating dot)
v: Verbose output
-vl: Show logfile and verion info but not all the info that is shown with the -v option.

Calculate landmarks: `landmark.pl`

This Perl-script calculates landmark measurements for a database and is used internally by the landmark interface scripts.

Next: Adapting METAL-MLEE Up: METAL The METAL Machine Previous: Standard Database Format Contents

2002-10-17

The Programs

Algorithm interface programs

Calculate quick measures from names files: parse_names

Calculate quick measures from names files: `parse_names`