Next: Running Experiments Up: METAL The METAL Machine Previous: The Programs Contents

Subsections

Adapting METAL-MLEE

If you want to use METAL-MLEE with other learning algorithms than those for which interface scripts (see 5.2) already exist, you need to create interface scripts for that purpose.

In a similar way, you can also add additional preprocessing algorithms.

Adapting METAL-MLEE to additional algorithms essentially consists in adding the necessary interface programs. The best way to do this is to copy and adapt an existing interface program for a similar algorithm. The interface programs are written in Perl, some knowledge of Perl will be necessary to create a new interface program.

For each type of algorithm, there is a heavily commented template file that can be used as a basis for a new interface program.

In order to run a certain list of interface scipts automatically (insteading of specifying them using the -l option of the run_exp script), edit the config.pm script and change the lists of script names given there for the default classification, default regression, default classification data-measurement and default regression data-measurement algorithms.

Adding Learning Algorithm Interface Scripts

In order to be usable with METAL-MLEE, the folliwng conditions must be fulfilled:

The learning algorithm should be able to read a training file that is in a format similar to the format of the data file described in Section 4. Similarily, the meta-data required be the algorithm should be in a from that can be derived from the names file. Unless the learning algorithm can use the data and names files directly, it will be necessary to include a conversion filter in the interface program.
The learning algorithm should generate a persistent model file that can be used to later assign values for the target variable for new data records.
It should be possible to separately call the training and prediction (testing) phases of the algorithm.
The prediction/testing function of the algorithm should take the model file generated in the training phase and a data file and generate a file that contains only the predictions for each of the cases in the data file, one prediction in each line.
NOTE: the prediction function of the algorithm should output the prediction error which needs to be captured in the interface script and returned to the main driver program run_exp. This program will regard a learning algorithm for which no prediction error is returned as failed and abort the testing for that learning algorithm. If the prediction function does not output the error, a dummy value must be used unless the something went wrong (see below). The correct prediction error will be calculated by the run_stats program and be available in the .stats file anyway.

For classification learning algorithms refer to the template run_cla_TEMPLATE, for regression learning algorithms refer to run_rla_TEMPLATE.

**Figure:** The `run_cla_TEMPLATE` file
$\begin{figure}{\footnotesize\begin{verbatim}...$

Figure 1 shows the run_cla_TEMPLATE file with all comments removed and line numbers added. To adapt the file to some learning algorithm, copy it to a file run_cla_xxx (for a classification algorithm) where xxx is the name of the learning algorithm. Follow the advice given in the comments in the template file to program the interface file for your learning algorithm.

Here a few notes on adapting the template:

Lines 1 to 8 should be kept untouched
The LANAME in line 9 should be the same as the algorithm name portion of the interface file (i.e. the same as the xxx part in run_cla_xxx). VERSION should reflect the version of the interface file (if you want to diferenciate between different versions of the learning algorithm, include that version info in the LANAME, e.g. myla2.1)
Line 10 should contain the command needed to call the training phase of the learning algorithm. The variable $filestem will contain the filestem of the three files that are prepared before each invocation of the interface file (the data, test, and names file). The variables $args and $trainargs are the values passed via the options -a and -at respectively.
Lines 12 to 15 show a loop that will process each of the lines that is output to stdandard output or standard error by the learning algorithm. Within the loop you can search the output for (numeric) information that you want to pass back to run_exp which will automatically record them in the .results file and calculate averages. NOTE: you must process all output lines in such a while loop, using the defined($_=getLine()) condition, or the interface will not work properly! If you do not need any information from the output, simply remove lines 13 to 15.
Line 17 shows how to pass back the information to the run_exp program: simply assign the value to a hash variable where the hash key is the desired name of the variable.
Lines 18 should be kept unchanged: it shows how to finish the startCMD block that was opened in line 10 and record the time used up by the training phase of the learning algorithm.
Line 19 starts the evaluation/prediction phase of the learning algorithm by calling a different program or the same program with the apropriate options for prediction. The variable $testargs will contain the value for the option -ae.
Lines 20-24 show how to process the standard output of the algorithm. As for the training phase all lines must be processed that way. In addition, run_exp will only work correctly if you pass back the prediction error in the variable Error as shown. If your algorithm does not output the prediction error to standard output, you can work around the problem by passing back an arbitrary value and ignoring the errors in the .result file later, using the errors from the .stats file instead (this is recommended anyway, since the errors in the .stats file are more accurate)
Line 25 finishes the startCMD block for the prediction phase and should be kept unchanged.
Line 26: the rmFile function should be used to remove any temporary and working files the algorithm might have created.
Line 27 must be kept unchanged.

Adding Preprocessing Algorithms

Preprocessing algorithms will change the database before the learning algorithm is applied. For each run_exp experimentation run, you can optionally specify one preprocessing algorithm. That algorithm will be called for each fold of the crossvalidation. In order for run_exp to be able to call the preprocessing algorithm, an interface script must be provided.

Preprocessing algorithms, like learning algorithms, have to process the training and testing sets for each fold separately. A typical interface script will contain two calls to the preprocessing program, one for the training file and one for the test file.

Note: The preprocessing algorithm should never use information from the class labels in the test set! The preprocessing algorithm should always carry out exactly the same preprocessing transformation on the test set as on the training set. If the preprocessing algorithm adapts itself to the input dataset you must take care that this does not happen when the test set is processed! For example, a class-aware discretization algorithm should discretize the numeric attributes in the test set in exactly the same way as it discretizes the attributes in the trainingset instead of calculating new discretization intervals, based on the specific information in the test set.

This is important, because of the practical reason that otherwise the content or format of the generated training and test files could be incompatible, but more importantly, because of the theoretical reason that anything else would be cheating - using information from the test set that should be regarded as completely unavailble for the estimation procedure.

As with the interface scripts for learning algorithms, use one of the scripts included in the package as a template.

Next: Running Experiments Up: METAL The METAL Machine Previous: The Programs Contents

2002-10-17