next up previous contents
Next: The Programs Up: METAL The METAL Machine Previous: What METAL-MLEE Does   Contents

Subsections


Standard Database Format

In order to be usable with METAL-MLEE, databases must be in a standard format. This format is similar to the formats used by the C4.5 [Quinlan 1993] and C5.0 machine learning algorithms, but with additional constraints. If your database does not conform to the format explained below in more detail, it needs to be converted.

For each database, two files are required: one file, the data-file that contains the actual data in ASCII-coded, comma-separated variables (CSV) format, and another file, the names-file that contains the names and types of the variables in the data file. A convention that must be observed for use with METAL-MLEE is that both files must have the same name and be located in the same path, but differ in their file name extension: the data file has the extension .data while the names file has the extension .names. The part of the filename without the extension that is necessary to uniquely identify a specific pair of files for a database is called filestem.

Hence, a database for use with METAL-MLEE always consists of two files, the names and data files, and can be specified by the part of the name that is common to both files, the filestem.

METAL-MLEE can handle both regression and classficiation problems, i.e. both numeric and discrete target variables. In both cases, the target variable has to be the last variable in the comma-separated list of fields that make up the individual records in the data file.

The restrictions on the format of the database have been imposed to be able to use as many learning algorithms as possible without having to perform costly database format conversions. Note that depending on which learning algorithms you use and how the interface scripts that plug these learning algorithms into METAL-MLEE are written, it might be possible to use a format that does not obey all of the constraints given below. For example, the limitation that the labels used for classes may not be used for discrete attributes has been introduced to make it easier to support the ripper rule learning algorithm. If you do not use this algorithm or if you enhance the interface script for this algorithm, that constraint on the databases need not be enforced any longer.

Names File

The names file describes the name and types of the fields, or attributes, in the data file. The format of the names file differs slightly if the target variable is continuous (i.e. the database is used for a regresion problem) or discrete (the database is used for a classification problem):

An atom is a string of characters that does not contain blanks, other whitespace, special characters or accented characters and has a maximum length of 32 characters. The string can contain numeric digits, but must start with an alphabetical character.

Data File

The data file contains one new-line terminated record for each case in the database. Each record is a comma separated list of either numeric values, atoms, or the missing value indicator.

Formal Format Description

Here is a definition of the file formats, in a meta-language similar to Backus-Naur with Perl-like regular expressions. NUMVALUE is not defined - it should be a string that can be parsed to a numeric value by the C scanf function and format directive ``%g''.

atom := [a-z][a-z0-9]{0,29}

namesfile := targetline NEWLINE (attrdefline){1,N}
targetline := attrname DOT | labellist DOT
attrname := atom
labellist := atom [COMMA atom]*
attrdefline := (attrname COLON SPC labellist) |
               (attrname COLON SPC 'continuous') DOT 
DOT := '.'
COLON := ':'
COMMA := ','
SPC := ' '
datafile := (valuelist NEWLINE ){1,N}
valuelist := value COMMA value [COMMA value]*
value := (atom | NUMVALUE)


next up previous contents
Next: The Programs Up: METAL The METAL Machine Previous: What METAL-MLEE Does   Contents
2002-10-17