Next: The Programs Up: METAL The METAL Machine Previous: What METAL-MLEE Does Contents

Subsections

Standard Database Format

In order to be usable with METAL-MLEE, databases must be in a standard format. This format is similar to the formats used by the C4.5 [Quinlan 1993] and C5.0 machine learning algorithms, but with additional constraints. If your database does not conform to the format explained below in more detail, it needs to be converted.

For each database, two files are required: one file, the data-file that contains the actual data in ASCII-coded, comma-separated variables (CSV) format, and another file, the names-file that contains the names and types of the variables in the data file. A convention that must be observed for use with METAL-MLEE is that both files must have the same name and be located in the same path, but differ in their file name extension: the data file has the extension .data while the names file has the extension .names. The part of the filename without the extension that is necessary to uniquely identify a specific pair of files for a database is called filestem.

Hence, a database for use with METAL-MLEE always consists of two files, the names and data files, and can be specified by the part of the name that is common to both files, the filestem.

METAL-MLEE can handle both regression and classficiation problems, i.e. both numeric and discrete target variables. In both cases, the target variable has to be the last variable in the comma-separated list of fields that make up the individual records in the data file.

The restrictions on the format of the database have been imposed to be able to use as many learning algorithms as possible without having to perform costly database format conversions. Note that depending on which learning algorithms you use and how the interface scripts that plug these learning algorithms into METAL-MLEE are written, it might be possible to use a format that does not obey all of the constraints given below. For example, the limitation that the labels used for classes may not be used for discrete attributes has been introduced to make it easier to support the ripper rule learning algorithm. If you do not use this algorithm or if you enhance the interface script for this algorithm, that constraint on the databases need not be enforced any longer.

Names File

The names file describes the name and types of the fields, or attributes, in the data file. The format of the names file differs slightly if the target variable is continuous (i.e. the database is used for a regresion problem) or discrete (the database is used for a classification problem):

The first line for a classification database contains a comma separated list of possible class labels and is terminated by a dot. Each class label must be a valid atom (see below for the definition) that does not occur as a value of any of the other discrete attributes.
The first line for a regression database contains the name of the last attribute defined in the names file, followed by a dot.
All other lines contain attribute descriptions, in order of appearance of the corresponding fields in the data file.
An attribute description consists of an attribute name that starts in column 1, followed by a colon and a blank, followed by either the word "continuous" for real-valued attributes or a comma separated list of values for a discrete-valued attribute. Discrete values must be atoms and cannot be integers.
All attribute descriptions must be terminated by a dot.
Names must be atoms.
For classification databases, the values for discrete attributes may not include values that are used as class labels.
The names file contains nothing else. More specifically, it must not contain any comments as allowed for C5.0 nor any blank lines.
The missing value indicator is not part of the value list or otherwise listed in the attribute description.

An atom is a string of characters that does not contain blanks, other whitespace, special characters or accented characters and has a maximum length of 32 characters. The string can contain numeric digits, but must start with an alphabetical character.

Data File

The data file contains one new-line terminated record for each case in the database. Each record is a comma separated list of either numeric values, atoms, or the missing value indicator.

Every atom that occurs in a field must be mentioned in the corresponding attribute description in the names file.
Numeric values must be represented in a way that can be read in with the C scanf function using the ``%g'' directive.
The missing value indicator for both numeric and discrete fields is an unquoted question mark.
The atoms used for discrete fields must not be quoted.
Data records must not be terminated by a dot.

Formal Format Description

Here is a definition of the file formats, in a meta-language similar to Backus-Naur with Perl-like regular expressions. NUMVALUE is not defined - it should be a string that can be parsed to a numeric value by the C scanf function and format directive ``%g''.

atom := [a-z][a-z0-9]{0,29}

namesfile := targetline NEWLINE (attrdefline){1,N}
targetline := attrname DOT | labellist DOT
attrname := atom
labellist := atom [COMMA atom]*
attrdefline := (attrname COLON SPC labellist) |
               (attrname COLON SPC 'continuous') DOT 
DOT := '.'
COLON := ':'
COMMA := ','
SPC := ' '
datafile := (valuelist NEWLINE ){1,N}
valuelist := value COMMA value [COMMA value]*
value := (atom | NUMVALUE)

Next: The Programs Up: METAL The METAL Machine Previous: What METAL-MLEE Does Contents

2002-10-17