at.ofai.gate
Class ListGazetteer

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by gate.creole.gazetteer.AbstractGazetteer
                      extended by gate.creole.gazetteer.DefaultGazetteer
                          extended by at.ofai.gate.ListGazetteer
All Implemented Interfaces:
gate.creole.ANNIEConstants, gate.creole.gazetteer.Gazetteer, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable

public class ListGazetteer
extends gate.creole.gazetteer.DefaultGazetteer

This is a modified version of the GATE DefaultGazetteer class. It does everything that class does but in addition can also provide annotations for the part of a word that precedes a match (the prefix) and the part of a word that comes after a match until the end of a word (the suffix).

Naturally, this makes only sense if the gazetter matches parts of words, so additional annotations are deactivated if the wholeWordsOnly parameter is true.

Prefix and suffix annotation can be seperately switched on and off by the corresponding parameters SuffixAnnotations and PrefixAnnotations.

Prefix and suffix annotations are created exactly as the corresponding lookup annations, but have major annotation type "Lookup_prefix" and "Lookup_suffix" respectively.

Suffix annotations always include the "string" feature that contains the string of the suffix.

Lookup_prefix and Lookup annotations include the string feature if parameter IncludeStrings is set to true.)

In addition to the features provided by the GATE Default gazetteer, the Lookup_prefix and Lookup annotations also include the features

Nearly all parameters are defined to be runtime parameters so it is much easier to change them during debugging without the need to re-create the processing resource.

How words and word boundaries are defined is influenced by the following parameters:

Whitespace will ALWAYS be interpreted as word boundary, combining spacing mark and non spacing mark will always be interpreted as part of a word.

The word characters as defined here will only influence how the characters *outside* the actual gazetteer match will be processed, i.e. how suffixes and prefixes are found. That means that an entry in a gazetteer list can contain non-word characters and still match, e.g. "word1 word2" will match "theworda wordbs" even though the space is a non-word character and will generate Lookup_prefix.string = "the" and Lookup_suffix.string = "s".

NOTE1: the way how to define words and word boundaries might change in the future!

NOTE2: the gazetteer program will always insert two special annotations into the document: @DOCBEGIN with zero length at position 0 and @DOCEND with zero length after the last character in the document. These annotations are planned for use in a modified JAPE transducer and should do no harm with the default JAPE transducers where all annotations of zero length are ignored.

Author:
Valentin Tablan, Borislav Popov, Johann Petrak
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class gate.creole.gazetteer.DefaultGazetteer
gate.creole.gazetteer.DefaultGazetteer.CharMap, gate.creole.gazetteer.DefaultGazetteer.Iter
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener
 
Field Summary
 
Fields inherited from class gate.creole.gazetteer.DefaultGazetteer
DEF_GAZ_ANNOT_SET_PARAMETER_NAME, DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME, DEF_GAZ_DOCUMENT_PARAMETER_NAME, DEF_GAZ_ENCODING_PARAMETER_NAME, DEF_GAZ_LISTS_URL_PARAMETER_NAME, initialState, listsByNode
 
Fields inherited from class gate.creole.gazetteer.AbstractGazetteer
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, mappingDefinition, wholeWordsOnly
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
ListGazetteer()
          Build a gazetter using the default lists from the gate resources
 
Method Summary
 void execute()
          This method runs the gazetteer.
 java.lang.Boolean getIncludeStrings()
           
 java.lang.Boolean getPrefixAnnotations()
           
 java.lang.Boolean getSuffixAnnotations()
           
 java.lang.String getWordBoundaryChars()
           
 java.lang.String getWordChars()
           
 java.lang.Integer getWordCharsClass()
           
 gate.Resource init()
           
 boolean isWithinWord(char ch)
          Tests whether a character is internal to a word (i.e.
 void setIncludeStrings(java.lang.Boolean newIncludeStrings)
           
 void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)
           
 void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)
           
 void setWordBoundaryChars(java.lang.String newWordBoundaryChars)
           
 void setWordChars(java.lang.String newWordChars)
           
 void setWordCharsClass(java.lang.Integer newWordCharsClass)
           
 
Methods inherited from class gate.creole.gazetteer.DefaultGazetteer
add, addLookup, getFSMgml, isWordInternal, lookup, readList, remove, removeLookup
 
Methods inherited from class gate.creole.gazetteer.AbstractGazetteer
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setMappingDefinition, setWholeWordsOnly
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Constructor Detail

ListGazetteer

public ListGazetteer()
Build a gazetter using the default lists from the gate resources

Method Detail

getPrefixAnnotations

public java.lang.Boolean getPrefixAnnotations()

setPrefixAnnotations

public void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)

getSuffixAnnotations

public java.lang.Boolean getSuffixAnnotations()

setSuffixAnnotations

public void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)

getIncludeStrings

public java.lang.Boolean getIncludeStrings()

setIncludeStrings

public void setIncludeStrings(java.lang.Boolean newIncludeStrings)

getWordCharsClass

public java.lang.Integer getWordCharsClass()

setWordCharsClass

public void setWordCharsClass(java.lang.Integer newWordCharsClass)

getWordBoundaryChars

public java.lang.String getWordBoundaryChars()

setWordBoundaryChars

public void setWordBoundaryChars(java.lang.String newWordBoundaryChars)

getWordChars

public java.lang.String getWordChars()

setWordChars

public void setWordChars(java.lang.String newWordChars)

init

public gate.Resource init()
                   throws gate.creole.ResourceInstantiationException
Specified by:
init in interface gate.Resource
Overrides:
init in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ResourceInstantiationException

isWithinWord

public boolean isWithinWord(char ch)
Tests whether a character is internal to a word (i.e. if it's a letter or a combining mark (spacing or not)).

Parameters:
ch - the character to be tested
Returns:
a boolean value

execute

public void execute()
             throws gate.creole.ExecutionException
This method runs the gazetteer. It assumes that all the needed parameters are set. If they are not, an exception will be fired.

Specified by:
execute in interface gate.Executable
Overrides:
execute in class gate.creole.gazetteer.DefaultGazetteer
Throws:
gate.creole.ExecutionException