at.ofai.gate
Class ListGazetteer
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractProcessingResource
gate.creole.AbstractLanguageAnalyser
gate.creole.gazetteer.AbstractGazetteer
gate.creole.gazetteer.DefaultGazetteer
at.ofai.gate.ListGazetteer
- All Implemented Interfaces:
- gate.creole.ANNIEConstants, gate.creole.gazetteer.Gazetteer, gate.Executable, gate.LanguageAnalyser, gate.ProcessingResource, gate.Resource, gate.util.FeatureBearer, gate.util.NameBearer, java.io.Serializable
public class ListGazetteer
- extends gate.creole.gazetteer.DefaultGazetteer
This is a modified version of the GATE DefaultGazetteer class. It does
everything that class does but in addition can also provide annotations
for the part of a word that precedes a match (the prefix) and the part
of a word that comes after a match until the end of a word (the suffix).
Naturally, this makes only sense if the gazetter matches parts of words,
so additional annotations are deactivated if the wholeWordsOnly
parameter is true.
Prefix and suffix annotation can be seperately switched on and off by
the corresponding parameters SuffixAnnotations and PrefixAnnotations.
Prefix and suffix annotations are created exactly as the corresponding lookup
annations, but have major annotation type "Lookup_prefix" and "Lookup_suffix"
respectively.
Suffix annotations always include the "string" feature that contains
the string of the suffix.
Lookup_prefix and Lookup annotations include the string feature if parameter
IncludeStrings is set to true.)
In addition to the features provided by the GATE Default gazetteer, the
Lookup_prefix and Lookup annotations also include the features
- firstcharUpper which is true if the first letter of the corresponding
string is upper case (for Lookup_prefix and Lookup)
- atEnd which is true if a Lookup annotation matched at the end of a
word according to the current word definition (for Lookup).
- atBeginning which is true if a Lookup annotation matched at the
beginning of a word according to the current word definition.
Nearly all parameters are defined to be runtime parameters so it is
much easier to change them during debugging without the need to re-create
the processing resource.
How words and word boundaries are defined is influenced by the following
parameters:
- wordCharsClass:
- 0 - Everything that is not whitespace is a word. So punctuation etc.
must be included to match whole Words, or will be included in
the suffix or prefix if these are used.
- 1 - Letter: words only consist of letters (unicode class). Everything
else (including digits or special characters) is interpreted
as word boundary.
- 2 - Digit: words only consist of digits (unicode class). Everything
else (including letters and special characters) is interpreted
as word boundary
- 3 - LetterOrDigit: words consist of digits or letters
- wordChars: a string made up of additional characters that should be
accepted for words. Whitespace will be removed.
You might want to add e.g. a hyphen here.
- wordBoundaryChars: a string made up of additional characters that
should be interpreted as word boundaries. Whitespace will be removed.
Whitespace will ALWAYS be interpreted as word boundary, combining spacing
mark and non spacing mark will always be interpreted as part of a word.
The word characters as defined here will only influence how the characters
*outside* the actual gazetteer match will be processed, i.e. how
suffixes and prefixes are found. That means that an entry in a gazetteer
list can contain non-word characters and still match, e.g.
"word1 word2" will match "theworda wordbs" even though the space is a
non-word character and will generate Lookup_prefix.string = "the" and
Lookup_suffix.string = "s".
NOTE1: the way how to define words and word boundaries might change in
the future!
NOTE2: the gazetteer program will always insert two special annotations
into the document: @DOCBEGIN with zero length at position 0 and
@DOCEND with zero length after the last character in the document.
These annotations are planned for use in a modified JAPE transducer and
should do no harm with the default JAPE transducers where all annotations
of zero length are ignored.
- Author:
- Valentin Tablan, Borislav Popov, Johann Petrak
- See Also:
- Serialized Form
| Nested classes/interfaces inherited from class gate.creole.gazetteer.DefaultGazetteer |
gate.creole.gazetteer.DefaultGazetteer.CharMap, gate.creole.gazetteer.DefaultGazetteer.Iter |
| Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource |
gate.creole.AbstractProcessingResource.InternalStatusListener, gate.creole.AbstractProcessingResource.IntervalProgressListener |
| Fields inherited from class gate.creole.gazetteer.DefaultGazetteer |
DEF_GAZ_ANNOT_SET_PARAMETER_NAME, DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME, DEF_GAZ_DOCUMENT_PARAMETER_NAME, DEF_GAZ_ENCODING_PARAMETER_NAME, DEF_GAZ_LISTS_URL_PARAMETER_NAME, initialState, listsByNode |
| Fields inherited from class gate.creole.gazetteer.AbstractGazetteer |
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, mappingDefinition, wholeWordsOnly |
| Fields inherited from class gate.creole.AbstractLanguageAnalyser |
corpus, document |
| Fields inherited from class gate.creole.AbstractProcessingResource |
interrupted |
| Fields inherited from class gate.creole.AbstractResource |
name |
| Fields inherited from interface gate.creole.ANNIEConstants |
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME |
|
Constructor Summary |
ListGazetteer()
Build a gazetter using the default lists from the gate resources |
| Methods inherited from class gate.creole.gazetteer.DefaultGazetteer |
add, addLookup, getFSMgml, isWordInternal, lookup, readList, remove, removeLookup |
| Methods inherited from class gate.creole.gazetteer.AbstractGazetteer |
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setMappingDefinition, setWholeWordsOnly |
| Methods inherited from class gate.creole.AbstractLanguageAnalyser |
getCorpus, getDocument, setCorpus, setDocument |
| Methods inherited from class gate.creole.AbstractProcessingResource |
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener |
| Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface gate.LanguageAnalyser |
getCorpus, getDocument, setCorpus, setDocument |
| Methods inherited from interface gate.Resource |
cleanup, getParameterValue, setParameterValue, setParameterValues |
| Methods inherited from interface gate.util.NameBearer |
getName, setName |
| Methods inherited from interface gate.Executable |
interrupt, isInterrupted |
ListGazetteer
public ListGazetteer()
- Build a gazetter using the default lists from the gate resources
getPrefixAnnotations
public java.lang.Boolean getPrefixAnnotations()
setPrefixAnnotations
public void setPrefixAnnotations(java.lang.Boolean newPrefixAnnotations)
getSuffixAnnotations
public java.lang.Boolean getSuffixAnnotations()
setSuffixAnnotations
public void setSuffixAnnotations(java.lang.Boolean newSuffixAnnotations)
getIncludeStrings
public java.lang.Boolean getIncludeStrings()
setIncludeStrings
public void setIncludeStrings(java.lang.Boolean newIncludeStrings)
getWordCharsClass
public java.lang.Integer getWordCharsClass()
setWordCharsClass
public void setWordCharsClass(java.lang.Integer newWordCharsClass)
getWordBoundaryChars
public java.lang.String getWordBoundaryChars()
setWordBoundaryChars
public void setWordBoundaryChars(java.lang.String newWordBoundaryChars)
getWordChars
public java.lang.String getWordChars()
setWordChars
public void setWordChars(java.lang.String newWordChars)
init
public gate.Resource init()
throws gate.creole.ResourceInstantiationException
- Specified by:
init in interface gate.Resource- Overrides:
init in class gate.creole.gazetteer.DefaultGazetteer
- Throws:
gate.creole.ResourceInstantiationException
isWithinWord
public boolean isWithinWord(char ch)
- Tests whether a character is internal to a word (i.e. if it's a letter or
a combining mark (spacing or not)).
- Parameters:
ch - the character to be tested
- Returns:
- a boolean value
execute
public void execute()
throws gate.creole.ExecutionException
- This method runs the gazetteer. It assumes that all the needed parameters
are set. If they are not, an exception will be fired.
- Specified by:
execute in interface gate.Executable- Overrides:
execute in class gate.creole.gazetteer.DefaultGazetteer
- Throws:
gate.creole.ExecutionException