Class HyphenationCompoundWordTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
-
- org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
-
- All Implemented Interfaces:
Closeable,AutoCloseable,Unwrappable<TokenStream>
public class HyphenationCompoundWordTokenFilter extends CompoundWordTokenFilterBase
ATokenFilterthat decomposes compound words found in many Germanic languages."Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator)Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)Creates a newHyphenationCompoundWordTokenFilterinstance.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)Creates a newHyphenationCompoundWordTokenFilterinstance.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch, boolean noSubMatches, boolean noOverlappingMatches)Creates a newHyphenationCompoundWordTokenFilterinstance.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voiddecompose()Decomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist.static HyphenationTreegetHyphenationTree(String hyphenationFilename)Create a hyphenator treestatic HyphenationTreegetHyphenationTree(InputSource hyphenationSource)Create a hyphenator tree-
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, reset
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Constructor Detail
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)
Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against.
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the stream
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch, boolean noSubMatches, boolean noOverlappingMatches)
Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the streamnoSubMatches- Excludes subwords that are enclosed by an other tokennoOverlappingMatches- Excludes subwords that overlap with an other subword
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
-
-
Method Detail
-
getHyphenationTree
public static HyphenationTree getHyphenationTree(String hyphenationFilename) throws IOException
Create a hyphenator tree- Parameters:
hyphenationFilename- the filename of the XML grammar to load- Returns:
- An object representing the hyphenation patterns
- Throws:
IOException- If there is a low-level I/O error.
-
getHyphenationTree
public static HyphenationTree getHyphenationTree(InputSource hyphenationSource) throws IOException
Create a hyphenator tree- Parameters:
hyphenationSource- the InputSource pointing to the XML grammar- Returns:
- An object representing the hyphenation patterns
- Throws:
IOException- If there is a low-level I/O error.
-
decompose
protected void decompose()
Description copied from class:CompoundWordTokenFilterBaseDecomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist. The original token may not be placed in the list, as it is automatically passed through this filter.- Specified by:
decomposein classCompoundWordTokenFilterBase
-
-