Package org.apache.lucene.analysis.ko
Class KoreanTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.ko.KoreanTokenizer
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public final class KoreanTokenizer extends Tokenizer
Tokenizer for Korean that uses morphological analysis.This tokenizer sets a number of additional attributes:
PartOfSpeechAttributecontaining part-of-speech.ReadingAttributecontaining reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classKoreanTokenizer.DecompoundModeDecompound mode: this determines how the tokenizer handlesPOS.Type.COMPOUND,POS.Type.INFLECTandPOS.Type.PREANALYSIStokens.static classKoreanTokenizer.TypeToken type reflecting the original source of this token-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static KoreanTokenizer.DecompoundModeDEFAULT_DECOMPOUNDDefault mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description KoreanTokenizer()Creates a new KoreanTokenizer with default parameters.KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()voidend()booleanincrementToken()voidreset()voidsetGraphvizFormatter(GraphvizFormatter dotOut)Expert: set this to produce graphviz (dot) output of the Viterbi lattice-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_DECOMPOUND
public static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.
-
-
Constructor Detail
-
KoreanTokenizer
public KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.Uses the default AttributeFactory.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory- the AttributeFactory to useuserDictionary- Optional: if non-null, user dictionary.mode- Decompound mode.outputUnknownUnigrams- if true outputs unigrams for unknown words.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory- the AttributeFactory to useuserDictionary- Optional: if non-null, user dictionary.mode- Decompound mode.outputUnknownUnigrams- if true outputs unigrams for unknown words.discardPunctuation- true if punctuation tokens should be dropped from the output.
-
KoreanTokenizer
public KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input toDictionaryBuilder.- Parameters:
factory- the AttributeFactory to usesystemDictionary- a custom known token dictionaryunkDictionary- a custom unknown token dictionaryconnectionCosts- custom token transition costsuserDictionary- Optional: if non-null, user dictionary.mode- Decompound mode.outputUnknownUnigrams- if true outputs unigrams for unknown words.discardPunctuation- true if punctuation tokens should be dropped from the output.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Method Detail
-
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice
-
close
public void close() throws IOException- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenizer- Throws:
IOException
-
reset
public void reset() throws IOException- Overrides:
resetin classTokenizer- Throws:
IOException
-
end
public void end() throws IOException- Overrides:
endin classTokenStream- Throws:
IOException
-
incrementToken
public boolean incrementToken() throws IOException- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
-