Package org.apache.lucene.analysis.email
Class UAX29URLEmailTokenizerImpl
- java.lang.Object
-
- org.apache.lucene.analysis.email.UAX29URLEmailTokenizerImpl
-
public final class UAX29URLEmailTokenizerImpl extends Object
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters
-
-
Field Summary
Fields Modifier and Type Field Description static intAVOID_BAD_URLstatic intEMAIL_TYPEEmail token typestatic intEMOJI_TYPEEmoji token typestatic intHANGUL_TYPEHangul token typestatic intHIRAGANA_TYPEHiragana token typestatic intIDEOGRAPHIC_TYPEIdeographic token typestatic intKATAKANA_TYPEKatakana token typestatic intNUMERIC_TYPENumbersstatic intSOUTH_EAST_ASIAN_TYPEChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).static intURL_TYPEURL token typestatic intWORD_TYPEAlphanumeric sequencesstatic intYYEOFThis character denotes the end of file.static intYYINITIALLexical States.
-
Constructor Summary
Constructors Constructor Description UAX29URLEmailTokenizerImpl(Reader in)Creates a new scanner
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetNextToken()Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.voidgetText(CharTermAttribute t)Fills CharTermAttribute with the current token text.voidsetBufferSize(int numChars)Sets the scanner buffer size in charsbooleanyyatEOF()Returns whether the scanner has reached the end of the reader it reads from.voidyybegin(int newState)Enters a new lexical state.intyychar()Character count processed so farcharyycharat(int position)Returns the character at the given position from the matched text.voidyyclose()Closes the input reader.intyylength()How many characters were matched.voidyypushback(int number)Pushes the specified amount of characters back into the input stream.voidyyreset(Reader reader)Resets the scanner to read from a new input stream.intyystate()Returns the current lexical state.Stringyytext()Returns the text matched by the current regular expression.
-
-
-
Field Detail
-
YYEOF
public static final int YYEOF
This character denotes the end of file.- See Also:
- Constant Field Values
-
YYINITIAL
public static final int YYINITIAL
Lexical States.- See Also:
- Constant Field Values
-
AVOID_BAD_URL
public static final int AVOID_BAD_URL
- See Also:
- Constant Field Values
-
WORD_TYPE
public static final int WORD_TYPE
Alphanumeric sequences- See Also:
- Constant Field Values
-
NUMERIC_TYPE
public static final int NUMERIC_TYPE
Numbers- See Also:
- Constant Field Values
-
SOUTH_EAST_ASIAN_TYPE
public static final int SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
- See Also:
- Constant Field Values
-
IDEOGRAPHIC_TYPE
public static final int IDEOGRAPHIC_TYPE
Ideographic token type- See Also:
- Constant Field Values
-
HIRAGANA_TYPE
public static final int HIRAGANA_TYPE
Hiragana token type- See Also:
- Constant Field Values
-
KATAKANA_TYPE
public static final int KATAKANA_TYPE
Katakana token type- See Also:
- Constant Field Values
-
HANGUL_TYPE
public static final int HANGUL_TYPE
Hangul token type- See Also:
- Constant Field Values
-
EMAIL_TYPE
public static final int EMAIL_TYPE
Email token type- See Also:
- Constant Field Values
-
URL_TYPE
public static final int URL_TYPE
URL token type- See Also:
- Constant Field Values
-
EMOJI_TYPE
public static final int EMOJI_TYPE
Emoji token type- See Also:
- Constant Field Values
-
-
Constructor Detail
-
UAX29URLEmailTokenizerImpl
public UAX29URLEmailTokenizerImpl(Reader in)
Creates a new scanner- Parameters:
in- the java.io.Reader to read input from.
-
-
Method Detail
-
yychar
public final int yychar()
Character count processed so far
-
getText
public final void getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text.
-
setBufferSize
public final void setBufferSize(int numChars)
Sets the scanner buffer size in chars
-
yyclose
public final void yyclose() throws IOExceptionCloses the input reader.- Throws:
IOException- if the reader could not be closed.
-
yyreset
public final void yyreset(Reader reader)
Resets the scanner to read from a new input stream.Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to
ZZ_INITIAL.Internal scan buffer is resized down to its initial length, if it has grown.
- Parameters:
reader- The new input stream.
-
yyatEOF
public final boolean yyatEOF()
Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
-
yystate
public final int yystate()
Returns the current lexical state.- Returns:
- the current lexical state.
-
yybegin
public final void yybegin(int newState)
Enters a new lexical state.- Parameters:
newState- the new lexical state
-
yytext
public final String yytext()
Returns the text matched by the current regular expression.- Returns:
- the matched text.
-
yycharat
public final char yycharat(int position)
Returns the character at the given position from the matched text.It is equivalent to
yytext().charAt(pos), but faster.- Parameters:
position- the position of the character to fetch. A value from 0 toyylength()-1.- Returns:
- the character at
position.
-
yylength
public final int yylength()
How many characters were matched.- Returns:
- the length of the matched text region.
-
yypushback
public void yypushback(int number)
Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method.
- Parameters:
number- the number of characters to be read again. This number must not be greater thanyylength().
-
getNextToken
public int getNextToken() throws IOExceptionResumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
IOException- if any I/O-Error occurs.
-
-