Class TruncateTokenFilterFactory
TruncateTokenFilter.
Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts
Since Lucene 10.5, the filter correctly handles codepoints and truncates after
truncateAfterCodePoints codepoints, no longer producing incomplete surrogate pairs. For
backwards compatibility the old prefixLength is still supported and its behaviour depends
on the luceneMatchVersion parameter. If no parameter is given, it uses a prefix length of
5. In case you change to the more modern codepoint behaviour, reindexing may be required if your
documents contain surrogate pairs (like emojis).
The following type is recommended for "diacritics-insensitive search" for Turkish:
<fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.TruncateTokenFilterFactory" truncateAfterCodePoints="5"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>- Since:
- 4.8.0
- SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
- "truncate"
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringSPI namestatic final StringDeprecated.static final Stringstatic final StringFields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion -
Constructor Summary
ConstructorsConstructorDescriptionDefault ctor for compatibility with SPI -
Method Summary
Methods inherited from class org.apache.lucene.analysis.TokenFilterFactory
availableTokenFilters, findSPIName, forName, lookupClass, normalize, reloadTokenFiltersMethods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
PREFIX_LENGTH_KEY
Deprecated.- See Also:
-
TRUNCATE_AFTER_CODEPOINTS_KEY
- See Also:
-
TRUNCATE_AFTER_CHARS_KEY
- See Also:
-
-
Constructor Details
-
TruncateTokenFilterFactory
-
TruncateTokenFilterFactory
public TruncateTokenFilterFactory()Default ctor for compatibility with SPI
-
-
Method Details
-
create
- Specified by:
createin classTokenFilterFactory
-