You are looking at documentation for an older release.
Not what you want? See the
current release documentation.
Standard Tokenizeredit
A tokenizer of type standard
providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
The following are settings that can be set for a standard
tokenizer
type:
Setting | Description |
---|---|
| The maximum token length. If a token is seen that
exceeds this length then it is split at |