Pattern Tokenizer | Elasticsearch Reference [1.7]

You are looking at documentation for an older release. Not what you want? See the current release documentation.

» » »

« Whitespace Tokenizer UAX Email URL Tokenizer »

Pattern Tokenizeredit

A tokenizer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

Setting	Description
`pattern`	The regular expression pattern, defaults to `\W+`.
`flags`	The regular expression flags.
`group`	Which group to extract into tokens. Defaults to `-1` (split).

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

group set to -1 (the default) is equivalent to "split". Using group >= 0 selects the matching group as the token. For example, if you have:

pattern = '([^']+)'
group   = 0
input   = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks).

« Whitespace Tokenizer UAX Email URL Tokenizer »

Pattern Tokenizeredit

Top Videos

Be in the know with the latest and greatest from Elastic.