NGram Tokenizer | Elasticsearch Reference [1.7]

You are looking at documentation for an older release. Not what you want? See the current release documentation.

» » »

« Lowercase Tokenizer Whitespace Tokenizer »

NGram Tokenizeredit

A tokenizer of type nGram.

The following are settings that can be set for a nGram tokenizer type:

Setting	Description	Default value
`min_gram`	Minimum size in codepoints of a single n-gram	`1`.
`max_gram`	Maximum size in codepoints of a single n-gram	`2`.
`token_chars`	Characters classes to keep in the tokens, Elasticsearch will split on characters that don’t belong to any of these classes.	`[]` (Keep all characters)

token_chars accepts the following character classes:

`letter`	for example `a`, `b`, `ï` or `京`
`digit`	for example `3` or `7`
`whitespace`	for example `" "` or `"\n"`
`punctuation`	for example `!` or `"`
`symbol`	for example `$` or `√`

Exampleedit

    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "my_ngram_analyzer" : {
                        "tokenizer" : "my_ngram_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_ngram_tokenizer" : {
                        "type" : "nGram",
                        "min_gram" : "2",
                        "max_gram" : "3",
                        "token_chars": [ "letter", "digit" ]
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

« Lowercase Tokenizer Whitespace Tokenizer »

NGram Tokenizeredit

Exampleedit

Top Videos

Be in the know with the latest and greatest from Elastic.