ICU Tokenizer | Elasticsearch Plugins and Integrations [2.0]

You are looking at documentation for an older release. Not what you want? See the current release documentation.

» » »

« ICU Normalization Character Filter ICU Normalization Token Filter »

ICU Tokenizeredit

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

« ICU Normalization Character Filter ICU Normalization Token Filter »

ICU Tokenizeredit

Top Videos

Be in the know with the latest and greatest from Elastic.