Pattern Analyzeredit
An analyzer of type pattern that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The following are settings that can be set for a pattern analyzer
type:
|
|
Should terms be lowercased or not. Defaults to |
|
|
The regular expression pattern, defaults to |
|
|
The regular expression flags. |
|
|
A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details. |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check
Java
Pattern API for more details about flags options.
Pattern Analyzer Examplesedit
In order to try out these examples, you should delete the test index
before running each example.
Whitespace tokenizeredit
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"whitespace": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
}
}
GET /test/_analyze?analyzer=whitespace&text=foo,bar baz
# "foo,bar", "baz"Non-word character tokenizeredit
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"nonword": {
"type": "pattern",
"pattern": "[^\\w]+"
}
}
}
}
}
GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"
GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"CamelCase tokenizeredit
DELETE test
PUT /test?pretty=1
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)