elasticsearch分析器

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis.html
分析器analyzer包含如下几个属性:

分析器类型type:custom
字符过滤器char_filter: 零个或多个
分词器tokenizer: 有且仅有一个
词元过滤器filter:零个或多个 按顺序应用的
字符过滤器 字符过滤器也叫预处理过滤器,用于预处理字符流,然后再将其传递给分词器。
字符过滤器有三种:
1. html_strip:HTML标签字符过滤器 特性:
a. 从原始文本中过滤掉HTML标签
可选配置:
escaped_tags: 不应从原始文本中过滤掉的HTML标签,数组类型。
example:
GET _analyze { "tokenizer":"keyword", "char_filter":[ "html_strip" ], "text": "I' m so happy!
" }

{ "tokens": [ { "token": """I'm so happy!""", "start_offset": 0, "end_offset": 32, "type": "word", "position": 0 } ] }

PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["b"]// 不从原始文本中过滤标签 } } } } }

GET my_index/_analyze { "analyzer":"my_analyzer", "text": "I' m so happy!
" }

{ "tokens": [ { "token": """I'm so happy!""", "start_offset": 0, "end_offset": 32, "type": "word", "position": 0 } ] }

2. mapping: 映射字符过滤器 特性:
a. mapping字符过滤器接受键值对数组。每当遇到与键相同的字符串时,它将用与该键关联的值替换它们。
b. 匹配是贪婪的,优先匹配最长的那一个。
c. 允许替换为空字符串。
可选配置:
mappings:定义一个键值对数组
mappings_path: 定义一个包含键值对数组文件的路径
example:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "& => and", "$ => ¥" ] } } } } }

POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My license plate is $203 & $110" }

{ "tokens": [ { "token": "My license plate is ¥203 and ¥110", "start_offset": 0, "end_offset": 31, "type": "word", "position": 0 } ] }

3. pattern_replace: 正则替换字符过滤器 特性:
a. 使用一个正则表达式匹配,用指定的字符串替换字符。
b. 替换字符串可以引用正则表达式中的捕获组。
可选配置:
pattern: 一个Java的正则表达式,必须。
replacement:替换字符串,可以参考使用捕获组 $1.. $9语法,说明 这里
flags:Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
example:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } }

POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }

{ "tokens": [ { "token": "My credit card is 123_456_789", "start_offset": 0, "end_offset": 29, "type": "word", "position": 0 } ] }

分词器
  • Standard Tokenizer
特性:
标准类型的tokenizer对欧洲语言非常友好, 支持Unicode。
可选配置:
max_token_length:最大的token集合,即经过tokenizer过后得到的结果集的最大值。如果token的长度超过了设置的长度,将会继续分,默认255
ex:
POST _analyze { "tokenizer": "standard", "text": "The 2 QUICK Brown-Foxes of dog's bone." }

结果
[The, 2, QUICK, Brown, Foxes, of, dog's bone]
ex:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "standard", "max_token_length": 5 } } } } }

  • Letter Tokenizer
特性:
每当遇到一个字符是不是字母的时候,进行分词。
可选配置:
不可配置
ex:
POST _analyze { "tokenizer": "letter", "text": "The 2 QUICK Brown-Foxes of dog's bone." }

结果:
[The, 2, QUICK, Brown, Foxes, of, dog, s, bone]
  • Lowercase Tokenizer
特性:
可以看做是Letter Tokenizer和lower token filter的组合
  • Whitespace Tokenizer
特性:
遇到空白字符就分词
可选配置:

  • UAX URL Email Tokenizer
特性:
和standard 类型的分词器类似,但是能识别url和email
可选配置:
max_token_length:默认256
ex:
POST _analyze { "tokenizer": "uax_url_email", "text": "Email me at john.smith@global-international.com http://www.baidu.com" }

结果
[Email, me, at, john.smith@global-international.com, http://www.baidu.com]
  • Classic Tokenizer
特性:
为英语而生的分词器. 这个分词器对于英文的首字符缩写、 公司名字、 email 、 大部分网站域名.都能很好的解决。 但是, 对于除了英语之外的其他语言,都不是很好使。
可选配置:
max_token_length: 默认255
  • Thai Tokenizer
特性:
泰语专用分词器
  • NGram Tokenizer
特性:
N-gram就像一个滑动窗口,在整个单词上移动-连续的指定长度字符序列。它们对于查询不用空格的语言(例如德语,汉语)很有用。
可选配置:
min_gram:分词后词语的最小长度
max_gram: 分词后数据的最大长度
token_chars:设置分词的形式,例如数字还是文字。elasticsearch将根据分词的形式对文本进行分词。
[] (Keep all characters)
token_chars可用的值:
普通字符:letter?—? for example a, b, ? or 京
数字:digit?—? for example 3 or 7
空格或回车符:whitespace?—? for example " " or "\n"
标点符号:punctuation?—?for example ! or "
特殊字符:symbol?—? for example $ or √
ex:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer", "filter":["lowercase"] } }, "tokenizer": { "my_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 10, "token_chars": [ "letter", "digit" ] } } } }, "mappings": { "doc": { "properties": { "title": { "type": "text", "analyzer": "my_analyzer" } } } } }

POST my_index/_analyze { "analyzer": "my_analyzer", "text": "2 2311 Quick Foxes." }

分词结果:
[231, 2311, 311, qui, quic, quick, uic, uick, ick, fox, foxe, foxes, oxe, oxes, xes]
  • Edge NGram Tokenizer
    特性
边缘ngram分词器,与ngram分词器的不同之处在于ngram是补全提示,而edge-ngram是自动补全。
例如:
POST _analyze { "tokenizer": "ngram", "text": "a Quick Foxes." }

ngram分词测试结果:
["a", "a ", " ", " Q", "Q", "Qu", "u", "ui", "i", "ic", "c", "ck", "k", "k ", " ", " F", "F", "Fo", "o", "ox", "x", "xe", "e", "es", "s", "s.", ".",]
POST _analyze { "tokenizer": "edge_ngram", "text": "a Quick Foxes." }

edeg_ngram分词测试结果
["a", "a "]
从上面的测试结果可以看出:
默认ngram和edge_ngram分词器是将a Quick Foxes.当成一个整体的。
默认ngram和edge_ngram最小长度和最大长度都是 1 和 2。
ngram是一个固定的小窗口在单词上滑动的。(一般用来搜索补全提示)
edge_ngram是起始位置不动,小窗口从最小值到最大值延长的结果。(一般用来单词自动补全)
  • Keyword Tokenizer
特性:
keyword 类型的tokenizer 是将一整块的输入数据作为一个单独的分词
可选配置:
buffer_size: 默认256
  • Pattern Tokenizer
    特性:
用正则表达式分词
可选配置:
pattern:一个Java的正则表达式,则默认为\W+
flags: Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
group: 要提取哪个捕获组作为令牌。默认为-1(分割).
group默认为-1,表示以正则表达式匹配字符来进行分割分词
group=0,则表示保留匹配所有正则表达式的字符串来分词
group=1,2,3...,则表示保留匹配正则表达式中的某一个()中的匹配结果。
ex:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "\"(.*)\"", "flags": "", "group": -1 } } } } }

注意:匹配两端带"的字符串,注意"\"(.*)\"""\".*\""的区别,同样都是匹配两端带"的字符串,前面一个有group,而后面这个没有group。
{ "analyzer": "my_analyzer", "text": "comma,\"separated\",values" }

分词测试结果:
当group为默认的-1时, 以正则匹配到的结果作为分隔符分词,所得结果: ["comma", "values"]
当group=0时,以正则匹配到的结果作为分词结果,所得结果为匹配的字符串:[""separated""]
当group=1时,以正则中第一个()中匹配的字符串作为分词结果,所得的分词结果为:["separated"]
当group=2时,以正则中第二个()中匹配的字符串作为分词结果,这里正则表达式中因为只有一个(),所以会报异常。
  • Path Hierarchy Tokenizer
    特性
所述path_hierarchy标记生成器需要像文件系统路径的分层值,分割的路径分隔,并发出一个术语,树中的每个组件。
可选配置
delimiter:路径匹配分割符,默认/.
replacement:替换路径分隔符,默认与delimiter一致
buffer_size:分割路径最大长度,默认1024.
reverse:是否翻转,默认 false.
skip: 默认为 0.
ex:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "path_hierarchy", "delimiter": "-", "replacement":"/", "reverse": false, "skip": 0 } } } } }

POST my_index/_analyze { "analyzer": "my_analyzer", "text": "one-two-three-four" }

{ "tokens": [ { "token": "one", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "one/two", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "one/two/three", "start_offset": 0, "end_offset": 13, "type": "word", "position": 0 }, { "token": "one/two/three/four", "start_offset": 0, "end_offset": 18, "type": "word", "position": 0 } ] }

一、内置的8种分析器:
  • standard analyzer:默认分词器,它提供了基于语法的分词(基于Unicode文本分割算法,如Unicode? Standard Annex #29所指定的),适用于大多数语言,对中文分词效果很差。
POST _analyze { "analyzer":"standard", "text":"Geneva K. Risk-Issues " }

{ "tokens": [ { "token": "geneva", "start_offset": 0, "end_offset": 6, "type": "", "position": 0 }, { "token": "k", "start_offset": 7, "end_offset": 8, "type": "", "position": 1 }, { "token": "risk", "start_offset": 10, "end_offset": 14, "type": "", "position": 2 }, { "token": "issues", "start_offset": 15, "end_offset": 21, "type": "", "position": 3 } ] }

  • simple analyzer:它提供了基于字母的分词,如果遇到不是字母时直接分词,所有字母均置为小写
POST _analyze { "analyzer":"simple", "text":"Geneva K. Risk-Issues " }

{ "tokens": [ { "token": "geneva", "start_offset": 0, "end_offset": 6, "type": "word", "position": 0 }, { "token": "k", "start_offset": 7, "end_offset": 8, "type": "word", "position": 1 }, { "token": "risk", "start_offset": 10, "end_offset": 14, "type": "word", "position": 2 }, { "token": "issues", "start_offset": 15, "end_offset": 21, "type": "word", "position": 3 } ] }

  • whitespace analyzer:它提供了基于空格的分词,如果遇到空格时直接分词
POST _analyze { "analyzer":"whitespace", "text":"Geneva K. Risk-Issues " }

{ "tokens": [ { "token": "Geneva", "start_offset": 0, "end_offset": 6, "type": "word", "position": 0 }, { "token": "K.", "start_offset": 7, "end_offset": 9, "type": "word", "position": 1 }, { "token": "Risk-Issues", "start_offset": 10, "end_offset": 21, "type": "word", "position": 2 } ] }

  • stop analyzer:它与simple analyzer相同,但是支持删除停用词。它默认使用 _english_stop 单词。
POST _analyze { "analyzer":"stop", "text":"Geneva K.of Risk-Issues " }

{ "tokens": [ { "token": "geneva", "start_offset": 0, "end_offset": 6, "type": "word", "position": 0 }, { "token": "k", "start_offset": 7, "end_offset": 8, "type": "word", "position": 1 }, { "token": "risk", "start_offset": 16, "end_offset": 20, "type": "word", "position": 4 }, { "token": "issues", "start_offset": 21, "end_offset": 27, "type": "word", "position": 5 } ] }

  • keyword analyzer:它提供的是无操作分词,它将整个输入字符串作为一个词返回,即不分词。
POST _analyze { "analyzer":"keyword", "text":"Geneva K.of Risk-Issues " }

{ "tokens": [ { "token": "Geneva K.of Risk-Issues ", "start_offset": 0, "end_offset": 24, "type": "word", "position": 0 } ] }

  • pattern analyzer:它提供了基于正则表达式将文本分词。正则表达式应该匹配词语分隔符,而不是词语本身。正则表达式默认为\W+(或所有非单词字符)。
POST _analyze { "analyzer":"pattern", "text":"Geneva K.of Risk-Issues " }

{ "tokens": [ { "token": "geneva", "start_offset": 0, "end_offset": 6, "type": "word", "position": 0 }, { "token": "k", "start_offset": 7, "end_offset": 8, "type": "word", "position": 1 }, { "token": "of", "start_offset": 9, "end_offset": 11, "type": "word", "position": 2 }, { "token": "risk", "start_offset": 12, "end_offset": 16, "type": "word", "position": 3 }, { "token": "issues", "start_offset": 17, "end_offset": 23, "type": "word", "position": 4 } ] }

  • language analyzers:它提供了一组语言的分词,旨在处理特定语言.它包含了一下语言的分词:
    阿拉伯语,亚美尼亚语,巴斯克语,巴西语,保加利亚语,加泰罗尼亚语,cjk,捷克语,丹麦语,荷兰语,英语,芬兰语,法语,加利西亚语,德语,希腊语,印度语,匈牙利语,爱尔兰语,意大利语,拉脱维亚语,立陶宛语,挪威语,波斯语,葡萄牙语,罗马尼亚语,俄罗斯语,索拉尼语,西班牙语,瑞典语,土耳其语,泰国语
POST _analyze { "analyzer":"english",## french(法语) "text":"Geneva K.of Risk-Issues " }

{ "tokens": [ { "token": "geneva", "start_offset": 0, "end_offset": 6, "type": "", "position": 0 }, { "token": "k.of", "start_offset": 7, "end_offset": 11, "type": "", "position": 1 }, { "token": "risk", "start_offset": 12, "end_offset": 16, "type": "", "position": 2 }, { "token": "isue", "start_offset": 17, "end_offset": 23, "type": "", "position": 3 } ] }

  • fingerprint analyzer:对text进行排序,重复数据删除然后将它们重新组合为单个text。
POST _analyze { "analyzer":"fingerprint", "text":"Geneva K.of Risk-Issues " }

{ "tokens": [ { "token": "geneva issues k.of risk", "start_offset": 0, "end_offset": 24, "type": "fingerprint", "position": 0 } ] }

二、测试自定义分析器
【elasticsearch分析器】分析器analyze API的使用
分析器analyze API可验证分析器的分析效果并解释分析过程。
text: 待分析文本
explain:解释分析过程
char_filter:字符过滤器
tokenizer:分词器
filter:词元过滤器
GET _analyze { "char_filter": ["html_strip"], "tokenizer": "standard", "filter": ["lowercase"], "text": "No dreams, why bother Beijing !
", "explain": true }

{ "detail": { "custom_analyzer": true, "charfilters": [ { "name": "html_strip", "filtered_text": [ """No dreams, why bother Beijing !""" ] } ], "tokenizer": { "name": "standard", "tokens": [ { "token": "No", "start_offset": 7, "end_offset": 9, "type": "", "position": 0, "bytes": "[4e 6f]", "positionLength": 1 }, { "token": "dreams", "start_offset": 13, "end_offset": 23, "type": "", "position": 1, "bytes": "[64 72 65 61 6d 73]", "positionLength": 1 }, { "token": "why", "start_offset": 25, "end_offset": 28, "type": "", "position": 2, "bytes": "[77 68 79]", "positionLength": 1 }, { "token": "bother", "start_offset": 29, "end_offset": 35, "type": "", "position": 3, "bytes": "[62 6f 74 68 65 72]", "positionLength": 1 }, { "token": "Beijing", "start_offset": 39, "end_offset": 50, "type": "", "position": 4, "bytes": "[42 65 69 6a 69 6e 67]", "positionLength": 1 } ] }, "tokenfilters": [ { "name": "lowercase", "tokens": [ { "token": "no", "start_offset": 7, "end_offset": 9, "type": "", "position": 0, "bytes": "[6e 6f]", "positionLength": 1 }, { "token": "dreams", "start_offset": 13, "end_offset": 23, "type": "", "position": 1, "bytes": "[64 72 65 61 6d 73]", "positionLength": 1 }, { "token": "why", "start_offset": 25, "end_offset": 28, "type": "", "position": 2, "bytes": "[77 68 79]", "positionLength": 1 }, { "token": "bother", "start_offset": 29, "end_offset": 35, "type": "", "position": 3, "bytes": "[62 6f 74 68 65 72]", "positionLength": 1 }, { "token": "beijing", "start_offset": 39, "end_offset": 50, "type": "", "position": 4, "bytes": "[62 65 69 6a 69 6e 67]", "positionLength": 1 } ] } ] } }

归一化分析器 normalizer
针对type为keyword类型的字段,只能精确搜索,而且是区分大小写的。有时候我们希望对于keyword类型的字段不区分大小写也能精确检索怎么办呢?normalizer这个分析器可以帮你解决这个问题。
normalizer的构成比analyzer少了一个tokenizer属性,它的结构如下:
分析器类型type:custom
字符过滤器char_filter: 零个或多个 按顺序应用的
词元过滤器filter:零个或多个 按顺序应用的
这里借用官方的一个例子:
PUT index { "settings": { "analysis": { "char_filter": { "quote": { "type": "mapping", "mappings": [ "? => \"", "? => \"" ] } }, "normalizer": { "my_normalizer": { "type": "custom", "char_filter": ["quote"], "filter": ["lowercase", "asciifolding"] } } } }, "mappings": { "type": { "properties": { "foo": { "type": "keyword",// normalizer只能用在keyword类型的字段 "normalizer": "my_normalizer" } } } } }

PUT testlog/wd_doc/1 { "title": "Quick Frox" }GET testlog/wd_doc/_search { "query": { "match": { "title": { "query": "quick Frox"// 大小写不敏感,无论大小写都能检索到 } } } }

内置的12个分词器
  • 标准分词器
  • 字母分词器
  • 小写分词器
  • 空格分词器
  • UAX URL电子邮件令牌生成器
  • 经典分词器
  • 泰语分词器
  • NGram令牌生成器
  • Edge NGram令牌生成器
  • 关键字标记器
  • 模式标记器
  • 路径层次标记器

    推荐阅读