Elasticsearch【正则搜索】分析&实践

2023-11-14 17:22:44

Elasticsearch【正则搜索】分析&实践Regexp Query

允许regexp使用正则表达式进行term查询。请注意，如果regexp使用不当，会给服务器带来严重的性能压力。例如.*开头的查询将匹配所有倒排索引中的关键字，这几乎相当于整个表面扫描，并且会非常缓慢。因此，如果可以，最好在使用正则之前添加匹配的前缀。如果在正则中使用.*?或者+会降低查询性能。

注：term查询，即此查询不能跨term。

举个简单的例子:

GET /_search{    "query": {        "regexp":{            "name.first": "s.*y"        }    }}

一些正则支持的标准用法：

搜索关键词的一部分

如果给定的term是abcde

ab.* 可与abcd相匹配 不可以匹配

也支持使用^或者$指定开头或结尾。

允许使用特殊字符

一些特殊的字符需要转换，例如：

. ? + * | { } [ ] ( ) " \

如果你想搜索一个固定的单词，你也可以添加双引号。

匹配任何字符

.任何字符都可以匹配，比如

ab... a.c.e

这些都可以与abcde相匹配

匹配一个或多个

使用+一个或多个字符表示匹配一个或多个字符

a+b+        # matchaa+bb+      # matcha+.+        # matchaa+bbb+     # match

这些都可以匹配aaabbb

匹配零或多个

a*b*        # matcha*b*c*      # match.*bbb.*     # matchaaa*bbb*    # match

这些都可以匹配aaabbb

匹配另一个或一个

aaa?bbb?    # matchaaaa?bbbb?  # match?.?    # matchaa?bb?      # no match

这些都可以匹配aaabbb

支持匹配次数

使用{}支持匹配指定的最小值和最大值范围

     # repeat exactly 5 times   # repeat at least twice and at most 5 times    # repeat at least twice

例如，字符串：

a{3}b{3}        # matcha{2,4}b{2,4}    # matcha{2,}b{2,}      # match.{3}.{3}        # matcha{4}b{4}        # no matcha{4,6}b{4,6}    # no matcha{4,}b{4,}      # no match

捕获组

对于字符串ababab

(ab)+       ab(ab)+     (..)+       (...)+      (ab)*       abab(ab)?   ab(ab)?     (ab){3}     (ab){1,2}

选择运算符

支持或操作匹配，注意这里默认匹配最长。

aabb|bbaa   # matchaacc|bb     # no matchaa(cc|bb)   # matcha+|b+       # no matcha+b+|b+a+   # matcha+(b|c)+    # match

字符匹配

支持在[]中间进行字符匹配，^代表非的意思

[abc]   # 'a' or 'b' or 'c'[a-c]   # 'a' or 'b' or 'c'[-abc]  # '-' or 'a' or 'b' or 'c'[abc\-] # '-' or 'a' or 'b' or 'c'[^abc]  # any character except 'a' or 'b' or 'c'[^a-c]  # any character except 'a' or 'b' or 'c'[^-abc]  # any character except '-' or 'a' or 'b' or 'c'[^abc\-] # any character except '-' or 'a' or 'b' or 'c'

其中-代表范围匹配。

可选匹配符

Flags字段可用于控制是否打开正则表达式中的一些特殊操作符。

Complement

这意味着正则意味着匹配一段字符串，例如ab~cd意思是:a开头，后面是b，然后是一堆非c字符串，最后是d结尾。例如，字符串abcdef

ab~df     ab~cf     ab~cdef   a~(cb)def a~(bc)def

Interval

interval选项支持数值的范围，如字符串foo80:

foo<1-100>     # matchfoo<01-100>    # matchfoo<001-100>   # no match

Intersection

使用&可实现多个匹配连接，如字符串aaabbb：

aaa.+&.+bbb     # matchaaa&bbb         # no match

Any

使用@，任何字符串都可以匹配

实践

首先创建索引：PUT test然后创建映射：PUT test/_mapping/test{  "properties": {    "a": {      "type": "string",      "index":"not_analyzed"     },    "b":{      "type":"string"    }  }}添加一个数据：PUT test/test/1{  "a":"a,b,c","b":"a,b,c"}

先分析一下，a,b,c默认分析成什么？

POST test/_analyze{  : ,  : }返回内容：{  : [    {      : ,      : 0,      : 1,      : ,      : 0    },    {      : ,      : 2,      : 3,      : ,      : 1    },    {      : ,      : 4,      : 5,      : ,      : 2    }  ]}

然后查询：

POST /test/test/_search?pretty{  :{    :{        :     }  }}返回{  : 2,  : false,  : {    : 5,    : 5,    : 0  },  : {    : 1,    : 1,    : [      {        : ,        : ,        : ,        : 1,        : {          : ,          :         }      }    ]  }}

用b字段试试：

POST /test/test/_search?pretty{  "query":{    "regexp":{        "b": "a.*b.*"    }  }}返回{  "took": 1,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "failed": 0  },  "hits": {    "total": 0,    "max_score": null,    "hits": []  }}

为什么会这样？

因为整个regexp查询应用于一个词，搜索某个词a.*b.*，因为a字段不分词，它的词是整个字段a.b.c；b字段经过分词，他的词是a和b和c三个独立的词，所以a字段的正则搜索可以找到结果；但是b字段找不到。

19908451513

467805942@qq.com

Elasticsearch【正则搜索】分析&实践