elasticsearch 跨字段查询时怎么对不同字段赋予不同的权重

点击联系发帖人 时间：2017-11-05 21:23

elasticsearch id字段

3被浏览1654分享邀请回答12 条评论分享收藏感谢收起0添加评论分享收藏感谢收起这里有一篇很好的文章，很不错，翻译和整理了一下，英文不错的，建议直接看原文：
elasticsearch里面有BOOL&、AND、OR、NOT&，这几个看起来很相似，都有什么区别呢？什么时候用bool？什么时候用AND filter呢？
事实上，bool filter和AND 、OR、NOT filter 是完全不同，在查询性能上面的影响是非常大的。
首先咱们需要了解的是filter里面都是怎么工作的，其中核心的一个东西叫，可以理解为一个很大的bit数组，数组里面的每个元素有2个状态:0和1（bloom filter知道么？），而filter大家都知道，只处理文档是否匹配与否，不涉及文档评分操作。如果一个文档和filter查询匹配，那么其对应的bit位就设置为1，匹配不上则设置为0。
es在执行filter查询过滤的时候，会打开lucene的每个segment段文件，然后去判断里面的文档符合该filter与否，这个匹配的结果我们就可以用bitset来存储起来，下次同样的filter查询过来，我们就直接使用内存里面的bitset来进行判断就行了，而不需要再打开lucene的segment文件了，避免了io的操作，这样就可以大大提高查询处理的速度，这也是为什么filter这么高效的原因。
因为lucene的segment段文件是不变的，lucene会产生新段，但是旧段是不变的，所以bitset是重复利用的，根据不同的filter条件和不同的段，会产生相应的bitset，另外不同的查询可能会涉及到多个bitset的做交集，计算机对这种bit位处理过程是非常拿手的，速度很快。
另外，如果filter的结果如果是空的，那么里面的bitset位都是0，es以后在处理该filter的时候，会把该bitset整个忽略掉，提高性能。
前面说完了基础内容，咱们再看看bool filter和AND filter这些的区别吧
bool filter会使用到前面提到过的bitset数据结构（bitset派），而AND \OR\ NOTfilter则不能利用到bitset（non-bitset派），为什么呢？
AND、OR、NOT filter是doc by doc的逐个文档的处理，es逐个加载文档里面的字段内容，然后检查字段的内容是否满足查询条件，不满足的文档就排除在结果集之外，依次迭代进行，直到过完一遍所有的文档，这中间的过程用不到前面提到过的bitset，也就不能重复利用缓存资源
如果你有多个filter条件，即一个AND、OR、NOT里面包含多个filter过滤条件（支持数组的方式），那么处理的逻辑就是每个filter会将依次将生成的结果集传到下一个filter，理论上处理的文档数会越来越少，因为只会过滤减少，不会增加，这样依次过滤，所以一般限制条件比较苛刻的可以放前面执行，这样后面的filter需要处理的文档数就会很小，这样可以大大提高整体处理的速度，另外除了数量上的考虑外，还需要考虑filter的效率问题，一些filter执行效率很低，如Geo filter（大量计算）或者script based filter（动态脚本），建议将这些性能开销比较大的查询放最后执行来提高整体的处理速度。
好了，现在应该有这么一个概念了，AND、OR、NOT是文档by文档，依次处理，如果你的结果集很大，即一个很宽松的查询，命中很多，那么你使用AND、OR、NOT filter是不合适的，但是有些filter是必须文档by文档处理的，如下面的这几个filter：
Geo* filters
Numeric_range
所以除了上面那几个没有办法的，其它的filter应该一律使用bool filter来提高查询性能。
如果你的查询里面需要同时使用到bitset和non-bitset类型的filter，则可以组合起来使用bool filter和AND\OR\NOT filter，
前面说了，AND 是结果集依次向后传递，所以我们把性能比较好的放前面，non-bitset放AND的filter的后面，如下面一个包含多个filter类型的复杂的filter
"bool" : {
"must" : [
{ "term" : {} },
{ "range" : {} },
{ "term" : {} }
{ "custom_script" : {} },
{ "geo_distance" : {} }
and 在最外层做wrapper，第一个filter是一个bool filter，里面有3个must的子filter，处理完了之后，得到文档结果集，然后再执行一个or的子filter，OR里面两个查询会分别进行，最终的文档结果集就是我们的搜索结果了。
总之，filter使用的时候，一定要优先使用bitset流，然后还要考虑filter顺序和组合的问题
Geo, Script&or&Numeric_range&filter: 使用 And/Or/Not Filters
所有其它的: 使用 Bool Filter
掌握了以上这些，就不难写出高性能的查询了。
本文出自：
阅读(...) 评论()在 SegmentFault，解决技术问题
每个月，我们帮助 1000 万的开发者解决各种各样的技术问题。并助力他们在技术能力、职业生涯、影响力上获得提升。
一线的工程师、著名开源项目的作者们，都在这里：
获取验证码
已有账号？
标签：至少1个，最多5个
之前测试了
这里我们使用和之前完全相同的测试数据，来测试 elasticsearch 存储时间序列的表结构选择问题。
一个点一个doc的表结构
同样我们以最简单的表结构开始。在elasticsearch中，先要创建index，然后index下有mapping。所谓的mapping就是表结构的概念。建表的配置如下：
settings = {
'number_of_shards': 1,
'number_of_replicas': 0,
'index.query.default_field': 'timestamp',
'index.mapping.ignore_malformed': False,
'index.mapping.coerce': False,
'index.query.parse.allow_unmapped_fields': False,
mappings = {
'testdata': {
'_source': {'enabled': False},
'_all': {'enabled': False},
'properties': {
'timestamp': {
'type': 'date',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
'vAppid': {
'type': 'string',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
'iResult': {
'type': 'string',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
'vCmdid': {
'type': 'string',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
'dProcessTime': {
'type': 'integer',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
'totalCount': {
'type': 'integer',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': True,
'fielddata': {
'format': 'doc_values'
表结构虽然没有做按时间段打包的高级优化，但是一些es相关的设置是特别值得注意的。首先_source被关闭了，这样原始的json文档不会被重复存储一遍。其次_all也被关闭了。而且每个字段的store都是False，也就是不会单独被存储。之前测试mongodb的时候，所有字段都没有建索引的，所以为了公平起见，这里把索引都关了。这些都关掉了，那么数据存哪里了？存在doc_values里。doc_values用于在做聚合运算的时候，根据一批文档id快速找到对应的列的值。doc_values在磁盘上一个按列压缩存储的文件，非常高效。
那么800多万行数据导入之后，磁盘占用情况如何？
size: 198Mi (198Mi)
docs: 8,385,335 (8,385,335)
非常惊人，838万行在mongodb里占了3G的磁盘空间，导入es居然只占用了198M。即便把所有维度字段的索引加上膨胀也非常小。
size: 233Mi (233Mi)
docs: 8,385,335 (8,385,335)
那么查询效率呢？
'timestamp': {
'terms': {
'field': 'timestamp'
'totalCount': {'sum': {'field': 'totalCount'}}
res = es.search(index="wentao-test1", doc_type='testdata', body=q, search_type='count')
同样是按时间聚合，取得同周期的totalCount之和。查询结果为：
{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1},
u'aggregations': {u'timestamp': {u'buckets': [{u'doc_count': 38304,
u'key': 0,
u'key_as_string': u'T22:05:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 38020,
u'key': 0,
u'key_as_string': u'T22:06:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37865,
u'key': 0,
u'key_as_string': u'T22:01:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37834,
u'key': 0,
u'key_as_string': u'T22:04:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37780,
u'key': 0,
u'key_as_string': u'T22:09:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37761,
u'key': 0,
u'key_as_string': u'T22:07:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37738,
u'key': 0,
u'key_as_string': u'T22:08:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37598,
u'key': 0,
u'key_as_string': u'T22:00:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37541,
u'key': 0,
u'key_as_string': u'T22:02:00.000Z',
u'totalCount': {u'value': }},
{u'doc_count': 37518,
u'key': 0,
u'key_as_string': u'T22:03:00.000Z',
u'totalCount': {u'value': }}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 8007376}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 8385335},
u'timed_out': False,
u'took': 1033}
只花了1秒钟的时间，之前这个查询在mongodb里需要花9秒。那么是不是因为elasticsearch是并行数据库所以快呢？我们之前在创建index的时候故意指定了shard数量为1，所以这个查询只有一个机器参与的。为了好奇，我又试验了以下6个分片的。在分片为6的时候，总尺寸为259M（含索引），而上面那个查询只需要200ms。当然这里测试的时候使用的mongodb和es的机器不完全一样，也许是因为硬件原因呢？
第二个查询要复杂一些，按vAppid过滤，然后按timestamp和vCmdid两个维度聚合。查询如下：
'query': {
'constant_score': {
'filter': {
'must_not': {
'vAppid': ''
'timestamp': {
'terms': {
'field': 'timestamp'
'vCmdid': {
'terms': {
'field': 'vCmdid'
'totalCount': {'sum': {'field': 'totalCount'}}
res = es.search(index="wentao-test3", doc_type='testdata', body=q, search_type='count')
constant_score跳过了score阶段。查询结果如下：
{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1},
u'aggregations': {u'timestamp': {u'buckets': [{u'doc_count': 38304,
u'key': 0,
u'key_as_string': u'T22:05:00.000Z',
u'vCmdid': {u'buckets': [{u'doc_count': 7583,
u'key': u'10000',
u'totalCount': {u'value': }},
{u'doc_count': 4122, u'key': u'19', u'totalCount': {u'value': 41463.0}},
{u'doc_count': 2312, u'key': u'14', u'totalCount': {u'value': 41289.0}},
{u'doc_count': 2257, u'key': u'18', u'totalCount': {u'value': 57845.0}},
{u'doc_count': 1723,
u'key': u'1002',
u'totalCount': {u'value': 33844.0}},
{u'doc_count': 1714,
u'key': u'2006',
u'totalCount': {u'value': 33681.0}},
{u'doc_count': 1646,
u'key': u'2004',
u'totalCount': {u'value': 28374.0}},
{u'doc_count': 1448, u'key': u'13', u'totalCount': {u'value': 32187.0}},
{u'doc_count': 1375, u'key': u'3', u'totalCount': {u'value': 32976.0}},
{u'doc_count': 1346,
u'key': u'2008',
u'totalCount': {u'value': 45932.0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 12778}},
... // ignore
{u'doc_count': 37518,
u'key': 0,
u'key_as_string': u'T22:03:00.000Z',
u'vCmdid': {u'buckets': [{u'doc_count': 7456,
u'key': u'10000',
u'totalCount': {u'value': }},
{u'doc_count': 4049, u'key': u'19', u'totalCount': {u'value': 39884.0}},
{u'doc_count': 2308, u'key': u'14', u'totalCount': {u'value': 39939.0}},
{u'doc_count': 2263, u'key': u'18', u'totalCount': {u'value': 57121.0}},
{u'doc_count': 1731,
u'key': u'1002',
u'totalCount': {u'value': 32309.0}},
{u'doc_count': 1695,
u'key': u'2006',
u'totalCount': {u'value': 33299.0}},
{u'doc_count': 1649,
u'key': u'2004',
u'totalCount': {u'value': 28429.0}},
{u'doc_count': 1423, u'key': u'13', u'totalCount': {u'value': 30672.0}},
{u'doc_count': 1340,
u'key': u'2008',
u'totalCount': {u'value': 45051.0}},
{u'doc_count': 1308, u'key': u'3', u'totalCount': {u'value': 32076.0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 12296}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 8007376}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 8385335},
u'timed_out': False,
u'took': 2235}
查询只花了2.2秒，而之前在mongodb上花了21.4秒。在6个shard的index上跑同样的查询，只需花0.6秒。
一个时间段打包成一个doc
和之前 MongoDB 的 _._._._.v 的结构一样，数据按照维度嵌套存放在内部的子文档里。
表结构如下
mappings = {
'testdata': {
'_source': {'enabled': False},
'_all': {'enabled': False},
'properties': {
'max_timestamp': {
'type': 'date',
'index': 'not_analyzed',
'store': False,
'dynamic': 'strict',
'doc_values': False,
'fielddata': {
'format': 'disabled'
'min_timestamp': {
'type': 'date',
'index': 'not_analyzed',
'store': False,
'dynamic': 'strict',
'doc_values': False,
'fielddata': {
'format': 'disabled'
'count': {
'type': 'integer',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': False,
'fielddata': {
'format': 'disabled'
'sum_totalCount': {
'type': 'integer',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': False,
'fielddata': {
'format': 'disabled'
'sum_dProcessTime': {
'type': 'integer',
'index': 'no',
'store': False,
'dynamic': 'strict',
'doc_values': False,
'fielddata': {
'format': 'disabled'
'_': { # timestamp
'type': 'nested',
'properties': {
'd': {'type': 'date', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'c': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'0': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'1': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'_': { # vAppid
'type': 'nested',
'properties': {
'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'_': { # iResult
'type': 'nested',
'properties': {
'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'_': { # vCmdid
'type': 'nested',
'properties': {
'd': {'type': 'string', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'v': { # values
'type': 'nested',
'properties': {
'0': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}},
'1': {'type': 'integer', 'index': 'not_analyzed', 'store': False, 'fielddata': {'format': 'fst'}}
表结构的要点是一对nested的嵌套文档。nested的成员必须打开doc_values或者index中的一项，否则数据不会被保存。因为doc_values更占空间，所以我们选择了不存doc values。
在 MongoDB 里的数据
"sharded" : false,
"primary" : "shard2_RS",
"ns" : "wentao_test.sparse_precomputed_no_appid",
"count" : 39,
"size" : 2.68435e+08,
"avgObjSize" : 6.88294e+06,
"storageSize" : 2.75997e+08,
"numExtents" : 3,
"nindexes" : 1,
"lastExtentSize" : 1.58548e+08,
"paddingFactor" : 1.0000,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 8176,
"indexSizes" : {
"_id_" : 8176
"ok" : 1.0000,
"$gleStats" : {
"lastOpTime" : Timestamp(, 3),
"electionId" : ObjectId("54c9f324adaa0bd054140fda")
只有39个文档，尺寸是270M。数据导入到es之后
size: 74.6Mi (74.6Mi)
docs: 9,355,029 (9,355,029)
文档数变成了935万个，因为子文档在es里也算成文档的，尺寸只有74M。查询条件如下
'expanded_timestamp': {
'nested' : {
'path': '_'
'grouped_timestamp': {
'terms': {
'totalCount': {
'field': '_.0'
res = es.search(index="wentao-test4", doc_type='testdata', body=q, search_type='count')
注意 _.0 是预先计算好的同周期的 totalCount sum。嵌套的维度字段排序是 timestmap =& vAppid =& iResult =& vCmdid =& values (0 as toalCount, 1 as dProcessTime)。
{u'_shards': {u'failed': 0, u'successful': 1, u'total': 1},
u'aggregations': {u'expanded_timestamp': {u'doc_count': 743,
u'grouped_timestamp': {u'buckets': [{u'doc_count': 8,
u'key': 0,
u'key_as_string': u'T22:09:00.000Z',
u'totalCount': {u'value': }},
... // ignore
{u'doc_count': 1,
u'key': 0,
u'key_as_string': u'T22:59:00.000Z',
u'totalCount': {u'value': 83009.0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 39},
u'timed_out': False,
u'took': 56}
查询只花了0.056秒。使用预先计算的值并不公平。使用原始的值计算也是可以做到的：
'per_id': {
'terms': {
'field': '_uid'
'expanded_timestamp': {
'nested' : {
'path': '_'
'grouped_timestamp': {
'terms': {
'expanded_vAppid': {
'nested' : {
'path': '_._._._.v'
'totalCount': {
'field': '_._._._.v.0'
这里使用了多级展开，最后对 _._._._.v.0 求和。计算的结果和 _.0 求和是一样的。花的时间是0.548秒。
然后再来测一下按vAppid过滤，同时按时间和vCmdid两个维度聚合的查询。这个写起来有一些变态：
'expanded_timestamp': {
'nested' : {
'path': '_'
'grouped_timestamp': {
'terms': {
'expanded_to_vAppid': {
'nested' : {
'path': '_._'
'vAppid_not_empty': {
'filter': {
'must_not': {
'_._.d': ''
'expanded_to_vCmdid': {
'nested' : {
'path': '_._._._'
'ts_and_vCmdid': {
'terms': {'field': '_._._._.d', 'size': 0}, # _._._._.d is vCmdid
'expanded_to_values': {
'nested' : {
'path': '_._._._.v'
'totalCount': {
'field': '_._._._.v.0'
查询的速度是3.2秒。比原始格式保存的方式查起来要慢。但是实际情况下，预先计算的值是更可能被使用的，这种需要拆开原始的value的情况很少。
ElasticSearch 就像闪电一样快。
原始格式保存，占用 198M（mongodb是3G），查询1秒（mongodb是9秒）
打包格式保存，占用 74M（mongodb是270M），查询0.54秒（mongodb是7.1秒）
打包格式在原始值要完全展开的时候稍微比原始格式要慢，但是打包可以很方便的存储预聚合的值，那么大部分时候读取甚至是0.05秒这个级别的。
如果我们可以用74M，存储880万个点。那么有2T硬盘，可以存多少数据呢？很多很多……不但可以存进去读出来，更重要的是es还可以帮我们在服务器端完成按需聚合，从不同维度快速展示数据。
6 收藏&&|&&43
你可能感兴趣的文章
29 收藏，9.7k
25 收藏，9.3k
14 收藏，4.2k
分享到微博？
我要该，理由是：}

叫阿莫西中心