

公司内部的ES访问架构一般是, Java应用--->SLB(域名)---->ES ingest node (no data) --> ES data node ,其中ingest节点是对外暴露的,供Java应用访问,承担了一个纯client角色,不提供数据存储和倒排索引检索服务。这其中SLB是为了方便起到一个域名和负载均衡的功能,绑定后端的n个client节点,并且做到对业务透明,但是毕竟还是有开销的,多了一次网络rpc的转发(尽管他很快),同时也是多花了一份钱。所以在930的时候我们把SLB去掉了,并且进行了验证完全没有问题,这其中还要得益于es本身就支持ip配置列表,并且自身实现了负载均衡的功能。 更改之后的访问链路,Java应用--->ES ingest node -->ES data node
就在缩容的时候,我们遇到了问题,我们更改了es client里面配置的ip列表,结果出现了超时,同时观察到一个现象,每次更新ip 列表的时候,总有一台机器的连接数明显高于其他机器,这是为何呢?关键节点如下

  • step1 下午16:xx 更新了es client里的机器列表,系统表现正常
  • step2 晚上20:5x 开始下线数据节点,系统出现了少量超时和报错,并且观察到有一台es的client节点流量明显高于其他机器,出现了负载不均衡
  • step3 晚上21:1x 以为es 流量高的节点有问题,所以进行了下线
  • step4 晚上21:2x 随着那台client节点下线,另外一台新的client节点又出现了流量过高的情况,并且超时一波一波,像是定时发生的
  • step5 怀疑是不是es client sdk 初始化有问题,代码里创建了新的es client的时候,老的未正常销毁,于是开始分批重启Java应用,让他重新初始化es client,而不是做热替换,分6批,每批大概10台机器
  • step6 观察到超时依然持续,负载不均衡的问题,依然没有解决,同时超时从一波波变成了持续但是少量,相当于原来超时的波峰被均匀打散到各个时间段了
  • step7 随后发现,es client里的ip列表配错了,里面配置了data node 数据节点,而正好20:5x下线了这几台机器,这几台已经不可用了
  • step8 修正es client连接的ip列表,系统报错消失,负载又均衡了,系统恢复正常


    有 ServerA, ServerB,ServerC,ServerD,ServerE 5台机器,当ServerD不可用的时候,ServerE的流量会明显增高,当ServerE不可用的时候,ServerA的流量会明显增高。

    1. 为什么会超时?为什么超时最开始一波波的,重启后超时会打散了?
    2. ES是如何处理es client里的坏节点的? 如果是加黑名单,为什么还会出现负载不均衡和超时问题?




      * Sends a request to the Elasticsearch cluster that the client points to.
      * Blocks until the request is completed and returns its response or fails
      * by throwing an exception. Selects a host out of the provided ones in a
      * round-robin fashion. Failing hosts are marked dead and retried after a
      * certain amount of time (minimum 1 minute, maximum 30 minutes), depending
      * on how many times they previously failed (the more failures, the later
      * they will be retried). In case of failures all of the alive nodes (or
      * dead nodes that deserve a retry) are retried until one responds or none
      * of them does, in which case an {@link IOException} will be thrown.
      * This method works by performing an asynchronous call and waiting
      * for the result. If the asynchronous call throws an exception we wrap
      * it and rethrow it so that the stack trace attached to the exception
      * contains the call site. While we attempt to preserve the original
      * exception this isn't always possible and likely haven't covered all of
      * the cases. You can get the original exception from
      * {@link Exception#getCause()}.
      * @param request the request to perform
      * @return the response returned by Elasticsearch
      * @throws IOException in case of a problem or the connection was aborted
      * @throws ClientProtocolException in case of an http protocol error
      * @throws ResponseException in case Elasticsearch responded with a status code that indicated an error
      public Response performRequest(Request request) throws IOException {
      InternalRequest internalRequest = new InternalRequest(request);
      return performRequest(nextNodes(), internalRequest, null);

      这里有个nextNodes() ,返回值是一个NodeTuple<Iterator>是一个服务器列表,暂且不去看他怎么调整的,看看他怎么用的(分析过程需要理解es是怎么使用Apache的httpclient去请求服务器的,这里直接公布答案,host信息会带在request里面构造成一个类似 HttpGet("https://host:port/search?q=0")这样的一个对象传给httpclient执行,)从后面使用的地方来看,try里面选中的是nodeTuple.nodes.next(), 由于这是第一次从list里取数据,因此是头结点。

      RequestContext context = request.createContextForNextAttempt(nodeTuple.nodes.next(), nodeTuple.authCache);
注释说的很清楚,使用了RR负载均衡算法,并且错误的节点会被静默处理(加入黑名单,1分钟,最大30分钟 )
 Selects a host out of the provided ones in a round-robin fashion. Failing hosts are marked dead and retried after a certain amount of time (minimum 1 minute, maximum 30 minutes)
来到了这个关键代码,nextNodes,是干什么的? 他其实是对原来的列表进行了排序并且剔除了dead node . 
 static Iterable<Node> selectNodes(NodeTuple<List<Node>> nodeTuple, Map<HttpHost, DeadHostState> blacklist,
                                      AtomicInteger lastNodeIndex, NodeSelector nodeSelector) throws IOException {
         * Sort the nodes into living and dead lists.
        List<Node> livingNodes = new ArrayList<>(Math.max(0, nodeTuple.nodes.size() - blacklist.size()));
        List<DeadNode> deadNodes = new ArrayList<>(blacklist.size());
        for (Node node : nodeTuple.nodes) {
            DeadHostState deadness = blacklist.get(node.getHost());
            if (deadness == null || deadness.shallBeRetried()) {
            } else {
                deadNodes.add(new DeadNode(node, deadness));

        if (false == livingNodes.isEmpty()) {
             * Normal state: there is at least one living node. If the
             * selector is ok with any over the living nodes then use them
             * for the request.
            List<Node> selectedLivingNodes = new ArrayList<>(livingNodes);
            if (false == selectedLivingNodes.isEmpty()) {
                 * Rotate the list using a global counter as the distance so subsequent
                 * requests will try the nodes in a different order.
                Collections.rotate(selectedLivingNodes, lastNodeIndex.getAndIncrement());
                return selectedLivingNodes;


  1. 要返回的列表是new 出来的,跟原来的你配置进去的不干扰
  2. 如果有死节点,这里就直接清理掉了,关键判断逻辑 deadness.shallBeRetried()稍后介绍
  3. 使用的是集合的Collections.rotate()实现了轮询机制,稍后介绍
  4. 轮询的之后返回的是一个调整完排序的新的列表给到performRequest调用next()获取了第一个节点
  5. rotate的第二个参数rotate是多个线程同时共享使用的,每次+1, 因此实现了轮询的作用


    过程如下,假如原来你配置的列表, A,B,C,D,假设正常情况没有坏节点的情况下
    第一次Collections.rotate("A,B,C,D", 0) = "A,B,C,D"
    第二次lastNodeIndex=1,Collections.rotate("A,B,C,D", 1) = "D,A,B,C"
    第三次lastNodeIndex=2,Collections.rotate("A,B,C,D", 2) = "C,D,A,B"
    第四次lastNodeIndex=3,Collections.rotate("A,B,C,D", 3) = "B,C,D,A"
    第五次lastNodeIndex=4,Collections.rotate("A,B,C,D", 4) = "A,B,C,D"


需要提一下的是,这里面有个NodeSelector对象干扰,可以看到,在调用rotate之前,调用了这个对象的select方法,点进去看到的是一个接口,那么这里很大概率就是一个扩展点了,真实的我们在用的时候,有个默认值,NodeSelector ANY,他的方法体里是个空的,什么也没做,也就是默认,什么都不做,所以这个是没用的,不用去关心

     * Select the {@link Node}s to which to send requests. This is called with
     * a mutable {@link Iterable} of {@linkplain Node}s in the order that the
     * rest client would prefer to use them and implementers should remove
     * nodes from the that should not receive the request. Implementers may
     * iterate the nodes as many times as they need.
     * <p>
     * This may be called twice per request: first for "living" nodes that
     * have not been blacklisted by previous errors. If the selector removes
     * all nodes from the list or if there aren't any living nodes then the
     * {@link RestClient} will call this method with a list of "dead" nodes.
     * <p>
     * Implementers should not rely on the ordering of the nodes.
    void select(Iterable<Node> nodes);



 try {
            httpResponse = client.execute(context.requestProducer, context.asyncResponseConsumer, context.context, null).get();
        } catch(Exception e) {
            RequestLogger.logFailedRequest(logger, request.httpRequest, context.node, e);
            Exception cause = extractAndWrapCause(e);
            addSuppressedException(previousException, cause);
            if (nodeTuple.nodes.hasNext()) {
                return performRequest(nodeTuple, request, cause);
            if (cause instanceof IOException) {
                throw (IOException) cause;
            if (cause instanceof RuntimeException) {
                throw (RuntimeException) cause;
            throw new IllegalStateException("unexpected exception type: must be either RuntimeException or IOException", cause);


  1. onFailure(context.node) 把当前的这个节点加入黑名单里
  2. return performRequest(nodeTuple, request, cause); 递归调用下一个节点,直到有正常节点响应

那么加入黑名单之后会发生什么呢? 从刚刚的select node 逻辑里可以看到,blackList节点里的节点需要通过shallBeRetried的判断,要不要加入到living列表里,用来这次请求,这个方法如下

     * Indicates whether it's time to retry to failed host or not.
     * @return true if the host should be retried, false otherwise
    boolean shallBeRetried() {
        return timeSupplier.get() - deadUntilNanos > 0;


 DeadHostState(Supplier<Long> timeSupplier) {
        this.failedAttempts = 1;
        this.deadUntilNanos = timeSupplier.get() + MIN_CONNECTION_TIMEOUT_NANOS;
        this.timeSupplier = timeSupplier;


     * Build the dead state of a host given its previous dead state. Useful when a host has been failing before, hence
     * it already failed for one or more consecutive times. The more failed attempts we register the longer we wait
     * to retry that same host again. Minimum is 1 minute (for a node the only failed once created
     * through {@link #DeadHostState(Supplier)}), maximum is 30 minutes (for a node that failed more than 10 consecutive times)
     * @param previousDeadHostState the previous state of the host which allows us to increase the wait till the next retry attempt
    DeadHostState(DeadHostState previousDeadHostState) {
        long timeoutNanos = (long)Math.min(MIN_CONNECTION_TIMEOUT_NANOS * 2 * Math.pow(2, previousDeadHostState.failedAttempts * 0.5 - 1),
        this.deadUntilNanos = previousDeadHostState.timeSupplier.get() + timeoutNanos;
        this.failedAttempts = previousDeadHostState.failedAttempts + 1;
        this.timeSupplier = previousDeadHostState.timeSupplier;



  1. 为什么会负载不均衡?

ps: 假如坏掉的是最后一个节点,ABCD的D坏了,根据条件if (nodeTuple.nodes.hasNext()) {}他不会重试了,为什么第一个节点会出现负载不均衡呢?

  1. 为什么会超时?为什么超时最开始一波波的,重启后超时会打散了?

因为变更ip 列表是所有java机器几乎同时变更的,这个列表里有几个坏机器,因此触发了静默的逻辑,也就是第一次全部失败报错,然后静默了1分钟之后,再次请求,再次集体报错,下一次请求的节奏始终一致,因此这个超时是一波一波的像定时任务一样。

  1. 为什么加黑名单还会报错?



使用es restclient直接访问es集群的时候,通过ip直连而不是slb来连接的时候,由于es的负载均衡算法问题,会出现以下现象

  1. 一波一波的es访问超时,且没有日志。。。 (debug日志生产一般不开),重试时间过久会导致全链路超时
  2. 坏掉的节点的下一个节点上的流量会明显高于其他节点,负载不再均衡
  3. 全链路会报错,因为你配置的超时时间是每次请求es的Socket时间,而由于他自己会重试好几次,因此真实的search时间会超过你设置的超时时间,导致上游的soa cancel报错。这也是为什么es的响应头里告诉你took花了500ms,实际上你的search方法却花了1s的原因.



