Emitter Timeout Errors #11

busecolak · 2020-05-20T06:48:16Z

Hello,

I am using Druid 0.18.0 with this exporter. There is one druid exporter instance for each druid component.
Exporter seems to having trouble catching up on emitter. The exporter causes some druid components to pop errors related to emitter.

Druid emitter configuration is as follows:

druid_emitter_http_minHttpTimeoutMillis=100
druid_emitter_http_flushMillis=300000

In exporter logs, I see consecutive Post requests even if the emitter configuration as above.
[20/May/2020 06:39:36] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:45] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0

Some error log on Druid components:
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [25] ms, timeout [50] ms
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Failed to send events to url
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - failedBuffers queue size reached the limit [50], dropping the oldest failed buffer

Is there any suggestion, how to resolve this?

The text was updated successfully, but these errors were encountered:

elukey · 2020-05-20T07:33:04Z

Hi @busecolak, thanks a lot for the report. The other day I was trying to gather metrics related to how many metrics Druid is emitting to the exporter, but I think that those are only available in more recent versions (I run 0.12.3 sadly, but hopefully will migrate soon).

I am reading https://druid.apache.org/docs/latest/configuration/index.html#emitting-metrics, and I'd be curious to know if you have an idea about how big are your batches of datapoints, and how frequent they are. As we discussed previously the exporter might need some performance enhancement for big workloads, so it would be good to have an idea about what is the volume of datapoints that your druid daemons are emitting.

Moreover, I'd also test with lowering druid.emitter.http.flushMillis, in 300s there could be a ton of datapoints accumulated, and 100ms of timeout could be too tight. Maybe more frequent flush intervals (like once every 30s/60s) could help.

Let me know!

elukey · 2020-05-20T07:38:12Z

There are other two interesting settings:

druid.emitter.http.flushCount - How many messages the internal message buffer can hold before flushing (sending).
druid.emitter.http.maxBatchSize - The maximum batch size, in bytes.

So in theory, if you didn't change the latter, the max batch size is 500. I assume that when it gets to that number of datapoints, then it forces a flush (even if the 300s are not reached). How big are those datapoints though? Does it happen with specific daemons or with all?

Another thing that I'd do would be to lower down the max batch size to something like 100, and see how it goes.

elukey · 2020-05-20T07:45:52Z

@busecolak one thing that could be useful is getting some info about your workload with a graph like:

https://grafana.wikimedia.org/d/000000538/druid?panelId=62&fullscreen&orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All

The emitter publishes a metric called druid_exporter_datapoints_registered_total, this could be a good indicator about your workload..

busecolak · 2020-05-20T09:49:56Z

@elukey thanks for the replies.

The errors occur mostly with MiddleManager and Historical daemons, but most of them have the same issue.

For MiddleManagers, druid_exporter_datapoints_registered_total is around 4-5 millions. But I am not sure how to map this to the batch size of the emitter.

elukey · 2020-05-20T12:42:02Z

@busecolak can you put a irate..[5m] or similar and report the rps value? Just to have an idea of the volume and how it varies.

elukey · 2020-05-26T16:06:28Z

@busecolak any news?

freemanlutsk · 2020-05-27T06:59:41Z

Hi guys,
I've faced the same issue.
irate(druid_exporter_datapoints_registered_total[5m])

elukey · 2020-05-27T08:00:48Z

@freemanlutsk thanks a lot for the feedback, my use case is way less datapoints/s so the exporter was surely not designed for such big use cases. The idea that I have in mind is to see if we can make async the HTTP ingestion of the datapoints, push them to a python queue and then let the exporter process data with a different pace. Will try to research something, if anybody has ideas or suggestions based on experience I am all ears :)

elukey · 2020-05-27T08:33:47Z

Ok I have to admit that I didn't really check at the time how many threads the uwsgi make_server call would create. It seems that it creates only one, that serves requests synchronously, so the current issue is well explained.

freemanlutsk · 2020-05-27T09:18:40Z

Would you be able to fix it ? :)
Thank You!

elukey · 2020-05-27T10:16:44Z

Yes yes I am currently working on it, I think I found a solution that scales more, I'll try to file a pull request asap. After that any test and user report would be really appreciated :)

elukey · 2020-05-27T15:11:08Z

First attempt: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/599011/

Currently testing it but so far preliminary results are very good :)

elukey · 2020-05-27T15:30:06Z

@freemanlutsk any chance that you could apply the above patch and test how things look like?

freemanlutsk · 2020-05-27T15:51:46Z

Sure. I will test it in 1 hour. All changes in the master?

elukey · 2020-05-27T16:06:52Z

@freemanlutsk not yet, the code is ready and I tested it, but before committing I wanted some feedback if possible, this is why I was asking (so you'd need to take the above patch and apply it manually on the current master). If not possible not problem!

freemanlutsk · 2020-05-27T17:15:04Z

I've deployed yours fixes into my environment. Seems like it works as expected. Would be good to keep it working couple of hours. I'll be back with result tomorrow :)
Many thanks for quick response!

freemanlutsk · 2020-05-28T06:20:42Z

Unfortunately problem still exists (

2020-05-28T06:00:36,172 ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [33] ms, timeout [66] ms
2020-05-28T06:00:36,172 WARN [HttpPostEmitter-1] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 884ms.
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Request timeout to localhost.localdomain/127.0.0.1:8000 after 66 ms

elukey · 2020-05-28T06:37:31Z

@freemanlutsk thanks a lot for the report and the testing.

I guess that the patch alleviates the problem but we still see some issue right?

What I'd like to check now is if some tuning is needed on the Druid front. There are a lot of tuning parameters in: https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Do you have any special config? If I am reading the logs correct they say timeout [66] ms, that is very tight, meanwhile the default is 60s.

elukey · 2020-05-28T06:39:21Z

Also, how many druid daemons are pushing to the same druid exporter? At some point one solution to consider would be to have one or more "dedicated" instances for high-traffic daemons (say historical, middle managers), like @busecolak mentioned before.

Several users reported timeouts logged in Druid Historical/MiddleManager daemons when their traffic is really high (like 1500 msg/s). The exporter was built with a more conservative use case in mind (Wikimedia's), so the default sync single process of make_server() has been enough up to now. This patch separates datapoints ingestion (namely traffic that comes from Druid daemons towards the exporter) and datapoints processing, using a (thread-safe) queue. It also uses the gevent's WSGIServer implementation, that uses coroutines/greenlets and it is able to sustain a lot more (concurrent) traffic. GH issue: #11 Change-Id: I4b335a1f663957277fe0c443492c4000bbbcac89

freemanlutsk · 2020-05-28T12:51:14Z

https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Unfortunately I cannot easily play with those parameters.
Currently Broker, Historical, Router, MiddleManager - are storing metrics to the same exporter. (I'm on the way to changing this architecture).
Basically problems appear after adding the Historical daemon.

elukey · 2020-05-28T13:36:20Z

@freemanlutsk thanks, let me know how it goes. Today I was thinking that beyond a certain limit, we should rely on Kafka. I discovered today the kafka emitter, so I have in mind to add a flag to this exporer to just pulling from kafka if requested (instead of ingesting datapoints directly from Druid). Would it be something usable for you? In this way the exporter would not be concerned about traffic from Druid daemons, and we'd re-use something battle tested for it (namely Kafka).

freemanlutsk · 2020-05-28T13:54:40Z

Sounds good!
Let's try, I'm willing to testing.

elukey · 2020-05-28T14:14:12Z

Ok so I'll try to schedule some time during the next days to work on this, shouldn't take much. In the meantime, if you could test using a dedicated exporter only for the historical daemon it would be great :)

elukey · 2020-05-29T07:14:29Z

Added https://github.com/wikimedia/operations-software-druid_exporter#performance-considerations

elukey · 2020-05-29T08:39:42Z

@busecolak have you tried the new version by any chance? If so, how does it look in your environment?

elukey · 2020-05-30T09:59:40Z

The idea that I have in mind is the following: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/600295/

Still not tested, I need to verify the format of datapoints in kafka etc.., hopefully will be able to do it during the next days.

elukey · 2020-06-03T07:13:05Z

@freemanlutsk I had some time to test the above patch and it seems working, but I wasn't able to set Druid to use the Kafka emitter extension since I don't have it packaged yet. I manually pushed some json content to a kafka topic and it works, so if Druid doesn't do anything peculiar when emitting datapoints to Kafka I'd say that it should work fine. If you have time/patience to test the above and let me know if it is an acceptable/performant solution for you I'd be grateful.

freemanlutsk · 2020-06-03T08:09:28Z

Great! Thank You!
In 2-3 weeks, I'm going to change my Druid architecture. After that I will be able to restart my nodes with changed configs without any problems and to do test with production load.

This change introduces a separate thread that is able to pull data from a Kafka topic and insert datapoints in the shared queue that separate ingestion from processing. The idea is to set Druid daemons that need to push hundreds of datapoints/s to Kafka (via the KafkaEmitter) and collect them at a slower pace in the Exporter. GH issue: #11 Change-Id: Ibc82be5883f20c26b50342d2032381086bcd218a

freemanlutsk · 2020-07-23T15:24:17Z

Hi @elukey,
Sorry for delay.
I’ve enabled emitting metrics to the Kafka cluster with clarity-kafka extension help.
Format is:
{"feed":"metrics","timestamp":"2020-07-23T15:19:30.444Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/user","value":1410,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/percent","value":0.024666666666666667,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/sys","value":70,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.446Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/threads/livePeak","value":123,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"}

Will it work with your exporter ?

Thank You!

elukey · 2020-07-23T17:38:04Z

@freemanlutsk in theory yes, but better to test it before enabling it for Production use cases. In case you encounter any error please report it in here and I'll try to fix.

freemanlutsk · 2020-07-23T18:13:47Z

@elukey
Is that a right way to run an exporter ?
It seems something went wrong, because exposed port (8000) does not work.

docker run --rm -v /root/druid.json:/exporter/conf/druid.json registry.company.com/bidmachine/druid-exporter:kafka_v1 -t bidmachine-druid-metrics-stage -b kafka1-ext.example.com:9092 kafka2-ext.example.com:9092 kafka3-ext.example.com:9092 kafka4-ext.example.com:9092 kafka5-ext.example.com:9092 kafka6-ext.example.com:9092 kafka7-ext.example.com:9092 kafka8-ext.example.com:9092 kafka9-ext.example.com:9092 -g druid-metrics-stage  -d conf/druid.json
INFO:__main__:Using Kafka config: {'topic': 'bidmachine-druid-metrics-stage', 'bootstrap_servers': ['kafka1-ext.example.com:9092', 'kafka2-ext.example.com:9092', 'kafka3-ext.example.com:9092', 'kafka4-ext.example.com:9092', 'kafka5-ext.example.com:9092', 'kafka6-ext.example.com:9092', 'kafka7-ext.example.com:9092', 'kafka8-ext.example.com:9092', 'kafka9-ext.example.com:9092'], 'group_id': 'druid-metrics-stage'}
INFO:__main__:Starting druid_exporter on :8000
INFO:__main__:Reading metrics configuration from conf/druid.json
INFO:__main__:Checking consistency of metrics config file..
DEBUG:druid_exporter.collector:Process datapoints thread starting..

elukey · 2020-07-24T06:35:41Z

@freemanlutsk what issue do you see? Timeout while contacting the port 8000? Or something different? I'd check if port 8000 is correctly exposed by your docker container to localhost (where I guess you are trying to test it). Also you need to push datapoints to the topic to make it process something, since by default the kafka client starts from the tail of the topic. Thanks for testing!

freemanlutsk · 2020-07-24T06:41:40Z

netstat -antp|grep LIST|grep 8000 is empty.

curl 127.0.0.1:8000/metrics
curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused

@elukey

freemanlutsk · 2020-07-24T06:56:30Z

oh it was my mistake, —net host
nevermind to the previous my messages )

freemanlutsk · 2020-07-24T09:36:05Z

@elukey
It seems consumer does not grab a data from the Kafka stream (kafkacat shows there are lot of metrics in the topic).
Also I noticed, after starting the consumer (exporter), there was not created a consumer group in the Kafka cluster.
Perhaps problem with the clarity-kafka

Please, advice.
Thank You!

elukey · 2020-07-24T09:53:44Z

@freemanlutsk do you see, among the debug logs, something like: Kafka datapoints puller thread starting.. ?

If not, I think that the kafka-python dependency is not pulled in, ending up in https://github.com/wikimedia/operations-software-druid_exporter/blob/master/druid_exporter/collector.py#L27-L30. I haven't updated the Dockerfile yet, are you including kafka-python in the one that you are using by any chance?

If this is the problem maybe I can add a log to inform the user of what is happening, like Looks like you added a kafka config but your Kafka client dependencies are not deployed, so I'll skip to create the kafka puller thread..

freemanlutsk · 2020-07-24T10:33:25Z

@elukey Thanks! kafka-python was cause the previous error.
Would you be able to look through another error ?

DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=8) with offset 410380 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=1) with offset 410377 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=2) with offset 410370 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=7) with offset 410371 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=5) with offset 410372 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=6) with offset 410368 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=4) with offset 410377 to buffered record list
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 298, in pull_datapoints_from_kafka
    json_message = json.loads(message.value.decode())
NameError: name 'json' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 305, in pull_datapoints_from_kafka
    except json.JSONDecodeError:
NameError: name 'json' is not defined

@freemanlutsk

These changes were brought up by @freemanlutsk on GH while testing the new Kafka client code. Change-Id: I812b3ddb108fa849f8a178984b0e20b3a6e536e7 GH: #11

elukey · 2020-07-24T12:52:52Z

@freemanlutsk this is a bug, can you try the last version of the code? (just committed a fix)

freemanlutsk · 2020-07-24T15:23:56Z

Thank You!
Looks like all works fine!

But coordinator's metrics are grabbing 3 times (if you have 3 masters). For metrics from masters we have to use some functions like avg.

And I should try, how will feel an exporter in the production env.

elukey · 2020-07-24T16:22:38Z

@freemanlutsk thanks a lot for the tests! Yes the coordinator metrics are still something that I haven't solved, only one coordinator is the master at any given time, that publishes metrics. Due to any number of reasons, it can swap anytime (say due to a restart/problem/etc..).

Keep also in mind https://github.com/wikimedia/operations-software-druid_exporter#known-limitations, that I still haven't solved. Basically the main issue is that when the coordinator master moves from one host to the other one, then due to how prometheus work both will keep returning metrics: the "former" master will keep returning the last metrics state, and the new one will emit the new metrics. I have in mind some workarounds, but still didn't find a good compromise in the code. If it turns to be a problem for you I'll try to prioritize 👍

wmfgerrit pushed a commit that referenced this issue Jul 24, 2020

collector.py: import json and add more logging for the user

c1dd525

These changes were brought up by @freemanlutsk on GH while testing the new Kafka client code. Change-Id: I812b3ddb108fa849f8a178984b0e20b3a6e536e7 GH: #11

Emitter Timeout Errors #11

Emitter Timeout Errors #11

Comments

busecolak commented May 20, 2020

elukey commented May 20, 2020

elukey commented May 20, 2020

elukey commented May 20, 2020

busecolak commented May 20, 2020

elukey commented May 20, 2020

elukey commented May 26, 2020

freemanlutsk commented May 27, 2020

elukey commented May 27, 2020

elukey commented May 27, 2020

freemanlutsk commented May 27, 2020

elukey commented May 27, 2020

elukey commented May 27, 2020

elukey commented May 27, 2020

freemanlutsk commented May 27, 2020

elukey commented May 27, 2020

freemanlutsk commented May 27, 2020

freemanlutsk commented May 28, 2020

elukey commented May 28, 2020

elukey commented May 28, 2020

freemanlutsk commented May 28, 2020 • edited Loading

elukey commented May 28, 2020 • edited Loading

freemanlutsk commented May 28, 2020

elukey commented May 28, 2020

elukey commented May 29, 2020

elukey commented May 29, 2020

elukey commented May 30, 2020

elukey commented Jun 3, 2020

freemanlutsk commented Jun 3, 2020

freemanlutsk commented Jul 23, 2020

elukey commented Jul 23, 2020

freemanlutsk commented Jul 23, 2020 • edited Loading

elukey commented Jul 24, 2020

freemanlutsk commented Jul 24, 2020 • edited Loading

freemanlutsk commented Jul 24, 2020

freemanlutsk commented Jul 24, 2020

elukey commented Jul 24, 2020 • edited Loading

freemanlutsk commented Jul 24, 2020

elukey commented Jul 24, 2020

freemanlutsk commented Jul 24, 2020

elukey commented Jul 24, 2020 • edited Loading

freemanlutsk commented May 28, 2020 •

edited

Loading

elukey commented May 28, 2020 •

edited

Loading

freemanlutsk commented Jul 23, 2020 •

edited

Loading

freemanlutsk commented Jul 24, 2020 •

edited

Loading

elukey commented Jul 24, 2020 •

edited

Loading

elukey commented Jul 24, 2020 •

edited

Loading