Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emitter Timeout Errors #11

Open
busecolak opened this issue May 20, 2020 · 40 comments
Open

Emitter Timeout Errors #11

busecolak opened this issue May 20, 2020 · 40 comments

Comments

@busecolak
Copy link
Contributor

Hello,

I am using Druid 0.18.0 with this exporter. There is one druid exporter instance for each druid component.
Exporter seems to having trouble catching up on emitter. The exporter causes some druid components to pop errors related to emitter.

Druid emitter configuration is as follows:

  • druid_emitter_http_minHttpTimeoutMillis=100
  • druid_emitter_http_flushMillis=300000

In exporter logs, I see consecutive Post requests even if the emitter configuration as above.
[20/May/2020 06:39:36] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:45] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0

Some error log on Druid components:
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [25] ms, timeout [50] ms
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Failed to send events to url
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - failedBuffers queue size reached the limit [50], dropping the oldest failed buffer

Is there any suggestion, how to resolve this?

@elukey
Copy link
Contributor

elukey commented May 20, 2020

Hi @busecolak, thanks a lot for the report. The other day I was trying to gather metrics related to how many metrics Druid is emitting to the exporter, but I think that those are only available in more recent versions (I run 0.12.3 sadly, but hopefully will migrate soon).

I am reading https://druid.apache.org/docs/latest/configuration/index.html#emitting-metrics, and I'd be curious to know if you have an idea about how big are your batches of datapoints, and how frequent they are. As we discussed previously the exporter might need some performance enhancement for big workloads, so it would be good to have an idea about what is the volume of datapoints that your druid daemons are emitting.

Moreover, I'd also test with lowering druid.emitter.http.flushMillis, in 300s there could be a ton of datapoints accumulated, and 100ms of timeout could be too tight. Maybe more frequent flush intervals (like once every 30s/60s) could help.

Let me know!

@elukey
Copy link
Contributor

elukey commented May 20, 2020

There are other two interesting settings:

  • druid.emitter.http.flushCount - How many messages the internal message buffer can hold before flushing (sending).
  • druid.emitter.http.maxBatchSize - The maximum batch size, in bytes.

So in theory, if you didn't change the latter, the max batch size is 500. I assume that when it gets to that number of datapoints, then it forces a flush (even if the 300s are not reached). How big are those datapoints though? Does it happen with specific daemons or with all?

Another thing that I'd do would be to lower down the max batch size to something like 100, and see how it goes.

@elukey
Copy link
Contributor

elukey commented May 20, 2020

@busecolak one thing that could be useful is getting some info about your workload with a graph like:

https://grafana.wikimedia.org/d/000000538/druid?panelId=62&fullscreen&orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All

The emitter publishes a metric called druid_exporter_datapoints_registered_total, this could be a good indicator about your workload..

@busecolak
Copy link
Contributor Author

@elukey thanks for the replies.

The errors occur mostly with MiddleManager and Historical daemons, but most of them have the same issue.

For MiddleManagers, druid_exporter_datapoints_registered_total is around 4-5 millions. But I am not sure how to map this to the batch size of the emitter.

@elukey
Copy link
Contributor

elukey commented May 20, 2020

@busecolak can you put a irate..[5m] or similar and report the rps value? Just to have an idea of the volume and how it varies.

@elukey
Copy link
Contributor

elukey commented May 26, 2020

@busecolak any news?

@freemanlutsk
Copy link

Hi guys,
I've faced the same issue.
irate(druid_exporter_datapoints_registered_total[5m]) image

@elukey
Copy link
Contributor

elukey commented May 27, 2020

@freemanlutsk thanks a lot for the feedback, my use case is way less datapoints/s so the exporter was surely not designed for such big use cases. The idea that I have in mind is to see if we can make async the HTTP ingestion of the datapoints, push them to a python queue and then let the exporter process data with a different pace. Will try to research something, if anybody has ideas or suggestions based on experience I am all ears :)

@elukey
Copy link
Contributor

elukey commented May 27, 2020

Ok I have to admit that I didn't really check at the time how many threads the uwsgi make_server call would create. It seems that it creates only one, that serves requests synchronously, so the current issue is well explained.

@freemanlutsk
Copy link

Would you be able to fix it ? :)
Thank You!

@elukey
Copy link
Contributor

elukey commented May 27, 2020

Yes yes I am currently working on it, I think I found a solution that scales more, I'll try to file a pull request asap. After that any test and user report would be really appreciated :)

@elukey
Copy link
Contributor

elukey commented May 27, 2020

First attempt: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/599011/

Currently testing it but so far preliminary results are very good :)

@elukey
Copy link
Contributor

elukey commented May 27, 2020

@freemanlutsk any chance that you could apply the above patch and test how things look like?

@freemanlutsk
Copy link

Sure. I will test it in 1 hour. All changes in the master?

@elukey
Copy link
Contributor

elukey commented May 27, 2020

@freemanlutsk not yet, the code is ready and I tested it, but before committing I wanted some feedback if possible, this is why I was asking (so you'd need to take the above patch and apply it manually on the current master). If not possible not problem!

@freemanlutsk
Copy link

I've deployed yours fixes into my environment. Seems like it works as expected. Would be good to keep it working couple of hours. I'll be back with result tomorrow :)
Many thanks for quick response!

@freemanlutsk
Copy link

Unfortunately problem still exists (

2020-05-28T06:00:36,172 ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [33] ms, timeout [66] ms
2020-05-28T06:00:36,172 WARN [HttpPostEmitter-1] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 884ms.
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Request timeout to localhost.localdomain/127.0.0.1:8000 after 66 ms

@elukey
Copy link
Contributor

elukey commented May 28, 2020

@freemanlutsk thanks a lot for the report and the testing.

I guess that the patch alleviates the problem but we still see some issue right?

What I'd like to check now is if some tuning is needed on the Druid front. There are a lot of tuning parameters in: https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Do you have any special config? If I am reading the logs correct they say timeout [66] ms, that is very tight, meanwhile the default is 60s.

@elukey
Copy link
Contributor

elukey commented May 28, 2020

Also, how many druid daemons are pushing to the same druid exporter? At some point one solution to consider would be to have one or more "dedicated" instances for high-traffic daemons (say historical, middle managers), like @busecolak mentioned before.

wmfgerrit pushed a commit that referenced this issue May 28, 2020
Several users reported timeouts logged in Druid Historical/MiddleManager
daemons when their traffic is really high (like 1500 msg/s). The exporter
was built with a more conservative use case in mind (Wikimedia's), so the
default sync single process of make_server() has been enough up to now.

This patch separates datapoints ingestion (namely traffic that comes from
Druid daemons towards the exporter) and datapoints processing, using
a (thread-safe) queue. It also uses the gevent's WSGIServer implementation,
that uses coroutines/greenlets and it is able to sustain a lot more (concurrent)
traffic.

GH issue: #11

Change-Id: I4b335a1f663957277fe0c443492c4000bbbcac89
@freemanlutsk
Copy link

freemanlutsk commented May 28, 2020

https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Unfortunately I cannot easily play with those parameters.
Currently Broker, Historical, Router, MiddleManager - are storing metrics to the same exporter. (I'm on the way to changing this architecture).
Basically problems appear after adding the Historical daemon.

@elukey
Copy link
Contributor

elukey commented May 28, 2020

@freemanlutsk thanks, let me know how it goes. Today I was thinking that beyond a certain limit, we should rely on Kafka. I discovered today the kafka emitter, so I have in mind to add a flag to this exporer to just pulling from kafka if requested (instead of ingesting datapoints directly from Druid). Would it be something usable for you? In this way the exporter would not be concerned about traffic from Druid daemons, and we'd re-use something battle tested for it (namely Kafka).

@freemanlutsk
Copy link

Sounds good!
Let's try, I'm willing to testing.

@elukey
Copy link
Contributor

elukey commented May 28, 2020

Ok so I'll try to schedule some time during the next days to work on this, shouldn't take much. In the meantime, if you could test using a dedicated exporter only for the historical daemon it would be great :)

@elukey
Copy link
Contributor

elukey commented May 29, 2020

Added https://github.com/wikimedia/operations-software-druid_exporter#performance-considerations

@elukey
Copy link
Contributor

elukey commented May 29, 2020

@busecolak have you tried the new version by any chance? If so, how does it look in your environment?

@elukey
Copy link
Contributor

elukey commented May 30, 2020

The idea that I have in mind is the following: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/600295/

Still not tested, I need to verify the format of datapoints in kafka etc.., hopefully will be able to do it during the next days.

@elukey
Copy link
Contributor

elukey commented Jun 3, 2020

@freemanlutsk I had some time to test the above patch and it seems working, but I wasn't able to set Druid to use the Kafka emitter extension since I don't have it packaged yet. I manually pushed some json content to a kafka topic and it works, so if Druid doesn't do anything peculiar when emitting datapoints to Kafka I'd say that it should work fine. If you have time/patience to test the above and let me know if it is an acceptable/performant solution for you I'd be grateful.

@freemanlutsk
Copy link

Great! Thank You!
In 2-3 weeks, I'm going to change my Druid architecture. After that I will be able to restart my nodes with changed configs without any problems and to do test with production load.

wmfgerrit pushed a commit that referenced this issue Jun 22, 2020
This change introduces a separate thread that is able to pull
data from a Kafka topic and insert datapoints in the shared
queue that separate ingestion from processing.

The idea is to set Druid daemons that need to push hundreds
of datapoints/s to Kafka (via the KafkaEmitter) and collect them
at a slower pace in the Exporter.

GH issue: #11

Change-Id: Ibc82be5883f20c26b50342d2032381086bcd218a
@freemanlutsk
Copy link

Hi @elukey,
Sorry for delay.
I’ve enabled emitting metrics to the Kafka cluster with clarity-kafka extension help.
Format is:
{"feed":"metrics","timestamp":"2020-07-23T15:19:30.444Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/user","value":1410,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/percent","value":0.024666666666666667,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/sys","value":70,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.446Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/threads/livePeak","value":123,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"}

Will it work with your exporter ?

Thank You!

@elukey
Copy link
Contributor

elukey commented Jul 23, 2020

@freemanlutsk in theory yes, but better to test it before enabling it for Production use cases. In case you encounter any error please report it in here and I'll try to fix.

@freemanlutsk
Copy link

freemanlutsk commented Jul 23, 2020

@elukey
Is that a right way to run an exporter ?
It seems something went wrong, because exposed port (8000) does not work.

docker run --rm -v /root/druid.json:/exporter/conf/druid.json registry.company.com/bidmachine/druid-exporter:kafka_v1 -t bidmachine-druid-metrics-stage -b kafka1-ext.example.com:9092 kafka2-ext.example.com:9092 kafka3-ext.example.com:9092 kafka4-ext.example.com:9092 kafka5-ext.example.com:9092 kafka6-ext.example.com:9092 kafka7-ext.example.com:9092 kafka8-ext.example.com:9092 kafka9-ext.example.com:9092 -g druid-metrics-stage  -d conf/druid.json
INFO:__main__:Using Kafka config: {'topic': 'bidmachine-druid-metrics-stage', 'bootstrap_servers': ['kafka1-ext.example.com:9092', 'kafka2-ext.example.com:9092', 'kafka3-ext.example.com:9092', 'kafka4-ext.example.com:9092', 'kafka5-ext.example.com:9092', 'kafka6-ext.example.com:9092', 'kafka7-ext.example.com:9092', 'kafka8-ext.example.com:9092', 'kafka9-ext.example.com:9092'], 'group_id': 'druid-metrics-stage'}
INFO:__main__:Starting druid_exporter on :8000
INFO:__main__:Reading metrics configuration from conf/druid.json
INFO:__main__:Checking consistency of metrics config file..
DEBUG:druid_exporter.collector:Process datapoints thread starting..

@elukey
Copy link
Contributor

elukey commented Jul 24, 2020

@freemanlutsk what issue do you see? Timeout while contacting the port 8000? Or something different? I'd check if port 8000 is correctly exposed by your docker container to localhost (where I guess you are trying to test it). Also you need to push datapoints to the topic to make it process something, since by default the kafka client starts from the tail of the topic. Thanks for testing!

@freemanlutsk
Copy link

freemanlutsk commented Jul 24, 2020

netstat -antp|grep LIST|grep 8000 is empty.

curl 127.0.0.1:8000/metrics
curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused

@elukey

@freemanlutsk
Copy link

oh it was my mistake, —net host
nevermind to the previous my messages )

@freemanlutsk
Copy link

@elukey
It seems consumer does not grab a data from the Kafka stream (kafkacat shows there are lot of metrics in the topic).
Also I noticed, after starting the consumer (exporter), there was not created a consumer group in the Kafka cluster.
Perhaps problem with the clarity-kafka

Please, advice.
Thank You!

@elukey
Copy link
Contributor

elukey commented Jul 24, 2020

@freemanlutsk do you see, among the debug logs, something like: Kafka datapoints puller thread starting.. ?

If not, I think that the kafka-python dependency is not pulled in, ending up in https://github.com/wikimedia/operations-software-druid_exporter/blob/master/druid_exporter/collector.py#L27-L30. I haven't updated the Dockerfile yet, are you including kafka-python in the one that you are using by any chance?

If this is the problem maybe I can add a log to inform the user of what is happening, like Looks like you added a kafka config but your Kafka client dependencies are not deployed, so I'll skip to create the kafka puller thread..

@freemanlutsk
Copy link

@elukey Thanks! kafka-python was cause the previous error.
Would you be able to look through another error ?

DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=8) with offset 410380 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=1) with offset 410377 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=2) with offset 410370 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=7) with offset 410371 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=5) with offset 410372 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=6) with offset 410368 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=4) with offset 410377 to buffered record list
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 298, in pull_datapoints_from_kafka
    json_message = json.loads(message.value.decode())
NameError: name 'json' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 305, in pull_datapoints_from_kafka
    except json.JSONDecodeError:
NameError: name 'json' is not defined

wmfgerrit pushed a commit that referenced this issue Jul 24, 2020
These changes were brought up by @freemanlutsk on GH while
testing the new Kafka client code.

Change-Id: I812b3ddb108fa849f8a178984b0e20b3a6e536e7
GH: #11
@elukey
Copy link
Contributor

elukey commented Jul 24, 2020

@freemanlutsk this is a bug, can you try the last version of the code? (just committed a fix)

@freemanlutsk
Copy link

Thank You!
Looks like all works fine!

But coordinator's metrics are grabbing 3 times (if you have 3 masters). For metrics from masters we have to use some functions like avg.

And I should try, how will feel an exporter in the production env.
image

@elukey
Copy link
Contributor

elukey commented Jul 24, 2020

@freemanlutsk thanks a lot for the tests! Yes the coordinator metrics are still something that I haven't solved, only one coordinator is the master at any given time, that publishes metrics. Due to any number of reasons, it can swap anytime (say due to a restart/problem/etc..).

Keep also in mind https://github.com/wikimedia/operations-software-druid_exporter#known-limitations, that I still haven't solved. Basically the main issue is that when the coordinator master moves from one host to the other one, then due to how prometheus work both will keep returning metrics: the "former" master will keep returning the last metrics state, and the new one will emit the new metrics. I have in mind some workarounds, but still didn't find a good compromise in the code. If it turns to be a problem for you I'll try to prioritize 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants