-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emitter Timeout Errors #11
Comments
Hi @busecolak, thanks a lot for the report. The other day I was trying to gather metrics related to how many metrics Druid is emitting to the exporter, but I think that those are only available in more recent versions (I run 0.12.3 sadly, but hopefully will migrate soon). I am reading https://druid.apache.org/docs/latest/configuration/index.html#emitting-metrics, and I'd be curious to know if you have an idea about how big are your batches of datapoints, and how frequent they are. As we discussed previously the exporter might need some performance enhancement for big workloads, so it would be good to have an idea about what is the volume of datapoints that your druid daemons are emitting. Moreover, I'd also test with lowering Let me know! |
There are other two interesting settings:
So in theory, if you didn't change the latter, the max batch size is 500. I assume that when it gets to that number of datapoints, then it forces a flush (even if the 300s are not reached). How big are those datapoints though? Does it happen with specific daemons or with all? Another thing that I'd do would be to lower down the max batch size to something like 100, and see how it goes. |
@busecolak one thing that could be useful is getting some info about your workload with a graph like: The emitter publishes a metric called |
@elukey thanks for the replies. The errors occur mostly with MiddleManager and Historical daemons, but most of them have the same issue. For MiddleManagers, |
@busecolak can you put a irate..[5m] or similar and report the rps value? Just to have an idea of the volume and how it varies. |
@busecolak any news? |
@freemanlutsk thanks a lot for the feedback, my use case is way less datapoints/s so the exporter was surely not designed for such big use cases. The idea that I have in mind is to see if we can make async the HTTP ingestion of the datapoints, push them to a python queue and then let the exporter process data with a different pace. Will try to research something, if anybody has ideas or suggestions based on experience I am all ears :) |
Ok I have to admit that I didn't really check at the time how many threads the uwsgi make_server call would create. It seems that it creates only one, that serves requests synchronously, so the current issue is well explained. |
Would you be able to fix it ? :) |
Yes yes I am currently working on it, I think I found a solution that scales more, I'll try to file a pull request asap. After that any test and user report would be really appreciated :) |
First attempt: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/599011/ Currently testing it but so far preliminary results are very good :) |
@freemanlutsk any chance that you could apply the above patch and test how things look like? |
Sure. I will test it in 1 hour. All changes in the master? |
@freemanlutsk not yet, the code is ready and I tested it, but before committing I wanted some feedback if possible, this is why I was asking (so you'd need to take the above patch and apply it manually on the current master). If not possible not problem! |
I've deployed yours fixes into my environment. Seems like it works as expected. Would be good to keep it working couple of hours. I'll be back with result tomorrow :) |
Unfortunately problem still exists (
|
@freemanlutsk thanks a lot for the report and the testing. I guess that the patch alleviates the problem but we still see some issue right? What I'd like to check now is if some tuning is needed on the Druid front. There are a lot of tuning parameters in: https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module Do you have any special config? If I am reading the logs correct they say |
Also, how many druid daemons are pushing to the same druid exporter? At some point one solution to consider would be to have one or more "dedicated" instances for high-traffic daemons (say historical, middle managers), like @busecolak mentioned before. |
Several users reported timeouts logged in Druid Historical/MiddleManager daemons when their traffic is really high (like 1500 msg/s). The exporter was built with a more conservative use case in mind (Wikimedia's), so the default sync single process of make_server() has been enough up to now. This patch separates datapoints ingestion (namely traffic that comes from Druid daemons towards the exporter) and datapoints processing, using a (thread-safe) queue. It also uses the gevent's WSGIServer implementation, that uses coroutines/greenlets and it is able to sustain a lot more (concurrent) traffic. GH issue: #11 Change-Id: I4b335a1f663957277fe0c443492c4000bbbcac89
Unfortunately I cannot easily play with those parameters. |
@freemanlutsk thanks, let me know how it goes. Today I was thinking that beyond a certain limit, we should rely on Kafka. I discovered today the kafka emitter, so I have in mind to add a flag to this exporer to just pulling from kafka if requested (instead of ingesting datapoints directly from Druid). Would it be something usable for you? In this way the exporter would not be concerned about traffic from Druid daemons, and we'd re-use something battle tested for it (namely Kafka). |
Sounds good! |
Ok so I'll try to schedule some time during the next days to work on this, shouldn't take much. In the meantime, if you could test using a dedicated exporter only for the historical daemon it would be great :) |
@busecolak have you tried the new version by any chance? If so, how does it look in your environment? |
The idea that I have in mind is the following: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/600295/ Still not tested, I need to verify the format of datapoints in kafka etc.., hopefully will be able to do it during the next days. |
@freemanlutsk I had some time to test the above patch and it seems working, but I wasn't able to set Druid to use the Kafka emitter extension since I don't have it packaged yet. I manually pushed some json content to a kafka topic and it works, so if Druid doesn't do anything peculiar when emitting datapoints to Kafka I'd say that it should work fine. If you have time/patience to test the above and let me know if it is an acceptable/performant solution for you I'd be grateful. |
Great! Thank You! |
This change introduces a separate thread that is able to pull data from a Kafka topic and insert datapoints in the shared queue that separate ingestion from processing. The idea is to set Druid daemons that need to push hundreds of datapoints/s to Kafka (via the KafkaEmitter) and collect them at a slower pace in the Exporter. GH issue: #11 Change-Id: Ibc82be5883f20c26b50342d2032381086bcd218a
Hi @elukey, Will it work with your exporter ? Thank You! |
@freemanlutsk in theory yes, but better to test it before enabling it for Production use cases. In case you encounter any error please report it in here and I'll try to fix. |
@elukey
|
@freemanlutsk what issue do you see? Timeout while contacting the port 8000? Or something different? I'd check if port 8000 is correctly exposed by your docker container to localhost (where I guess you are trying to test it). Also you need to push datapoints to the topic to make it process something, since by default the kafka client starts from the tail of the topic. Thanks for testing! |
|
oh it was my mistake, |
@elukey Please, advice. |
@freemanlutsk do you see, among the debug logs, something like: If not, I think that the kafka-python dependency is not pulled in, ending up in https://github.com/wikimedia/operations-software-druid_exporter/blob/master/druid_exporter/collector.py#L27-L30. I haven't updated the Dockerfile yet, are you including kafka-python in the one that you are using by any chance? If this is the problem maybe I can add a log to inform the user of what is happening, like |
@elukey Thanks!
|
These changes were brought up by @freemanlutsk on GH while testing the new Kafka client code. Change-Id: I812b3ddb108fa849f8a178984b0e20b3a6e536e7 GH: #11
@freemanlutsk this is a bug, can you try the last version of the code? (just committed a fix) |
@freemanlutsk thanks a lot for the tests! Yes the coordinator metrics are still something that I haven't solved, only one coordinator is the master at any given time, that publishes metrics. Due to any number of reasons, it can swap anytime (say due to a restart/problem/etc..). Keep also in mind https://github.com/wikimedia/operations-software-druid_exporter#known-limitations, that I still haven't solved. Basically the main issue is that when the coordinator master moves from one host to the other one, then due to how prometheus work both will keep returning metrics: the "former" master will keep returning the last metrics state, and the new one will emit the new metrics. I have in mind some workarounds, but still didn't find a good compromise in the code. If it turns to be a problem for you I'll try to prioritize 👍 |
Hello,
I am using Druid 0.18.0 with this exporter. There is one druid exporter instance for each druid component.
Exporter seems to having trouble catching up on emitter. The exporter causes some druid components to pop errors related to emitter.
Druid emitter configuration is as follows:
In exporter logs, I see consecutive Post requests even if the emitter configuration as above.
[20/May/2020 06:39:36] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:45] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
Some error log on Druid components:
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [25] ms, timeout [50] ms
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Failed to send events to url
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - failedBuffers queue size reached the limit [50], dropping the oldest failed buffer
Is there any suggestion, how to resolve this?
The text was updated successfully, but these errors were encountered: