Skip to content

ti_abusech: Update Fleet status message on API 402 #13718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kcreddy
Copy link
Contributor

@kcreddy kcreddy commented Apr 29, 2025

Proposed commit message

ti_abusech: Update Fleet error message on API 402.

When API returns 402 Payment Required, the current component goes into 
DEGRADED state. This leads to input getting restarted in agentless environment.
This loop continues over and over again. Update the error message inside CEL 
to use the `Auth Key` (API Key) in the requests to avoid rate limiting issues[1]. 
This error message is used inside the Fleet UI.

[1] https://abuse.ch/blog/community-first/

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Screenshots

Before :

Input Health
Screenshot 2025-05-01 at 10 35 20 PM

Documents indexed
Screenshot 2025-05-01 at 10 32 34 PM

After (current PR):

Input Health
Screenshot 2025-05-01 at 10 51 41 PM

Documents indexed
Screenshot 2025-05-01 at 10 50 45 PM

@kcreddy kcreddy self-assigned this Apr 29, 2025
@kcreddy kcreddy added Integration:ti_abusech AbuseCH bugfix Pull request that fixes a bug issue Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] labels Apr 29, 2025
@kcreddy kcreddy marked this pull request as ready for review April 29, 2025 17:55
@kcreddy kcreddy requested a review from a team as a code owner April 29, 2025 17:55
@elasticmachine
Copy link

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@kcreddy kcreddy requested review from andrewkroh and efd6 April 29, 2025 17:56
@andrewkroh
Copy link
Member

andrewkroh commented Apr 29, 2025

This leads to input getting restarted in agentless environment.

Is this restart on DEGRADED status specific to the Agentless environment? IMO, non-fatal statuses should not cause a restart. The Agent should be able to inform users of unhealthy conditions without causing further issues (like a restart loop). A restart should be reserved for true unrecoverable conditions (like deadlock or other unresponsiveness).

We can debate whether the rate-limited state should be considered DEGRADED or HEALTHY. However, it would still be valuable to indicate the rate-limited state in Fleet. For example, if input collection will be paused for an hour before resuming, displaying this information on the agent status could be helpful.

@elastic-vault-github-plugin-prod
Copy link

elastic-vault-github-plugin-prod bot commented Apr 29, 2025

🚀 Benchmarks report

Package tenable_io 👍(3) 💚(0) 💔(2)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
scan 19230.77 13698.63 -5532.14 (-28.77%) 💔
vulnerability 1945.53 1605.14 -340.39 (-17.5%) 💔

Package ti_abusech 👍(2) 💚(0) 💔(2)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
malware 4366.81 3571.43 -795.38 (-18.21%) 💔
malwarebazaar 5154.64 3610.11 -1544.53 (-29.96%) 💔

To see the full report comment with /test benchmark fullreport

@efd6
Copy link
Contributor

efd6 commented Apr 30, 2025

Is this restart on DEGRADED status specific to the Agentless environment?

I cannot see how that is happening. It's certainly not expected behaviour for the input. On return of a non-array "events" field we just log, set DEGRADED, raise the object to an array for processing and then drop out of the periodic closure after publication in the normal manner. The only time we exit the periodic loop is when we have a non-nil Go error. This happens when the context is cancelled, when the rate limit response handler errors, or there are various unexpected type errors.

It would be good to understand which, if any, of these is the exit path. To understand this, it would be good to see what errors get logged immediately after the DEGRADATION logging.

@kcreddy
Copy link
Contributor Author

kcreddy commented Apr 30, 2025

Something is off with agentless in Serverless env. I was able to setup Abusech integration earlier. But now I am getting following error:

403 Forbidden: {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.malware-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"}],"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.malware-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"},"status":403} 403 Forbidden: {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.malwarebazaar-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"}],"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.malwarebazaar-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"},"status":403} 403 Forbidden: {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.threatfox-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"}],"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.threatfox-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"},"status":403} 403 Forbidden: {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.url-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"}],"type":"security_exception","reason":"action [indices:data/read/search] is unauthorized for API key id [xxxxxx] of user [elastic/fleet-server] on indices [agentless-state-cel-ti_abusech.url-501aa6be-4a34-4193-8b99-640449a92134], this action is granted by the index privileges [read,all]"},"status":403}

Sharing the agent logs from ECH agentless env.
elastic-agent-20250430.ndjson.zip

{"log.level":"info","@timestamp":"2025-04-30T16:49:03.363Z","message":"registering","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"id":"cel-ti_abusech.url-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67::https://urlhaus.abuse.ch/downloads/json","key":"cel-ti_abusech_url-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67::https://urlhaus_abuse_ch/downloads/json","uuid":"16c02628-e208-4b43-a8ff-cc3eedad4d8f","ecs.version":"1.6.0","log.logger":"metric_registry","log.origin":{"file.line":63,"file.name":"inputmon/input.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/inputmon.NewInputRegistry"},"service.name":"filebeat","input_type":"cel","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:03.366Z","message":"process repeated request","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"id":"cel-ti_abusech.url-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67","input_url":"https://urlhaus.abuse.ch/downloads/json","ecs.version":"1.6.0","log.origin":{"file.line":225,"file.name":"cel/input.go","function":"github.com/elastic/beats/v7/x-pack/filebeat/input/cel.input.run.func1"},"service.name":"filebeat","input_source":"https://urlhaus.abuse.ch/downloads/json","log.logger":"input.cel","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2025-04-30T16:49:03.608Z","message":"single event object returned by evaluation","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"service.name":"filebeat","id":"cel-ti_abusech.malwarebazaar-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67","input_source":"https://mb-api.abuse.ch/api/v1/","ecs.version":"1.6.0","log.logger":"input.cel","log.origin":{"file.line":407,"file.name":"cel/input.go","function":"github.com/elastic/beats/v7/x-pack/filebeat/input/cel.input.run.func1"},"input_url":"https://mb-api.abuse.ch/api/v1/","event":{"error":{"code":"402","id":"402 Payment Required","message":"POST:{\n    \"query_status\": \"ratelimited\",\n    \"msg\": \"Your request has been rate-limited. Please visit https:\\/\\/abuse.ch\\/rate-limit\\/ for more information.\"\n}"}},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-04-30T16:49:03.615Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":705},"message":"Unit state changed cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67 (HEALTHY->DEGRADED): single event error object returned by evaluation: {\"error\":{\"code\":\"402\",\"id\":\"402 Payment Required\",\"message\":\"POST:{\\n    \\\"query_status\\\": \\\"ratelimited\\\",\\n    \\\"msg\\\": \\\"Your request has been rate-limited. Please visit https:\\\\/\\\\/abuse.ch\\\\/rate-limit\\\\/ for more information.\\\"\\n}\"}}","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"HEALTHY"},"unit":{"id":"cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:04.860Z","message":"add_cloud_metadata: hosting provider type not detected.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":100,"file.name":"add_cloud_metadata/add_cloud_metadata.go","function":"github.com/elastic/beats/v7/libbeat/processors/add_cloud_metadata.(*addCloudMetadata).init.func1"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:04.899Z","message":"Connecting to backoff(elasticsearch(https://0c0a9b03c334490ebd8bd527d95014f8.us-central1.gcp.cloud.es.io:443))","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":138,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:05.008Z","message":"Attempting to connect to Elasticsearch version 8.18.0 (default)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"esclientleg","log.origin":{"file.line":323,"file.name":"eslegclient/connection.go","function":"github.com/elastic/beats/v7/libbeat/esleg/eslegclient.(*Connection).Ping"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:05.324Z","message":"Connection to backoff(elasticsearch(https://0c0a9b03c334490ebd8bd527d95014f8.us-central1.gcp.cloud.es.io:443)) established","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":146,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:30.711Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":687},"message":"Component state changed cel-default (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '28' exited with code '-1'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:30.712Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":705},"message":"Unit state changed cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67 (DEGRADED->STOPPED): Suppressing FAILED state due to restart for '28' exited with code '-1'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STOPPED"},"unit":{"id":"cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67","type":"input","state":"STOPPED","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:30.712Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":705},"message":"Unit state changed cel-default (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '28' exited with code '-1'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STOPPED"},"unit":{"id":"cel-default","type":"output","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:31.714Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":667},"message":"Spawned new component cel-default: Starting: spawned pid '68'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:31.714Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":674},"message":"Spawned new unit cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67: Starting: spawned pid '68'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STARTING"},"unit":{"id":"cel-default-cel-ti_abusech-1c57b81b-ee89-4fe9-ad66-bc5dd6af7b67","type":"input","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:31.714Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":674},"message":"Spawned new unit cel-default: Starting: spawned pid '68'","log":{"source":"elastic-agent"},"component":{"id":"cel-default","state":"STARTING"},"unit":{"id":"cel-default","type":"output","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:31.809Z","message":"Home path: [/usr/share/elastic-agent/data/elastic-agent-1c9cf2/components] Config path: [/usr/share/elastic-agent/data/elastic-agent-1c9cf2/components] Data path: [/agentless/data/run/cel-default] Logs path: [/usr/share/elastic-agent/data/elastic-agent-1c9cf2/components/logs]","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"log.origin":{"file.line":1082,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-04-30T16:49:31.810Z","message":"Beat ID: f2f6a96b-e3c6-490b-9722-453a4c35a4a9","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"cel-default","type":"cel"},"log":{"source":"cel-default"},"log.origin":{"file.line":1090,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Right after the unit goes from HEALTHY->DEGRADED , there is a reconnection attempt to Elasticsearch, which is followed by component going into STOPPED.

@efd6
Copy link
Contributor

efd6 commented Apr 30, 2025

From the logs, the cel-default input is (repeatedly) started and then terminates with a -1 with no additional logging to explain why. It might be helpful to set logging to debug to get some breadcrumbs. My bet would be on an OoMKill given the absence of any logging of errors that would be returned by the eval loop in order to exit.

@andrewkroh
Copy link
Member

I'm also thinking OOM killer.

In our demo cluster the ti_abusech integration is in the DEGRADED state, but it is NOT restarting repeatedly. It stays degraded as I would expected (stack version 8.17.4).

I did wonder if the agentless Kubernetes liveness probe could be the cause, but that can be ruled out because it's not the whole Agent process group that is exiting, it's a specific sub-process of the Agent.

@andrewkroh
Copy link
Member

It is a oom-kill.


I don't think we should hide the degraded state. Instead I think the integration should make it clear via the status message that they are rate-limited and that they need to authenticate to AbuseCH. Auth will become mandatory on June 30, 2025. And maybe we should point them at https://abuse.ch/blog/community-first/ .

@kcreddy
Copy link
Contributor Author

kcreddy commented May 1, 2025

Adding diagnostics in ECH agentless environment (with debug logs):

curl -L -H 'Authorization: 946cff1075dbe33a' -o 'elastic-agent-diagnostics-2025-05-01T14-40-05Z-00.zip' https://upload.elastic.co/d/1f02a80b7b69442352edce83fe89ccfde285d24eeaef9b217052a275aef08c56
  • The deployment is deleted after collecting diagnostics.
  • malwarebazaar is the datastream that fails with 402 (in last 2 tests).

@kcreddy
Copy link
Contributor Author

kcreddy commented May 1, 2025

I don't think we should hide the degraded state. Instead I think the integration should make it clear via the status message that they are rate-limited and that they need to authenticate to AbuseCH. Auth will become mandatory on June 30, 2025. And maybe we should point them at https://abuse.ch/blog/community-first/ .

Thanks @andrewkroh. I will update the error.message to indicate this.

But due to the restarts, the users will face the issue with billing (indexing spike) just like SDH. Is there anyway we can handle that?

@kcreddy
Copy link
Contributor Author

kcreddy commented May 1, 2025

Updated the error.message to reflect the usage of Auth Key and link inside the fleet UI: 8b1ba7c.
Updated the PR description and commit message.

@kcreddy kcreddy changed the title ti_abusech: Avoid agent DEGRADED state on API 402. ti_abusech: Update Fleet status message on API 402 May 1, 2025
@elasticmachine
Copy link

💚 Build Succeeded

History

cc @kcreddy

@andrewkroh andrewkroh added the Integration:tenable_io Tenable Vulnerability Management label May 1, 2025
@efd6
Copy link
Contributor

efd6 commented May 1, 2025

It is a oom-kill.

Broader topic; we used to see throws for this in the agent logs, but we don't seem to now. Why is this? This is a significant visibility hole.

@kcreddy
Copy link
Contributor Author

kcreddy commented May 2, 2025

PR: #13760 to increase the memory on the pod as per @andrewkroh suggestion in the SDH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Pull request that fixes a bug issue Integration:tenable_io Tenable Vulnerability Management Integration:ti_abusech AbuseCH Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants