Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Helix API Average Response Time #4873

Closed
dotnet-eng-status bot opened this issue Jan 31, 2025 · 11 comments
Closed

Production - [Alerting] Helix API Average Response Time #4873

dotnet-eng-status bot opened this issue Jan 31, 2025 · 11 comments
Assignees
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 16022.291070121664

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-24cae10d9eca44079e7cf3d47f148497

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) and removed Active Alert Issues from Grafana alerts that are now active labels Jan 31, 2025
Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

@dotnet-eng-status dotnet-eng-status bot added the Inactive Alert Issues from Grafana alerts that are now "OK" label Jan 31, 2025
Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

@meghnave meghnave self-assigned this Feb 1, 2025
@meghnave meghnave marked this as a duplicate of #4872 Feb 1, 2025
@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Feb 1, 2025
Copy link
Author

💔 Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 8242.936636888913

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 7807.1296268722745

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Feb 1, 2025
Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Feb 1, 2025
Copy link
Author

💔 Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 10231.686302483524

Go to rule

Copy link
Author

💔 Metric state changed to alerting

Helix API Average Response Time is high!

  • Server response time 10221.317338307059

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Feb 1, 2025
Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

Copy link
Author

💚 Metric state changed to ok

Helix API Average Response Time is high!

Go to rule

@meghnave
Copy link

meghnave commented Feb 3, 2025

The alert originally triggered due to a deployment. Looked at the instances afterwards and nothing seems to be very concerning in that the api calls that took long were the ones that are generally on the higher side anyway.
The only that stands out that can be kept an eye on in the future is GET Job/2019-06-17/PassFail [job/version] - it took 8 seconds for its 50th percentile mark during the last spike (all of them succeeded fwiw) whereas for the last 10 days the same value is at 8ms. So maybe something to dig into further if the pattern changes.

@meghnave meghnave closed this as completed Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

1 participant