Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Apple simulator failure rate alert #4869

Closed
dotnet-eng-status bot opened this issue Jan 31, 2025 · 4 comments
Closed

Production - [Alerting] Apple simulator failure rate alert #4869

dotnet-eng-status bot opened this issue Jan 31, 2025 · 4 comments
Assignees
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=dci-mac-build-132} 100

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-36d07fceeaf0472b804d8358b2198eac

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Jan 31, 2025
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Jan 31, 2025
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

  • FailureRate {Machine=dci-mac-build-143} 85

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Jan 31, 2025
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@meghnave meghnave self-assigned this Feb 3, 2025
@meghnave
Copy link

meghnave commented Feb 4, 2025

Was using this to understand more about the alert and if we could improve upon it - was missing the 12h summary in the kql and was thinking we could fix it so that we don't get the alert and then almost soon after, an 'ok'. But maybe that's inevitable considering we are bin-ing into 12h chunks (will find out offline why) but we are definitely not missing data so closing this.

@meghnave meghnave closed this as completed Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

1 participant