Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repeat flag doesn't terminate monika on fail probe #1165

Merged
merged 13 commits into from
Nov 10, 2023
Merged

Conversation

nicnocquee
Copy link
Member

@nicnocquee nicnocquee commented Nov 7, 2023

Monika Pull Request (PR)

What feature/issue does this PR add

  1. Repeat 1 doesn't exit when the probe is not accessible #1164

How did you implement / how did you fix it

There are several bugs that causes the issue #1164

  1. The setProbeRunning is called almost every second in this line https://github.com/hyperjumptech/monika/pull/1165/files#diff-c3cb2be3e6fbad9e8455e2fc282a2d52cc7875688162e517023bff5be26371b6L54. The fix: moved it back down as originally written.

  2. The retry function is called without maxAttempt here https://github.com/hyperjumptech/monika/pull/1165/files#diff-c3cb2be3e6fbad9e8455e2fc282a2d52cc7875688162e517023bff5be26371b6R81. This causes Monika to keep retrying to monitor the probe in a cycle which causes the bug in Repeat 1 doesn't exit when the probe is not accessible #1164. The fix: added maxAttempt with value whichever bigger between incidentThreshold and recoveryThreshold because we want to retry for both incident and recovery cases. See below for more information.

  3. Removal of recoveryThreshold causes the recovery notification to be sent immediately upon recovery. But we don't want that. Just like we don't want to immediately send incident notification, Monika needs to retry several times first before sending the recovery notification. The fix:

    1. restore the recoveryThreshold and remove the deprecation warning https://github.com/hyperjumptech/monika/pull/1165/files#diff-028bd48d9206d28afc908d305ef5ac57026ad51ba014a0741fb8abcaef2af741R113. I think there was misunderstanding, we just don't want user to be able to set the threshold from symon, but we still need it in Monika.
    2. don't immediately send the recovery notification, instead keep retrying until recoveryThreshold. https://github.com/hyperjumptech/monika/pull/1165/files#diff-537950d0505c20b7b3ec67590102fdafdbbbbddb0130d5dbf7a4ac232060598dR152

Some further changes to improve performance:

  1. Set the default retryInitialDelayMs value to 2 seconds. 128 milliseconds is not a sensible default. There's no need for Monika to retry that fast. https://github.com/hyperjumptech/monika/pull/1165/files#diff-06df4a808fa037b48c082497d352cb7825c883648d52a17ab72b0e7dd70a5c5dR84
  2. Remove unnecessary error throws, e.g., https://github.com/hyperjumptech/monika/pull/1165/files#diff-5d5a93a3e7827f41dca4b896d385bc9ab744774dd26149e2f605fbe064398021L134

Probing Mechanism

With this PR, I also fixed the probing mechanism. This is how it works.

Say we have the following probe:

probes:
  - id: '1'
    name: Localhost
    description: Check status
    interval: 5
    incidentThreshold: 3
    recoveryThreshold: 3
    requests:
      - method: GET
        timeout: 10000
        url: http://localhost:8000/api/delay
    alerts:
      - assertion: response.time > 800
        message: Response time > 800

Then say the server in the localhost:8000 is offline (unreachable). The process will go like the following:

  1. In the first cycle, Monika will try to probe it but fails. Monika will not send the incident notification immediately. Instead, it will retry 3 times (based on the incidentThreshold) with exponential backoff algorithm.
  2. If in the second retry, the server is up, then it will do nothing and just wait for 5 seconds (based on the interval) to start another cycle.
  3. On the other hand, if after 3 attempts the server is still unreachable, it will send the incident notification, and mark the probe as has incident. Then it will wait until 5 seconds (interval) has passed to probe it again.
  4. In the next cycle, if the server is online, Monika will not send the recovery notification immediately because the probe was marked as incident previously. Instead, it will retry to monitor the probe 3 times (according to recoveryThreshold) before sending the recovery notification.

The same process is also executed when the probe can reached but the assertion fails which in this case when response's time is greater than 800 ms.

When the -r flag is provided, Monika will perform as many cycles as defined by the repeat flag value. However, it will also still do retries in case of incident and recovery as defined by the probes' incidentThreshold and recoveryThreshold if provided, or use their respective default values.

How to test

  1. Use the following monika.yml:
probes:
  - id: '1'
    name: Localhost
    description: Multiple
    requests:
      - url: http://localhost:3000
  1. npm run start -- -c ./monika.yml -r 2

Screenshots

Running with -r 1 and the probe cannot be reached (offline):

Screenshot 2023-11-08 at 13 24 28

Running with -r 2 and the probe cannot be reached (offline):

Screenshot 2023-11-08 at 13 24 40

Running with -r 1 and the probe can be reached (online):

Screenshot 2023-11-08 at 13 27 32

Running without -r flag (continuously):

Screenshot 2023-11-08 at 13 27 11

Copy link

codecov bot commented Nov 7, 2023

Codecov Report

Merging #1165 (a847d7b) into main (a9316c6) will decrease coverage by 0.82%.
The diff coverage is 78.00%.

❗ Current head a847d7b differs from pull request most recent head 4342e36. Consider uploading reports for the commit 4342e36 to get more accurate results

@@            Coverage Diff             @@
##             main    #1165      +/-   ##
==========================================
- Coverage   61.88%   61.06%   -0.82%     
==========================================
  Files         105      106       +1     
  Lines        3130     3136       +6     
  Branches      536      535       -1     
==========================================
- Hits         1937     1915      -22     
- Misses       1015     1035      +20     
- Partials      178      186       +8     
Files Coverage Δ
...ents/config/validation/validator/default-values.ts 100.00% <100.00%> (ø)
...rc/components/config/validation/validator/probe.ts 86.90% <100.00%> (+0.15%) ⬆️
src/components/probe/index.ts 82.85% <100.00%> (+2.85%) ⬆️
src/flag.ts 100.00% <ø> (ø)
src/looper/index.ts 76.00% <100.00%> (+0.48%) ⬆️
src/components/probe/prober/http/index.ts 72.22% <77.77%> (-12.63%) ⬇️
src/commands/monika.ts 54.62% <57.14%> (+1.95%) ⬆️
src/components/probe/prober/index.ts 64.28% <68.42%> (-17.27%) ⬇️

... and 2 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@nico-hyperjump nico-hyperjump marked this pull request as ready for review November 8, 2023 14:23
@dennypradipta
Copy link
Contributor

dennypradipta commented Nov 9, 2023

Okay based on the tests you gave me:

First case
npm run start -- -c monika.yml -r 1
I can't seem to reproduce the first case where the config is monika.yml with -r 1 flag. I still have the incident threshold as 5 instead of 2:
image

But after the probe is marked as incident Monika stopped itself though. Was it intended or not?

Second case
npm run start -- -c monika.yml -r 2
image
Same as the first case, but got the same results as you: ended after ECONNREFUSED

Third case
npm run start -- -c ../monika-alert-simulator/examples/monika-sample-1-delay.yml -r 1
Works fine.

Fourth case
npm run start -- -c ../monika-alert-simulator/examples/monika-sample-1-delay.yml
Unexpectedly crashed because of throw new Error
image

Additional case
npm run start -- --symonKey= --symonUrl=http://localhost:3000 --retryInitialDelayMs=150000 --retryMaxDelayMs=1500000 --symonMonikaId= --symonReportInterval=10000
Doesn't seem to break Symon mode.
image

@nicnocquee
Copy link
Member Author

The first and second cases are by design. The fourth case should be fixed now.

image

Copy link
Contributor

@dennypradipta dennypradipta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@sapiderman sapiderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked for me. Ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants