Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent_state_summary plan - null report_timestamp for node(s) fails the plan #236

Closed
taikaa opened this issue Sep 27, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@taikaa
Copy link
Contributor

taikaa commented Sep 27, 2024

Describe the Bug

If a node returned in the PuppetDB query has a null report_timestamp, the plan fails at line 23

Plan output

{
  "msg" : "The function 'new_timestamp' was called with arguments it does not accept. It expects one of:\n  ()\n    rejected: expects no arguments, got 1\n  (Variant[Integer, Float] seconds)\n    rejected: parameter 'seconds' expects a value of type Integer or Float, got Undef\n  (String[1] string, Formats = Variant[String[2], Array[String[2], 1]] format?, String[1] timezone?)\n    rejected: parameter 'string' expects a String value, got Undef\n  (Struct[{'string' => String[1], Optional['format'] => Formats, Optional['timezone'] => String[1]}] hash_arg)\n    rejected: parameter 'hash_arg' expects a Struct value, got Undef",
  "kind" : "bolt/pal-error",
  "details" : {
    "file" : "/opt/puppetlabs/server/data/orchestration-services/code/environments/production/modules/pe_status_check/plans/agent_state_summary.pp",
    "line" : 23,
    "column" : 30
  }
}

Expected Behavior

A node with a null report_timestamp shouldn't fail the plan. Since the plan operationalizes unhealthy nodes as "latest reports that are older than runinterval", the node's with a null report_timestamp can be added to the unhealthy list.

Steps to Reproduce

Provision a node and request a cert. Submit a node entry to PuppetDB without submitting a report

puppet ssl submit_request
puppet ssl download_cert
puppet facts upload

Query PuppetDB to confirm that report_timestamp is null

puppet query nodes[certname,report_timestamp]{}

Then, run the plan pe_status_check::agent_state_summary

Environment

  • Version 4.4.0

Additional Context

Another edge case that would result in a null report_timestamp is if a resource that contains a null byte preventing PuppetDB from ever storing a report from a node

@taikaa taikaa added the bug Something isn't working label Sep 27, 2024
@MartyEwings
Copy link
Collaborator

@bastelfreak --^

@bastelfreak
Copy link
Contributor

ah nice, I will try to take a look this week.

bastelfreak added a commit to bastelfreak/puppetlabs-pe_status_check that referenced this issue Sep 30, 2024
…nhealthy

It's possible that a Puppet Agent was stopped or disabled and all old
reports were garbage collected from PuppetDB. The node still exists in
PuppetDB, but when checking for a report the timestamp is null:

```
```

```json
[
  {
    "certname": "pe.tim.local",
    "report_timestamp": "2024-09-30T13:21:17.042Z"
  },
  {
    "certname": "pe2.tim.local",
    "report_timestamp": null
  }
]
```

Previously we always assumed that `report_timestamp` has a valid
timestamp. With this patch we explicitly validate the timestamp and
count nodes withhout a timestamp as unhealthy.
bastelfreak added a commit to bastelfreak/puppetlabs-pe_status_check that referenced this issue Sep 30, 2024
…is null

The attribute `cached_catalog_status` on the `/nodes` endpoint will be
`null` for nodes without a catalog in PuppetDB:

```
puppet query 'nodes[certname,latest_report_noop,latest_report_corrective_change,cached_catalog_status,latest_report_status,report_timestamp]{}'
```

```
[
  {
    "cached_catalog_status": null,
    "certname": "pe2.tim.betadots.training",
    "latest_report_corrective_change": null,
    "latest_report_noop": null,
    "latest_report_status": null,
    "report_timestamp": null
  }
]
```
bastelfreak added a commit to bastelfreak/puppetlabs-pe_status_check that referenced this issue Sep 30, 2024
…nhealthy

It's possible that a Puppet Agent was stopped or disabled and all old
reports were garbage collected from PuppetDB. The node still exists in
PuppetDB, but when checking for a report the timestamp is null:

```
puppet query nodes[certname,report_timestamp]{}
```

```json
[
  {
    "certname": "pe.tim.local",
    "report_timestamp": "2024-09-30T13:21:17.042Z"
  },
  {
    "certname": "pe2.tim.local",
    "report_timestamp": null
  }
]
```

Previously we always assumed that `report_timestamp` has a valid
timestamp. With this patch we explicitly validate the timestamp and
count nodes withhout a timestamp as unhealthy.

Now with the fix:

```
puppet plan run pe_status_check::agent_state_summary --environment peadm log_healthy_nodes=true log_unhealthy_nodes=true
```

```json
{
    "responsive": [
        "pe.tim.local",
        "pe2.tim.local"
    ],
    "healthy_counter": 0,
    "total_counter": 2,
    "unhealthy_counter": 2,
    "noop": [],
    "unhealthy": [
        "pe2.tim.local",
        "pe.tim.local"
    ],
    "healthy": [],
    "changed": [
        "pe.tim.local"
    ],
    "no_report": [
        "pe.tim.local"
    ],
    "corrective_changes": [],
    "used_cached_catalog": [
        "pe2.tim.local"
    ],
    "unresponsive": [],
    "failed": []
}
```
@bastelfreak
Copy link
Contributor

@taikaa thanks for using the plan. Great to see that other can benefit from it. I raised #238 , can you maybe test it?

bastelfreak added a commit to bastelfreak/puppetlabs-pe_status_check that referenced this issue Oct 7, 2024
…nhealthy

It's possible that a Puppet Agent was stopped or disabled and all old
reports were garbage collected from PuppetDB. The node still exists in
PuppetDB, but when checking for a report the timestamp is null:

```
puppet query nodes[certname,report_timestamp]{}
```

```json
[
  {
    "certname": "pe.tim.local",
    "report_timestamp": "2024-09-30T13:21:17.042Z"
  },
  {
    "certname": "pe2.tim.local",
    "report_timestamp": null
  }
]
```

Previously we always assumed that `report_timestamp` has a valid
timestamp. With this patch we explicitly validate the timestamp and
count nodes withhout a timestamp as unhealthy.

Now with the fix:

```
puppet plan run pe_status_check::agent_state_summary --environment peadm log_healthy_nodes=true log_unhealthy_nodes=true
```

```json
{
    "responsive": [
        "pe.tim.local",
        "pe2.tim.local"
    ],
    "healthy_counter": 0,
    "total_counter": 2,
    "unhealthy_counter": 2,
    "noop": [],
    "unhealthy": [
        "pe2.tim.local",
        "pe.tim.local"
    ],
    "healthy": [],
    "changed": [
        "pe.tim.local"
    ],
    "no_report": [
        "pe.tim.local"
    ],
    "corrective_changes": [],
    "used_cached_catalog": [
        "pe2.tim.local"
    ],
    "unresponsive": [],
    "failed": []
}
```
bastelfreak added a commit to bastelfreak/puppetlabs-pe_status_check that referenced this issue Oct 7, 2024
…nhealthy

It's possible that a Puppet Agent was stopped or disabled and all old
reports were garbage collected from PuppetDB. The node still exists in
PuppetDB, but when checking for a report the timestamp is null:

```
puppet query nodes[certname,report_timestamp]{}
```

```json
[
  {
    "certname": "pe.tim.local",
    "report_timestamp": "2024-09-30T13:21:17.042Z"
  },
  {
    "certname": "pe2.tim.local",
    "report_timestamp": null
  }
]
```

Previously we always assumed that `report_timestamp` has a valid
timestamp. With this patch we explicitly validate the timestamp and
count nodes withhout a timestamp as unhealthy.

Now with the fix:

```
puppet plan run pe_status_check::agent_state_summary --environment peadm log_healthy_nodes=true log_unhealthy_nodes=true
```

```json
{
    "responsive": [
        "pe.tim.local",
        "pe2.tim.local"
    ],
    "healthy_counter": 0,
    "total_counter": 2,
    "unhealthy_counter": 2,
    "noop": [],
    "unhealthy": [
        "pe2.tim.local",
        "pe.tim.local"
    ],
    "healthy": [],
    "changed": [
        "pe.tim.local"
    ],
    "no_report": [
        "pe.tim.local"
    ],
    "corrective_changes": [],
    "used_cached_catalog": [
        "pe2.tim.local"
    ],
    "unresponsive": [],
    "failed": []
}
```
MartyEwings added a commit that referenced this issue Nov 18, 2024
(#236) agent_state_summary: Count nodes without report as unhealthy
MartyEwings added a commit that referenced this issue Nov 18, 2024
(#236) agent_state_summary: Check if cached_catalog_status is null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants