Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CAN] Control callback crashes when using thingset_import_data #57

Open
jalinei opened this issue Feb 10, 2025 · 13 comments
Open

[CAN] Control callback crashes when using thingset_import_data #57

jalinei opened this issue Feb 10, 2025 · 13 comments

Comments

@jalinei
Copy link
Contributor

jalinei commented Feb 10, 2025

Hi, it's me again 😄

I'm currently troubleshooting CAN control broadcasting.
My current setup is the following : Two nodes are connected, one is broadcasting one data object with object-id 0x8001, I'm listening the bus with the Peak dongle.
The nodes have a can_control_rx_handler that go through these control messages with ids > 8000

I'm facing what I do believe is a bug :
Using thingset_import_data in control report callback leads to a general crash, without any LOG_ERR
or LOG of any kind. The MCU gets fully stuck without outputting any fatal reporting.

I'm using the sample you've done when we were working on the Vhelio e-bike.

jalinei/Core@6dccaba

I've commented out the culprit function and I've added HEXDUMP to check the frame before using thingset_import_data.
I'm posting the HEXDUMP output below, it seems valid to me.

[00:00:33.412,000] <inf> ts_can: Thingset frame:
                                 a1 19 80 01 fa 00 00 00  00                      |........ .       
[00:00:34.412,000] <inf> ts_can: received control msg with id 0x8001 from addr 0xC6
[00:00:34.412,000] <inf> ts_can: Thingset frame:
                                 a1 19 80 01 fa 3a 83 12  6f 

Here is the corresponding candump snippet, everything looks sane as well.

  can0  0E8001C6   [5]  FA 00 00 00 00
  can0  0E8001C6   [5]  FA 3A 83 12 6F

I made sure that the data item is writtable in my data_object :

THINGSET_ADD_ITEM_FLOAT(ID_CTRL, ID_CTRL_REFERENCE, "zCtrlReference",
                        &reference_value, 1, THINGSET_ANY_RW, SUBSET_CTRL);

At first I was having a different setup in which both nodes where broadcasting the same 0x8001 data objct, so I was thinking that it was some kind of race conditions to access the data-object.
I simplified the setup so only one node is broadcasting and the second is listening, and I still have the same issue. I works perfectly without calling thingset_import_data but crashes when calling it.

Any clue in what direction to look ?

@martinjaeger
Copy link
Contributor

Yeah, the data does indeed look correct.

I've just double-checked if acquisition of the ThingSet context lock might have failed, leading to a dead lock. But that should have printed an error message.

Did you double-check if the stack size of the calling thread is large enough?

@jalinei
Copy link
Contributor Author

jalinei commented Feb 10, 2025

I've tried passing CONFIG_MAIN_STACK_SIZE from 2048 to 8192, no changes.
In this case the calling thread is the main thread right ? As the callback is set using SYS_INIT (just want to make sure).

@martinjaeger
Copy link
Contributor

In this case the calling thread is the main thread right ? As the callback is set using SYS_INIT (just want to make sure).

As far as I can see it's offloaded to the system work queue, so the stack would be CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE.

@jalinei
Copy link
Contributor Author

jalinei commented Feb 11, 2025

Ok, my bad, I've tried this morning with CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=8192.
Same behavior unfortunately..

@martinjaeger
Copy link
Contributor

Ok. Don't have another idea at the moment, unfortunately, except for trying to step through the function with a debugger and see where it crashes.

@jalinei
Copy link
Contributor Author

jalinei commented Feb 12, 2025

I gave the debugger a chance today. I got the following log:

[00:00:07.497,000] <inf> ts_can: Thingset frame:
                                 a1 19 80 01 fa 00 00 00  00                      |........ .       
ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ ZEPHYR_BASE/kernel/sem.c:136

[00:00:07.497,000] <err> os: r0/a1:  0x00000004  r1/a2:  0x00000088  r2/a3:  0x0000000c
[00:00:07.497,000] <err> os: r3/a4:  0x00000004 r12/ip:  0x000124da r14/lr:  0x0802a63b
[00:00:07.497,000] <err> os:  xpsr:  0x21000067
[00:00:07.497,000] <err> os: s[ 0]:  0x00000005  s[ 1]:  0x200086d0  s[ 2]:  0x200086d8  s[ 3]:  0x20003a2c
[00:00:07.497,000] <err> os: s[ 4]:  0x200086e4  s[ 5]:  0x00000009  s[ 6]:  0x20003a34  s[ 7]:  0x08023609
[00:00:07.497,000] <err> os: s[ 8]:  0x000000f0  s[ 9]:  0x00000003  s[10]:  0x00000005  s[11]:  0x200086d0
[00:00:07.497,000] <err> os: s[12]:  0x200086d8  s[13]:  0x0803f16c  s[14]:  0x000000b0  s[15]:  0x080251f1
[00:00:07.497,000] <err> os: fpscr:  0x00000003
[00:00:07.497,000] <err> os: Faulting instruction address (r15/pc): 0x08032c4e
[00:00:07.497,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[00:00:07.497,000] <err> os: Fault during interrupt handling

[00:00:07.497,000] <err> os: Current thread: 0x20002748 (idle)
[00:00:07.536,000] <err> os: Halting system

It seems to fail because of the k_sem_take

    if (k_sem_take(&ts->lock, K_MSEC(THINGSET_CONTEXT_LOCK_TIMEOUT_MS)) != 0) {
        LOG_ERR("ThingSet context lock timed out");
        return -THINGSET_ERR_INTERNAL_SERVER_ERR;
    }

I'm not sure why though.. I tried increasing the THINGSET_CONTEXT_LOCK_TIMEOUT_MS to 10000, same issue.

@jalinei
Copy link
Contributor Author

jalinei commented Feb 13, 2025

I'll pass the thingset_import_data function call to a kwork item.

My doubts below :

I'm not sure how to properly "fix" this. Should we add this kwork in the SDK directly ? This would be neat as the work item could be sent to the thingset workqueue instead of the system one.

But then how to define user based logic to set software level filtering on the incoming can control messages ?

Is it okay to introduce timing uncertainty (because of kwork) when receiving RX control messages ? Ideally we would like to make sure that they are dealt with ASAP.

@martinjaeger
Copy link
Contributor

If you look into your log output, the following line gives some hints about what's wrong:

ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ ZEPHYR_BASE/kernel/sem.c:136

Looks like the function is called from ISR context and the below assertion fails:

https://github.com/zephyrproject-rtos/zephyr/blob/c387fcae17eb8cef274e782acd817bdaa5fe2613/kernel/sem.c#L136-L137

Are you sure it's called from a kwork item?

@jalinei
Copy link
Contributor Author

jalinei commented Feb 13, 2025

If you look into your log output, the following line gives some hints about what's wrong:

ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ ZEPHYR_BASE/kernel/sem.c:136

Looks like the function is called from ISR context and the below assertion fails:

https://github.com/zephyrproject-rtos/zephyr/blob/c387fcae17eb8cef274e782acd817bdaa5fe2613/kernel/sem.c#L136-L137

Are you sure it's called from a kwork item?

The cause of the crash is indeed because thingset_import_data was called directly in the callback. (see commit below).
jalinei/Core@6dccaba

@martinjaeger
Copy link
Contributor

Ok, so it should work if you offload it to a work item.

@jalinei
Copy link
Contributor Author

jalinei commented Feb 13, 2025

Got it working with a work item : jalinei/Core@ae1983f

As an enhancement I suggest to add a warning in the doxygen comment of thingset_import_data so that user is warned not to use it directly in a callback.

Additional considerations :

  • Shouldn't this work item be be sent to the thingset workqueue instead of the system workqueue ?

  • Is it okay to introduce timing uncertainty (because of kwork) when receiving RX control messages ? Ideally we would like to make sure that they are dealt with ASAP.

@martinjaeger
Copy link
Contributor

  • Shouldn't this work item be be sent to the thingset workqueue instead of the system workqueue ?

I guess it doesn't make a big difference.

  • Is it okay to introduce timing uncertainty (because of kwork) when receiving RX control messages ? Ideally we would like to make sure that they are dealt with ASAP.

Yeah, it would introduce a small delay. We could instead make the timeout configurable and set it to K_NO_WAIT. In that case it's allowed to be called from an ISR. It means that some values may get ignored, though, as the function would return with an error immediately if the context is not available. We need to decide what is better: Sometimes write the values with a small delay or sometimes omit some values?

@jalinei
Copy link
Contributor Author

jalinei commented Feb 13, 2025

Hehe classical Real Time tradeoffs nightmare.

In practice, not being able to lock ts means that the data_object is being used to receive or send a message through another physical bus right ?

Obviously it depends on the use-case, but IMO it is a good practice to have a way of knowing when incoming control reference has been delayed, in order to be able to reject it for application in which it could not be okay being 800ms late..

A potential middleground, would be to have by default as K_NO_WAIT allowing use in regular callback, (and intrinsically make sure data is fresh) and add yet another config that make it compatible with IoT usecases for which 800ms is nothing you are really afraid of..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants