Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Data race in ibv_req_notify_cq with rxm utility provider #10762

Open
piotrchmiel opened this issue Feb 4, 2025 · 1 comment
Open
Labels

Comments

@piotrchmiel
Copy link
Contributor

Describe the bug
A data race occurs during the libfabric execution of ibv_req_notify_cq when using the prov/verbs provider with the rxm utility provider.

To Reproduce

  1. Create multiple endpoints in parallel across multiple threads.
  2. Bind each endpoint to an rx/tx completion queue and an address vector.
  3. Use the verbs backend with mlx5.
  4. The issue occurs in all cases where ibv_req_notify_cq is called, specifically in vrb_cq_open and vrb_cq_close.

Expected behavior
No data races or segmentation faults should occur when calling ibv_req_notify_cq.

Output (Valgrind Helgrind Log)

==3139337== ----------------------------------------------------------------
==3139337== Possible data race during write of size 4 at 0x4FD6970 by thread #5
==3139337== Locks held: none
==3139337==    at 0x4FC6FBC: ibv_dontfork_range (memory.c:723)
==3139337==    by 0x7892387: mlx5_alloc_buf (buf.c:555)
==3139337==    by 0x7891B2B: mlx5_alloc_prefered_buf (buf.c:331)
==3139337==    by 0x78BE555: mlx5_alloc_cq_buf (cq.c:1963)
==3139337==    by 0x791F885: create_cq (verbs.c:1063)
==3139337==    by 0x791FF8A: mlx5_create_cq (verbs.c:1194)
==3139337==    by 0x4FCA0E1: ibv_create_cq@@IBVERBS_1.1 (verbs.c:552)
==3139337==    by 0x4923C66: vrb_cq_open (verbs_cq.c:567)
==3139337==    by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337==    by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337==    by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337==    by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337== 
==3139337== This conflicts with a previous write of size 4 by thread #3
==3139337== Locks held: none
==3139337==    at 0x4FC6FBC: ibv_dontfork_range (memory.c:723)
==3139337==    by 0x7892387: mlx5_alloc_buf (buf.c:555)
==3139337==    by 0x7891B2B: mlx5_alloc_prefered_buf (buf.c:331)
==3139337==    by 0x78BE555: mlx5_alloc_cq_buf (cq.c:1963)
==3139337==    by 0x791F885: create_cq (verbs.c:1063)
==3139337==    by 0x791FF8A: mlx5_create_cq (verbs.c:1194)
==3139337==    by 0x4FCA0E1: ibv_create_cq@@IBVERBS_1.1 (verbs.c:552)
==3139337==    by 0x4923C66: vrb_cq_open (verbs_cq.c:567)
==3139337==  Address 0x4fd6970 is 0 bytes inside data symbol "too_late"
==3139337== Possible data race during write of size 8 at 0x487F020 by thread #5
==3139337== Locks held: none
==3139337==    at 0x789246A: mmio_write64_be (mmio.h:173)
==3139337==    by 0x78BDD50: mlx5_arm_cq (cq.c:1755)
==3139337==    by 0x492255A: ibv_req_notify_cq (verbs.h:2887)
==3139337==    by 0x4923D28: vrb_cq_open (verbs_cq.c:576)
==3139337==    by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337==    by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337==    by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337==    by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337== 
==3139337== This conflicts with a previous write of size 8 by thread #3
==3139337== Locks held: none
==3139337==    at 0x789246A: mmio_write64_be (mmio.h:173)
==3139337==    by 0x78BDD50: mlx5_arm_cq (cq.c:1755)
==3139337==    by 0x492255A: ibv_req_notify_cq (verbs.h:2887)
==3139337==    by 0x4923D28: vrb_cq_open (verbs_cq.c:576)
==3139337==    by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337==    by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337==    by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337==    by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337==  Address 0x487f020 is in a -w- mapped file /dev/infiniband/uverbs0 segment

Environment:
OS (if not Linux), provider, endpoint type, etc.

  • OS: Ubuntu 22.04
  • RDMA-Core Version: 43
  • Verbs Backend: mlx5
  • Valgrind Version: 3.24
  • Libfabric Version: 1.22

Additional context
The issue was detected using Valgrind with the Helgrind tool. It causes non-deterministic segmentation faults and illegal instructions that appear once every few hundred runs.

@piotrchmiel piotrchmiel added the bug label Feb 4, 2025
@shefty
Copy link
Member

shefty commented Feb 11, 2025

This looks like a problem in the libibverbs provider implementation, not libfabric. I think this needs to be reported to rdma-core.

It's not obvious to me how the above possible data races would result in a seg fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants