You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A data race occurs during the libfabric execution of ibv_req_notify_cq when using the prov/verbs provider with the rxm utility provider.
To Reproduce
Create multiple endpoints in parallel across multiple threads.
Bind each endpoint to an rx/tx completion queue and an address vector.
Use the verbs backend with mlx5.
The issue occurs in all cases where ibv_req_notify_cq is called, specifically in vrb_cq_open and vrb_cq_close.
Expected behavior
No data races or segmentation faults should occur when calling ibv_req_notify_cq.
Output (Valgrind Helgrind Log)
==3139337== ----------------------------------------------------------------
==3139337== Possible data race during write of size 4 at 0x4FD6970 by thread #5
==3139337== Locks held: none
==3139337== at 0x4FC6FBC: ibv_dontfork_range (memory.c:723)
==3139337== by 0x7892387: mlx5_alloc_buf (buf.c:555)
==3139337== by 0x7891B2B: mlx5_alloc_prefered_buf (buf.c:331)
==3139337== by 0x78BE555: mlx5_alloc_cq_buf (cq.c:1963)
==3139337== by 0x791F885: create_cq (verbs.c:1063)
==3139337== by 0x791FF8A: mlx5_create_cq (verbs.c:1194)
==3139337== by 0x4FCA0E1: ibv_create_cq@@IBVERBS_1.1 (verbs.c:552)
==3139337== by 0x4923C66: vrb_cq_open (verbs_cq.c:567)
==3139337== by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337== by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337== by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337== by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337==
==3139337== This conflicts with a previous write of size 4 by thread #3
==3139337== Locks held: none
==3139337== at 0x4FC6FBC: ibv_dontfork_range (memory.c:723)
==3139337== by 0x7892387: mlx5_alloc_buf (buf.c:555)
==3139337== by 0x7891B2B: mlx5_alloc_prefered_buf (buf.c:331)
==3139337== by 0x78BE555: mlx5_alloc_cq_buf (cq.c:1963)
==3139337== by 0x791F885: create_cq (verbs.c:1063)
==3139337== by 0x791FF8A: mlx5_create_cq (verbs.c:1194)
==3139337== by 0x4FCA0E1: ibv_create_cq@@IBVERBS_1.1 (verbs.c:552)
==3139337== by 0x4923C66: vrb_cq_open (verbs_cq.c:567)
==3139337== Address 0x4fd6970 is 0 bytes inside data symbol "too_late"
==3139337== Possible data race during write of size 8 at 0x487F020 by thread #5
==3139337== Locks held: none
==3139337== at 0x789246A: mmio_write64_be (mmio.h:173)
==3139337== by 0x78BDD50: mlx5_arm_cq (cq.c:1755)
==3139337== by 0x492255A: ibv_req_notify_cq (verbs.h:2887)
==3139337== by 0x4923D28: vrb_cq_open (verbs_cq.c:576)
==3139337== by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337== by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337== by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337== by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337==
==3139337== This conflicts with a previous write of size 8 by thread #3
==3139337== Locks held: none
==3139337== at 0x789246A: mmio_write64_be (mmio.h:173)
==3139337== by 0x78BDD50: mlx5_arm_cq (cq.c:1755)
==3139337== by 0x492255A: ibv_req_notify_cq (verbs.h:2887)
==3139337== by 0x4923D28: vrb_cq_open (verbs_cq.c:576)
==3139337== by 0x494C792: fi_cq_open (fi_domain.h:382)
==3139337== by 0x4951765: rxm_ep_msg_cq_open (rxm_ep.c:1377)
==3139337== by 0x495214C: rxm_ep_ctrl (rxm_ep.c:1571)
==3139337== by 0x2FB98C: fi_enable (fi_endpoint.h:226)
==3139337== Address 0x487f020 is in a -w- mapped file /dev/infiniband/uverbs0 segment
Environment:
OS (if not Linux), provider, endpoint type, etc.
OS: Ubuntu 22.04
RDMA-Core Version: 43
Verbs Backend: mlx5
Valgrind Version: 3.24
Libfabric Version: 1.22
Additional context
The issue was detected using Valgrind with the Helgrind tool. It causes non-deterministic segmentation faults and illegal instructions that appear once every few hundred runs.
The text was updated successfully, but these errors were encountered:
Describe the bug
A data race occurs during the libfabric execution of ibv_req_notify_cq when using the prov/verbs provider with the rxm utility provider.
To Reproduce
Expected behavior
No data races or segmentation faults should occur when calling ibv_req_notify_cq.
Output (Valgrind Helgrind Log)
Environment:
OS (if not Linux), provider, endpoint type, etc.
Additional context
The issue was detected using Valgrind with the Helgrind tool. It causes non-deterministic segmentation faults and illegal instructions that appear once every few hundred runs.
The text was updated successfully, but these errors were encountered: