Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue: Implement doorbell batching for the new API #164

Open
wants to merge 169 commits into
base: vNext
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
169 commits
Select commit Hold shift + click to select a range
1976934
issue: 3514044 Introducing cq_mgr_regrq and cq_mgr_strq
AlexanderGrissik Aug 27, 2023
1495ac4
issue: 3514044 Renaming cq_mgr_mlx5 to cq_mgr_regrq
AlexanderGrissik Aug 27, 2023
fb6022f
issue: 3514044 Renaming cq_mgr_mlx5_strq to cq_mgr_strq
AlexanderGrissik Aug 27, 2023
f21880d
issue: 3514044 Moving cq_mgr_regrq tx methods to cq_mgr
AlexanderGrissik Aug 27, 2023
ea4a4e6
issue: 3514044 Moving cq_mgr_regrq events to cq_mgr
AlexanderGrissik Aug 27, 2023
8f5999c
issue: 3514044 Moving cq_mgr_regrq add_qp_tx to cq_mgr
AlexanderGrissik Aug 27, 2023
16f67fc
issue: 3514044 Moving cq_mgr_regrq RX common to cq_mgr
AlexanderGrissik Aug 27, 2023
52d0cf5
issue: 3514044 Moving Tx from cq_mgr to cq_mgr_tx
AlexanderGrissik Aug 27, 2023
7eb4156
issue: 3514044 Rename cq_mgr to cq_mgr_rx
AlexanderGrissik Aug 28, 2023
553cc35
issue: 3514044 Remove qp_rec struct
AlexanderGrissik Aug 28, 2023
797c40c
issue: 3514044 Squash qp_mgr_eth to qp_mgr
AlexanderGrissik Aug 28, 2023
5fb7681
issue: 3514044 Remove DEFINED_DPCP from qp_mgr and styling fixes
AlexanderGrissik Oct 3, 2023
3e1a3bf
issue: 3514044 Squash qp_mgr_eth_mlx5 to qp_mgr
AlexanderGrissik Oct 3, 2023
6dfaffc
issue: 3514044 Squash qp_mgr_eth_mlx5_dpcp to qp_mgr
AlexanderGrissik Oct 4, 2023
c890d0d
issue: 3514044 Split qp_mgr to hw_queue_tx and hw_queue_rx
AlexanderGrissik Oct 8, 2023
4bd1df4
issue: 3514044 Squash rfs_rule_dpcp to rfs_rule
AlexanderGrissik Oct 8, 2023
01594f5
issue: 3514044 Removing m_attach_flow_data vector from rfs
AlexanderGrissik Oct 9, 2023
d5bb9e4
issue: 3514044 Removing hqrx from attach_flow_data_t
AlexanderGrissik Oct 9, 2023
70b1bb3
issue: 3514044 Removing ibv steering flows
AlexanderGrissik Oct 11, 2023
bc3e728
issue: 3514044 Adding flow tag check through dpcp::adapter
AlexanderGrissik Oct 15, 2023
ea7dfd0
issue: 3514044 Require dpcp for configure and CI
AlexanderGrissik Oct 15, 2023
e2acc51
issue: 3514044 Rebasing changes on top 3.20.5 with coverity fixes
AlexanderGrissik Oct 16, 2023
0e91471
issue 3514044 Fixing package test with mandatory dpcp
AlexanderGrissik Oct 16, 2023
d702e3c
issue: 3514044 Updating min dpcp version to 1.1.43
AlexanderGrissik Dec 31, 2023
67185ae
issue: 3514044 Replacing .inl file with .h
AlexanderGrissik Jan 11, 2024
3ce76f7
issue: 3514044 Removing option_strq
AlexanderGrissik Jan 11, 2024
6cda7bf
issue: 3514044 Removing unnecessary checks
AlexanderGrissik Jan 11, 2024
ecb0c8a
issue: 3745279 Fix artifact generation in CI
alexbriskin Jan 17, 2024
bf0b744
issue: 3664594 Return ETIMEDOUT err for timed out socket
AlexanderGrissik Dec 28, 2023
18f5d31
Issue: 3375239 - add email scan in packages
dpressle Jan 8, 2024
b7ac236
issue: 3724170 Support building as a static library
alexbriskin Jan 8, 2024
2e5c9c6
issue: 3724170 disable LTO in Jenkins compiler tests
alexbriskin Jan 16, 2024
5e8fca0
issue: 3704820 Fix strides in WQE for NGINX master
iftahl Jan 22, 2024
ff34156
issue: 3678579 Update last_unacked on ACK recv
iftahl Nov 22, 2023
4ff44ec
issue: 3678579 Fix last_unsent on retransmission
iftahl Nov 22, 2023
2bf7224
issue: 3678579 Fix last_unacked in tcp_rexmit_rto
iftahl Nov 22, 2023
43fe6e5
issue: 3678579 Update last_unacked in tcp_rexmit
iftahl Nov 22, 2023
95845c3
issue: 3678579 Remove iterating lists to find last
iftahl Nov 22, 2023
8a9d5b2
issue: 3678579 Coverity
iftahl Dec 14, 2023
07203a2
issue: 3690535 Remove SO_XLIO_RING_USER_MEMORY
pasis Dec 3, 2023
b12b93a
issue: 3690535 Reduce ring_allocation_logic size
pasis Dec 3, 2023
8ec6c52
issue: 3690535 Improve condition of ring migration support
pasis Dec 8, 2023
1e92b28
issue: 3690535 Print ring allocation logic type in logs
pasis Dec 8, 2023
871ff34
issue: 3690535 Remove unused fields in sockinfo_tcp
pasis Jan 30, 2024
46d4172
[CI] Coverity: add snapshot action
vialogi Jan 24, 2024
70467eb
issue: 3668182 Add tcp_write_zc/tcp_prealloc_zc
alexbriskin Nov 13, 2023
4713505
issue: 3668182 Connect tcp_write_zc to sockinfo_tcp::tcp_tx
alexbriskin Nov 13, 2023
670058e
issue: 3668182 Remove PBUF_DESC_MAP for send zerocopy
alexbriskin Nov 13, 2023
9ebbfef
issue: 3668182 Allow snd_buf drop below 0 in zero-copy path
alexbriskin Nov 14, 2023
249369d
issue: 3668182 Add sockinfo_tcp::tcp_tx_express
alexbriskin Nov 20, 2023
3c63b73
issue: 3668182 Use tcp_tx_express in TLS tx zerocopy
alexbriskin Dec 5, 2023
cf7ff57
issue: 3668182 Remove zerocopy flow from tcp_write
alexbriskin Dec 31, 2023
94f8392
issue: 3668182 Refactor sockinfo_tcp
alexbriskin Dec 31, 2023
ff55e60
issue: 3668182 Refactor LwIP + sockinfo_tcp + dst_entry_tcp
alexbriskin Nov 15, 2023
1e4e730
issue: 3668182 Fix PR comments
alexbriskin Jan 23, 2024
00db3d7
issue: 3668182 Revert tcp_seg::bufs to pbuf_clen()
alexbriskin Jan 30, 2024
1bc79a7
issue: 3724170 Add missing ifdef __cplusplus
alexbriskin Jan 28, 2024
181f582
issue: 3724170 Remove references to os_api
alexbriskin Jan 29, 2024
72f513d
issue: 3724170 Make xlio.h C standard compliant
benlwalker Jan 26, 2024
65bbb00
issue: 3724170 Disable the constructor/destructor in static build
alexbriskin Jan 29, 2024
24e427f
issue: 3724170 Make socketxtreme API regular function declarations
alexbriskin Jan 30, 2024
569a551
issue: 3724170 Fix compilation for static build
alexbriskin Feb 4, 2024
4534754
issue: 3724170 Disable the *_check functions for the static build
alexbriskin Feb 4, 2024
660afb5
issue: 3771283 Fix function pointer check
alexbriskin Feb 7, 2024
0fe1ee1
version: 3.30.0
galnoam Feb 12, 2024
ae058d1
issue: 3786434 Remove C23 feature from public xlio_extra.h
pasis Feb 19, 2024
2cdbc84
version: 3.30.1
galnoam Feb 22, 2024
9ef53fc
issue: 3792731 Fix -Walloc-size-larger-than warning
pasis Feb 22, 2024
077d5b2
issue: 3514044 Fix null pointer dereference
iftahl Feb 13, 2024
e82e642
issue: 3795922 Remove pbuf_split_64k()
pasis Feb 2, 2024
83fc3f5
issue: 3795922 Remove refused_data in lwip
pasis Feb 2, 2024
74c38c2
issue: 3781322 Fix for 100% CPU load
iftahl Feb 18, 2024
cbdbfec
issue: 3813802 Terminate process instead of 'throw' on panic
pasis Mar 6, 2024
8b076f9
issue: 3813802 Don't wrap xlio_raw_post_recv() with IF_VERBS_FAILURE
pasis Mar 6, 2024
b2c8589
issue: 3813802 Avoid partial initialization of an event_data_t object
pasis Mar 6, 2024
fb1a8ea
issue: 3813802 Remove dst_entry::m_p_send_wqe
pasis Mar 6, 2024
5a7881e
issue: 3813802 Fix type overflow warning in time_converter_rtc
pasis Mar 7, 2024
fa7aadd
issue: 3813802 Fix IP_FRAG_DEBUG=1 build
pasis Mar 7, 2024
4b205fb
issue: 3813802 Include system headers in the right way
pasis Mar 7, 2024
eecf0a5
issue: 3813802 Remove unneeded cppcheck suppressions
pasis Mar 7, 2024
6803858
issue: 3770816 Use override instead virtual
alexbriskin Feb 5, 2024
a095c8b
issue: 3770816 Use nullptr instead of NULL
alexbriskin Feb 6, 2024
b66af63
issue: 3770816 Remove redundant void argument lists
alexbriskin Feb 6, 2024
be5a307
issue: 3770816 Replace empty destructor with default
alexbriskin Feb 6, 2024
e5f3ead
issue: 3788369 Rename thread_local_event_handler
pasis Feb 21, 2024
8561e11
issue: 3788369 Fix subsequent xlio_get_socket_rings_fds() calls
pasis Feb 22, 2024
e583d42
issue: 3788369 Return TX ring by xlio_get_socket_rings_fds()
pasis Feb 23, 2024
757e0a1
issue: 3788369 Don't reset TCP connection twice
pasis Feb 23, 2024
6215359
issue: 3788369 Don't hardcode TCP send buffer for TCP_NODELAY
pasis Feb 27, 2024
a750ddc
issue: 3788369 Disable MSG_ZEROCOPY tests in gtests
pasis Feb 29, 2024
7a4f2da
issue: 3788369 Remove XLIO_ZC_TX_SIZE
pasis Mar 2, 2024
9c38f68
issue: 3788369 Remove redundant max_send_sge field
pasis Mar 2, 2024
cee8ca7
issue: 3788369 Pass iovec to tcp_write_express()
pasis Mar 2, 2024
98e3ada
issue: 3788369 Don't poll RX while checking is_rst()
pasis Mar 2, 2024
a42d50d
issue: 3788369 Fix LwIP type length related to segment/pbuf size
pasis Mar 2, 2024
75221e0
issue: 3788369 Fix Nagle's algorithm for negative snd_buf
pasis Mar 2, 2024
140401b
issue: 3788369 Remove redundant snd_buf check in LwIP
pasis Mar 2, 2024
236d43f
issue: 3788369 Remove pbuf_desc::map
pasis Mar 2, 2024
8779882
issue: 3788369 Introduce XLIO socket API
pasis Feb 4, 2024
4abcf4f
issue: 3788369 Remove lwip/init.[ch]
pasis Mar 4, 2024
cdcc99c
issue: 3788369 Remove pbuf_custom wrapper structure
pasis Mar 4, 2024
fea6e82
version: 3.30.2
galnoam Mar 11, 2024
1d23a9d
issue: HPCINFRA-1321 add Dockerfile for static tests
vialogi Mar 7, 2024
6e9a005
issue: HPCINFRA-1321 Switch cppcheck to a docker
vialogi Mar 5, 2024
bda508b
issue: HPCINFRA-1321 Switch csbuild to a docker
vialogi Mar 6, 2024
accbf7b
issue: HPCINFRA-1321 Switch Tidy to a docker
vialogi Mar 6, 2024
9521056
issue: 3777348 Remove unused pipeinfo class
AlexanderGrissik Feb 12, 2024
e449276
issue: 3777348 Removing cleanable_obj from socket_fd_api
AlexanderGrissik Feb 12, 2024
4305473
issue: 3777348 Removing unused pkt_sndr_source class
AlexanderGrissik Feb 12, 2024
29c6b63
issue: 3777348 Replacing pkt_rcvr_source class with sockinfo
AlexanderGrissik Feb 12, 2024
a1cb209
issue: 3777348 Simplifying timers for TCP sockets
AlexanderGrissik Feb 13, 2024
19441e4
issue: 3777348 Moving wakeup_pipe to be a member of sockinfo
AlexanderGrissik Feb 14, 2024
1aa581a
issue: 3777348 Replacing socket_fd_api access with sockinfo
AlexanderGrissik Feb 15, 2024
fcbba6e
issue: 3777348 Merging socket_fd_api with sockinfo
AlexanderGrissik Feb 15, 2024
0168f14
issue: 3777348 Moving sockinfo inline impl outside the class
AlexanderGrissik Feb 25, 2024
d135aac
issue: 3777348 sockinfo Reordering methods
AlexanderGrissik Feb 25, 2024
eca2eba
issue: 3777348 Moving sock stats outside the socket
AlexanderGrissik Feb 26, 2024
38f8cd0
issue: 3777348 Reordering sockinfo members
AlexanderGrissik Mar 11, 2024
06a7917
issue: 3777348 Removing m_flow_tag_enabled check
AlexanderGrissik Mar 13, 2024
1949822
issue: 3777348 Remove support for SO_XLIO_FLOW_TAG
AlexanderGrissik Mar 13, 2024
28ded23
issue: 3777348 Avoid process_timestamps checking on each packet
AlexanderGrissik Mar 13, 2024
0d975c0
issue: 3777348 Remove precached sysvars from sockinfo
AlexanderGrissik Mar 13, 2024
cb0b278
issue: 3777348 Remove access to m_sock_wakeup_pipe for socketxtreme
AlexanderGrissik Mar 13, 2024
08956f8
issue: 3777348 Avoid checking m_iomux_ready_fd_array for Socketxtrme
AlexanderGrissik Mar 13, 2024
6434e58
issue: 3777348 Avoid unnecessary access to ring_allocation_tx members
AlexanderGrissik Mar 13, 2024
87a76ea
issue: 3777348 Use thread_local dummy lock
AlexanderGrissik Mar 14, 2024
1ff8f40
issue: 3777348 Avoid copying src/dst addresses for TCP flow-tag DP
AlexanderGrissik Mar 14, 2024
7fb5963
issue: 3808935 Add nullptr checks before dereferencing
alexbriskin Mar 18, 2024
af806d1
issue: 3829626 Fix new TCP timers registration for reused sockets
AlexanderGrissik Mar 19, 2024
3a98c90
issue: 3829626 Fixing statistics init for reused sockets
AlexanderGrissik Mar 19, 2024
27931ce
issue: 3788369 Replace XLIO_HUGEPAGE_LOG2 with XLIO_HUGEPAGE_SIZE
pasis Mar 12, 2024
bf77b2c
issue: 3788369 Remove xlio_key prototypes
pasis Mar 18, 2024
0dcadd3
issue: 3788369 Move public types definitions to xlio_types.h
pasis Mar 18, 2024
4df427c
issue: 3788369 Add external allocator to XLIO Socket API
pasis Mar 18, 2024
f959cff
issue: 3788369 Add XLIO Socket API to the xlio_api_t pointers
pasis Mar 18, 2024
1baf574
issue: 3777348 Adding lock_spin_simple for smaller space utilization"
AlexanderGrissik Mar 17, 2024
6ab22fa
issue: 3777348 Adding template cached_obj_pool
AlexanderGrissik Mar 17, 2024
347c362
issue: 3777348 Socketxtreme completions ring pool
AlexanderGrissik Feb 29, 2024
4de688b
issue: Fix big endian build and clean unused macros
pasis Mar 11, 2024
221ea72
version: 3.30.3
galnoam Mar 20, 2024
dcdcd64
issue: 3788369 Keep global collection of the polling groups
pasis Mar 17, 2024
dde0276
issue: 3788369 Keep sockets list per polling group
pasis Mar 17, 2024
8452e00
issue: 3788369 poll_group takes reference to ring
pasis Mar 17, 2024
617759e
issue: 3788369 Throw exception if netdev not found for a ring
pasis Mar 17, 2024
25351d9
issue: 3788369 Release native rings in the poll_group destructor
pasis Mar 21, 2024
0b3eb59
issue: 3788369 Don't free buffer unconditionally in XLIO Socket API
pasis Mar 24, 2024
69ca61f
issue: 3788369 Use reclaim_recv_buffers() in XLIO Socket API
pasis Mar 24, 2024
8ad2b35
issue: 3788369 Pass proper hugepage_size to XLIO Socket API
pasis Mar 25, 2024
7b8853d
issue: 3788369 Poll local ring before XLIO socket destruction
pasis Mar 25, 2024
64bf555
issue: 3788369 Re-read env params in xlio_init_ex()
pasis Mar 25, 2024
4771bdd
issue: 3788369 Avoid POSIX connect() in xlio_socket_connect()
pasis Mar 25, 2024
dfdbd8b
issue: 3788369 Remove get_fd() from XLIO Socket API
pasis Mar 25, 2024
86fc67a
issue: 3829626 Fix seg fault in TCP timers
iftahl Mar 27, 2024
65130bd
issue: 3818038 Remove BlueFlame doorbell method
pasis Apr 1, 2024
6f485a1
issue: 3818038 Remove likely() from the inline WQE branch
pasis Apr 2, 2024
ea38dd7
issue: 3844385 Fix new TCP timers registration lock contention
AlexanderGrissik Apr 2, 2024
8e64060
version: 3.30.4
galnoam Apr 4, 2024
9b7eec0
issue: 3788164 Fix RX poll on TX option for UTLS
pasis Apr 5, 2024
0678a45
issue: 3855390 Fixing adding TCP timer twice warning
AlexanderGrissik Apr 8, 2024
db61660
issue: 3795997 Control TSO max payload size
iftahl Apr 4, 2024
1e18c6a
version: 3.30.5
galnoam Apr 9, 2024
fdd3157
issue: Fix incorrect value for pbuf_desc::attr
pasis Jun 8, 2024
a0d9ce6
issue: Don't initialize pbuf_desc field before copy
pasis Jun 8, 2024
d69a7c6
issue: Move inline WQE part to a separate method
pasis Jun 8, 2024
18500b4
issue: Keep last wqe pointer instead of prefetching hot
pasis Jun 8, 2024
dc33e0a
issue: Split mem_buf_desc_t::ZCOPY into URGENT and CALLBACK
pasis Jun 8, 2024
8c8aabd
issue: Implement delayed doorbell
pasis Jun 8, 2024
1e5696b
issue: Skip doorbell for PBUF_DESC_EXPRESS buffers
pasis Jun 8, 2024
e6ddba0
issue: Allow doorbell batching for tcp_tx_express_inline()
pasis Jun 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
issue: 3668182 Fix PR comments
Signed-off-by: Alex Briskin <abriskin@nvidia.com>
  • Loading branch information
alexbriskin authored and AlexanderGrissik committed Feb 1, 2024
commit 1e4e73035a82d88b14e50570a1e5ac85446dacef
2 changes: 1 addition & 1 deletion src/core/lwip/tcp_impl.h
Original file line number Diff line number Diff line change
@@ -296,7 +296,7 @@ struct tcp_seg {
#define TF_SEG_OPTS_ZEROCOPY (u8_t) TCP_WRITE_ZEROCOPY /* Use zerocopy send mode */

u8_t tcp_flags; /* Cached TCP flags for outgoing segments */
u8_t bufs;
u8_t bufs; /* Number of buffers int the pbuf linked list */

/* L2+L3+TCP header for zerocopy segments, it must have enough room for options
This should have enough space for L2 (ETH+vLAN), L3 (IPv4/6), L4 (TCP)
20 changes: 11 additions & 9 deletions src/core/proto/dst_entry_tcp.cpp
Original file line number Diff line number Diff line change
@@ -222,16 +222,18 @@ ssize_t dst_entry_tcp::fast_send(const iovec *p_iov, const ssize_t sz_iov, xlio_
m_sge[i].addr = (uintptr_t)p_tcp_iov[i].iovec.iov_base;
m_sge[i].length = p_tcp_iov[i].iovec.iov_len;
if (is_zerocopy) {
if (PBUF_DESC_EXPRESS == p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.attr) {
m_sge[i].lkey = p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.mkey;
} else if (PBUF_DESC_MKEY == p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.attr) {
auto *p_desc = p_tcp_iov[i].p_desc;
auto &pbuf_descriptor = p_desc->lwip_pbuf.pbuf.desc;
if (PBUF_DESC_EXPRESS == pbuf_descriptor.attr) {
m_sge[i].lkey = pbuf_descriptor.mkey;
} else if (PBUF_DESC_MKEY == pbuf_descriptor.attr) {
/* PBUF_DESC_MKEY - value is provided by user */
m_sge[i].lkey = p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.mkey;
} else if (PBUF_DESC_MDESC == p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.attr ||
PBUF_DESC_NVME_TX == p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.attr) {
mem_desc *mdesc = (mem_desc *)p_tcp_iov[i].p_desc->lwip_pbuf.pbuf.desc.mdesc;
m_sge[i].lkey = mdesc->get_lkey(p_tcp_iov[i].p_desc, ib_ctx,
(void *)m_sge[i].addr, m_sge[i].length);
m_sge[i].lkey = pbuf_descriptor.mkey;
} else if (PBUF_DESC_MDESC == pbuf_descriptor.attr ||
PBUF_DESC_NVME_TX == pbuf_descriptor.attr) {
mem_desc *mdesc = (mem_desc *)pbuf_descriptor.mdesc;
m_sge[i].lkey =
mdesc->get_lkey(p_desc, ib_ctx, (void *)m_sge[i].addr, m_sge[i].length);
if (m_sge[i].lkey == LKEY_TX_DEFAULT) {
m_sge[i].lkey = m_p_ring->get_tx_lkey(m_id);
}
148 changes: 71 additions & 77 deletions src/core/sock/sockinfo_tcp.cpp
Original file line number Diff line number Diff line change
@@ -769,12 +769,12 @@ bool sockinfo_tcp::prepare_dst_to_send(bool is_accepted_socket /* = false */)
return ret_val;
}

unsigned sockinfo_tcp::tx_wait(int &err, bool blocking)
unsigned sockinfo_tcp::tx_wait(bool blocking)
{
unsigned sz = sndbuf_available();
int poll_count = 0;
si_tcp_logfunc("sz = %u rx_count=%d", sz, m_n_rx_pkt_ready_list_count);
err = 0;
int err = 0;
while (is_rts() && (sz = sndbuf_available()) == 0) {
err = rx_wait(poll_count, blocking);
// AlexV:Avoid from going to sleep, for the blocked socket of course, since
@@ -795,7 +795,7 @@ unsigned sockinfo_tcp::tx_wait(int &err, bool blocking)
poll_count = 0;
}
}
si_tcp_logfunc("end sz=%d rx_count=%d", sz, m_n_rx_pkt_ready_list_count);
si_tcp_logfunc("end sz=%u rx_count=%d", sz, m_n_rx_pkt_ready_list_count);
return sz;
}

@@ -907,6 +907,16 @@ static inline bool is_invalid_iovec(const iovec *iov, size_t sz_iov)
return iov == nullptr || sz_iov == 0;
}

/**
* Handles transmission operations on a TCP socket, supporting various user actions such as
* write, send, sendv, sendmsg, and sendfile. This function operates on both blocking and
* non-blocking sockets, providing options for zero-copy send operations. When the socket is
* configured for zero-copy send, it executes a fast-path send for non-blocking operations;
* otherwise, it falls back to the tcp_tx_slow_path function.
*
* @param tx_arg The TCP transmission arguments and parameters.
* @return Returns the number of bytes transmitted, or -1 on error with the errno set.
*/
ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
{
iovec *p_iov = tx_arg.attr.iov;
@@ -933,7 +943,6 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
if (unlikely(!is_connected_and_ready_to_send())) {
return -1;
}

si_tcp_logfunc("tx: iov=%p niovs=%d", p_iov, sz_iov);

if (m_sysvar_rx_poll_on_tx_tcp) {
@@ -947,7 +956,7 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
return tcp_tx_slow_path(tx_arg);
}

bool is_send_zerocopy = tx_arg.opcode != TX_FILE;
bool is_non_file_zerocopy = tx_arg.opcode != TX_FILE;
pd_key_array =
(tx_arg.priv.attr == PBUF_DESC_MKEY ? (struct xlio_pd_key *)tx_arg.priv.map : NULL);

@@ -979,12 +988,19 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
unsigned tx_size = sndbuf_available();

if (tx_size == 0) {
return tcp_tx_handle_sndbuf_unavailable(total_tx, is_dummy, is_send_zerocopy,
if (unlikely(!is_rts())) {
si_tcp_logdbg("TX on disconnected socket");
return tcp_tx_handle_errno_and_unlock(ECONNRESET);
}
// force out TCP data before going on wait()
tcp_output(&m_pcb);

return tcp_tx_handle_sndbuf_unavailable(total_tx, is_dummy, is_non_file_zerocopy,
errno_tmp);
}

tx_size = std::min<size_t>(p_iov[i].iov_len - pos, tx_size);
if (is_send_zerocopy) {
if (is_non_file_zerocopy) {
/*
* For send zerocopy we don't support pbufs which
* cross huge page boundaries. To avoid forming
@@ -1003,7 +1019,7 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
}
if (unlikely(g_b_exit)) {
return tcp_tx_handle_partial_send_and_unlock(total_tx, EINTR, is_dummy,
is_send_zerocopy, errno_tmp);
is_non_file_zerocopy, errno_tmp);
}

err = tcp_write_express(&m_pcb, tx_ptr, tx_size, &tx_arg.priv);
@@ -1012,7 +1028,7 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
si_tcp_logdbg("connection closed: tx'ed = %d", total_tx);
shutdown(SHUT_WR);
return tcp_tx_handle_partial_send_and_unlock(total_tx, EPIPE, is_dummy,
is_send_zerocopy, errno_tmp);
is_non_file_zerocopy, errno_tmp);
}
if (unlikely(err != ERR_MEM)) {
// we should not get here...
@@ -1021,35 +1037,37 @@ ssize_t sockinfo_tcp::tcp_tx(xlio_tx_call_attr_t &tx_arg)
BULLSEYE_EXCLUDE_BLOCK_END
}
return tcp_tx_handle_partial_send_and_unlock(total_tx, EAGAIN, is_dummy,
is_send_zerocopy, errno_tmp);
is_non_file_zerocopy, errno_tmp);
}
tx_ptr = (void *)((char *)tx_ptr + tx_size);
pos += tx_size;
total_tx += tx_size;
}
}

return tcp_tx_handle_done_and_unlock(total_tx, errno_tmp, is_dummy, is_send_zerocopy);
return tcp_tx_handle_done_and_unlock(total_tx, errno_tmp, is_dummy, is_non_file_zerocopy);
}

/**
* Handles transmission operations on a TCP socket similar to tcp_tx.
* This is a fallback function when the operation is either blocking, not zero-copy, or the socket
* wasn't configured for zero-copy operations.
*
* @param tx_arg The TCP transmission arguments and parameters.
* @return Returns the number of bytes transmitted, or -1 on error with the errno set.
*/
ssize_t sockinfo_tcp::tcp_tx_slow_path(xlio_tx_call_attr_t &tx_arg)
{
iovec *p_iov = tx_arg.attr.iov;
size_t sz_iov = tx_arg.attr.sz_iov;
int flags = tx_arg.attr.flags;
int errno_tmp = errno;
int ret = 0;
int poll_count = 0;
uint16_t apiflags = 0;
err_t err;
bool is_send_zerocopy = false;
void *tx_ptr = NULL;
struct xlio_pd_key *pd_key_array = NULL;

if (m_sysvar_rx_poll_on_tx_tcp) {
rx_wait_helper(poll_count, false);
}

if (tx_arg.opcode == TX_FILE) {
/*
* TX_FILE is a special operation which reads a single file.
@@ -1081,12 +1099,13 @@ ssize_t sockinfo_tcp::tcp_tx_slow_path(xlio_tx_call_attr_t &tx_arg)

lock_tcp_con();

if (cannot_do_requested_dummy_send(m_pcb, tx_arg) || TCP_WND_UNAVALABLE(m_pcb, total_iov_len)) {
if (cannot_do_requested_dummy_send(m_pcb, tx_arg)) {
return tcp_tx_handle_errno_and_unlock(EAGAIN);
}

int total_tx = 0;
off64_t file_offset = 0;
bool block_this_run = BLOCK_THIS_RUN(m_b_blocking, flags);
for (size_t i = 0; i < sz_iov; i++) {
si_tcp_logfunc("iov:%d base=%p len=%d", i, p_iov[i].iov_base, p_iov[i].iov_len);
if (unlikely(!p_iov[i].iov_base)) {
@@ -1122,9 +1141,13 @@ ssize_t sockinfo_tcp::tcp_tx_slow_path(xlio_tx_call_attr_t &tx_arg)
// force out TCP data before going on wait()
tcp_output(&m_pcb);

/* Set return values for nonblocking socket and finish processing */
// non blocking socket should return in order not to tx_wait()
if (!block_this_run) {
return tcp_tx_handle_sndbuf_unavailable(total_tx, is_dummy, is_send_zerocopy,
errno_tmp);
}

tx_size = tx_wait(ret, true);
tx_size = tx_wait(block_this_run);
}

tx_size = std::min<size_t>(p_iov[i].iov_len - pos, tx_size);
@@ -1150,11 +1173,9 @@ ssize_t sockinfo_tcp::tcp_tx_slow_path(xlio_tx_call_attr_t &tx_arg)
is_send_zerocopy, errno_tmp);
}

if (apiflags & XLIO_TX_PACKET_ZEROCOPY) {
err = tcp_write_express(&m_pcb, tx_ptr, tx_size, &tx_arg.priv);
} else {
err = tcp_write(&m_pcb, tx_ptr, tx_size, apiflags, &tx_arg.priv);
}
err_t err = (apiflags & XLIO_TX_PACKET_ZEROCOPY)
? tcp_write_express(&m_pcb, tx_ptr, tx_size, &tx_arg.priv)
: tcp_write(&m_pcb, tx_ptr, tx_size, apiflags, &tx_arg.priv);
if (unlikely(err != ERR_OK)) {
if (unlikely(err == ERR_CONN)) { // happens when remote drops during big write
si_tcp_logdbg("connection closed: tx'ed = %d", total_tx);
@@ -1168,6 +1189,15 @@ ssize_t sockinfo_tcp::tcp_tx_slow_path(xlio_tx_call_attr_t &tx_arg)
si_tcp_logpanic("tcp_write return: %d", err);
BULLSEYE_EXCLUDE_BLOCK_END
}
/* Set return values for nonblocking socket and finish processing */
if (!block_this_run) {
if (total_tx > 0) {
return tcp_tx_handle_done_and_unlock(total_tx, errno_tmp, is_dummy,
is_send_zerocopy);
} else {
return tcp_tx_handle_errno_and_unlock(EAGAIN);
}
}

rx_wait(poll_count, true);

@@ -1278,12 +1308,9 @@ err_t sockinfo_tcp::ip_output(struct pbuf *p, struct tcp_seg *seg, void *v_p_con
return ERR_OK;
}

ssize_t ret = 0;
if (likely((p_dst->is_valid()))) {
ret = p_dst->fast_send((struct iovec *)lwip_iovec, count, attr);
} else {
ret = p_dst->slow_send((struct iovec *)lwip_iovec, count, attr, p_si_tcp->m_so_ratelimit);
}
ssize_t ret = likely((p_dst->is_valid()))
? p_dst->fast_send((struct iovec *)lwip_iovec, count, attr)
: p_dst->slow_send((struct iovec *)lwip_iovec, count, attr, p_si_tcp->m_so_ratelimit);

rc = p_si_tcp->m_ops->handle_send_ret(ret, seg);

@@ -6030,43 +6057,12 @@ inline bool sockinfo_tcp::handle_bind_no_port(int &bind_ret, in_port_t in_port,
int sockinfo_tcp::tcp_tx_express(const struct iovec *iov, unsigned iov_len, uint32_t mkey,
xlio_express_flags flags, void *opaque_op)
{
if (unlikely(!is_rts())) {
if (m_conn_state == TCP_CONN_TIMEOUT) {
si_tcp_logdbg("TX timed out");
errno = ETIMEDOUT;
} else if (m_conn_state == TCP_CONN_RESETED) {
si_tcp_logdbg("TX on reseted socket");
errno = ECONNRESET;
} else if (m_conn_state == TCP_CONN_ERROR) {
si_tcp_logdbg("TX on connection failed socket");
errno = ECONNREFUSED;
} else {
si_tcp_logdbg("TX on disconnected socket");
errno = EPIPE;
}
if (unlikely(!is_connected_and_ready_to_send())) {
return -1;
}

err_t err;
pbuf_desc mdesc;

if (unlikely(!is_rts())) {
if (m_conn_state == TCP_CONN_TIMEOUT) {
si_tcp_logdbg("TX timed out");
errno = ETIMEDOUT;
} else if (m_conn_state == TCP_CONN_RESETED) {
si_tcp_logdbg("TX on reseted socket");
errno = ECONNRESET;
} else if (m_conn_state == TCP_CONN_ERROR) {
si_tcp_logdbg("TX on connection failed socket");
errno = ECONNREFUSED;
} else {
si_tcp_logdbg("TX on disconnected socket");
errno = EPIPE;
}
return -1;
}

switch (flags & XLIO_EXPRESS_OP_TYPE_MASK) {
case XLIO_EXPRESS_OP_TYPE_DESC:
mdesc.attr = PBUF_DESC_EXPRESS;
@@ -6084,21 +6080,26 @@ int sockinfo_tcp::tcp_tx_express(const struct iovec *iov, unsigned iov_len, uint

lock_tcp_con();

err_t err;
for (unsigned i = 0; i < iov_len; ++i) {
err = tcp_write_express(&m_pcb, iov[i].iov_base, iov[i].iov_len, &mdesc);
if (err != ERR_OK) {
/* The only error in tcp_write_express is a memory error */
/* The only error in tcp_write_express is a memory error
* In this version we don't implement any error recovery or avoidance
* mechanism and an error at this stage is irrecoverable.
* The considered alternatives are:
* - Setting the socket an error state (this is the one we chose here)
* - Rolling back any written buffers, i.e. recovering
* - Reserving the pbuf(s)/tcp_seg(s) before calling for tcp_write_express */
m_conn_state = TCP_CONN_ERROR;
m_error_status = ENOMEM;
return tcp_tx_handle_errno_and_unlock(ENOMEM);
}
bytes_written += iov[i].iov_len;
}

if (!(flags & XLIO_EXPRESS_MSG_MORE)) {
err = tcp_output(&m_pcb);
if (err != ERR_OK) {
/* The error very likely to be recoverable */
si_tcp_logdbg("tcp_tx_express - tcp_output failed");
}
tcp_output(&m_pcb);
}
unlock_tcp_con();

@@ -6187,7 +6188,7 @@ bool sockinfo_tcp::is_connected_and_ready_to_send()
si_tcp_logdbg("TX on connection failed socket");
errno = ECONNREFUSED;
} else {
si_tcp_logdbg("TX on disconnected socket");
si_tcp_logdbg("TX on unconnected socket");
errno = EPIPE;
}
return false;
@@ -6204,13 +6205,6 @@ bool sockinfo_tcp::is_connected_and_ready_to_send()
ssize_t sockinfo_tcp::tcp_tx_handle_sndbuf_unavailable(ssize_t total_tx, bool is_dummy,
bool is_send_zerocopy, int errno_to_restore)
{
if (unlikely(!is_rts())) {
si_tcp_logdbg("TX on disconnected socket");
return tcp_tx_handle_errno_and_unlock(ECONNRESET);
}
// force out TCP data before going on wait()
tcp_output(&m_pcb);

// non blocking socket should return in order not to tx_wait()
if (total_tx > 0) {
m_tx_consecutive_eagain_count = 0;
2 changes: 1 addition & 1 deletion src/core/sock/sockinfo_tcp.h
Original file line number Diff line number Diff line change
@@ -387,7 +387,7 @@ class sockinfo_tcp : public sockinfo, public timer_handler {
int wait_for_conn_ready_blocking();
static err_t connect_lwip_cb(void *arg, struct tcp_pcb *tpcb, err_t err);
// tx
unsigned tx_wait(int &err, bool blocking);
unsigned tx_wait(bool blocking);
int os_epoll_wait_with_tcp_timers(epoll_event *ep_events, int maxevents);
int handle_child_FIN(sockinfo_tcp *child_conn);