forked from amzn/amzn-drivers
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
351 lines (300 loc) · 13.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
FreeBSD kernel driver for Elastic Network Adapter (ENA) family:
===============================================================
Version:
========
0.8.1
Supported FreeBSD Versions:
===========================
- FreeBSD release/11.0
- FreeBSD release/11.1
- FreeBSD 12, starting from rS3091111,
commit cbc182c1724fcd3ee240ed9933fcc53eb6db5b9b (Nov 24, 2016).
Overview:
=========
ENA is a networking interface designed to make good use of modern CPU
features and system architectures.
The ENA device exposes a lightweight management interface with a
minimal set of memory mapped registers and extendable command set
through an Admin Queue.
The driver supports a range of ENA devices, is link-speed independent
(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
a negotiated and extendable feature set.
Some ENA devices support SR-IOV. This driver is used for both the
SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
ENA devices enable high speed and low overhead network traffic
processing by providing multiple Tx/Rx queue pairs (the maximum number
is advertised by the device via the Admin Queue), a dedicated MSI-X
interrupt vector per Tx/Rx queue pair, and CPU cacheline optimized
data placement.
The ENA driver supports industry standard TCP/IP offload features such
as checksum offload and TCP transmit segmentation offload (TSO).
Receive-side scaling (RSS) is supported for multi-core scaling.
The ENA driver and its corresponding devices implement health
monitoring mechanisms such as watchdog, enabling the device and driver
to recover in a manner transparent to the application, as well as
debug logs.
Some of the ENA devices support a working mode called Low-latency
Queue (LLQ), which saves several more microseconds. This feature will
be implemented for driver in future releases.
Driver compilation:
===================
Prerequisites:
--------------
In order to build and run standalone driver system, the OS sources
corresponding to currently installed OS version are required.
Depending on user configuration the sources retrieval process may vary,
however, on a fresh installation on EC2 instance, the necessary steps
may look like shown below (some may require super user privileges):
pkg install subversion
mkdir /usr/src
# Get sources for current installation. This step may require
# accepting certificate.
# Getting sources may vary between system versions. The resources need
# to be adjusted accordingly.
# - for stable:
svn checkout https://svn0.us-east.FreeBSD.org/base/stable/11/ /usr/src
# - for release (FreeBSD 11.1)
svn checkout https://svn0.us-east.FreeBSD.org/base/releng/11.1/ /usr/src
# - for -CURRENT (unstable)
svn checkout https://svn0.us-east.freebsd.org/base/head /usr/src
# - for kernel version currently running in the system
uname -a # provides revision of running kernel
# Sample output:
# FreeBSD host 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r316750: current_date
# r316750 is indicating revision of current kernel
# In this example, we have to pull kernel tree with revision r316750 from head:
svn checkout -r316750 https://svn0.us-east.freebsd.org/base/head /usr/src
# r316750 must be changed to the revision number from the 'uname -a' output
Compilation:
------------
Run "make" in the amzn-drivers/kernel/fbsd/ena/ directory.
As a result of compilation if_ena.ko kernel module file is created in
the same directory.
Driver installation:
====================
loading driver:
---------------
kldload ./if_ena.ko
For automatic driver start upon OS boot
-------------------------------------------
vi /boot/loader.conf
# insert 'if_ena_load="YES"' in the above file
cp if_ena.ko /boot/modules/
sync; sleep 30;
Then restart the OS (reboot and reconnect).
Driver update - if the kernel was built with ENA
-------------------------------------------
vi /boot/loader.conf
# insert 'if_ena_load="YES"' in the above file
cp if_ena.ko /boot/modules/
# remove old module
rm /boot/kernel/if_ena.ko
sync; sleep 30;
Then restart the OS (reboot and reconnect).
Supported PCI vendor ID/device IDs:
===================================
1d0f:0ec2 - ENA PF
1d0f:1ec2 - ENA PF with LLQ support
1d0f:ec20 - ENA VF
1d0f:ec21 - ENA VF with LLQ support
ENA Source Code Directory Structure:
====================================
/*
ena.[ch] - Main FreeBSD kernel driver.
ena_sysctl.[ch] - ENA sysctl nodes for ENA configuration and statistics.
ena_com/*
ena_com.[ch] - Management communication layer. This layer is
responsible for the handling all the management
(admin) communication between the device and the
driver.
ena_eth_com.[ch] - Tx/Rx data path.
ena_admin_defs.h - Definition of ENA management interface.
ena_eth_io_defs.h - Definition of ENA data path interface.
ena_common_defs.h - Common definitions for ena_com layer.
ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers.
ena_plat.h - Platform dependent code for FreeBSD.
Management Interface:
=====================
ENA management interface is exposed by means of:
- PCIe Configuration Space
- Device Registers
- Admin Queue (AQ) and Admin Completion Queue (ACQ)
- Asynchronous Event Notification Queue (AENQ)
ENA device MMIO Registers are accessed only during driver
initialization and are not involved in further normal device
operation.
AQ is used for submitting management commands, and the
results/responses are reported asynchronously through ACQ.
ENA introduces a very small set of management commands with room for
vendor-specific extensions. Most of the management operations are
framed in a generic Get/Set feature command.
The following admin queue commands are supported:
- Create I/O submission queue
- Create I/O completion queue
- Destroy I/O submission queue
- Destroy I/O completion queue
- Get feature
- Set feature
- Configure AENQ
- Get statistics
Refer to ena_admin_defs.h for the list of supported Get/Set Feature
properties.
The Asynchronous Event Notification Queue (AENQ) is a uni-directional
queue used by the ENA device to send to the driver events that cannot
be reported using ACQ. AENQ events are subdivided into groups. Each
group may have multiple syndromes, as shown below
The events are:
Group Syndrome
Link state change - X -
Fatal error - X -
Notification Suspend traffic
Notification Resume traffic
Keep-Alive - X -
ACQ and AENQ share the same MSI-X vector.
Keep-Alive is a special mechanism that allows monitoring of the
device's health. The driver maintains a watchdog (WD) handler which,
if fired, logs the current state and statistics then resets and
restarts the ENA device and driver. A Keep-Alive event is delivered by
the device every second. The driver re-arms the WD upon reception of a
Keep-Alive event. A missed Keep-Alive event causes the WD handler to
fire.
Data Path Interface:
====================
I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
SQ correspondingly). Each SQ has a completion queue (CQ) associated
with it.
The SQs and CQs are implemented as descriptor rings in contiguous
physical memory.
The ENA driver in 0.8.1 currently supports one Queue Operation mode for Tx SQs:
- Regular mode
* In this mode the Tx SQs reside in the host's memory. The ENA
device fetches the ENA Tx descriptors and packet data from host
memory.
The Rx SQs support only the regular mode.
The driver supports multi-queue for both Tx and Rx. This has various
benefits:
- Reduced CPU/thread/process contention on a given Ethernet interface.
- Cache miss rate on completion is reduced, particularly for data
cache lines that hold the mbuf structures.
- Increased process-level parallelism when handling received packets.
- Increased data cache hit rate, by steering kernel processing of
packets to the CPU, where the application thread consuming the
packet is running.
- In hardware interrupt re-direction.
Interrupt Modes:
================
The driver assigns a single MSI-X vector per queue pair (for both Tx
and Rx directions). The driver assigns an additional dedicated MSI-X vector
for management (for ACQ and AENQ).
Management interrupt registration is performed when the FreeBSD kernel
attaches the adapter, and it is de-registered when the adapter is
removed. I/O queue interrupt registration is performed when the FreeBSD
interface of the adapter is opened, and it is de-registered when the
interface is closed.
The management interrupt is named:
ena-mgmnt@pci:<PCI domain:bus:slot.function>
and for each queue pair, an interrupt is named:
<interface name>-TxRx-<queue index>
The ENA device operates in auto-mask and auto-clear interrupt
modes. That is, once MSI-X is delivered to the host, its Cause bit is
automatically cleared and the interrupt is masked. The interrupt is
unmasked by the driver after cleaning all TX and Rx packets or the cleanup
routine is being called 8 times while handling single interrupt.
Statistics:
===========
The user can obtain ENA device and driver statistics using sysctl.
MTU:
====
The driver supports an arbitrarily large MTU with a maximum that is
negotiated with the device. The driver configures MTU using the
SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
via ifconfig.
Stateless Offloads:
===================
The ENA driver supports:
- TSO over IPv4/IPv6
- IPv4 header checksum offload
- TCP/UDP over IPv4/IPv6 checksum offloads
RSS:
====
- The ENA device supports RSS that allows flexible Rx traffic
steering.
- Toeplitz and CRC32 hash functions are supported.
- Different combinations of L2/L3/L4 fields can be configured as
inputs for hash functions.
- The driver configures RSS settings using the AQ SetFeature command
(ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties).
- The driver sets default CRC32 function and in 0.8.1 it cannot be
configured manually.
DATA PATH:
==========
Tx:
---
ena_mq_start() is called by the stack. This function does the following:
- Assigns mbuf to proper tx queue according to hash type and flowid.
- Puts packet in the drbr (multi-producer, {single, multi}-consumer lock-less
ring buffer).
- If drbr was empty before putting packet, tries to acquire lock for tx queue
and, if succeeded, it runs ena_start_xmit() function for sending packet that
was just added.
- If lock could not be acquired, it enqueues task ena_deferred_mq_start()
which will run ena_start_xmit() in different thread and it will
clean all of the packets in the drbr.
- ena_start_xmit() is doing following steps:
* Check if there is still enough space in the hw queues, if not, call
ena_tx_cleanup() function directly
* Call ena_xmit_mbuf() function for all mbufs in the drbr or until
transmission error occurs.
* ena_xmit_mbuf() is sending mbufs to the ENA device with given steps:
+ Mbufs are mapped and defragmented if necessary for the DMA transactions
+ Allocates a new request ID from the empty req_id ring. The request
ID is the index of the packet in the Tx info. This is used for
out-of-order TX completions.
+ The packet is added to the proper place in the TX ring.
+ ena_com_prepare_tx() is called, an ENA communication layer that converts
the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
needed.)
# This function also copies the ENA descriptors and the push buffer
to the Device memory space (if in push mode.)
* Write doorbells to the ENA device.
* After emptying drbr, if hw tx queue is low on space, call ena_tx_cleanup()
routine
When the ENA device finishes sending the packet, a completion
interrupt is raised:
- The interrupt handler cleans RX and TX descriptors in the loop until all
descriptors are cleaned up or number of loop iteration exceeds maximum value
- The ena_tx_cleanup() function is called. This function calls
ena_tx_cleanup() which handles the completion descriptors generated by
the ENA, with a single completion descriptor per completed packet.
* req_id is retrieved from the completion descriptor. The tx_info of
the packet is retrieved via the req_id. The data buffers are
unmapped and req_id is returned to the empty req_id ring.
* The function stops when the completion descriptors are completed or given
budget is depleted
- All interrupts are being unmasked
Rx:
---
When a packet is received from the ENA device:
- The interrupt handler cleans RX and TX descriptors in the loop until all
descriptors are cleaned up or global number of loop iteration exceeds maximum
value
- The ena_rx_cleanup() function is called. This function calls
ena_com_rx_pkt(), an ENA communication layer function, which returns the
number of descriptors used for a new unhandled packet, and zero if
no new packet is found.
- Then it calls the ena_rx_mbuf() function:
The new mbuf is updated with the necessary information (protocol,
checksum hw verify result, etc.).
- Mbuf is then passed to the network stack, using the ifp->if_input function
or tcp_lro_rx() if LRO is enabled and packet is of type TCP/IP with TCP
checksum computed by the hardware.
- The function stops when all packets are handled or given budget is depleted.
Unsupported features:
=====================
- LLQ support
- RSS configuration by user
Known issues:
=============
- FLOWTABLE option (per-CPU routing cache) leads to system crash on both
FreeBSD 11 and FreeBSD 12-CURRENT system versions