-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathnvmex.8
287 lines (258 loc) · 7.76 KB
/
nvmex.8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
.TH nvmex 8 "2021-03-28"
.SH "NAME"
nvmex \- a metrics exporter for Nvidia GPUs
.SH "SYNOPSIS"
.nh
.na
.HP
.B nvmex
[\fB\-CLPScdfh\fR]
[\fB\-l\ \fIip\fR]
[\fB\-p\ \fIport\fR]
[\fB\-v\ DEBUG\fR|\fBINFO\fR|\fBWARN\fR|\fBERROR\fR|\fBFATAL\fR]
.ad
.hy
.SH "DESCRIPTION"
.B nvmex
is a \fBm\fRetrics \fBex\fRporter for \fBnv\fRidia graphic processing
units (GPUs).
To monitor the health and resource consumption of GPUs it utilizes the
nvidia management library (NVML) usually distributed with the nvidia
driver or a similar package as libnividia\-ml.so.1 (e.g. Solaris:
driver/graphics/nvidia, Ubuntu: libnvidia\-compute\-*). Collected data
can be exposed via HTTP in Prometheuse exposition format [1] using the
endpoint URL \fBhttp://\fIhostname\fB:\fI9400\fB/metrics\fR and thus
visualized e.g. using Grafana [2], Netdata [3], or Zabbix [4].
In contrast to Nvidia\'s dcgm\-exporter \fBnvmex\fR is written in plain C
and thus it is compared to dcgm-exporter extremely lightweight, does not
trash your disks with hugh error logs, or hogs any cpu. You may run it on
bare metal, or in any zone, container, or pod which has access to 1+ GPU.
\fBnvmex\fR operates in 3 modes:
.RS 2
.IP \fBdefault\fR 2
Just collects all data as it would for a /metrics HTTP request, print
it to the standard output and exit.
.IP \fBforeground\fR
Start the internal HTTP server to answer HTTP requests, but stays
attached to the console/terminal (i.e. standard input, output and error).
Use option \fB-f\fR to request this mode.
.IP \fBdaemon\fR
Start the internal HTTP server (daemon) to answer HTTP requests in the
background (fork-exec model) in a new session, detach from the
console/terminal, attach standard input, output and error to /dev/null
and finally exit with exit code \fB0\fR, if the daemon is running as
desired. Remember, if you do not specify a logfile to use, all messages
emitted by the daemon get dropped.
Use option \fB-d\fR to request this mode.
.RE
\fBnvmex\fR answers one HTTP request after another to have a
very small footprint wrt. the system and queried devices. So it is
recommended to adjust your firewalls and/or HTTP proxies accordingly.
If you need SSL or authentication, use a HTTP proxy like nginx - for now
\fBnvmex\fR should be kept small and simple.
When \fBnvmex\fR runs in \fBforeground\fR or \fBdaemon\fR mode, it also
returns by default the duration it took to collect and format:
.RS 2
.TP 2
.B default
HTTP related statistics
.TP
.B process
\fBnvmex\fR process related data
.TP
.B nvidia
data provided by NVML/nvidia driver
.TP
.B libprom
everything (i.e. overall incl. default, process and nvidia).
.RE
.SH "OPTIONS"
.TP 4
.B \-L
.PD 0
.TP
.B \-\-no\-scrapetime
Disable the overall scrapetime metrics (libprom collector), i.e. the time
elapsed when scraping all the required data. One needs also disable
collecting scrapetimes of all other collectors before this option
gets honored.
.TP
.B \-S
.PD 0
.TP
.B \-\-no\-scrapetime\-all
Disable recording the scrapetime of each collector separately. There is
one collector named \fBdefault\fR, which collects HTTP request/response
statistics, the optional \fBprocess\fR collector, which records metrics
about the nvmex process itself, the \fBnvidia\fR collector, which queries
all the GPUs devices for metrics, and finally the \fBlibprom\fR collector,
which just records the time it took to collect and prom-format the data
of all other collectors.
.TP
.B \-c
.PD 0
.TP
.B \-\-compact
Sending a HELP and TYPE comment alias description about a metric is
according to the Prometheus exposition format [1] optional. With this
option they will be ommitted in the HTTP response and thus it saves
bandwith and processing time.
.TP
.B \-d
.PD 0
.TP
.B \-\-daemon
Run \fBnvmex\fR in \fBdaemon\fR mode.
.TP
.B \-f
.PD 0
.TP
.B \-\-foreground
Run \fBnvmex\fR in \fBforeground\fR mode.
.TP
.B \-h
.PD 0
.TP
.B \-\-help
Print a short help summary to the standard output and exit.
.TP
.BI \-l " file"
.PD 0
.TP
.BI \-\-logfile= file
Log all messages to the given \fIfile\fR when the main process is running.
.TP
.BI \-n " list"
.PD 0
.TP
.BI \-\-no-metric= list
Skip all GPU metrics given in the comma separated \fIlist\fR of metric names.
Currently supported are:
.RS 4
.TP 4
.B version
All \fBnvmex_version\fR metrics (nvidia collector).
.TP 4
.B gpuinfo
All \fBnvmex_gpu_info\fR metrics (nvidia collector).
.TP 4
.B clock
All \fBnvmex_clock_*\fR metrics (nvidia collector).
.TP 4
.B bar1mem
All \fBnvmex_bar1mem_bytes\fR metrics (nvidia collector).
.TP 4
.B temperature
All \fBnvmex_temperature_celsius\fR metrics (nvidia collector).
.TP 4
.B power
All \fBnvmex_power_*\fR and \fBnvmex_perf\fR metrics (nvidia collector).
.TP 4
.B fan\
All \fBnvmex_fan_speed_pct\fR metrics (nvidia collector).
.TP 4
.B utilization
All \fBnvmex_util_pct\fR metrics (nvidia collector).
.TP 4
.B pcie
All \fBnvmex_pcie_*\fR metrics (nvidia collector). Usually slow.
.TP 4
.B violation
All \fBnvmex_violation_penalty_ms\fR metrics (nvidia collector).
.TP 4
.B memory
All \fBnvmex_memory_bytes\fR metrics (nvidia collector).
.TP 4
.B ecc\
All \fBnvmex_ecc_*\fR metrics (nvidia collector).
.TP 4
.B nvlink
All \fBnvmex_nvlink_*\fR metrics (nvidia collector).
.TP 4
.B encstat
All \fBnvmex_enc_stat_*\fR metrics (nvidia collector).
.TP 4
.B encsession
All \fBnvmex_enc_sessions_*\fR metrics (nvidia collector).
.TP 4
.B fbcstat
All \fBnvmex_fbc_stat_*\fR metrics (nvidia collector).
.TP 4
.B fbcsession
All \fBnvmex_fbc_sessions_*\fR metrics (nvidia collector).
.TP 4
.B process
All \fBnvmex_process_*\fR metrics (process collector).
.RE
.BI \-p " num"
.PD 0
.TP
.BI \-\-port= num
Bind to port \fInum\fR and listen there for HTTP requests. Note that a port
below 1024 usually requires additional privileges.
.TP
.BI \-s " IP"
.PD 0
.TP
.BI \-\-source= IP
Bind the HTTP server to the given \fIIP\fR address, only. Per default
it binds to 0.0.0.0, i.e. all IPs configured on this host/zone/container.
If you want to enable IPv6, just specify an IPv6 address here (\fB::\fR
is the same for IPv6 as 0.0.0.0 for IPv4).
.TP
.BI \-v " level"
.PD 0
.TP
.BI \-\-verbosity= level
Set the message verbosity to the given \fIlevel\fR. Accepted tokens are
\fBDEBUG\fR, \fBINFO\fR, \fBWARN\fR, \fBERROR\fR, \fBFATAL\fR and for
convenience \fB1\fR..\fB5\fR respectively.
.SH "EXIT STATUS"
.TP 4
.B 0
on success.
.TP
.B 1
if an unexpected error occurred during the start (other problem).
.TP
.B 96
if an invalid option or option value got passed (config problem).
.TP
.B 100
if the logfile is not writable or port access is not allowed (permission problem).
.TP
.B 101
if the NVML could not be initialized (e.g. because the related kernel module
is not loaded), or no nvidia GPUs was found (temporary problem).
.SH "ENVIRONMENT"
.TP 4
.B PROM_LOG_LEVEL
If no verbosity level got specified via option \fB-v\ \fI...\fR, this
environment variable gets checked for a verbosity value. If there is a
valid one, the verbosity level gets set accordingly, otherwise \fBINFO\fR
level will be used.
.SH "FILES"
.TP 4
.B /dev/nvidiaN /dev/nvidiactl
The character special devices used by the NVML to access GPUs.
.SH "NOTES"
\fBnvmex\fR collects static data like min and max GPU temperature or power
limits only once, prom formats them, and from now on just copies the cached
strings on each request. So if the kernel modul gets reloaded or GPU gets
reset, or GPU enumeration changes, one should restart \fBnvmex\fR as well.
.SH "BUGS"
https://github.com/jelmd/nvmex is the official source code repository
for \fBnvmex\fR. If you need some new features, or metrics, or bug fixes,
please feel free to create an issue there using
https://github.com/jelmd/nvmex/issues .
.SH "AUTHORS"
Jens Elkner
.SH "SEE ALSO"
[1]\ https://prometheus.io/docs/instrumenting/exposition_formats/
.br
[2]\ https://grafana.com/
.br
[3]\ https://www.netdata.cloud/
.br
[4]\ https://www.zabbix.com/
.\" # vim: ts=4 sw=4 filetype=nroff