-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect priors get loaded intermittently (Mac/OpenCL?) #811
Comments
Oh. KillerDucky suggested using check backend from #146. And indeed using the same weights and position:
|
So some more testing with --backend-opts="freq=1.0" and various networks and
|
there are known bad opencl drivers out there - was the original reason the check backend existed in lczero IIRC - in order to force such contributors to crash out if their opencl started doing bad things. But you mentioned on discord that it seems most problematic with 80 channel policy head nets (T40/T37) - so that could potentially be a real bug with opencl implementation, or maybe just the very large fully connected layer after the 80 channel policy head triggers bugs in your opencl drivers. |
Cannot reproduce on my Ubuntu system, so our implementation is probably ok. Guessing Mac OpenCL driver bug. We never found what the bug is for LZGo. |
Yeah LZGo always had mac opencl issues, and gcp says there's not much we can do especially with Apple deprecating it with 10.14 Mojave for their own Metal leela-zero/leela-zero#1517 (comment). Not sure if this needs to be highlighted more for people using t40 on macs, but seems like this problem happens to be avoided with t50+. For reference if people are running into problems:
|
For reference, 37001 and 40001 both trigger errors, so not something that appears as weights develop:
|
Here is an example of the opencl backend failing on macos with a batch_size of 4, NN 41585. (Net 11258 works fine with this tuning and batch_size)
Tuning was
Crashlog is
|
I was checking in on t40's progress with current master daf933e and #237's drawn position by trying to force the top prior move to be picked for each side with

--fpu-value=100
:However, one time out of ~100 attempts (restarting lc0 each time), it doesn't lead to a draw:
Then with the same lc0 instance, I added some more nodes to see if it would change then jumped ahead to where white played the wrong move c4c5 (instead of perpetual check moving rook on h file):
Some reason the highest prior move was indeed c4c5… But then I quit and jumped straight to that "wrong move" position:
The correct priors for h4h5 and c4c5 do get loaded…
Then trying to reproduce this issue again while writing up this comment. Again maybe 1 out of 100 attempts resulted in a not-drawn position but for a different move d2e2:
The text was updated successfully, but these errors were encountered: