You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.
The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (chmod 000 verdi-snapshot-900*).
d) restart all servers
e) issue one GET client request
In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).
Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.
From @pfons on April 13, 2016 2:29
During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.
The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (
chmod 000 verdi-snapshot-900*
).d) restart all servers
e) issue one GET client request
In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).
Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.
Copied from original issue: uwplse/verdi#40
The text was updated successfully, but these errors were encountered: