Transient system call errors during recovery cause inconsistent re-initialization #52

palmskog · 2017-05-21T03:31:29Z

From @pfons on April 13, 2016 2:29

During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.

The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (chmod 000 verdi-snapshot-900*).
d) restart all servers
e) issue one GET client request

In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).

Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.

Copied from original issue: uwplse/verdi#40

The text was updated successfully, but these errors were encountered:

palmskog mentioned this issue May 21, 2017

Transient system call errors during recovery cause inconsistent re-initialization uwplse/verdi#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient system call errors during recovery cause inconsistent re-initialization #52

Transient system call errors during recovery cause inconsistent re-initialization #52

palmskog commented May 21, 2017

Transient system call errors during recovery cause inconsistent re-initialization #52

Transient system call errors during recovery cause inconsistent re-initialization #52

Comments

palmskog commented May 21, 2017