Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT when opening mail #284

Open
frasertweedale opened this issue Apr 5, 2019 · 17 comments
Open

SIGABRT when opening mail #284

frasertweedale opened this issue Apr 5, 2019 · 17 comments
Labels

Comments

@frasertweedale
Copy link
Member

frasertweedale commented Apr 5, 2019

Describe the bug
purebred terminated SIGABRT when I opened a mail.

Apr 05 12:26:35 T470s systemd[1]: Started Process Core Dump (PID 19268/UID 0).
Apr 05 12:26:36 T470s systemd-coredump[19269]: Process 19235 (purebred-linux-) of user 1000 dumped core.
                                               
    Stack trace of thread 19243:
    #0  0x00007fbb790aaeab raise (libc.so.6)
    #1  0x00007fbb790955b9 abort (libc.so.6)
    #2  0x00007fbb7a315931 talloc_abort.cold.19 (libtalloc.so.2)
    #3  0x00007fbb7a31604d _talloc_steal_loc.cold.47 (libtalloc.so.2)
    #4  0x00000000006a00f3 n/a (/home/ftweedal/.cache/purebred/purebred-linux-x86_64)
Apr 05 12:26:36 T470s abrtd[8533]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2019-01-22-10:24:24.235958-10186'
Apr 05 12:26:37 T470s abrt-notification[19333]: Process 19235 (purebred-linux-x86_64) crashed in _talloc_set_destructor.cold.20()

To Reproduce
Not reliably reproducible. May be related to something begin GC'd that we need to hang on to.

@frasertweedale frasertweedale changed the title segv when opening mail SIGABRT when opening mail Apr 6, 2019
@romanofski
Copy link
Member

Does it happen when you open a gazillion of threads and while purebred is calculating the length of the list, you open the first or second mail?

@frasertweedale
Copy link
Member Author

@romanofski good Question; worth checking that. Did you encounter it?

@romanofski
Copy link
Member

Yeah pretty much almost reliably. Steps:

  1. Search for *
  2. Start opening the first one or two mails, b00m.

@frasertweedale
Copy link
Member Author

@romanofski OK awesome, that's good info! I've got some time off this week, I'll try and sort it out :)

@frasertweedale
Copy link
Member Author

Info from further investigation:

  • It only occurs while the background thread is traversing the spine of the thread list to determine the number of threads

  • If you wait for the "num threads" update to complete (background thread is done), then execute another large search, the problem can occur (i.e. it is not restricted to the initial search)

  • The problem occur upon opening the second unread mail (i.e. with tag "unread", or whatever nmNewTag is). You can precede or interleave opening of non-unread mails without triggering the abort.

  • This "state" (so to speak) does not persist across searches. If you start a long search, open one unread mail, background spine traversal completes, then open a second unread mail, the problem does not occur. Similarly, executing a second long search "starts over"; two unread mails must be opened during the spine traversal of second search.

@romanofski
Copy link
Member

Yes. I can reproduce those observations too.

@frasertweedale
Copy link
Member Author

I won't have time to investigate further for a few weeks... I fear I will need to instrument the heck out of hs-notmuch to find out what's going on.

@frasertweedale
Copy link
Member Author

Dumping some links about Xapian thread safety here. Probably not related to this issue but I just stumbled across them and don't want to forget about them:

@frasertweedale
Copy link
Member Author

frasertweedale commented Jul 13, 2019

Without saying I'm giving up on finding and understand the cause, I'm pondering whether to implement the following to see if it disappears the error:

  1. Spawn a thread to perform Database write actions. Writes are performed only by this thread.
  2. When we want to modify the database (e.g. edit tags, index file), send a message to that thread and it will do it on our behalf. Relevant message types/constructors would have to be defined.

I'm suspicious that having multiple RW database handles is a factor in this bug. But I don't have any hard evidence. The writes are currently performed in the Brick event loop so they are certainly serialised, but maybe something isn't being GC'd promptly enough (or at all). Then again, the fact this only happens during the background result traversal does weight against this theory somewhat.

update 1: using a single writer thread, holding a single r/w database handle, does dispel the issue. But the changes only get written when the database gets closed (at termination of the whole program) so there's not much point to the change. Moving to per-event withDatabase in the single writer thread, such that each r/w database handle will be closed when GC'd, results in the same problem (which is not all that surprising).

@romanofski
Copy link
Member

@frasertweedale sounds like a tough nut to crack. Out of curiosity, is the problem triggered because of GC or the absence of it? If I understood your comment right, it's because data is GC'd which shouldn't have been garbage collected (yet)?

@frasertweedale
Copy link
Member Author

I don't think that's it, but it may be somehow related. I'm really still no closer to working this out :)

@frasertweedale
Copy link
Member Author

I just thought, maybe it is to do with the thread safety of talloc? One very quickly finds that talloc is not thread safe: https://talloc.samba.org/talloc/doc/html/libtalloc__threads.html.

@romanofski
Copy link
Member

Maybe this could help too (if you haven't used it already): https://github.com/bgamari/ghc-debug

@romanofski romanofski added this to the 1.0 milestone Jun 23, 2020
@romanofski
Copy link
Member

Added this to our 1.0 milestone, since well, we can't release release 1.0 without this plaguing us. Once I've chewed through the rest two issues, I'll see if I can help with this in any way.

@romanofski
Copy link
Member

@frasertweedale what about we put a build flag around the lazy loading feature. I haven't checked how tricky/cumbersome that would be, but it would help us release a first version, give us some exposure and perhaps help from others? Then it can still be activated during compilation time for peeps who want to use and hack on it?

romanofski added a commit that referenced this issue Jun 29, 2021
This puts a build flag around the lazy vector feature. Motivation behind
is to make the first release, get some publicity and hopefully get more
people interested to hack on Purebred. Possibly help with fixing the
bug.

Relates to #284
romanofski added a commit that referenced this issue Sep 27, 2021
This puts a build flag around the lazy vector feature. Motivation behind
is to make the first release, get some publicity and hopefully get more
people interested to hack on Purebred. Possibly help with fixing the
bug.

Relates to #284
@romanofski romanofski modified the milestones: 1.0, Future Feature Sep 28, 2021
@frasertweedale
Copy link
Member Author

Can confirm this still happens with GHC 9.0.

@romanofski
Copy link
Member

@frasertweedale do you reckon that we might need to look at a different architecture here on how to communicate with notmuch? I remember crazy ideas we had were something like using a messaging system. But it seems wrestling the memory is a tough nut to crack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants