Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing non-deterministic behavior of post/leave? #9

Open
smarr opened this issue Apr 27, 2020 · 20 comments
Open

Addressing non-deterministic behavior of post/leave? #9

smarr opened this issue Apr 27, 2020 · 20 comments

Comments

@smarr
Copy link
Contributor

smarr commented Apr 27, 2020

My understanding is that the size of a chat depends on the client, who invites all its friends.
The number of clients corresponds then also to the number of messages send as a consequence of a post action.

To allow comparisons of different benchmark executions, I assume the overall number of messages needs to be identical.

So, this means, a post action needs to predictably pick chats with the same sizes, which is currently not given since chat membership is racy, and the ordering of the chat list in a client also reflects execution order and timing.

For a moment, I was thinking, perhaps, chats need to be ordered by size, and that would be sufficient, but I don't think this would work, because the random picking of a chat would still be subject to the invitation/leave/membership race.

So, now I am thinking, actions need to follow a pattern such as: Invite (Post | Compute)* Leave?, where Post and Leave or only done on chats the client created directly.

I am unsure about the impact of this, and the dynamic behavior that would get lost.
One thing that would certainly get lost is that the number of messages triggered by a client is not changing anymore.

A very simple change would be to have one chat created per client on client creation, which can't be left, as well as making sure we distinguish client-created from invited-to chats.

@josephnoir
Copy link
Contributor

Would it help if we delayed activity in the chat until all expected clients joined?

Not sure if it is feasible to buffer actions (post, leave) in the chat until then, but when a chat is created we already know how many clients are expected to participate:

match behavior(_dice)
      | Post => _chats(index)?.post(None, accumulator)
      | Leave => _chats(index)?.leave(this, false, accumulator)
      | Compute => Fibonacci(35) ; accumulator.stop(Compute)
      | Invite =>
        var invitations: USize = s.next().usize() % _friends.size()
        if invitations == 0 then
          accumulator.stop(Invite)
        else
          let created = Chat(this, invitations)

          _chats.push(created)

          // Again convert the set values to an array, in order
          // to be able to use shuffle from rand
          let f = _friends.clone()
       
          _rand.shuffle[Client](f)
          accumulator.bump(invitations)

          for k in Range[USize](0, invitations) do
            try f(k)?.invite(created, accumulator) end
          end
        end
      else
        accumulator.stop()
      end

Just a thought. It might be cleaner to look into the ordering of actions as you suggested.

@smarr
Copy link
Contributor Author

smarr commented Apr 27, 2020

I don't think delaying is sufficient.

The problem is the ordering of _chats and the picking at https://github.com/sblessing/chat-app/blob/master/chat.pony#L171

Post => _chats(index)?.post(None, accumulator)

Here we have the non-determinism of be invite()/be left, which first influence the order of entries in _chat, and then also influence the number of members in _chat(index).

@smarr
Copy link
Contributor Author

smarr commented Apr 27, 2020

As noted here, the Newspeak implementation may have an issue with the poker/act message design.

In Newspeak, I do have issues with identity. I can not compare promises, I need to unwrap the value in the promise first. This means, I have to defer an operation until a turn after the promise is resolved.
However, since the poker, and then the directory, fill the mailbox with act messages, all of the act operations will be done before I can ever create a single chat to send messages to.

@smarr
Copy link
Contributor Author

smarr commented Apr 28, 2020

@RG-707 @josephnoir in CAV, do you see the same issue described in my last post here?

That is, there are very few post/leave operations ever executed? I am thinking the Poker/Directory loops are just aggressively sending messages, where Pony's support for back-pressure, helps so that the other operations can actually succeed.

I do not have support for back-pressure, and, worse, I need to resolve promises (an extra turn, possibly scheduled after all operations).

It seems to me, that the benchmark needs to specify what exactly it wants without relying on back pressure support (which could be different enough to not help as expected).

@josephnoir
Copy link
Contributor

Yeah, I think we observer the same. CAF doesn't implement scheduling back-pressure either. Although I can't say if these actions never happen or if it is only the recoding of actions that fails.

Do you have a configuration you want numbers for? With a single thread we observe neither posts nor leaves. With the default configuration on a quad-core processes I get: Post: 4196, Leave: 218, Invite: 635227, Compute: 579776, None: 361896. The number of Post and Leave actions is low compared to the others.

@smarr
Copy link
Contributor Author

smarr commented Apr 28, 2020

The probabilities say:

  • 55% compute
  • 25% post
  • 10% leave
  • 10% invite

I think your numbers are also sufficiently far off to indicate that there might be an issue for implementations without back-pressure support.

At least my expectation would be that for large enough numbers, which you have, they should be roughly in proportion to these probabilities. And, yeah, the number of None actions is just too high.

@hgoldhammer
Copy link
Contributor

hgoldhammer commented Apr 28, 2020

Yeah, I think too. I get on my X240 numbers that are near @josephnoir numbers. But the pony numbers for post are too high too compared to the probabilities:

Compute     576829
None        108809
Post       1269132
Invite      283242
Leave        72554

@smarr
Copy link
Contributor Author

smarr commented Apr 28, 2020

The numbers on my MacBookPro are almost the same for Pony:

Compute     576829
Invite      333494
None        105182
Leave        73589
Post       1273753

What's a bit concerning is that the results are printed and then it is still working for another 5seconds or longer. So, it's apparently doing stuff, which is outside of the timed interval, but perhaps should not be outside of it.

@sblessing
Copy link
Owner

sblessing commented Apr 28, 2020

@smarr @josephnoir @RG-707

The counts for post messages also currently include the messages sent because of buffered chat forwards. That is, it is expected to this not match the probabilities.

Also, @smarr, the stuff that Pony does after printing out the stats its purely doing cycle detection on a large graph of blocked actors, at a point in time where the program is quiescent. Running with --ponynoblock will cause it to terminate immediately after printing the stats. Cycle detection at this point is kind-of not necessary, because the runtime knows the program is going to terminate anyway.

@josephnoir
Copy link
Contributor

What do you think about counting post and delivery-of-a-post separately?

@sblessing
Copy link
Owner

sblessing commented Apr 29, 2020

Done and pushed.

Standard configuration:

Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               243.867 ms        243.184 ms       ±9.96341 %            70.126
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  203.444 ms          192.5 ms       ±1.58941 %           52.7931                         51.7576

None                108867
Invite              277270          (32*32*1024)*0,10 ~ 104.857
Post                 76538          (32*32*1024)*0,25 ~ 262.144
Compute             576829          (32*32*1024)*0,55 ~ 576.716
Leave                72539          (32*32*1024)*0,10 ~ 104.857
PostDelivery       1193789

Execution with a single thread:

Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               375.044 ms        387.592 ms       ±7.58932 %           82.1493
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  320.062 ms          350.5 ms       ±1.41943 %           74.1723                         74.1035

None                108867
Invite              269366
Post                 80379
Compute             576829
Leave                72539
PostDelivery       1193789

@hgoldhammer
Copy link
Contributor

When I change it the same way @sblessing did it for pony, I don't get post any more only post_delivery. See https://github.com/RG-707/chat-app/tree/caf_post_count

@smarr
Copy link
Contributor Author

smarr commented Apr 29, 2020

To me it seems we need to somehow specify that there is room for the actor to actually perform turns between act messages. Just filling up the mail boxes isn't going to lead to the desired behavior.

We might need a request/reply for the act.

I am not sure how that fits with the goal to 'saturate' the system, though. But as long as the number of clients is a multiple of the cores in the system, it might satisfy this goal?

@sblessing
Copy link
Owner

@smarr I think you are referring to the fact, that currently, the system allows that turns overlap? I.e turn j+1 could finish before turn j. For saturating the system (and simulating realistic external events) I was the opinion that this nice - but I might be wrong.

@smarr
Copy link
Contributor Author

smarr commented Apr 29, 2020

No, sorry for the terminological confusion.
In communicating event loop systems, we talk about turns referring to a message being processed. Pony seems to be an event-loop system. The actors will pick a message and process it, i.e., perform a turn.

The problem I see is that I need extra turns, for instance for unwrapping promises, and I don't get a chance for performing these turns before the next act message, because the mailbox is already full of act messages, and everything else will simply go to the back of it.

So, in all of this issue, when I referred to turn, I was meaning it in the event-loop context.

@sblessing
Copy link
Owner

I am not familiar in detail with newspeak, but would your problem go away if you wouldn't use promises?

@smarr
Copy link
Contributor Author

smarr commented Apr 29, 2020

The Newspeak version of ChatApp currently does not use promises, except where it absolutely has to, because there's simply no other way to do it.

For instance, to create a chat actor, we create an actor chat, and then have to ask the actor to instantiate a Chat object in the chat actor. This is what E, AmbientTalk, JCoBox, and Newspeak require you to do. Actors are second-class citizens, and can't even be referenced directly. These actors are typically also called vats because of these "containing objects" semantics.

In practice this means, the reference to the Chat object is communicated wrapped in a promise. Well, we get the promise immediately as the result of the asynchronous send, and it will eventually be resolved.

So, in these languages, this can simply not be avoided.

And, I think, this isn't just a problem for these languages. CAF clearly has also very different numbers caused by how the communication is structured.

@smarr
Copy link
Contributor Author

smarr commented Apr 29, 2020

And just to put a little bit more weight on this: if anyone wants to implement this benchmark for instance JavaScript, perhaps based on web workers, they'll run into the very same issues, because operations simply don't have synchronous return values.

@sblessing
Copy link
Owner

sblessing commented Apr 30, 2020

@smarr we have made some changes to counting and bumping 773a5c0. Some of it was broken after not well thought-out changes and a merge fail. The most important change, i guess, is that we are now making sure that befriending did happen before the first poke.

Also, back pressure is, for small configurations, not an issue for CAF nor Newspeak. The core problem in the Newspeak implementation is actor creation promises.

@sblessing
Copy link
Owner

sblessing commented Apr 30, 2020

Pony results after 773a5c0

Benchmark                                  i-mean          i-median           i-error          i-stddev
Chat App                               146.847 ms        114.725 ms       ±38.9673 %           165.153
                                           j-mean          j-median           j-error          j-stddev              quality of service
Turns                                  132.029 ms            107 ms       ±7.59628 %           163.744                         163.671

None                110491
Post                196640          (32*32*1024)*0,25 ~ 262.144
PostDelivery       2527529
Ignore             1214301
Leave                70443          (32*32*1024)*0,10 ~ 104.857
Compute             572043          (32*32*1024)*0,55 ~ 576.716
Invite               98959          (32*32*1024)*0,10 ~ 104.857

The actual distance to the total numbers alongside the probabilities is now coming from the fact that at the start, clients are in not chats. That is, the picked posts and leaves from the first couple of turns are ending in None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants