-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ffwd will block when there are multiple client threads #2
Comments
Hi Zulai,
Can you tell us what machine you are running this on. Because of the busy-polling nature of FFWD, you can encounter severe slowdown if you oversubscribe the cores. Could that be what’s going on here?
- Jakob
On Sep 11, 2022, at 4:11 AM, Zulai Wang ***@***.******@***.***>> wrote:
I'm trying to run ffwd on several machines of mine, but find out that more than two clients will cause blocking on FFWD_EXEC. After some debugging, I find out that when there is concurrent FFWD_EXEC, all client threads will block on waiting the server's response, while server cannot receive any client's requests.
$ ./ffwd_sample -t 1 -s 2 -d 100 # this will run to completion
1 0.100 0.013
$ ./ffwd_sample -t 2 -s 2 -d 100 # this will block
—
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbitslab%2Fffwd%2Fissues%2F2&data=05%7C01%7Cjakob%40uic.edu%7C8069d7fabd0e49e34d3208da93d592fc%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637984842753847635%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tb0Yh93QuKYumlORLLfmVuvq%2FWO3TDhs%2FJqVjoXDOhE%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAS3BQHF233ZKZURREZDR3V5WO3BANCNFSM6AAAAAAQJWFRZY&data=05%7C01%7Cjakob%40uic.edu%7C8069d7fabd0e49e34d3208da93d592fc%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637984842754003834%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IjD3jkQcCEiG78NTn82wiSZFJqnIuDythM8XV%2BO5fgA%3D&reserved=0>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Hi Jakob, This is the environment of my machine
I think I didn't "oversubscribing" the cores, because I only assign 2 servers and 2 clients in the following test, which is blocking forever. Blocking keeps when I increase the server number or client number. ./ffwd_sample -t 2 -s 2 -d 100 # Here, `-t` means the thread number of clients, and `-s` means the number of polling servers. What I've tried so far:
|
Have a look at htop when the program is running. The program should be using 4 cores 100%, all green (user space). Is that what you see?
…Sent from my iPhone
On Sep 14, 2022, at 3:56 AM, Zulai Wang ***@***.***> wrote:
Hi Jakob,
This is the environment of my machine
* Intel(R) Xeon(R) Gold 6238R CPU
* 56 cores
* 377G DRAM
I think I didn't "oversubscribing" the cores, because I only assign 2 servers and 2 clients in the following test, which is blocking forever. Blocking keeps when I increase the server number or client number.
./ffwd_sample -t 2 -s 2 -d 100 # Here, `-t` means the thread number of clients, and `-s` means the number of polling servers.
BTW, I've tried on several Xeon machines, which all results in blocking. It will only work when I limit the client thread number (-t) to be one.
—
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbitslab%2Fffwd%2Fissues%2F2%23issuecomment-1246454091&data=05%7C01%7Cjakob%40uic.edu%7Cd0e23e9aa32d4707b09608da962f045e%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987425974397752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=r6XXS8vMkdq70p8Dl21yxpcaKHfdGfz8LVsaoLtHcyA%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAS3BU7IWMZYFW3DH2U7JDV6GHL3ANCNFSM6AAAAAAQJWFRZY&data=05%7C01%7Cjakob%40uic.edu%7Cd0e23e9aa32d4707b09608da962f045e%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987425974397752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ux3olD0mqFPGSEIPjfpnXZBS5CRtbr9o7RejqRp3ZD0%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Yes. Following is the output of CPU PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
105 3995483 wangzl 20 0 435M 2408 1956 R 100. 0.0 0:53.43 ./ffwd_sample -t 2 -s 2 -d 100
78 3995484 wangzl 20 0 435M 2408 1956 R 100. 0.0 0:53.44 ./ffwd_sample -t 2 -s 2 -d 100
57 3995482 wangzl 20 0 435M 2408 1956 R 100. 0.0 0:53.44 ./ffwd_sample -t 2 -s 2 -d 100
1 3995481 wangzl 20 0 435M 2408 1956 R 50.0 0.0 0:26.75 ./ffwd_sample -t 2 -s 2 -d 100
1 3995486 wangzl 20 0 435M 2408 1956 R 50.0 0.0 0:26.68 ./ffwd_sample -t 2 -s 2 -d 100 |
I've added two Two 103 #define FFWD_EXEC(server_no, function, ret, ...) \
+ printf("context=%p server_no=%d\n", context, server_no);\
105 context->request[server_no]->fptr = function; \
106 prepare_request(context->request[server_no], __VA_ARGS__); \
107 context->local_client_flag[server_no] ^= context->mask; \
108 context->request[server_no]->flag = context->local_client_flag[server_no]; \
109 while(((context->server_response[server_no]->flags ^ context->local_client_flag[server_no]) & context->mask)){ \
110 __asm__ __volatile__("rep;nop": : :"memory"); \
111 } \
+ printf("get_value\n");\
113 ret = context->server_response[server_no]->return_values[((context->id_in_chip)) % NCLIENTS]; \
114
115 #define GET_CONTEXT() \
116 struct ffwd_context *context = ffwd_get_context(); Runtime log: context=0x565401c0f860 server_no=1
get_value
...
context=0x565401c0f860 server_no=1
get_value
context=0x565401c0f860 server_no=0
get_value
context=0x565401c0f860 server_no=1
get_value
context=0x565401c0f7d0 server_no=0
context=0x565401c0f860 server_no=0
get_value
# Blocking start It seems that when the second context (i.e., second client thread) starts to send message, both of the client threads will block. |
This is new behavior to me. I don’t see how the number of clients plays a role.
As a debugging step, try adding an “mfence” instruction after rep;nop; on line 110.
- Jakob
On Sep 14, 2022, at 8:39 AM, Zulai Wang ***@***.******@***.***>> wrote:
I've added two printf to gather logs on concurrent FFWD_EXEC. Hope this can provide some hints.
Two printf:
103 #define FFWD_EXEC(server_no, function, ret, ...) \
+ printf("context=%p server_no=%d\n", context, server_no);\
105 context->request[server_no]->fptr = function; \
106 prepare_request(context->request[server_no], __VA_ARGS__); \
107 context->local_client_flag[server_no] ^= context->mask; \
108 context->request[server_no]->flag = context->local_client_flag[server_no]; \
109 while(((context->server_response[server_no]->flags ^ context->local_client_flag[server_no]) & context->mask)){ \
110 __asm__ __volatile__("rep;nop": : :"memory"); \
111 } \
+ printf("get_value\n");\
113 ret = context->server_response[server_no]->return_values[((context->id_in_chip)) % NCLIENTS]; \
114
115 #define GET_CONTEXT() \
116 struct ffwd_context *context = ffwd_get_context();
Runtime log:
context=0x565401c0f860 server_no=1
get_value
...
context=0x565401c0f860 server_no=1
get_value
context=0x565401c0f860 server_no=0
get_value
context=0x565401c0f860 server_no=1
get_value
context=0x565401c0f7d0 server_no=0
context=0x565401c0f860 server_no=0
get_value
# Blocking start
It seems that when the second context (i.e., second client thread) starts to send message, both of the client threads will block.
—
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbitslab%2Fffwd%2Fissues%2F2%23issuecomment-1246780391&data=05%7C01%7Cjakob%40uic.edu%7C37c0453eb5514e82c6a908da96568261%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987595546283908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wHNP3ggvNze%2BcFS8uENKNJuCFL1aX4853CurqKzcQ%2Bs%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAS3BVG2XDKOJJTFKP2NVTV6HIQBANCNFSM6AAAAAAQJWFRZY&data=05%7C01%7Cjakob%40uic.edu%7C37c0453eb5514e82c6a908da96568261%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987595546283908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tBwaS%2B%2F5i9KalQUNxjieNbrRbX2jrAkj6sxM%2BZk3bTY%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Added "mfence" 103 #define FFWD_EXEC(server_no, function, ret, ...) \
+ printf("context=%p server_no=%d\n", context, server_no);\
105 context->request[server_no]->fptr = function; \
106 prepare_request(context->request[server_no], __VA_ARGS__); \
107 context->local_client_flag[server_no] ^= context->mask; \
108 context->request[server_no]->flag = context->local_client_flag[server_no];
\
109 while(((context->server_response[server_no]->flags ^ context->local_client
_flag[server_no]) & context->mask)){ \
~ __asm__ __volatile__("rep;nop;mfence": : :"memory"); \
111 } \
+ printf("get_value\n");\
113 ret = context->server_response[server_no]->return_values[((context->id_in_
chip)) % NCLIENTS]; \ After recompiling and re-run, behavior seems to be the same
|
Probably good news that this didn’t have any effect. There’s probably a silly problem at play that’s triggered by your particular configuration (56 cores, one socket?). My best guess is that you have more cores in one socket than the code is configured for.
Please print the values context->id, context->id_in_chip as well.
- Jakob
On Sep 14, 2022, at 8:52 AM, Zulai Wang ***@***.******@***.***>> wrote:
Added "mfence"
103 #define FFWD_EXEC(server_no, function, ret, ...) \
+ printf("context=%p server_no=%d\n", context, server_no);\
105 context->request[server_no]->fptr = function; \
106 prepare_request(context->request[server_no], __VA_ARGS__); \
107 context->local_client_flag[server_no] ^= context->mask; \
108 context->request[server_no]->flag = context->local_client_flag[server_no];
\
109 while(((context->server_response[server_no]->flags ^ context->local_client
_flag[server_no]) & context->mask)){ \
~ __asm__ __volatile__("rep;nop;mfence": : :"memory"); \
111 } \
+ printf("get_value\n");\
113 ret = context->server_response[server_no]->return_values[((context->id_in_
chip)) % NCLIENTS]; \
After recompiling and re-run, behavior seems to be the same
context=0x5588004c7860 server_no=1
get_value
context=0x5588004c7860 server_no=0
get_value
context=0x5588004c7860 server_no=1
get_value
context=0x5588004c7860 server_no=0
get_value
context=0x5588004c7860 server_no=1
get_value
context=0x5588004c7860 server_no=0
context=0x5588004c77d0 server_no=0
get_value
—
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbitslab%2Fffwd%2Fissues%2F2%23issuecomment-1246797989&data=05%7C01%7Cjakob%40uic.edu%7C51182ddb4cba4ba926d908da965855d7%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987603385539691%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FDANkWxQg20nXuc7YV5XfE0mIDgLD7xl1KIgxr6tn4w%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAS3BSRTU55QCWGQHTX4PLV6HKBBANCNFSM6AAAAAAQJWFRZY&data=05%7C01%7Cjakob%40uic.edu%7C51182ddb4cba4ba926d908da965855d7%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987603385539691%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9XeECVNtxYXyuQNGRfr3TP8Z4WWLUPFOkXsGf9RV8O4%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.***>
|
The second There are two NUMA node and hyperthreading enabled on my machine, so there is 28 physical cores (56 logical cores) on one socket. |
Aha, there it is. To just get things to run, try temporarily changing the Makefile to force the number of cores to 32 or less.
- Jakob
On Sep 14, 2022, at 9:06 AM, Zulai Wang ***@***.******@***.***>> wrote:
context=0x56229556b860 context->id=2 context->id_in_chip=2 server_no=0
get_value
context=0x56229556b860 context->id=2 context->id_in_chip=2 server_no=1
get_value
context=0x56229556b860 context->id=2 context->id_in_chip=2 server_no=0
get_value
context=0x56229556b860 context->id=2 context->id_in_chip=2 server_no=1
get_value
context=0x56229556b860 context->id=2 context->id_in_chip=2 server_no=0
context=0x56229556b7d0 context->id=0 context->id_in_chip=-2 server_no=0
get_value
The second id_in_chip is a negative number. Maybe this is the cause? I'll look into the CPU core configuration part in the ffwd code.
—
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbitslab%2Fffwd%2Fissues%2F2%23issuecomment-1246817906&data=05%7C01%7Cjakob%40uic.edu%7C0d755f9b679247b0ba7d08da965a536a%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987611938088174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CS0HGIazUtHfo6y4tliCgO%2FtdJMaoWIAOU8feN47vUQ%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAS3BR5IKQLFGGLV7CAIHDV6HLWNANCNFSM6AAAAAAQJWFRZY&data=05%7C01%7Cjakob%40uic.edu%7C0d755f9b679247b0ba7d08da965a536a%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637987611938088174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CnxCPTYE%2FxEVFT375PmbTMTwQ%2Bgo%2F9aseLQrj2VIlwY%3D&reserved=0>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I'm trying to run ffwd on several machines of mine, but find out that more than two clients will cause blocking on
FFWD_EXEC
. After some debugging, I find out that when there is concurrentFFWD_EXEC
, all client threads will block on waiting the server's response, while server cannot receive any client's requests.The text was updated successfully, but these errors were encountered: