-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparallel-prog.html
1354 lines (1200 loc) · 49.4 KB
/
parallel-prog.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2025-01-24 Fri 13:18 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>OpenMP 笔记 Pt.I</title>
<meta name="author" content="mitchell" />
<meta name="generator" content="Org Mode" />
<link rel="stylesheet" type="text/css" href="static/css/default.css" />
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
displayAlign: "center",
displayIndent: "0em",
"HTML-CSS": { scale: 100,
linebreaks: { automatic: "false" },
webFont: "TeX"
},
SVG: {scale: 100,
linebreaks: { automatic: "false" },
font: "TeX"},
NativeMML: {scale: 100},
TeX: { equationNumbers: {autoNumber: "AMS"},
MultLineWidth: "85%",
TagSide: "right",
TagIndent: ".8em"
}
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_HTML"></script>
</head>
<body>
<div id="content" class="content">
<h1 class="title">OpenMP 笔记 Pt.I</h1>
<div id="table-of-contents" role="doc-toc">
<h2>Table of Contents</h2>
<div id="text-table-of-contents" role="doc-toc">
<ul>
<li><a href="#org2496024">1. Concurrency VS Parallelism</a>
<ul>
<li><a href="#orgb14be04">1.1. 定义</a></li>
<li><a href="#org49b6f18">1.2. 图例</a></li>
</ul>
</li>
<li><a href="#org79cb770">2. OpenMP: 一个开发多线程程序的 API</a>
<ul>
<li><a href="#org6852345">2.1. 基本信息</a></li>
<li><a href="#orgaaffb1d">2.2. 使用 OpenMP</a>
<ul>
<li><a href="#orgffe000e">2.2.1. 第一个并发程序: <code>Hello World</code>   <span class="tag"><span class="example">example</span></span></a></li>
<li><a href="#org3d3d02a">2.2.2. 线程的互动方式</a>
<ul>
<li><a href="#orgd3a905f">2.2.2.1. 共享内存计算机</a></li>
<li><a href="#org7da0fbe">2.2.2.2. 单线程与多线程</a></li>
</ul>
</li>
<li><a href="#orgf5316b1">2.2.3. OpenMP 的系统概述</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org1c8e07c">3. OpenMP 的功能探索</a>
<ul>
<li><a href="#org373fbfe">3.1. 创建线程</a>
<ul>
<li><a href="#org46f154e">3.1.1. Fork-join 并行计算模型</a></li>
<li><a href="#orgdf45492">3.1.2. 在 OpenMP 中创建线程</a></li>
<li><a href="#org076a2db">3.1.3. 线程执行</a></li>
<li><a href="#org42fca5e">3.1.4. 编译器在干什么?</a></li>
<li><a href="#org60d8282">3.1.5. 用 OpenMP 来优化定积分计算   <span class="tag"><span class="example">example</span></span></a>
<ul>
<li><a href="#orgf232fa9">3.1.5.1. 什么是 <code>SPMD</code> ?</a></li>
</ul>
</li>
<li><a href="#orgc20baa6">3.1.6. 虚假共享: 对定积分程序再次优化</a>
<ul>
<li><a href="#orgd9666bd">3.1.6.1. 那怎么办? 填充数组!</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org40ab607">3.2. 同步机制</a>
<ul>
<li><a href="#orge38ef89">3.2.1. 两种最常见的同步方式</a></li>
<li><a href="#orgcccf132">3.2.2. 其他的同步方式/一些例子</a>
<ul>
<li><a href="#orge20afc8">3.2.2.1. 高阶同步</a></li>
<li><a href="#org90587f3">3.2.2.2. 底层同步</a></li>
<li><a href="#org0eabbd4">3.2.2.3. Barrier</a></li>
<li><a href="#org600b4e7">3.2.2.4. Critical (Mutual Exclusion)</a></li>
<li><a href="#org0e01610">3.2.2.5. Atomic (Mutual Exclusion)</a></li>
</ul>
</li>
<li><a href="#orgef97f4a">3.2.3. 利用同步机制优化定积分程序</a>
<ul>
<li><a href="#orgfb3e65b">3.2.3.1. Critical</a></li>
<li><a href="#org0276752">3.2.3.2. Atomic</a></li>
</ul>
</li>
<li><a href="#org45a30c3">3.2.4. SMPD vs. Worksharing</a></li>
<li><a href="#orgfd6235e">3.2.5. Worksharing constructs</a>
<ul>
<li><a href="#orgca01907">3.2.5.1. 有什么用?</a></li>
<li><a href="#org5211857">3.2.5.2. 调度子句</a></li>
</ul>
</li>
<li><a href="#org3e53349">3.2.6. Working with Loops</a>
<ul>
<li><a href="#orgf603fbe">3.2.6.1. 方法论</a></li>
<li><a href="#org7e86fbb">3.2.6.2. 特殊情况之: 嵌套循环</a></li>
<li><a href="#orgca2c399">3.2.6.3. 特殊情况之: 规约</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>
<p>
(一共两篇,这是第一篇)
</p>
<div id="outline-container-org2496024" class="outline-2">
<h2 id="org2496024"><span class="section-number-2">1.</span> Concurrency VS Parallelism</h2>
<div class="outline-text-2" id="text-1">
</div>
<div id="outline-container-orgb14be04" class="outline-3">
<h3 id="orgb14be04"><span class="section-number-3">1.1.</span> 定义</h3>
<div class="outline-text-3" id="text-1-1">
<ul class="org-ul">
<li><p>
<b><b>Concurrency(并行)</b></b>: CPU 一次性 <b><b>处理(logically active)</b></b> 多个任务的能力
</p>
<p>
[A <b><b>property</b></b> of execution flows]
</p></li>
<li><p>
<b><b>Parallelism(并发)</b></b>: 多个 CPU 同时 <b><b>运行(actually active)</b></b> 多个任务的能力
</p>
<p>
[A <b><b>State</b></b> of execution flows]
</p></li>
</ul>
</div>
</div>
<div id="outline-container-org49b6f18" class="outline-3">
<h3 id="org49b6f18"><span class="section-number-3">1.2.</span> 图例</h3>
<div class="outline-text-3" id="text-1-2">
<div id="orgedaf2a8" class="figure">
<p><img src="files/para.jpg" alt="para.jpg" />
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org79cb770" class="outline-2">
<h2 id="org79cb770"><span class="section-number-2">2.</span> OpenMP: 一个开发多线程程序的 API</h2>
<div class="outline-text-2" id="text-2">
</div>
<div id="outline-container-org6852345" class="outline-3">
<h3 id="org6852345"><span class="section-number-3">2.1.</span> 基本信息</h3>
<div class="outline-text-3" id="text-2-1">
<p>
OpenMP construct 大部分是编译器指令,格式为:
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#pragma</span> omp construct [clause [clause]…]
</pre>
</div>
<p>
其中,函数原型等定义在 <code>omp.h</code> 中
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span>
</pre>
</div>
</div>
</div>
<div id="outline-container-orgaaffb1d" class="outline-3">
<h3 id="orgaaffb1d"><span class="section-number-3">2.2.</span> 使用 OpenMP</h3>
<div class="outline-text-3" id="text-2-2">
<p>
使用 <code>gcc</code> 时,只需要加上 <code>-fopenmp</code> 就可以启用 OpenMP
</p>
<div class="org-src-container">
<pre class="src src-sh">gcc -fopenmp foo.c
<span class="org-builtin">export</span> <span class="org-variable-name">OMP_NUM_THREADS</span>=4
./a.out
</pre>
</div>
</div>
<div id="outline-container-orgffe000e" class="outline-4">
<h4 id="orgffe000e"><span class="section-number-4">2.2.1.</span> 第一个并发程序: <code>Hello World</code>   <span class="tag"><span class="example">example</span></span></h4>
<div class="outline-text-4" id="text-2-2-1">
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span> <span class="org-comment-delimiter">// </span><span class="org-comment">OpenMP include file</span>
<span class="org-preprocessor">#include</span> <span class="org-string"><stdio.h></span>
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-preprocessor"> #pragma</span> omp parallel num_threads(4) <span class="org-comment-delimiter">// </span><span class="org-comment">Parallel region with 4 threads</span>
{
<span class="org-type">int</span> <span class="org-variable-name">id</span> = omp_get_thread_num(); <span class="org-comment-delimiter">// </span><span class="org-comment">Runtime library function to return a thread ID</span>
printf(<span class="org-string">"Hello World from thread = %d"</span>, id);
printf(<span class="org-string">" with %d threads\n"</span>,omp_get_num_threads());
} <span class="org-comment-delimiter">// </span><span class="org-comment">End of the Parallel region</span>
}
</pre>
</div>
<p>
运行结果是:
</p>
<div class="org-src-container">
<pre class="src src-bash">./a.out
Hello World from thread = 0 with 4 threads
Hello World from thread = 1 with 4 threads
Hello World from thread = 3 with 4 threads
Hello World from thread = 2 with 4 threads
</pre>
</div>
</div>
</div>
<div id="outline-container-org3d3d02a" class="outline-4">
<h4 id="org3d3d02a"><span class="section-number-4">2.2.2.</span> 线程的互动方式</h4>
<div class="outline-text-4" id="text-2-2-2">
</div>
<div id="outline-container-orgd3a905f" class="outline-5">
<h5 id="orgd3a905f"><span class="section-number-5">2.2.2.1.</span> 共享内存计算机</h5>
<div class="outline-text-5" id="text-2-2-2-1">
<ul class="org-ul">
<li><b><b>对称多处理器 (SMP)</b></b></li>
</ul>
<p>
SMP 中,所有处理器共享相同的地址空间,每个处理器对内存的访问是"等时 <b><b>(equal-time)</b></b>"的,且操作系统以相同的方式对待每个处理器
</p>
<ul class="org-ul">
<li><b><b>非一致性内存访问多处理器 (NUMA)</b></b></li>
</ul>
<p>
NUMA 中,内存被分成不同的区域: "近"内存和"远"内存 (远近指与特定处理器的物理距离), 处理器访问近内存的速度更快,而访问远内存的速度较慢
</p>
</div>
<ol class="org-ol">
<li><a id="org4081f27"></a>What about CPU Cache?<br />
<div class="outline-text-6" id="text-2-2-2-1-1">
<div id="org8f63088" class="figure">
<p><img src="files/cache.png" alt="cache.png" />
</p>
</div>
<p>
举个例子: Intel® Core™ i7-970
</p>
<ul class="org-ul">
<li>它有6个核心,每个核心支持2路超线程</li>
<li>它是一个6宽超标量处理器,能够在每个时钟周期发出6条指令,同时在3个管线中支持4宽 SIMD 操作</li>
<li>它拥有192KB的L1缓存,1.5MB的L2缓存,以及12MB的共享L3缓存</li>
</ul>
<p>
虽然它看起来像是一个SMP系统,但其缓存层次结构意味着不同核心对不同内存地址的访问成本是不同的,因此它实际上是NUMA
</p>
</div>
</li>
</ol>
</div>
<div id="outline-container-org7da0fbe" class="outline-5">
<h5 id="org7da0fbe"><span class="section-number-5">2.2.2.2.</span> 单线程与多线程</h5>
<div class="outline-text-5" id="text-2-2-2-2">
<div class="org-center">
<p>
<img src="files/normal.png" alt="normal.png" />
<img src="files/para.png" alt="para.png" />
</p>
</div>
<ul class="org-ul">
<li>单线程:所有执行都在一个栈上,内存结构更简单</li>
<li>多线程:每个线程有自己的栈指针和 PC,允许并发执行</li>
</ul>
<p>
所以,对于一个程序:
</p>
<ul class="org-ul">
<li>它有一个进程和很多个线程</li>
<li>线程互动是通过对共享内存空间的读写操作</li>
<li>OS 的调度器决定线程的运行顺序</li>
<li>多个线程可能同时访问共享数据,同步机制可以确保在任何时刻只有一个线程可以访问共享资源</li>
</ul>
<div id="org5fa70c9" class="figure">
<p><img src="files/shared.png" alt="shared.png" />
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-orgf5316b1" class="outline-4">
<h4 id="orgf5316b1"><span class="section-number-4">2.2.3.</span> OpenMP 的系统概述</h4>
<div class="outline-text-4" id="text-2-2-3">
<p>
<b><b>OpenMP 是多线程共享地址模型</b></b>
</p>
<ul class="org-ul">
<li>线程通过共享变量进行通信</li>
</ul>
<p>
数据的意外共享会导致竞争:
</p>
<ul class="org-ul">
<li>竞争条件:当程序的结果因线程调度的不同而发生变化</li>
</ul>
<p>
为了控制竞争条件:
</p>
<ul class="org-ul">
<li>使用同步机制来保护数据冲突</li>
</ul>
<p>
由于同步开销较大,因此:
</p>
<ul class="org-ul">
<li>最好改变数据的访问方式,以最小化对同步的需求</li>
</ul>
</div>
</div>
</div>
</div>
<div id="outline-container-org1c8e07c" class="outline-2">
<h2 id="org1c8e07c"><span class="section-number-2">3.</span> OpenMP 的功能探索</h2>
<div class="outline-text-2" id="text-3">
</div>
<div id="outline-container-org373fbfe" class="outline-3">
<h3 id="org373fbfe"><span class="section-number-3">3.1.</span> 创建线程</h3>
<div class="outline-text-3" id="text-3-1">
</div>
<div id="outline-container-org46f154e" class="outline-4">
<h4 id="org46f154e"><span class="section-number-4">3.1.1.</span> Fork-join 并行计算模型</h4>
<div class="outline-text-4" id="text-3-1-1">
<ul class="org-ul">
<li>在 Fork-Join 模型中,通常有一个主线程,负责管理并行任务的执行</li>
<li><p>
当主线程需要进行并行计算时,它会分叉出一组子线程
</p>
<div class="org-src-container">
<pre class="src src-text"> master
----------
/__________\
/ sequential \
----- -----
\ sequential /
----------
</pre>
</div></li>
</ul>
</div>
</div>
<div id="outline-container-orgdf45492" class="outline-4">
<h4 id="orgdf45492"><span class="section-number-4">3.1.2.</span> 在 OpenMP 中创建线程</h4>
<div class="outline-text-4" id="text-3-1-2">
<ul class="org-ul">
<li>方法一</li>
</ul>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span> <span class="org-comment-delimiter">// </span><span class="org-comment">OpenMP include file</span>
<span class="org-preprocessor">#include</span> <span class="org-string"><stdio.h></span>
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">a</span> = 1;
<span class="org-comment-delimiter">/* </span><span class="org-comment">Each thread executes a copy of the</span>
<span class="org-comment"> code within the structured block</span><span class="org-comment-delimiter">*/</span>
omp_set_num_threads(4); <span class="org-comment-delimiter">//</span><span class="org-comment">Runtime function to request a certain number of threads</span>
<span class="org-preprocessor"> #pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">id</span> = omp_get_thread_num(); <span class="org-comment-delimiter">// </span><span class="org-comment">Runtime library function to return a thread ID</span>
foo(a, id) <span class="org-comment-delimiter">// </span><span class="org-comment">for ID = 0 to 3</span>
} <span class="org-comment-delimiter">// </span><span class="org-comment">End of the Parallel region</span>
<span class="org-keyword">return</span> 0;
}
</pre>
</div>
<ul class="org-ul">
<li>方法二</li>
</ul>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span> <span class="org-comment-delimiter">// </span><span class="org-comment">OpenMP include file</span>
<span class="org-preprocessor">#include</span> <span class="org-string"><stdio.h></span>
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">a</span> = 1;
<span class="org-comment-delimiter">/* </span><span class="org-comment">Each thread executes a copy of the</span>
<span class="org-comment"> code within the structured block</span><span class="org-comment-delimiter">*/</span>
<span class="org-preprocessor"> #pragma</span> omp parallel num_threads(4) <span class="org-comment-delimiter">// </span><span class="org-comment">clause to request a certain number of threads</span>
{
<span class="org-type">int</span> <span class="org-variable-name">id</span> = omp_get_thread_num(); <span class="org-comment-delimiter">// </span><span class="org-comment">Runtime library function to return a thread ID</span>
foo(a, id) <span class="org-comment-delimiter">// </span><span class="org-comment">for ID = 0 to 3</span>
} <span class="org-comment-delimiter">// </span><span class="org-comment">End of the Parallel region</span>
<span class="org-keyword">return</span> 0;
}
</pre>
</div>
</div>
</div>
<div id="outline-container-org076a2db" class="outline-4">
<h4 id="org076a2db"><span class="section-number-4">3.1.3.</span> 线程执行</h4>
<div class="outline-text-4" id="text-3-1-3">
<div class="org-src-container">
<pre class="src src-text"> |
a = 1
|
|
omp_set_num_threads(4)
|--------|--------|--------|
foo(a,0) foo(a,1) foo(a,2) foo(a,3)
|--------|--------|--------|
| <-- Threads wait here for all threads to finish before proceeding (barrier)
return 0
|
</pre>
</div>
</div>
</div>
<div id="outline-container-org42fca5e" class="outline-4">
<h4 id="org42fca5e"><span class="section-number-4">3.1.4.</span> 编译器在干什么?</h4>
<div class="outline-text-4" id="text-3-1-4">
<p>
考虑这个 construct:
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#pragma</span> omp parallel num_threads(4)
{
foo();
}
</pre>
</div>
<p>
编译器会将它变成以下的程序:
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-type">void</span> <span class="org-function-name">thunk</span> ()
{
foo();
}
<span class="org-type">pthread_t</span> <span class="org-variable-name">tid</span>[4];
<span class="org-keyword">for</span> (<span class="org-type">int</span> <span class="org-variable-name">i</span> = 1; i < 4; ++i)
pthread_create (
&tid[i],0,thunk, 0);
<span class="org-function-name">thunk</span>();
<span class="org-keyword">for</span> (<span class="org-type">int</span> <span class="org-variable-name">i</span> = 1; i < 4; ++i)
<span class="org-function-name">pthread_join</span> (<span class="org-type">tid</span>[i]);
</pre>
</div>
<p>
OpenMP 使用线程池机制,所以在并行区域内,线程是从一个预先创建的线程池中获取的
</p>
<p>
在上例中,只有三个线程被创建,因为最后一个并行线程是由主线程调用的
</p>
</div>
</div>
<div id="outline-container-org60d8282" class="outline-4">
<h4 id="org60d8282"><span class="section-number-4">3.1.5.</span> 用 OpenMP 来优化定积分计算   <span class="tag"><span class="example">example</span></span></h4>
<div class="outline-text-4" id="text-3-1-5">
<p>
考虑以下定积分:
</p>
<div class="org-center">
<p>
\(\int_0^1 \frac{4.0}{(1 + x^2)} dx = \pi\)
</p>
</div>
<p>
根据定积分的几何意义,我们可以写出一个 naive 的计算实现:
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-keyword">static</span> <span class="org-type">long</span> <span class="org-variable-name">num_steps</span> = 100000;
<span class="org-type">double</span> <span class="org-variable-name">step</span>;
<span class="org-type">int</span> <span class="org-function-name">main</span> ()
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>; <span class="org-type">double</span> <span class="org-variable-name">x</span>, <span class="org-variable-name">pi</span>, <span class="org-variable-name">sum</span> = 0.0;
step = 1.0/(<span class="org-type">double</span>) num_steps;
<span class="org-keyword">for</span> (i=0;i< num_steps; i++)
{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
</pre>
</div>
<p>
接下来我们试着使用 OpenMP 来简单优化这个程序
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span>
<span class="org-keyword">static</span> <span class="org-type">long</span> <span class="org-variable-name">num_steps</span> = 100000;
<span class="org-type">double</span> <span class="org-variable-name">step</span>;
<span class="org-preprocessor">#define</span> <span class="org-variable-name">NUM_THREADS</span> 2
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">nthreads</span>; <span class="org-type">double</span> <span class="org-variable-name">pi</span>, <span class="org-variable-name">sum</span>[NUM_THREADS];
step = 1.0/(<span class="org-type">double</span>) num_steps;
omp_set_num_threads(NUM_THREADS);
<span class="org-preprocessor"> #pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">id</span>, <span class="org-variable-name">nthrds</span>;
<span class="org-type">double</span> <span class="org-variable-name">x</span>;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
<span class="org-keyword">if</span> (id == 0) nthreads = nthrds;
<span class="org-keyword">for</span> (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
<span class="org-keyword">for</span>(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
}
</pre>
</div>
<p>
其中使用的一些技巧:
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">nthreads</span>; <span class="org-type">double</span> <span class="org-variable-name">pi</span>, <span class="org-variable-name">sum</span>[NUM_THREADS];
</pre>
</div>
<p>
在这里将原先的标量 <code>sum</code> 提升为了一个有进程数个元素的 Array,以避免竞争
</p>
<p>
(每个线程都有自己独立的 <code>sum[id]</code> , 避免多个线程同时写入同一内存位置的情况。每个线程只更新自己对应的 <code>sum</code> 元素)
</p>
<hr />
<div class="org-src-container">
<pre class="src src-C"><span class="org-keyword">for</span> (i = id, sum[id] = 0.0; i < num_steps; i = i + nthrds)
</pre>
</div>
<p>
<b><b>循环分配</b></b> :每个线程从自己的 <code>id</code> 开始,以 <code>nthrds</code> 为步长进行迭代,确保了每个线程处理不同的循环迭代,从而避免数据冲突
</p>
<p>
(a common trick in SPMD programs)
</p>
<hr />
<div class="org-src-container">
<pre class="src src-C"><span class="org-keyword">if</span> (id == 0) nthreads = nthrds;
</pre>
</div>
<p>
针对全局变量的优化: <code>nthreads</code> 是一个全局变量,多个线程可能会同时写入它,导致竞争条件
</p>
<p>
于是在代码中,只允许一个线程来更新 <code>nthreads</code> , 从而避免多个线程同时写入同一地址
</p>
</div>
<div id="outline-container-orgf232fa9" class="outline-5">
<h5 id="orgf232fa9"><span class="section-number-5">3.1.5.1.</span> 什么是 <code>SPMD</code> ?</h5>
<div class="outline-text-5" id="text-3-1-5-1">
<p>
<code>SPMD (Simple Program Multiple Data)</code> 即单程序多数据的算法设计策略
</p>
<p>
<b><b>Concepts</b></b>
</p>
<ul class="org-ul">
<li>单程序: 在 SPMD 模式下,所有核心运行相同的程序, which means 代码是统一的,所有处理单元执行相同的指令</li>
<li>多数据: 每个处理单元处理不同的数据集,通过不同的输入数据,处理单元可以独立执行任务</li>
</ul>
<p>
<b><b>Rank</b></b>
</p>
<p>
处理单元都有一个标识符,称为 Rank,通常是从 0 到 P−1 的整数 (So P 是处理单元的数量)
</p>
</div>
</div>
</div>
<div id="outline-container-orgc20baa6" class="outline-4">
<h4 id="orgc20baa6"><span class="section-number-4">3.1.6.</span> 虚假共享: 对定积分程序再次优化</h4>
<div class="outline-text-4" id="text-3-1-6">
<p>
虚假共享是指多个线程在不同的内存位置上进行写操作时,这些内存位置恰好位于同一个缓存行中
</p>
<p>
现代处理器使用缓存来提高数据访问速度,当一个线程更新某个数据时,整个缓存行会被标记为无效,导致其他线程需要重新从主内存加载这个缓存行,从而引发性能下降
</p>
<div id="org41b16fd" class="figure">
<p><img src="files/false.png" alt="false.png" />
</p>
</div>
<p>
所以我们之前对 <code>sum</code> 的提升带来了很大的问题!
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-type">double</span> <span class="org-variable-name">sum</span>[NUM_THREADS];
</pre>
</div>
<p>
数组的元素在内存中是连续存放的,这意味着它们可能会共享同一个缓存行 (尤其是在数组元素较小的情况下)
</p>
</div>
<div id="outline-container-orgd9666bd" class="outline-5">
<h5 id="orgd9666bd"><span class="section-number-5">3.1.6.1.</span> 那怎么办? 填充数组!</h5>
<div class="outline-text-5" id="text-3-1-6-1">
<p>
如果我们通过某种方式使得数组的元素之间有足够的空间,确保每个元素位于不同的缓存行中
</p>
<p>
那么,每个线程在更新自己的数据时,就不会干扰到其他线程的数据
</p>
<p>
具体来说,在数组中插入额外的空元素就可以 (就这么简单…)
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span>
<span class="org-keyword">static</span> <span class="org-type">long</span> <span class="org-variable-name">num_steps</span> = 100000;
<span class="org-type">double</span> <span class="org-variable-name">step</span>;
<span class="org-preprocessor">#define</span> <span class="org-variable-name">NUM_THREADS</span> 2
<span class="org-preprocessor">#define</span> <span class="org-variable-name">PAD</span> 8 <span class="org-comment-delimiter">// </span><span class="org-comment">assume 64 byte L1 cache line size</span>
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">nthreads</span>; <span class="org-type">double</span> <span class="org-variable-name">pi</span>, <span class="org-variable-name">sum</span>[NUM_THREADS][PAD];
step = 1.0/(<span class="org-type">double</span>) num_steps;
omp_set_num_threads(NUM_THREADS);
<span class="org-preprocessor"> #pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">id</span>, <span class="org-variable-name">nthrds</span>;
<span class="org-type">double</span> <span class="org-variable-name">x</span>;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
<span class="org-keyword">if</span> (id == 0) nthreads = nthrds;
<span class="org-keyword">for</span> (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
<span class="org-keyword">for</span>(i=0, pi=0.0;i<nthreads;i++)pi += sum[i][0] * step;
}
</pre>
</div>
<p>
问题圆满解决…了吗?
</p>
<p>
填充数组需要CPU架构相关的知识,学习成本高;其次,如果程序被移植到缓存行大小不同的CPU上,那么优化就有可能变成负优化…
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org40ab607" class="outline-3">
<h3 id="org40ab607"><span class="section-number-3">3.2.</span> 同步机制</h3>
<div class="outline-text-3" id="text-3-2">
<p>
之前在 OpenMP 的系统概述中提到:
</p>
<blockquote>
<p>
为了控制竞争条件:
</p>
<ul class="org-ul">
<li>使用同步机制来保护数据冲突</li>
</ul>
</blockquote>
<p>
所以使用同步机制可以用一个 universal 的方式解决之前遇到的问题
</p>
</div>
<div id="outline-container-orge38ef89" class="outline-4">
<h4 id="orge38ef89"><span class="section-number-4">3.2.1.</span> 两种最常见的同步方式</h4>
<div class="outline-text-4" id="text-3-2-1">
<ul class="org-ul">
<li>Mutual Exclusion</li>
</ul>
<p class="verse">
当一个线程想要进入可以访问共享资源的代码块时,它必须首先获得锁<br />
一旦获得锁,其他线程就无法进入临界区,直到持有锁的线程释放它<br />
</p>
<ul class="org-ul">
<li>Barrier</li>
</ul>
<p>
Barrier 要求所有参与的线程在某个特定点上停下来,直到所有线程都到达这个点
</p>
<div class="org-src-container">
<pre class="src src-text"> |
a = 1
|
|
omp_set_num_threads(4)
|--------|--------|--------|
foo(a,0) foo(a,1) foo(a,2) foo(a,3)
|--------|--------|--------|
| <-- Threads wait here for all threads to finish before proceeding (barrier)
return 0
|
</pre>
</div>
</div>
</div>
<div id="outline-container-orgcccf132" class="outline-4">
<h4 id="orgcccf132"><span class="section-number-4">3.2.2.</span> 其他的同步方式/一些例子</h4>
<div class="outline-text-4" id="text-3-2-2">
</div>
<div id="outline-container-orge20afc8" class="outline-5">
<h5 id="orge20afc8"><span class="section-number-5">3.2.2.1.</span> 高阶同步</h5>
<div class="outline-text-5" id="text-3-2-2-1">
<ul class="org-ul">
<li>critical</li>
<li>atomic</li>
<li>barrier</li>
<li>ordered</li>
</ul>
</div>
</div>
<div id="outline-container-org90587f3" class="outline-5">
<h5 id="org90587f3"><span class="section-number-5">3.2.2.2.</span> 底层同步</h5>
<div class="outline-text-5" id="text-3-2-2-2">
<ul class="org-ul">
<li>flush</li>
<li>locks</li>
</ul>
</div>
</div>
<div id="outline-container-org0eabbd4" class="outline-5">
<h5 id="org0eabbd4"><span class="section-number-5">3.2.2.3.</span> Barrier</h5>
<div class="outline-text-5" id="text-3-2-2-3">
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">id</span>=omp_get_thread_num();
A[id] = foo1(id);
<span class="org-preprocessor">#pragma</span> omp barrier <span class="org-comment-delimiter">// </span><span class="org-comment">stops here</span>
B[id] = foo2(id, A);
}
</pre>
</div>
</div>
</div>
<div id="outline-container-org600b4e7" class="outline-5">
<h5 id="org600b4e7"><span class="section-number-5">3.2.2.4.</span> Critical (Mutual Exclusion)</h5>
<div class="outline-text-5" id="text-3-2-2-4">
<div class="org-src-container">
<pre class="src src-C"><span class="org-type">float</span> <span class="org-variable-name">res</span>;
<span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-type">float</span> <span class="org-variable-name">B</span>; <span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">id</span>, <span class="org-variable-name">nthrds</span>;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
<span class="org-keyword">for</span> (i=id;i<niters;i+=nthrds)
{
B = big_job(i);
<span class="org-preprocessor">#pragma</span> omp critical <span class="org-comment-delimiter">// </span><span class="org-comment">Threads wait for their turns</span>
res += consume (B);
}
}
</pre>
</div>
</div>
</div>
<div id="outline-container-org0e01610" class="outline-5">
<h5 id="org0e01610"><span class="section-number-5">3.2.2.5.</span> Atomic (Mutual Exclusion)</h5>
<div class="outline-text-5" id="text-3-2-2-5">
<p>
Atomic 是一种特殊的同步机制,当多个线程需要对同一个变量进行更新时,使用 Atomic 可以确保每次更新都是安全的
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-type">double</span> <span class="org-variable-name">tmp</span>, <span class="org-variable-name">B</span>;
B = DOIT();
tmp = foo1(B);
<span class="org-preprocessor"> #pragma</span> omp atomic
X += tmp;
}
</pre>
</div>
<p>
在 OpenMP 中,Atomic 中的语句必须是以下几种形式之一:
</p>
<ul class="org-ul">
<li><code>x binop= expr</code></li>
<li><code>x++</code> : 自增操作</li>
<li><code>++x</code> : 前置自增</li>
<li><code>x--</code> : 自减操作</li>
<li><code>--x</code> : 前置自减</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-orgef97f4a" class="outline-4">
<h4 id="orgef97f4a"><span class="section-number-4">3.2.3.</span> 利用同步机制优化定积分程序</h4>
<div class="outline-text-4" id="text-3-2-3">
</div>
<div id="outline-container-orgfb3e65b" class="outline-5">
<h5 id="orgfb3e65b"><span class="section-number-5">3.2.3.1.</span> Critical</h5>
<div class="outline-text-5" id="text-3-2-3-1">
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span>
<span class="org-keyword">static</span> <span class="org-type">long</span> <span class="org-variable-name">num_steps</span> = 100000;
<span class="org-type">double</span> <span class="org-variable-name">step</span>;
<span class="org-preprocessor">#define</span> <span class="org-variable-name">NUM_THREADS</span> 2
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">nthreads</span>; <span class="org-type">double</span> <span class="org-variable-name">pi</span>;
step = 1.0/(<span class="org-type">double</span>) num_steps;
omp_set_num_threads(NUM_THREADS);
<span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">id</span>, <span class="org-variable-name">nthrds</span>;
<span class="org-type">double</span> <span class="org-variable-name">x</span>, <span class="org-variable-name">sum</span>;
<span class="org-comment-delimiter">// </span><span class="org-comment">Create a scalar local to each thread to accumulate partial sums</span>
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
<span class="org-keyword">if</span> (id == 0) nthreads = nthrds;
<span class="org-keyword">for</span> (i=id, sum=0.0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
<span class="org-preprocessor">#pragma</span> omp critical
pi += sum * step;
<span class="org-comment-delimiter">//</span><span class="org-comment">Must protect summation into pi in a critical region so updates don’t conflict</span>
}
}
</pre>
</div>
</div>
</div>
<div id="outline-container-org0276752" class="outline-5">
<h5 id="org0276752"><span class="section-number-5">3.2.3.2.</span> Atomic</h5>
<div class="outline-text-5" id="text-3-2-3-2">
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#include</span> <span class="org-string"><omp.h></span>
<span class="org-keyword">static</span> <span class="org-type">long</span> <span class="org-variable-name">num_steps</span> = 100000;
<span class="org-type">double</span> <span class="org-variable-name">step</span>;
<span class="org-preprocessor">#define</span> <span class="org-variable-name">NUM_THREADS</span> 2
<span class="org-type">int</span> <span class="org-function-name">main</span>()
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">nthreads</span>; <span class="org-type">double</span> <span class="org-variable-name">pi</span>;
step = 1.0/(<span class="org-type">double</span>) num_steps;
omp_set_num_threads(NUM_THREADS);
<span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-type">int</span> <span class="org-variable-name">i</span>, <span class="org-variable-name">id</span>, <span class="org-variable-name">nthrds</span>;
<span class="org-type">double</span> <span class="org-variable-name">x</span>, <span class="org-variable-name">sum</span>;
<span class="org-comment-delimiter">// </span><span class="org-comment">Create a scalar local to each thread to accumulate partial sums</span>
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
<span class="org-keyword">if</span> (id == 0) nthreads = nthrds;
<span class="org-keyword">for</span> (i=id, sum=0.0;i< num_steps; i=i+nthrds)
{
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
sum = sum * step;
<span class="org-preprocessor">#pragma</span> omp critical
pi += sum;
<span class="org-comment-delimiter">//</span><span class="org-comment">Must protect summation into pi in a critical region so updates don’t conflict</span>
}
}
</pre>
</div>
<p>
添加同步机制的代码块时,需要小心逻辑错误,例如上面两个情况中,如果将最后的求和放入循环内部会导致读写冲突 (Use your brain)
</p>
</div>
</div>
</div>
<div id="outline-container-org45a30c3" class="outline-4">
<h4 id="org45a30c3"><span class="section-number-4">3.2.4.</span> SMPD vs. Worksharing</h4>
<div class="outline-text-4" id="text-3-2-4">
<p>
先前我们提到了 SMPD 单程序多数据的设计思想,但还有另外一个想法叫做 Worksharing,顾名思义便是共享线程之间工作,避免某些线程过载而其他线程空闲的情况发生
</p>
<div class="org-center">
<div id="org9aaccb8" class="figure">
<p><img src="files/core.jpg" alt="core.jpg" />
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-orgfd6235e" class="outline-4">
<h4 id="orgfd6235e"><span class="section-number-4">3.2.5.</span> Worksharing constructs</h4>
<div class="outline-text-4" id="text-3-2-5">
<p>
OpenMP 为 C/C++ 提供了 <code>for</code> 作为 Loop construct name
</p>
<div class="org-src-container">
<pre class="src src-C"><span class="org-preprocessor">#pragma</span> omp parallel
{
<span class="org-preprocessor">#pragma</span> omp <span class="org-keyword">for</span>
<span class="org-keyword">for</span> (I=0;I<N;I++)
{
foo(I);