为什么CAD2017带可见线着色边缘会CAD画锯齿线怎么画化

点击联系发帖人 时间：2018-07-30 04:42

CAD画锯齿线怎么画

&figure&&img src=&https://pic4.zhimg.com/v2-a3fcca337cbc199a539b6dca89a484bc_b.jpg& data-rawwidth=&1920& data-rawheight=&1080& class=&origin_image zh-lightbox-thumb& width=&1920& data-original=&https://pic4.zhimg.com/v2-a3fcca337cbc199a539b6dca89a484bc_r.jpg&&&/figure&&blockquote&转自 &a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&&span class=&invisible&&https://&/span&&span class=&visible&&zhuanlan.zhihu.com/p/31&/span&&span class=&invisible&&558973&/span&&span class=&ellipsis&&&/span&&/a&&/blockquote&&p&深度学习最吃机器，耗资源，在本文，我将来科普一下在深度学习中：&/p&&ul&&li&何为“资源”&/li&&li&不同操作都耗费什么资源&/li&&li&如何充分的利用有限的资源&/li&&li&如何合理选择显卡&/li&&/ul&&p&并纠正几个&b&误区：&/b&&/p&&ul&&li&显存和GPU等价，使用GPU主要看显存的使用？&/li&&li&Batch Size 越大，程序越快，而且近似成正比？&/li&&li&显存占用越多，程序越快？&/li&&li&显存占用大小和batch size大小成正比？&/li&&/ul&&h2&&b&0 预备知识&/b&&/h2&&p&&code&nvidia-smi&/code&是Nvidia显卡命令行管理套件，基于NVML库，旨在管理和监控Nvidia GPU设备。&/p&&figure&&img src=&https://pic1.zhimg.com/v2-df8d2d03efeefacc4cbd_b.jpg& data-size=&normal& data-rawwidth=&1380& data-rawheight=&494& class=&origin_image zh-lightbox-thumb& width=&1380& data-original=&https://pic1.zhimg.com/v2-df8d2d03efeefacc4cbd_r.jpg&&&figcaption&nvidia-smi的输出&/figcaption&&/figure&&p&这是nvidia-smi命令的输出，其中最重要的两个指标：&/p&&ul&&li&显存占用&/li&&li&GPU利用率&/li&&/ul&&p&显存占用和GPU利用率是两个不一样的东西，显卡是由GPU计算单元和显存等组成的，显存和GPU的关系有点类似于内存和CPU的关系。&/p&&p&这里推荐一个好用的小工具：&code&gpustat&/code&,直接&code&pip install gpustat&/code&即可安装，gpustat基于&code&nvidia-smi&/code&，可以提供更美观简洁的展示，结合watch命令，可以&b&动态实时监控&/b&GPU的使用情况。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&watch --color -n1 gpustat -cpu
&/code&&/pre&&/div&&figure&&img src=&https://pic3.zhimg.com/v2-84f91f3ed3e94ae5141ed_b.jpg& data-size=&normal& data-rawwidth=&1628& data-rawheight=&286& class=&origin_image zh-lightbox-thumb& width=&1628& data-original=&https://pic3.zhimg.com/v2-84f91f3ed3e94ae5141ed_r.jpg&&&figcaption&gpustat 输出&/figcaption&&/figure&&p&&b&显存可以看成是空间，类似于内存。&/b&&/p&&ul&&li&显存用于存放模型，数据&/li&&li&显存越大，所能运行的网络也就越大&/li&&/ul&&p&&b&GPU计算单元&/b&类似于CPU中的核，用来进行数值计算。衡量计算量的单位是flop： &i&the number of floating-point multiplication-adds&/i&，浮点数先乘后加算一个flop。计算能力越强大，速度越快。衡量计算能力的单位是flops：每秒能执行的flop数量&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&1*2+3
1*2 + 3*4 + 4*5
&/code&&/pre&&/div&&h2&&b&1. 显存分析&/b&&/h2&&h2&1.1 存储指标&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&1Byte = 8 bit
1K = 1024 Byte
1M = 1024 K
1G = 1024 M
1T = 1024 G
10 K = 10*1024 Byte
&/code&&/pre&&/div&&p&除了&code&K&/code&、&code&M&/code&，&code&G&/code&，&code&T&/code&等之外，我们常用的还有&code&KB&/code& 、&code&MB&/code&，&code&GB&/code&，&code&TB&/code& 。二者有细微的差别。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&1Byte = 8 bit
1KB = 1000 Byte
1MB = 1000 KB
1GB = 1000 MB
1TB = 1000 GB
10 KB = 10000 Byte
&/code&&/pre&&/div&&p&&code&K&/code&、&code&M&/code&，&code&G&/code&，&code&T&/code&是以1024为底，而&code&KB&/code& 、&code&MB&/code&，&code&GB&/code&，&code&TB&/code&以1000为底。不过一般来说，在估算显存大小的时候，我们不需要严格的区分这二者。&/p&&p&在深度学习中会用到各种各样的数值类型，数值类型命名规范一般为&code&TypeNum&/code&，比如Int64、Float32、Double64。&/p&&ul&&li&Type：有Int，Float，Double等&/li&&li&Num: 一般是 8，16，32，64，128，表示该类型所占据的比特数目&/li&&/ul&&p&常用的数值类型如下图所示(&i&int64 准确的说应该是对应c中的long long类型， long类型在32位机器上等效于int32&/i&)：&/p&&figure&&img src=&https://pic2.zhimg.com/v2-de841adb27c8932ccd0ee48_b.jpg& data-size=&normal& data-rawwidth=&1004& data-rawheight=&420& class=&origin_image zh-lightbox-thumb& width=&1004& data-original=&https://pic2.zhimg.com/v2-de841adb27c8932ccd0ee48_r.jpg&&&figcaption&常见数值类型&/figcaption&&/figure&&p&其中Float32 是在深度学习中最常用的数值类型，称为单精度浮点数，每一个单精度浮点数占用4Byte的显存。&/p&&p&举例来说：有一个的矩阵，float32，那么占用的显存差不多就是&/p&&blockquote& Byte = 4MB&/blockquote&&p&32x3x256x256的四维数组（BxCxHxW）占用显存为：24M&/p&&h2&&b&1.2 神经网络显存占用&/b&&/h2&&p&神经网络模型占用的显存包括：&/p&&ul&&li&模型自身的参数&/li&&li&模型的输出&/li&&/ul&&p&举例来说，对于如下图所示的一个全连接网络(不考虑偏置项b)&/p&&figure&&img src=&https://pic3.zhimg.com/v2-4c79bd422b7a_b.jpg& data-size=&normal& data-rawwidth=&1414& data-rawheight=&641& class=&origin_image zh-lightbox-thumb& width=&1414& data-original=&https://pic3.zhimg.com/v2-4c79bd422b7a_r.jpg&&&figcaption&模型的输入输出和参数&/figcaption&&/figure&&p&模型的显存占用包括：&/p&&ul&&li&参数：二维数组 W&/li&&li&模型的输出：二维数组 Y&/li&&/ul&&p&输入X可以看成是上一层的输出，因此把它的显存占用归于上一层。&/p&&p&&b&这么看来显存占用就是W和Y两个数组？&/b&&/p&&p&&b&并非如此！！！&/b&&/p&&p&下面细细分析。&/p&&h2&1.2.1 参数的显存占用&/h2&&p&只有有参数的层，才会有显存占用。这部份的显存占用和&b&输入无关&/b&，模型加载完成之后就会占用。&/p&&p&&b&有参数的层主要包括：&/b&&/p&&ul&&li&卷积&/li&&li&全连接&/li&&li&BatchNorm&/li&&li&Embedding层&/li&&li&... ...&/li&&/ul&&p&&b&无参数的层&/b&：&/p&&ul&&li&多数的激活层(Sigmoid/ReLU)&/li&&li&池化层&/li&&li&Dropout&/li&&li&... ...&/li&&/ul&&p&更具体的来说，模型的参数数目(这里均不考虑偏置项b)为：&/p&&ul&&li&Linear(M-&N): 参数数目：M×N&/li&&li&Conv2d(Cin, Cout, K): 参数数目：Cin × Cout × K × K&/li&&li&BatchNorm(N): 参数数目： 2N&/li&&li&Embedding(N,W): 参数数目： N × W&/li&&/ul&&p&&b&参数占用显存 = 参数数目×n&/b&&/p&&p&&i&n = 4 ：float32&/i&&/p&&p&&i&n = 2 : float16&/i&&/p&&p&&i&n = 8 : double64&/i&&/p&&p&在PyTorch中，当你执行完&code&model=MyGreatModel().cuda()&/code&之后就会占用相应的显存，占用的显存大小基本与上述分析的显存差不多（&i&会稍大一些，因为其它开销&/i&）。&/p&&h2&1.2.2 梯度与动量的显存占用&/h2&&p&举例来说，优化器如果是SGD：&/p&&p&&img src=&https://www.zhihu.com/equation?tex=W_%7Bt%2B1%7D+%3D+W_%7Bt%7D+-+%5Calpha+%5Cnabla+F%28W_t%29& alt=&W_{t+1} = W_{t} - \alpha \nabla F(W_t)& eeimg=&1&&&/p&&p&可以看出来，除了保存W之外还要保存对应的梯度 &img src=&https://www.zhihu.com/equation?tex=%5Cnabla+F%28W%29& alt=&\nabla F(W)& eeimg=&1&& ，因此显存占用等于参数占用的显存x2,&/p&&p&如果是带Momentum-SGD&/p&&p&&img src=&https://www.zhihu.com/equation?tex=v_%7Bt%2B1%7D+%3D+%5Crho+v_t+%2B+%5Cnabla+F%28W_t%29%5C%5C+W_%7Bt%2B1%7D+%3D+W_%7Bt%7D+-+%5Calpha+v_%7Bt%2B1%7D& alt=&v_{t+1} = \rho v_t + \nabla F(W_t)\\ W_{t+1} = W_{t} - \alpha v_{t+1}& eeimg=&1&&&/p&&p&这时候还需要保存动量，因此显存x3&/p&&p&如果是Adam优化器，动量占用的显存更多，显存x4&/p&&p&总结一下，模型中&b&与输入无关的显存占用&/b&包括：&/p&&ul&&li&参数 &b&W&/b&&/li&&li&梯度 &b&dW&/b&（一般与参数一样）&/li&&li&优化器的&b&动量&/b&（普通SGD没有动量，momentum-SGD动量与梯度一样，Adam优化器动量的数量是梯度的两倍）&/li&&/ul&&h2&1.2.3 输入输出的显存占用&/h2&&p&这部份的显存主要看输出的feature map 的形状。&/p&&figure&&img src=&https://pic3.zhimg.com/v2-f4d6bb1e12dad4f2cc56f9_b.jpg& data-size=&normal& data-rawwidth=&525& data-rawheight=&250& class=&origin_image zh-lightbox-thumb& width=&525& data-original=&https://pic3.zhimg.com/v2-f4d6bb1e12dad4f2cc56f9_r.jpg&&&figcaption&feature map&/figcaption&&/figure&&p&比如卷积的输入输出满足以下关系：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-f05c13bb2fd8eecc0bd7b6e4d3d3148d_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1523& data-rawheight=&636& class=&origin_image zh-lightbox-thumb& width=&1523& data-original=&https://pic1.zhimg.com/v2-f05c13bb2fd8eecc0bd7b6e4d3d3148d_r.jpg&&&/figure&&p&据此可以计算出每一层输出的Tensor的形状，然后就能计算出相应的显存占用。&/p&&p&&br&&/p&&p&模型输出的显存占用，总结如下：&/p&&ul&&li&需要计算每一层的feature map的形状（多维数组的形状）&/li&&li&需要保存输出对应的梯度用以反向传播（链式法则）&/li&&li&&b&显存占用与 batch size 成正比&/b&&/li&&li&模型输出不需要存储相应的动量信息。&/li&&/ul&&p&深度学习中神经网络的显存占用，我们可以得到如下公式：&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&显存占用 = 模型显存占用 + batch_size × 每个样本的显存占用
&/code&&/pre&&/div&&p&可以看出显存不是和batch-size简单的成正比，尤其是模型自身比较复杂的情况下：比如全连接很大，Embedding层很大&/p&&p&另外需要注意：&/p&&ul&&li&输入（数据，图片）一般不需要计算梯度&/li&&li&神经网络的每一层输入输出都需要保存下来，用来反向传播，但是在某些特殊的情况下，我们可以不要保存输入。比如ReLU，在PyTorch中，使用&code&nn.ReLU(inplace = True)&/code& 能将激活函数ReLU的输出直接覆盖保存于模型的输入之中，节省不少显存。感兴趣的读者可以思考一下，这时候是如何反向传播的（提示：y=relu(x) -& &i&dx = dy.copy();dx[y&=0]=0&/i&）&/li&&/ul&&h2&&b&1.3 节省显存的方法&/b&&/h2&&p&在深度学习中，一般占用显存最多的是卷积等层的输出，模型参数占用的显存相对较少，而且不太好优化。&/p&&p&节省显存一般有如下方法：&/p&&ul&&li&降低batch-size&/li&&li&下采样(NCHW -& (1/4)*NCHW)&/li&&li&减少全连接层（一般只留最后一层分类用的全连接层）&/li&&/ul&&h2&&b&2 计算量分析&/b&&/h2&&p&计算量的定义，之前已经讲过了，计算量越大，操作越费时，运行神经网络花费的时间越多。&/p&&h2&&b&2.1 常用操作的计算量&/b&&/h2&&p&常用的操作计算量如下：&/p&&ul&&li&全连接层：BxMxN , B是batch size，M是输入形状，N是输出形状。&/li&&li&卷积的计算量: &img src=&https://www.zhihu.com/equation?tex=BHWC_%7Bout%7DC_%7Bin%7DK%5E2& alt=&BHWC_{out}C_{in}K^2& eeimg=&1&&&/li&&/ul&&figure&&img src=&https://pic2.zhimg.com/v2-3fa53f85b4a3b9ef92fc_b.jpg& data-size=&normal& data-rawwidth=&1279& data-rawheight=&665& class=&origin_image zh-lightbox-thumb& width=&1279& data-original=&https://pic2.zhimg.com/v2-3fa53f85b4a3b9ef92fc_r.jpg&&&figcaption&卷积的计算量分析&/figcaption&&/figure&&ul&&li&BatchNorm 计算量我个人估算大概是 &img src=&https://www.zhihu.com/equation?tex=BHWC%5Ctimes+%5C%7B4%2C5%2C6%5C%7D& alt=&BHWC\times \{4,5,6\}& eeimg=&1&& ，欢迎指正&/li&&li&池化的计算量： &img src=&https://www.zhihu.com/equation?tex=BHWCK%5E2& alt=&BHWCK^2& eeimg=&1&&&/li&&/ul&&figure&&img src=&https://pic4.zhimg.com/v2-885ceae992cab0e07c64c_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1468& data-rawheight=&642& class=&origin_image zh-lightbox-thumb& width=&1468& data-original=&https://pic4.zhimg.com/v2-885ceae992cab0e07c64c_r.jpg&&&/figure&&ul&&li&ReLU的计算量： BHWC&/li&&/ul&&h2&&b&2.2 AlexNet 分析&/b&&/h2&&p&AlexNet的分析如下图，左边是每一层的参数数目（不是显存占用），右边是消耗的计算资源. 这里某些地方的计算结果可能和上面的公式对不上, 这是因为原始的AlexNet实现有点特殊(在多块GPU上实现的).&/p&&figure&&img src=&https://pic4.zhimg.com/v2-65d204bce61edfa7c08f_b.jpg& data-size=&small& data-rawwidth=&607& data-rawheight=&758& class=&origin_image zh-lightbox-thumb& width=&607& data-original=&https://pic4.zhimg.com/v2-65d204bce61edfa7c08f_r.jpg&&&figcaption&AlexNet分析&/figcaption&&/figure&&p&可以看出：&/p&&ul&&li&全连接层占据了绝大多数的参数&/li&&li&卷积层的计算量最大&/li&&/ul&&h2&&b&2.3 减少卷积层的计算量&/b&&/h2&&p&今年谷歌提出的MobileNet，利用了一种被称为DepthWise Convolution的技术，将神经网络运行速度提升许多，它的核心思想就是把一个卷积操作拆分成两个相对简单的操作的组合。如图所示, 左边是原始卷积操作，右边是两个特殊而又简单的卷积操作的组合（上面类似于池化的操作，但是有权重，下面类似于全连接操作）。&/p&&figure&&img src=&https://pic1.zhimg.com/v2-7d27d0bcc2f848c9e36b73ecd42b4480_b.jpg& data-size=&normal& data-rawwidth=&1664& data-rawheight=&827& class=&origin_image zh-lightbox-thumb& width=&1664& data-original=&https://pic1.zhimg.com/v2-7d27d0bcc2f848c9e36b73ecd42b4480_r.jpg&&&figcaption&Depthwise Convolution&/figcaption&&/figure&&p&这种操作使得：&/p&&ul&&li&显存占用变多(每一步的输出都要保存)&/li&&li&计算量变少了许多，变成原来的（ &img src=&https://www.zhihu.com/equation?tex=%7B1%5Cover+C_%7Bout%7D+%7D+%2B+%5Cfrac+1+%7Bk%5E2%7D& alt=&{1\over C_{out} } + \frac 1 {k^2}& eeimg=&1&& ）（一般为原来的10－15%）&/li&&/ul&&h2&&b&2.4 常用模型显存/计算复杂度/准确率&/b&&/h2&&p&去年一篇论文(&u&&a href=&https://link.zhihu.com/?target=https%3A//arxiv.org/abs/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&arxiv.org/abs/&/span&&span class=&invisible&&8&/span&&span class=&ellipsis&&&/span&&/a&&/u&)总结了当时常用模型的各项指标，横座标是计算复杂度（越往右越慢，越耗时），纵座标是准确率（越高越好），圆的面积是参数数量（不是显存占用），参数量越多，保存的模型文件越大。左上角我画了一个红色小圆，那是最理想的模型：快，准确率高，显存占用小。&/p&&figure&&img src=&https://pic1.zhimg.com/v2-b56bee7eed70_b.jpg& data-size=&normal& data-rawwidth=&1066& data-rawheight=&748& class=&origin_image zh-lightbox-thumb& width=&1066& data-original=&https://pic1.zhimg.com/v2-b56bee7eed70_r.jpg&&&figcaption&常见模型计算量/参数量/准确率&/figcaption&&/figure&&h2&&b&3 总结&/b&&/h2&&h2&&b&3.1 建议&/b&&/h2&&ul&&li&时间更宝贵，尽可能使模型变快（减少flop）&/li&&li&显存占用不是和batch size简单成正比，模型自身的参数及其延伸出来的数据也要占据显存&/li&&li&batch size越大，速度未必越快。在你充分利用计算资源的时候，加大batch size在速度上的提升很有限&/li&&/ul&&p&尤其是batch-size，假定GPU处理单元已经充分利用的情况下：&/p&&ul&&li&增大batch size能增大速度，但是很有限（主要是并行计算的优化）&/li&&li&增大batch size能减缓梯度震荡，需要更少的迭代优化次数，收敛的更快，但是每次迭代耗时更长。&/li&&li&增大batch size使得一个epoch所能进行的优化次数变少，收敛可能变慢，从而需要更多时间才能收敛（比如batch_size 变成全部样本数目）。&/li&&/ul&&h2&&b&3.2 关于显卡选购&/b&&/h2&&p&当前市面上常用的显卡指标如下：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-6ca96eb4b9b_b.jpg& data-size=&normal& data-rawwidth=&836& data-rawheight=&465& class=&origin_image zh-lightbox-thumb& width=&836& data-original=&https://pic3.zhimg.com/v2-6ca96eb4b9b_r.jpg&&&figcaption&常见显卡性能指标(Base Core, 不考虑tensorcore和Boost等)&/figcaption&&/figure&&p&更多显卡的更多指标请参阅&i&&u&&a href=&https://link.zhihu.com/?target=https%3A//en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&en.wikipedia.org/wiki/L&/span&&span class=&invisible&&ist_of_Nvidia_graphics_processing_units&/span&&span class=&ellipsis&&&/span&&/a&&/u&&/i&&/p&&p&显然&b&GTX 1080TI&/b&性价比最高，速度超越新Titan X，价格却便宜很多，显存也只少了1个G（据说故意阉割掉一个G，不然全面超越了Titan X怕激起买Titan X人的民愤~）。&/p&&ul&&li&K80性价比很低（速度慢，而且贼贵）&/li&&li&注意GTX TITAN X和Nvidia TITAN X的区别&/li&&li&tensorcore的性能目前来看还无法全面发挥出来, 这里不考虑. 其它的tesla系列像P100这些企业级的显卡这里不列了,普通消费者不会买, 而且性价比较低(一台DGX 1上百万.....)&/li&&/ul&&p&另外，针对本文，我做了一个&a href=&https://link.zhihu.com/?target=https%3A//docs.google.com/presentation/d/e/2PACX-1vQVHMzd5MKrAbsYtCCsWDJ4eo9AUGGsC1tHtOY0agRfUbK0a9YCySvgNejuOLokB6tHbj0tLuohCaNP/pub%3Fstart%3Dfalse%26loop%3Dfalse%26delayms%3D3000%26slide%3Did.p& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Google 幻灯片：神经网络性能分析&/a&，国内用户可以&a href=&https://link.zhihu.com/?target=http%3A//misc-.cosbj.myqcloud.com/%25E7%25A5%259E%25E7%25BB%258F%25E7%25BD%%25BB%259C%25E5%E6%259E%2590.pptx& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&点此下载ppt&/a&。Google幻灯片格式更好，后者格式可能不太正常。&/p&
转自深度学习最吃机器，耗资源，在本文，我将来科普一下在深度学习中：何为“资源”不同操作都耗费什么资源如何充分的利用有限的资源如何合理选择显卡并纠正几个误区：显存和GPU等价，使用GPU主要看显存的使用？Batch Size 越大，程序…
&figure&&img src=&https://pic1.zhimg.com/v2-657d4c369db90c1bfe3ea9dd06a173e0_b.jpg& data-rawwidth=&690& data-rawheight=&769& class=&origin_image zh-lightbox-thumb& width=&690& data-original=&https://pic1.zhimg.com/v2-657d4c369db90c1bfe3ea9dd06a173e0_r.jpg&&&/figure&&p&&/p&&p&随着过去几年机器学习模型训练，以及区块链领域中（币圈和链圈）对计算力的要求，人们对硬件计算速度的要求越来越高。很自然的，作为传统科学计算领域，基于GPU的加速也获得了大量的关注：Tensorflow 底层利用GPU来计算；大量的挖矿软件( e.g., ethminer)直接对GPU暴力使用。在16,17年，由于币价格的爆发，对GPU显卡挖矿的需求，直接导致Nvida的股票价格翻了好几倍。&/p&&h2&&b&一　需求&/b&&/h2&&p&然而很遗憾的是，目前GPU显卡API操作，主要是基于两大框架: Opencl 和CUDA。这两种框架实际上需要开发人员对框架底层有大量的了解, 主要体现在： &/p&&ol&&li&自己需要在底层实现kernel函数&/li&&li&自己申请，管理GPU的内存，并负责 Host memory和GPU memory的通讯。&/li&&li&自己去手动优化Kernle 方法的实现，比如基于数据类型的优化。&/li&&/ol&&p&由于这种复杂性，GPU的应用(e.g, coin mining, Machine Learning) 对高层语言开发者, 比如作为目前最通用的编程平台JVM(Java或者其他JVM语言),　是一件非常复杂的实现. 开发者只能通过自己写JNI的方式，对GPU做封装，然后自己在上层通过Java调用（这种方式对于绝大部分的程序员来说，可行性不高）。&/p&&p&&br&&/p&&p&本文的目的并不真正深入J9对 GPU的具体技术细节，更多的还是从上层的科普角度出发。自己在14,15年，以及后来的17年了解参与，讨论了部分J9这方面的调研和工作，包括后来15年在pppj也接触了Rice＆IBM Tokyo RD那边的Akihiro和Kazuaki等博士关于这方面的沟通，理了几遍代码。&/p&&p&首先，需要特别澄清的是:&/p&&ol&&li&JVM Specification 并没有制定JVM要对GPU的支持。这个Feature只是IBM J9 Java8自己一个特有的属性。(Hotspot不清楚，谁知道的喊声？)&/li&&li&J9 Java 8是最早开始支持 Cuda GPU的，至少我当时15年是，今天可能还有其它家的也支持（待考证）。&/li&&li&本文也区分另外一个Java & GPU的开源项目 &b&&a href=&https://link.zhihu.com/?target=https%3A//github.com/Syncleus/aparapi& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&aparapi&/a&&/b& 。Aparapi是在Java 语言中支持GPU, 但是需要开发者自己操作类似Kernel函数的，开发者需要知道GPu开发理论知识背景。而J9 不需要高层感知底下任何关于CUDA/OPENCL等知识, 它是直接在Runtime这一层去支持的。&/li&&/ol&&p&本文阅读需要一些背景：　&/p&&ul&&li&简单知道JVM以及bytecode&/li&&li&简单知道GPU以及CUDA是什么，以及知道为什么会有GPU这玩意&/li&&li&知道JIT 以及编译器是用来干什么的。&/li&&/ul&&h2&&b&二　大概原理&/b&&/h2&&p&J9 对 GPU 的支持主要是在以下两个方面&/p&&ol&&li&CUDA GPU (或者严格说基于Nvida家的CUDA框架)。&/li&&li&支持仅限于 Java 8中的Stream API.
比如： &/li&&/ol&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&
LongStream.range(low, up).parallel().forEach(i -& &lambda&)
&/code&&/pre&&/div&&p&Java8中的 Stream API 是从高层应用中去抽象出来了一个Parallel ，而GPU本身是在物理硬件上实现了 Same Instruction Multiple Data (SIMD)的数据并行。所以，直接通过GPU实现上层并行的逻辑是一个很自然而然的想法。&/p&&p&&br&&/p&&p&大概的框架流程图如下：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-57abb2c1b1faafd601b0ded_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1000& data-rawheight=&245& class=&origin_image zh-lightbox-thumb& width=&1000& data-original=&https://pic1.zhimg.com/v2-57abb2c1b1faafd601b0ded_r.jpg&&&/figure&&p&1， J9 Interpreter 解释bytecode指令的时候，检测并识别出来 Stream API中的foreach (for_loops) 以及lambda closure (这中间涉及到Invokedynamic). &/p&&p&2, JIT compiler (TR in J9) 此时进行优化，将产生两部分代码: Host machine
code, and target GPU code（NVVM IR）。&/p&&ul&&li&Host Machine Code (i.e., CPU here)运行的，这部分将包括：在GPU申请内存，GPU-CPU 内存之间的相互复制，调用Nvida的driver 来编译，启动NVVM IR在GPU上的执行。 &/li&&li&NVVM IR: 严格上来说是对应着lambda closure，这部分最后会有变成Parallel Thread Execution (PTX) 指令，并最终由Nvida编译器生成具体GPU上的指令。综合起来，这块的转化为：　&/li&&/ul&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&bytecode-& NVVM IR-&PTX instruction-& Nvida GPU
instruction.
&/code&&/pre&&/div&&h2&&b&三　优化方案&/b&&/h2&&p&直接利用GPU实现上层Java 语言的并行，能提高上层应用的运效率。这个是在比较理想的状况下能够成立的。&/p&&p&所以，在对lambda closure-& NVVM转化的时候，JVM还需要做一些其他方面的优化。对于计算能力，或者说对于计算速度的提升，或者说在考虑榨干现有硬件的条件下，一般是从两个方面去考虑：内存，指令。&/p&&h2&3.1 内存　（Array aligning）&/h2&&p&通过内存优化Runtime的性能，这本身是一个非常大的范围：比如通过memory management, GC, cache, locality等种种。细说起来需要一本书来完成。&/p&&p&J9对GPU的支持中，我们说的内存优化是指的是GPU中的array(i.e., device memory)的处理, 而非host memory中的array。在cuda原先memory allocation方法中。原先的管理方式是直接讲array object (array header and array body) 放入连续的一块地址中(starting from 0)。在新的优化，实际上重新对array object进行placement 使得body从128 的整数位(e.g., index 31) 开始。如下图：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-f0023fae5dd4e0537cdfa_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&483& data-rawheight=&174& class=&origin_image zh-lightbox-thumb& width=&483& data-original=&https://pic4.zhimg.com/v2-f0023fae5dd4e0537cdfa_r.jpg&&&/figure&&p&这么做的理由主要是基于两种条件：　a) 内存的操作更多的是对元素中read and write，对array header的操作并没有那么频繁（对比而言）, 所以header的读写的重要性不高; b) 　在128 index对齐后，读或者写可以在一个GPU指令周期内完成，而前者需要两个指令周期(第一个：　０－１２7，　第二个周期: １２８－３８４ )．&/p&&p&&br&&/p&&p&当然，除了array aligning, J9 也对内存其他方面进行的优化(不细说了)，比如基于jit对内存region 读写进行识别,　对指令进行re-order,　以达到high memory cache hit in GPU, 又或者减少不必要的复制指令 for copying from GPU memory to host memory，有或对 array header elimination. &/p&&h2&3.2 指令优化（Lambda Closure Optimization）&/h2&&p&指令优化属于传统的JIT 编译器方面的内容, 所以传统的JIT
optimization（e.g. , Deadcode elimination, ）基本上都可以拿过来用，毕竟lambda closure里面也是bytecode。这一部分就不需要细讲。&/p&&p&可以拿出来说的是 cross lambda method calling.
Exactly speaking, the caller is inside of lambda while the callee is out of lambda closure. 比如下面这个例子：&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&public class Sample{
public void myMethod(args...){}
public void anotherMethod(){
intSteam...forEach( i -&{
myMethod(receiver, other-args);
obj.anotherMethod();..
public void anotherMethod(){}
&/code&&/pre&&/div&&p&&br&&/p&&p&在这个例子中， myMethod 和anotherMethod调用都是跨lambda closure (直白点：myMethod, anotherMethod 都不是closure内的方法). 这就产生一个问题或者中间有两个gap:
GPU上高速执行的代码调用CPU上低速执行的代码　－_－!!&/p&&p&b)
GPU上还要在runtime 决定　方法的具体实现方法(Java上若无显示标注方法是 invokevirtiual)。 =_=&/p&&p&　为了决定具体某个方法的实现，the receiver of a method call 和　virtual method table 需要在runtime时候从host memory 复制到device memory中去。为了使得程序跑的正确，这两个Gap将会kill GPU的效率。&/p&&p&　　To resolve both Gaps,
J9中JIT 编译器直接进行inline Caching (IC)优化（Akihiro认为是Method Inling). 在VM中，Inline Caching之前是SELF　language中(For detail, please refer Dr Urs H?lzle's PhD thesis [3])。&/p&&p&　　简单的归纳下就是在生成NVVM的时候，先直接假定 the type of method call receiver是哪一种，然后将被调用的方法实现直接inline到方法调用处。为了生成代码的正确性，在原先call site之前插入一个guard (中文不知道如何翻译？？)进行检测。若检测没有成功，则jmp到原先低速的方法（也就是这个时候GPU停下来去要求CPU执行：看当前receiver具体是哪路神仙(这个GPU也可以做，但是需要先copy from host memory to device memory)，CPU执行callee's　method）.　所以对于上面lambda内 obj.anotherMethod();　变成了&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&if(receiver is someType){
SomeType's anotherMethod real Implementation and will be executed at GPU kernel.
invoke receiver.anotherMethod(..)
//execute on CPU
&/code&&/pre&&/div&&p&&br&&/p&&p&四
附注：&/p&&ul&&li&性能benchmark结果可以参考文章4. &/li&&li&本文的图来自4,5&/li&&li&转载请保留原作者的名字&/li&&/ul&&p&&br&&/p&&p&【１】Project &a href=&https://link.zhihu.com/?target=https%3A//github.com/Syncleus/aparapi& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Syncleus/aparapi&/a&&/p&&p&【２】&a href=&https://link.zhihu.com/?target=https%3A//docs.nvidia.com/cuda/nvvm-ir-spec/index.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&CUDA Toolkit Documentation&/a&&/p&&p&【３】Urs H?lzle, Adaptive Optimization for Self: Reconciling High Performance with Exploratory Programming. Ph.D. thesis, Stanford, CA, USA, 1995, UMI Order No. GAX95-12396.&/p&&p&【４】Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents and Vivek Sarkar, Compiling and Optimizing Java 8 Programs for GPU Execution/&/p&&p&【５】Akihiro Hayashi, Kazuaki Ishizaki, and Gita Koblents, Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection. PPPJ 2015.&/p&&p&&/p&
随着过去几年机器学习模型训练，以及区块链领域中（币圈和链圈）对计算力的要求，人们对硬件计算速度的要求越来越高。很自然的，作为传统科学计算领域，基于GPU的加速也获得了大量的关注：Tensorflow 底层利用GPU来计算；大量的挖矿软件( e.g., ethminer…
&figure&&img src=&https://pic3.zhimg.com/v2-e8126dbe267aeb_b.jpg& data-rawwidth=&2816& data-rawheight=&1437& class=&origin_image zh-lightbox-thumb& width=&2816& data-original=&https://pic3.zhimg.com/v2-e8126dbe267aeb_r.jpg&&&/figure&&p&&/p&&h2&&b&系列文章前言&/b&&/h2&&p&&br&&/p&&p&《GPU Gems》1~3 、《GPU Pro》1~7 以及《GPU Zen》组成的GPU精粹系列书籍，是游戏开发、计算机图形学和渲染领域的业界大牛们优秀经验的分享合辑汇编，是江湖各大武林门派绝学经典招式的精华荟萃，可谓游戏开发、图形学和渲染领域进阶知识精彩绝伦的饕餮盛宴。&/p&&p&&br&&/p&&p&这个系列书籍中所收录的文章不仅有奥斯卡特效大奖得主的成名之作，还有工业光魔等业界前沿的特效工作室带来的精彩分享，各种著名的游戏工作室、知名游戏引擎一线开发人员的肺腑之言，以及各式图形学大牛的经验之谈，可谓字字珠玑，干货无数。&/p&&p&&br&&/p&&p&而因为出版风格的相似性，都是出版当代前沿的图形学文章精粹的合辑汇编，我们可以将《GPU Gems》1~3 、《GPU Pro》1~7 以及《GPU Zen》组成的GPU系列书籍，目前共11本书，合称为“GPU精粹三部曲“。&/p&&p&&br&&/p&&p&可以毫不夸张地说，“GPU精粹三部曲“这11本书，是图形学和渲染爱好者站在巨人的肩膀上，了解图形学业界各种高阶知识和技法Trick，将自己的图形学与渲染能力进阶提升到下一个高度的捷径之一&i&。&/i&&/p&&p&&br&&/p&&p&而如果你要造轮子，自己开发3D引擎，书中的不少Trick，前人踩坑的经验总结，也会对你有帮助。&/p&&p&&br&&/p&&p&我们可以用一张图来了解“GPU精粹三部曲”目前11本书的面世顺序。&/p&&figure&&img src=&https://pic1.zhimg.com/v2-570c74c9f1a845f7042461dfa3000da5_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1790& data-rawheight=&2665& class=&origin_image zh-lightbox-thumb& width=&1790& data-original=&https://pic1.zhimg.com/v2-570c74c9f1a845f7042461dfa3000da5_r.jpg&&&/figure&&p&图 GPU精粹三部曲&/p&&p&&br&&/p&&p&有趣的是，虽说是三部曲，但其中的每一部，分别由不同的出版社出版： &/p&&ul&&li&Addison-Wesley出版社的《GPU Gems 1~3》 &/li&&li&CRC Press出版社的《GPU Pro 1~7》 &/li&&li&Black Cat Publishing出版社的《GPU Zen》 &/li&&/ul&&p&&br&&/p&&p&另外值得注意的是，这三部曲其中仅有《GPU Gems 1~3》有中文版，对应为人民邮电出版社出版的《GPU精粹1》，清华大学出版社出版的《GPU精粹2~3》。&/p&&p&&br&&/p&&p&总之，“GPU精粹三部曲”是图形学进阶学习的大宝藏，是图形学和渲染领域进阶知识的饕餮盛宴。如果你希望进阶地学习图形学、渲染以及Shader编程，仔细研读，认真实践，一定会受益匪浅。&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&系列文章写作风格说明&/b&&/h2&&p&&br&&/p&&p&这个系列，会总结和提炼“GPU精粹三部曲”11本书中总计200+篇关于游戏开发与渲染的核心内容，暂定每篇文章提炼“GPU精粹三部曲”中一本书的核心内容。&/p&&p&&br&&/p&&p&暂定对每本书中每章的提炼字数控制在500字以内，而每篇文章，将由20章左右的内容组成。具体的章节分布，可以在下文的核心内容列表一览中看到。&/p&&p&&br&&/p&&p&也因为是每篇文章需要提炼一整本书的内容，篇幅有限，只能是抓住重点和核心内容去总结，无法做到事无巨细，每章都交代足够的细节。如果你通过阅读这个系列文章，发现有些章节你很感兴趣，便可以找到原书中对应的原章节，进行进一步详细了解和研究。&/p&&p&&br&&/p&&p&而写作顺序方面，自然是按照书籍的出版时间正序进行，即第一篇正式文章，提炼总结《GPU&br&Gems 1》的内容，最后一篇正式文章，提炼《GPU Zen》的内容。&/p&&p&&br&&/p&&p&&br&&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&关于GPU精粹三部曲的有用链接&/b&&/h2&&p&&br&&/p&&ul&&li&《GPU Gem》1~3 的英文原文web版，已经由NVIDIA开源，链接在这里：&a href=&https://link.zhihu.com/?target=https%3A//developer.nvidia.com/gpugems/GPUGems/gpugems_pref01.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&developer.nvidia.com/gp&/span&&span class=&invisible&&ugems/GPUGems/gpugems_pref01.html&/span&&span class=&ellipsis&&&/span&&/a&&/li&&li&《GPU Pro》 1~7和《GPU Zen》等8本书的主编都是图形大牛 Wolfgang Engel。这边是Wolfgang Engel的博客主页：&a href=&https://link.zhihu.com/?target=https%3A//www.blogger.com/profile/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://www.&/span&&span class=&visible&&blogger.com/profile/110&/span&&span class=&invisible&&97662&/span&&span class=&ellipsis&&&/span&&/a&&/li&&li&以及Wolfgang Engel维护的《GPU Pro》系列书籍的博客地址：&a href=&https://link.zhihu.com/?target=http%3A//gpupro.blogspot.com/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&gpupro.blogspot.com/&/span&&span class=&invisible&&&/span&&/a&&/li&&li&还有Wolfgang Engel维护的《GPU Zen》系列书籍的博客地址：&a href=&https://link.zhihu.com/?target=https%3A//gpuzen.blogspot.com/& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&gpuzen.blogspot.com/&/span&&span class=&invisible&&&/span&&/a&&/li&&/ul&&p&&br&&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&GPU精粹全系列核心内容列表一览&/b&&/h2&&p&&br&&/p&&p&前文已经提到，“GPU精粹三部曲”全系列共11本书，具体列举如下：&/p&&ul&&li&GPU Gems 1 [2004]&/li&&li&GPU Gems 2 [2005]&/li&&li&GPU Gems 3 [2006]&/li&&li&GPU Pro 1 [2010]&/li&&li&GPU Pro 2 [2011]&/li&&li&GPU Pro 3 [2012]&/li&&li&GPU Pro 4 [2013]&/li&&li&GPU Pro 5 [2014]&/li&&li&GPU Pro 6 [2015]&/li&&li&GPU Pro 7 [2016]&/li&&li&GPU Zen [2017]&/li&&/ul&&p&以下对每本书的核心章节分别进行目录式列举。&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&一、第一本书《GPU Gems 1（GPU精粹1）》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-acb3b843c545ef8ea86ea2fc_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&461& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&461& data-original=&https://pic1.zhimg.com/v2-acb3b843c545ef8ea86ea2fc_r.jpg&&&/figure&&p&&br&&/p&&h2&1.1 第一部分自然效果的渲染（Natural Effects）&/h2&&p&第1章用物理模型进行高效的水模拟（Effective Water Simulation from Physical Models）&/p&&p&第2章水焦散的渲染（Rendering Water Caustics）&/p&&p&第3章 Dawn Demo中的皮肤（Skin in the &Dawn& Demo）&/p&&p&第4章 Dawn Demo中的动画（Animation in the &Dawn& Demo）&/p&&p&第5章改良的Perlin噪声实现（Implementing Improved Perlin Noise）&/p&&p&第6章 Vulcan Demo中的火焰渲染（Fire in the &Vulcan& Demo）&/p&&p&第7章无尽波动草叶的渲染（Rendering Countless Blades of Waving Grass）&/p&&p&第8章衍射的模拟（Simulating Diffraction）&/p&&p&&br&&/p&&h2&1.2 第二部分光照和阴影（Lighting and Shadows）&/h2&&p&第9章有效的阴影体渲染（Efficient Shadow Volume Rendering）&/p&&p&第10章电影级光照（Cinematic Lighting）&/p&&p&第11章阴影贴图抗锯齿（Shadow Map Antialiasing）&/p&&p&第12章全方位阴影映射（Omnidirectional Shadow Mapping）&/p&&p&第13章使用遮挡区间映射产生模糊的阴影（Generating Soft Shadows Using Occlusion Interval Maps）&/p&&p&第14章透视阴影贴图（Perspective Shadow Maps: Care and Feeding）&/p&&p&第15章逐像素光照的可见性管理（Managing Visibility for Per-Pixel Lighting）&/p&&p&&br&&/p&&h2&1.3 第三部分材质（Materials）&/h2&&p&第16章次表面散射的实时近似（Real-Time Approximations to Subsurface Scattering）&/p&&p&第17章环境光遮蔽（Ambient Occlusion）&/p&&p&第18章空间BRDF（Spatial BRDFs）&/p&&p&第19章基于图像的光照（Image-Based Lighting）&/p&&p&第20章纹理爆炸（Texture Bombing）&/p&&p&&br&&/p&&h2&1.4 第四部分图像处理（Image Processing）&/h2&&p&第21章实时辉光（Real-Time Glow）&/p&&p&第22章颜色控制（Color Controls）&/p&&p&第23章景深（Depth of Field）&/p&&p&第24章高品质的图像滤波（High-Quality Filtering）&/p&&p&第25章用纹理贴图进行快速滤波宽度的计算（Fast Filter-Width Estimates with Texture Maps）&/p&&p&第26章 OpenEXR图像文件格式（The OpenEXR Image File Format）&/p&&p&第27章图像处理的框架（A Framework for Image Processing）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&二、第二本书《GPU Gems 2（GPU精粹2）》&/b&&/h2&&figure&&img src=&https://pic3.zhimg.com/v2-16c887e380ae6a7a93d0dec1ddf15d84_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&468& data-rawheight=&645& class=&origin_image zh-lightbox-thumb& width=&468& data-original=&https://pic3.zhimg.com/v2-16c887e380ae6a7a93d0dec1ddf15d84_r.jpg&&&/figure&&p&&br&&/p&&h2&2.1 第一部分几何复杂性（Geometric Complexity）&/h2&&p&&br&&/p&&p&第1章　实现照片级真实感的虚拟植物（Toward Photorealism in Virtual Botany）&/p&&p&第2章　使用基于GPU几何体剪切图的地形渲染（Terrain Rendering Using GPU-Based Geometry Clipmaps）&/p&&p&第3章　几何体实例化的内幕（Inside Geometry Instancing）&/p&&p&第4章分段缓冲（Segment Buffering）&/p&&p&第5章　用多流优化资源管理（Optimizing Resource Management with Multistreaming）&/p&&p&第6章　让硬件遮挡查询发挥作用（Hardware Occlusion Queries Made Useful）&/p&&p&第7章　带有位移映射的细分表面自适应镶嵌（Adaptive Tessellation of Subdivision Surfaces with Displacement Mapping）&/p&&p&第8章　使用距离函数的逐像素位移（Per-Pixel Displacement Mapping with Distance Functions）&/p&&p&&br&&/p&&h2&2.2 第二部分着色、光照和阴影（Shading, Lighting, and Shadows）&/h2&&p&第9章　S.T.A.L.K.E.R.中的延迟着色（Deferred Shading in S.T.A.L.K.E.R.）&/p&&p&第10章　动态辐照度环境映射实时计算（Real-Time Computation of Dynamic Irradiance Environment Maps）&/p&&p&第11章　近似的双向纹理函数（Approximate Bidirectional Texture Functions）&/p&&p&第12章　基于贴面的纹理映射（Tile-Based Texture Mapping）&/p&&p&第13章　在GPU上实现mental images的phenomena渲染器（Implementing the mental images Phenomena Renderer on the GPU）&/p&&p&第14章　动态环境光遮蔽与间接光照（Dynamic Ambient Occlusion and Indirect Lighting）&/p&&p&第15章　蓝图渲染和草图绘制（Blueprint Rendering and &Sketchy Drawings&）&/p&&p&第16章　精确的大气散射（Accurate Atmospheric Scattering）&/p&&p&第17章　利用像素着色器分支的高效模糊边缘阴影（Efficient Soft-Edged Shadows Using Pixel Shader Branching）&/p&&p&第18章将顶点纹理位移用于水的真实感渲染（Using Vertex Texture Displacement for Realistic Water Rendering）&/p&&p&第19章通用的折射模拟（Generic Refraction）&/p&&p&&br&&/p&&h2&3.3 第三部分高质量渲染（High-Quality Rendering）&/h2&&p&第20章快速三阶纹理过滤（Fast Third-Order Texture Filtering）&/p&&p&第21章高质量反走样的光栅化（High-Quality Antialiased Rasterization）&/p&&p&第22章快速预过滤线条（Fast Prefiltered Lines）&/p&&p&第23章 Nalu Demo中的头发动画和渲染（Hair Animation and Rendering in the Nalu Demo）&/p&&p&第24章使用查找表加速颜色变换（Using Lookup Tables to Accelerate Color Transformations）&/p&&p&第25章 Apple Motion中的GPU图像处理（GPU Image Processing in Apple's Motion）&/p&&p&第26章改进的Perlin噪声（Implementing Improved Perlin Noise）&/p&&p&第27章　高级高质量过滤（Advanced High-Quality Filtering）&/p&&p&第28章 Mipmap级的测量（Mipmap-Level Measurement）&/p&&p&&br&&/p&&h2&3.4 第四部分 GPU的通用计算：初级读本（General-Purpose Computation on GPUS: A Primer）&/h2&&p&&br&&/p&&p&第29章　流式体系结构和技术趋势（Streaming Architectures and Technology Trends）&/p&&p&第30章　Geforce 6系列GPU的体系结构（The GeForce 6 Series GPU Architecture）&/p&&p&第31章　把计算概念映射到GPU（Mapping Computational Concepts to GPUs）&/p&&p&第32章　尝试GPU计算（Taking the Plunge into GPU Computing）&/p&&p&第33章　在GPU上实现高效的并行数据结构（Implementing Efficient Parallel Data Structures on GPUs）&/p&&p&第34章　GPU流程控制习惯用法（GPU Flow-Control Idioms）&/p&&p&第35章　GPU程序优化（GPU Program Optimization）&/p&&p&第36章　用于GPGPU应用程序的流式缩减操作（Stream Reduction Operations for GPGPU Applications）&/p&&p&&br&&/p&&h2&3.5 第五部分面向图像的计算（Image-Oriented Computing）&/h2&&p&第37章　GPU上的八叉树纹理（Octree Textures on the GPU）&/p&&p&第38章　使用光栅化的高质量全局照明渲染（High-Quality Global Illumination Rendering Using Rasterization）&/p&&p&第39章　使用逐步求精辐射度方法的全局照明（Global Illumination Using Progressive Refinement Radiosity）&/p&&p&第40章　GPU上的计算机视觉（Computer Vision on the GPU）&/p&&p&第41章　延迟过滤：困难数据格式的渲染（Deferred Filtering: Rendering from Difficult Data Formats）&/p&&p&第42章　保守光栅化（Conservative Rasterization）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&三、第三本书《GPU Gems 3（GPU精粹3）》&/b&&/h2&&figure&&img src=&https://pic2.zhimg.com/v2-e673a90dd87cba8_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&450& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&450& data-original=&https://pic2.zhimg.com/v2-e673a90dd87cba8_r.jpg&&&/figure&&p&&br&&/p&&h2&3.1 第一部分几何体（Geometry）&/h2&&p&第1章使用GPU 生成复杂的程序化地形（Generating Complex Procedural Terrains Using the GPU）&/p&&p&第2章群体动画渲染（Animated Crowd Rendering）&/p&&p&第3章 DirectX 10 混合形状：打破限制（DirectX 10 Blend Shapes: Breaking the Limits）&/p&&p&第4章下一代SpeedTree 渲染（Next-Generation SpeedTree Rendering）&/p&&p&第5章普遍自适应的网格优化（Generic Adaptive Mesh Refinement）&/p&&p&第6章 GPU 生成的树的程序式风动画（GPU-Generated Procedural Wind Animations for Trees）&/p&&p&第7章 GPU 上基于点的变形球可视化（Point-Based Visualization of Metaballs on a GPU）&/p&&p&第8章区域求和的差值阴影贴图（Summed-Area Variance Shadow Maps）&/p&&p&&br&&/p&&h2&3.2 第二部分光照和阴影（Light and Shadows）&/h2&&p&第9章使用全局照明实现互动的电影级重光照（Interactive Cinematic Relighting with Global Illumination）&/p&&p&第10章在可编程GPU 中实现并行分割的阴影贴图（Parallel-Split Shadow Maps on Programmable GPUs）&/p&&p&第11章使用层次化的遮挡剔除和几何着色器得到高效鲁棒的阴影体（Efficient and Robust Shadow Volumes Using Hierarchical Occlusion Culling and Geometry Shaders）&/p&&p&第12章高质量的环境光遮蔽（High-Quality Ambient Occlusion）&/p&&p&第13章作为后处理的体积光照散射（Volumetric Light Scattering as a Post-Process）&/p&&p&&br&&/p&&h2&3.3 第三部分渲染（Rendering）&/h2&&p&第14章用于真实感实时皮肤渲染的高级技术（Advanced Techniques for Realistic Real-Time Skin Rendering）&/p&&p&第15章可播放的全方位捕捉（Playable Universal Capture）&/p&&p&第16章 Crysis 中植被的程序化动画和着色（Vegetation Procedural Animation and Shading in Crysis）&/p&&p&第17章鲁棒的多镜面反射和折射（Robust Multiple Specular Reflections and Refractions）&/p&&p&第18章用于浮雕映射的松散式锥形步进（Relaxed Cone Stepping for Relief Mapping）&/p&&p&第19章 Tabula Rasa 中的延迟着色（Deferred Shading in Tabula Rasa）&/p&&p&第20章基于GPU的重要性采样（GPU-Based Importance Sampling）&/p&&p&&br&&/p&&p&&br&&/p&&h2&3.4 第四部分图像效果（Image Effects）&/h2&&p&第21章真正的Impostor（True Impostors）&/p&&p&第22章在GPU上烘焙法线贴图（Baking Normal Maps on the GPU）&/p&&p&第23章高速的离屏粒子（High-Speed, Off-Screen Particles）&/p&&p&第24章保持线性的重要性（The Importance of Being Linear）&/p&&p&第25章在GPU 上渲染矢量图（Rendering Vector Art on the GPU）&/p&&p&第26章通过颜色进行对象探测：使用GPU 进行实时视频图像处理（Object Detection by Color: Using the GPU for Real-Time Video Image Processing）&/p&&p&第27章作为后处理效果的运动模糊（Motion Blur as a Post-Processing Effect）&/p&&p&第28章实用景深后期处理（Practical Post-Process Depth of Field）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&四、第四本书《GPU Pro 1》&/b&&/h2&&figure&&img src=&https://pic3.zhimg.com/v2-dec93093b_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&531& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&531& data-original=&https://pic3.zhimg.com/v2-dec93093b_r.jpg&&&/figure&&p&&br&&/p&&h2&4.1 第一部分渲染技巧（Rendering Techniques）&/h2&&p&1. 结合高度混合的四叉树位移贴图（Quadtree Displacement Mapping with Height Blending）&/p&&p&2. 基于几何着色器的NPR效果（NPR Effects Using the Geometry Shader）&/p&&p&3. 作为后处理的alpha混合（Alpha Blending as a Post-Process）&/p&&p&4. 虚拟纹理映射101（Virtual Texture Mapping 101）&/p&&p&&br&&/p&&h2&4.2 第二部分全局光照（Global Illumination）&/h2&&p&1. 用于间接光照的快速，基于模板的多分辨率泼溅（Fast, Stencil-Based Multiresolution Splatting for Indirect Illumination）&/p&&p&2. 屏幕空间定向遮蔽（Screen-Space Directional Occlusion） &/p&&p&3. 使用几何Impostors的实时多级光线追踪（Real-Time Multi-Bounce Ray-Tracing with Geometry Impostors）&/p&&p&&br&&/p&&h2&4.3 第三部分图像空间（Image Space）&/h2&&p&1. GPU上的各项异性的Kuwahara滤波（Anisotropic Kuwahara Filtering on the GPU）&/p&&p&2. 边缘抗锯齿的后处理（Edge Anti-aliasing by Post-Processing）&/p&&p&3. 使用Floyd-Steinberg半色调的环境映射（Environment Mapping with Floyd-Steinberg&br&Halftoning）&/p&&p&4. 用于粒状遮挡剔除的分层项缓冲（Hierarchical Item Buffers for Granular Occlusion Culling）&/p&&p&5. 后期制作中的真实感景深（Realistic Depth of Field in Postproduction）&/p&&p&6. 实时屏幕空间的云层光照（Real-Time Screen Space Cloud Lighting）&/p&&p&7. 屏幕空间次表面散射（Screen-Space Subsurface Scattering）&/p&&p&&br&&/p&&h2&4.4 第四部分阴影（Shadows） &/h2&&p&1. 快速常规阴影过滤（Fast Conventional Shadow Filtering）&/p&&p&2. 混合最小/最大基于平面的阴影贴图（Hybrid Min/Max Plane-Based Shadow Maps）&/p&&p&3: 利用四面体映射实现全向光的阴影映射（Shadow Mapping for Omnidirectional Light Using Tetrahedron Mapping）&/p&&p&4. 屏幕空间软阴影（Screen Space Soft Shadows &/p&&p&&br&&/p&&h2&4.5 第五部分 3D引擎设计（3D Engine Design）&/h2&&p&&br&&/p&&p&1. 基于桶排序的GPU多片段效果（Multi-Fragment Effects on the GPU Using Bucket Sort）&/p&&p&2.随着cell带宽引擎的并行光预通道渲染（Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine）&/p&&p&3. 在Direct3D9和OpenGL 2之间移植代码（Porting Code between Direct3D9 and OpenGL 2.0）&/p&&p&4. DirectX 9实用线程渲染（Practical Thread Rendering for DirectX 9）&/p&&p&&br&&/p&&p&&br&&/p&&h2&4.6 第六部分游戏解析（Game Postmortems） &/h2&&p&1. Spore中的风格化渲染（Stylized Rendering in Spore）&/p&&p&2: 《狂野西部：生死同盟》中的渲染技巧（Rendering Techniques in Call of Juarez: Bound in Blood ）&/p&&p&3. 制作大型，漂亮、快速且流畅的游戏：经验教训（Making it Large, Beautiful, Fast, and Consistent: Lessons Learned）&/p&&p&4. 可破坏的体积地形（Destructible Volumetric Terrain）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&五、第五本书《GPU Pro 2》&/b&&/h2&&figure&&img src=&https://pic3.zhimg.com/v2-d6b636b6b5c6e1b892dc3_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&510& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&510& data-original=&https://pic3.zhimg.com/v2-d6b636b6b5c6e1b892dc3_r.jpg&&&/figure&&p&&br&&/p&&h2&5.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1. 使用硬件镶嵌的地形和海洋渲染（Terrain and Ocean Rendering with Hardware Tessellation）&/p&&p&2. 实际且真实的面部皱纹动画（Practical and Realistic Facial Wrinkles Animation）&/p&&p&3. GPU上的程序内容生成（Procedural Content Generation on the GPU）&/p&&p&&br&&/p&&p&&br&&/p&&h2&5.2 第二部分渲染（Rendering）&/h2&&p&1. 预集成的皮肤着色（Pre-Integrated Skin Shading）&/p&&p&2. 使用延迟着色实现毛发（Implementing Fur Using Deferred Shading）&/p&&p&3. 户外游戏的大规模地形渲染（Large-Scale Terrain Rendering for Outdoor Games） &/p&&p&4. 实用形态学抗锯齿（Practical Morphological Antialiasing）&/p&&p&5. 体积贴花（Volume Decals）&/p&&p&&br&&/p&&p&&br&&/p&&h2&5.3 第三部分全局光照效果（Global Illumination Effects）&/h2&&p&1. 时域屏幕空间环境光遮蔽（Temporal Screen-Space Ambient Occlusion）&/p&&p&2. 细节层次与流优化的辐照度法线映射（Level-of-Detail and Streaming Optimized Irradiance Normal Mapping）&/p&&p&3. 使用光线追踪的实时单次弹射间接光照与阴影（Real-Time One-Bounce Indirect Illumination and Shadows using Ray Tracing）&/p&&p&4. 半透明均匀介质中光传输的实时近似（Real-Time Approximation of Light Transport in Translucent Homogenous Media）&/p&&p&5. 基于时间相关光传播量的漫反射全局光照（Diffuse Global Illumination with Temporally Coherent Light Propagation Volumes）&/p&&p&&br&&/p&&p&&br&&/p&&h2&5.4 第四部分 Shadows 阴影 &/h2&&p&1. 减少方差阴影图光漏的技巧（Variance Shadow Maps Light-Bleeding Reduction Tricks）&/p&&p&2. 基于自适应阴影贴图的快速软阴影（Fast Soft Shadows via Adaptive Shadow Maps）&/p&&p&3. 自适应体积阴影贴图（Adaptive Volumetric Shadow Maps）&/p&&p&4. 具有时间相关性的快速软阴影（Fast Soft Shadows with Temporal Coherence）&/p&&p&5. Mip贴图屏幕空间软阴影（Mipmapped Screen-Space Soft Shadows）&/p&&p&&br&&/p&&h2&5.5 第五部分手持设备（Handheld Devices）&/h2&&p&&br&&/p&&p&1. 一个基于Shader的电子书渲染器（A Shader-Based eBook Renderer）&/p&&p&2. 移动设备上的后处理特效（Post-Processing Effects on Mobile Devices）&/p&&p&3. 基于shader的水特效（Shader-Based Water Effects）&/p&&p&&br&&/p&&h2&5.6 第六部分 3D Engine Design （3D引擎设计）&/h2&&p&1. 对于游戏的实用动态可见性（Practical, Dynamic Visibility for Games）&/p&&p&2. 使用像素四元消息传递的着色器分摊（Shader Amortization using Pixel Quad Message Passing）&/p&&p&3. 用于实时群体的渲染流水线（A Rendering Pipeline for Real-Time Crowds）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&六、第六本书《GPU Pro 3》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-9fdfa3ae7ad6_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&500& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&500& data-original=&https://pic1.zhimg.com/v2-9fdfa3ae7ad6_r.jpg&&&/figure&&p&&br&&/p&&h2&6.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1. 顶点着色器的镶嵌（Vertex Shader Tessellation）&/p&&p&2. 基于DirectX 11的实时变形地形渲染（Real-Time Deformable Terrain Rendering with DirectX 11）&/p&&p&3. 优化体育场的人群渲染（Optimized Stadium Crowd Rendering）&/p&&p&4. 几何抗锯齿方法（Geometric Antialiasing Methods）&/p&&p&&br&&/p&&h2&6.2 第二部分渲染（Rendering）&/h2&&p&1. 基于GPU的实用椭圆纹理滤波（Practical Elliptical Texture Filtering on the GPU）&/p&&p&2. 对大气散射的Chapman掠入射函数近似（An Approximation to the Chapman Grazing-Incidence Function for Atmospheric Scattering）&/p&&p&3. 立体实时水与泡沫的渲染（Volumetric Real-Time Water and Foam Rendering）&/p&&p&4. CryENGINE 3：回顾近三年的工作（CryENGINE 3: Three Years of Work in Review）&/p&&p&5. 简单对象的廉价抗锯齿（Inexpensive Antialiasing of Simple Objects）&/p&&p&&br&&/p&&h2&6.3 第三部分全局光照效果（Global Illumination Effects）&/h2&&p&1. 使用Oriented Splats网格的光线追踪近似反射（Ray-Traced Approximate Reflections&br&Using a Grid of Oriented Splats）&/p&&p&2. 屏幕空间弯曲锥体：一种实用的方法（Screen-Space Bent Cones: A Practical Approach）&/p&&p&3. 基于体素模型的实时近场全局照明（Real-Time Near-Field Global Illumination Based on a Voxel Model） &/p&&p&&br&&/p&&h2&6.4 第四部分阴影（Shadows）&/h2&&p&1.对阴影贴图的高效在线可见性（Efficient Online Visibility for Shadow Maps）&/p&&p&2. 深度拒绝图案阴影（Depth Rejected Gobo Shadows） &/p&&p&&br&&/p&&h2&6.5 第五部分 3D引擎设计（3D Engine Design）&/h2&&p&1. Z3剔除（Z3 Culling）&/p&&p&2. 基于四元数的渲染流水线（A Quaternion-Based Rendering Pipeline）&/p&&p&3. 用DirectX 11实现定向自适应边缘AA滤波器（Implementing a Directionally Adaptive&br&Edge AA Filter Using DirectX 11）&/p&&p&4. 设计一个数据驱动的渲染器（Designing a Data-Driven Renderer）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&七、第七本书《GPU Pro 4》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-f9ccb9bcd4_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&525& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&525& data-original=&https://pic1.zhimg.com/v2-f9ccb9bcd4_r.jpg&&&/figure&&p&&br&&/p&&h2&7.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1. GPU地形细分和镶嵌（GPU Terrain Subdivision and Tessellation）&/p&&p&2. 对可编程顶点Pulling渲染流水线的简介（Introducing the Programmable Vertex Pulling Rendering Pipeline）&/p&&p&3. WebGL全局渲染流水线（A WebGL Globe Rendering Pipeline）&/p&&p&&br&&/p&&h2&7.2 第二部分渲染（Rendering）&/h2&&p&1. 使用Cubemaps和图像代理的实用平面反射（Practical Planar Reflections Using Cubemaps and Image Proxies）&/p&&p&2. 实时Ptex和矢量位移（Real-Time Ptex and Vector Displacement）&/p&&p&3. 在GPU上解耦延迟着色（Decoupled Deferred Shading on the GPU）&/p&&p&4. 分块前向着色（Tiled Forward Shading）&/p&&p&5. 实时渲染向电影式渲染的迈进（Forward+: A Step Toward Film-Style Shading in Real Time）&/p&&p&6. 渐进屏幕空间多通道表面体素化（Progressive Screen-Space Multichannel Surface Voxelization）&/p&&p&7. 基于体素的动态全局照明（Rasterized Voxel-Based Dynamic Global Illumination）&/p&&p&&br&&/p&&h2&7.3 第三部分图像空间（Image Space）&/h2&&p&1. 《小龙斯派罗：交换力量》中的景深着色器（The Skylanders SWAP Force Depth-of-Field Shader）&/p&&p&2. 模拟后处理景深方法中的局部遮蔽（Simulating&br&Partial Occlusion in Post-Processing Depth-of-Field Methods）&/p&&p&3. 第二深度抗锯齿（Second-Depth Antialiasing）&/p&&p&4. 实用的帧缓冲压缩（Practical Framebuffer Compression）&/p&&p&5. 一致性 - 增强GPU上的过滤效果（Coherence-Enhancing Filtering on&br&the GPU）&/p&&p&&br&&/p&&h2&7.4 第四部分阴影（Shadows）&/h2&&p&1. 实时深度阴影贴图（Real-Time Deep Shadow Maps）&/p&&p&&br&&/p&&h2&7.5 第五部分游戏引擎设计（Game Engine Design）&/h2&&p&&br&&/p&&p&1. 基于方向的引擎架构（An Aspect-Based Engine Architecture）&/p&&p&2. 使用Direct3D 11进行Kinect编程（Kinect Programming with Direct3D 11）&/p&&p&3. 一个对Authored Structural Damage的管线（A Pipeline for Authored Structural&br&Damage）&/p&&p&&br&&/p&&h2&&b&八、第八本书《GPU Pro 5》&/b&&/h2&&figure&&img src=&https://pic2.zhimg.com/v2-93ef0b8fa61ad4ed9bf94_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&527& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&527& data-original=&https://pic2.zhimg.com/v2-93ef0b8fa61ad4ed9bf94_r.jpg&&&/figure&&p&&br&&/p&&h2&8.1 第一部分渲染（Rendering）&/h2&&p&1. 对单通道A缓冲的每像素列表（Per-Pixel Lists for Single Pass A-Buffer）&/p&&p&2. 使用双通道颜色编码减少纹理内存使用（Reducing Texture Memory Usage by 2-Channel Color Encoding）&/p&&p&3. 基于粒子的老化材质模拟（Particle-Based Simulation of Material Aging）&/p&&p&4. 简单的基于光栅化的液体（Simple Rasterization-Based Liquids）&/p&&p&&br&&/p&&h2&8.2 第二部分光照与着色（Lighting and Shading）&/h2&&p&1. 基于物理的区域光照（Physically Based Area Lights）&/p&&p&2. 利用极线采样的高性能室外光照散射（High Performance Outdoor Light Scattering Using Epipolar Sampling）&/p&&p&3.《杀戮地带》中的体积光效果：Shadow Fall（Volumetric Light Effects in Killzone: Shadow Fall）&/p&&p&4. 层次-Z 屏幕空间锥追踪反射（Hi-Z Screen-Space Cone-Traced Reflections）&/p&&p&5. TressFX：先进的实时毛发渲染（TressFX: Advanced Real-Time Hair Rendering）&/p&&p&6. 线的抗锯齿（Wire Antialiasing）&/p&&p&&br&&/p&&h2&8.3 第三部分图像空间（Image Space）&/h2&&p&1. 屏幕空间的草地（Screen-Space Grass）&/p&&p&2. 基于每像素链表构造实体几何的屏幕空间可变形网格（Screen-Space Deformable Meshes via CSG with Per-Pixel Linked Lists）&/p&&p&3. SPU上的背景虚化效果（Bokeh Effects on the SPU）&/p&&p&&br&&/p&&h2&8.4 第四部分移动设备（Mobile Devices）&/h2&&p&1. 手机上的真实感实时皮肤渲染（Realistic Real-Time Skin Rendering on Mobile）&/p&&p&2. 移动设备上的延迟渲染技术（Deferred Rendering Techniques on Mobile Devices）&/p&&p&3. 使用ARM(R) Mali(TM) GPUs的带宽高效图形渲染（Bandwidth Efficient Graphics with ARM(R) Mali(TM) GPUs）&/p&&p&4. 使用OpenGL ES 3.0的高效目标变形动画（Efficient Morph Target Animation Using OpenGL ES 3.0）&/p&&p&5. 分块延迟模糊（Tiled Deferred Blending）&/p&&p&6. 自适应可伸缩纹理压缩（Adaptive Scalable Texture Compression）&/p&&p&7. 针对ARM(R) Mali(TM)-T600 GPU的OpenCL内核优化（Optimizing OpenCL Kernels for the ARM(R) Mali(TM)-T600 GPUs）&/p&&p&&br&&/p&&h2&8.5 第五部分 3D引擎设计（3D Engine Design）&/h2&&p&1. 重新认识四元数（Quaternions Revisited）&/p&&p&2. glTF : 设计一个开放标准的运行时资源格式（glTF: Designing an Open-Standard Runtime Asset Format）&/p&&p&3. 管理Hierarchy中的变换（Managing Transformations in Hierarchy）&/p&&p&&br&&/p&&h2&8.6 第六部分计算（Compute）&/h2&&p&1. TressFX中的头发模拟（Hair Simulation in TressFX）&/p&&p&2. 对全动态场景的对象次序光线追踪（Object-Order Ray Tracing for Fully Dynamic Scenes）&/p&&p&3. GPU上的四叉树（Quadtrees on the GPU）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&九、第九本书《GPU Pro 6》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-f629c12ea_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&480& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&480& data-original=&https://pic1.zhimg.com/v2-f629c12ea_r.jpg&&&/figure&&p&&br&&/p&&h2&9.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1. 动态GPU地形（Dynamic GPU Terrain）&/p&&p&2. 在GPU上通过镶嵌的带宽高效程序化网格（Bandwidth-Efficient Procedural Meshes in the GPU via Tessellation）&/p&&p&3. 物体碰撞时细分表面的实时形变（Real-Time Deformation of Subdivision Surfaces on Object Collisions）&/p&&p&4.
游戏中的逼真体积爆炸（Realistic Volumetric Explosions in Games）&/p&&p&&br&&/p&&h2&9.2 第二部分渲染（Rendering）&/h2&&p&1. 《神偷》中的下一代渲染技术（Next-Generation Rendering in Thief）&/p&&p&2. 草地渲染和使用LOD的模拟（Grass Rendering and Simulation with LOD）&/p&&p&3. 混合重建抗锯齿（Hybrid Reconstruction Antialiasing）&/p&&p&4. 使用预计算散射的基于物理的云层实时渲染（Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering）&/p&&p&5. 稀疏程序化体渲染（Sparse Procedural Volume Rendering）&/p&&p&&br&&/p&&h2&9.3 第三部分光照（Lighting）&/h2&&p&1. 使用光照链表的实时光照（Real-Time Lighting via Light Linked List）&/p&&p&2. 延迟归一化的辐照度探针（Deferred Normalized Irradiance Probes）&/p&&p&3. 体积雾与光照（Volumetric Fog and Lighting）&/p&&p&4. GPU上基于物理的光照探针（Physically Based Light Probe Generation on GPU） &/p&&p&5. 使用薄片的实时全局光照（Real-Time Global Illumination Using Slices）&/p&&p&&br&&/p&&p&&br&&/p&&h2&9.4 第四部分阴影（Shadows）&/h2&&p&1. 实用屏幕空间软阴影（Practical Screen-Space Soft Shadows）&/p&&p&2. 基于分块的全方位阴影（Tile-Based Omnidirectional Shadows）&/p&&p&3. 阴影贴图轮廓的重新矢量化（Shadow Map Silhouette Revectorization）&/p&&p&&br&&/p&&p&&br&&/p&&h2&9.5 第五部分移动设备（Mobile Devices）&/h2&&p&1. PowerVR GPU上的混合光线追踪（Hybrid Ray Tracing on a PowerVR GPU）&/p&&p&2. 实现一个仅有GPU的粒子碰撞系统，使用自适应可伸缩纹理压缩3D纹理和OpenGL ES 3.0（Implementing a GPU-Only Particle-Collision System with ASTC 3D Textures and OpenGL ES 3.0）&/p&&p&3. 针对移动设备的动画角色毛皮（Animated Characters with Shell Fur for Mobile Devices）&/p&&p&4. 移动GPU的高动态范围计算摄影（High Dynamic Range Computational Photography on Mobile GPUs）&/p&&p&&br&&/p&&h2&9.6 第六部分计算（Compute）&/h2&&p&1. 基于计算的分块剔除（Compute-Based Tiled Culling）&/p&&p&2. 在GPU光线追踪器上渲染矢量位移映射表面（Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer）&/p&&p&3. 对体渲染的平滑概率环境光遮蔽（Smooth&br&Probabilistic Ambient Occlusion for Volume Rendering）&/p&&p&&br&&/p&&h2&9.7 第七部分 3D引擎设计（3D Engine Design）&/h2&&p&1. 用于快速光线投射操作的分块线性二元方格（Block-Wise Linear Binary Grids for Fast Ray-Casting Operations）&/p&&p&2. 采用Shader Shaker的基于语义的着色器生成（Semantic-Based Shader Generation Using Shader Shaker）&/p&&p&3. ANGL: 将OpenGL ES引入桌面端（ANGLE: Bringing OpenGL ES to the Desktop）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&十、第十本书《GPU Pro 7》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-cc232d2af89afb3ae38da775_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&507& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&507& data-original=&https://pic1.zhimg.com/v2-cc232d2af89afb3ae38da775_r.jpg&&&/figure&&p&&br&&/p&&h2&10.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1.《古墓丽影：崛起》中的延迟雪地形变（Deferred Snow Deformation in Rise of the Tomb Raider）&/p&&p&2. Catmull Clark细分曲面（Catmull-Clark Subdivision Surfaces）&/p&&p&&br&&/p&&h2&10.2 第二部分光照（Lighting）&/h2&&p&1. 集群着色：使用DirectX12中的保守光栅化指定光照（Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12）&/p&&p&2. 精细删减的分块光照列表（Fine Pruned Tiled Light Lists）&/p&&p&3. 延迟属性插值着色（Deferred Attribute Interpolation Shading）&/p&&p&4. 实时的体积云朵景观（Real-Time Volumetric Cloudscapes）&/p&&p&&br&&/p&&h2&10.3 第三部分渲染（Rendering）&/h2&&p&1. 自适应虚拟纹理（Adaptive Virtual Textures）&/p&&p&2. 延迟粗像素着色（Deferred Coarse Pixel Shading）&/p&&p&3. 使用多帧采样进行渲染（Progressive Rendering Using Multi-frame Sampling）&/p&&p&&br&&/p&&p&&br&&/p&&h2&10.4 第四部分移动设备（Mobile Devices）&/h2&&p&1. 基于静态局部立方体贴图的高效软阴影（Efficient Soft Shadows Based on Static Local Cubemap）&/p&&p&2. 移动平台上基于物理的延迟着色（Physically Based Deferred Shading on Mobile）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&十一、第十一本书《GPU Zen》&/b&&/h2&&figure&&img src=&https://pic1.zhimg.com/v2-366cdf5edde_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&549& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&549& data-original=&https://pic1.zhimg.com/v2-366cdf5edde_r.jpg&&&/figure&&p&&br&&/p&&p&&br&&/p&&h2&11.1 第一部分几何操作（Geometry Manipulation）&/h2&&p&1. 属性顶点云层（Attributed Vertex Clouds）&/p&&p&2. 使用内保守光栅化渲染凸面遮挡物（Rendering Convex Occluders with Inner Conservative Rasterization）&/p&&p&&br&&/p&&h2&11.2 第二部分光照（Lighting）&/h2&&p&1. 稳定的间接光照（Stable Indirect Illumination）&/p&&p&2. 使用膨化体积光的参与性介质（Participating Media Using Extruded Light Volumes）&/p&&p&&br&&/p&&h2&11.3 第三部分渲染（Rendering）&/h2&&p&1. Deferred+ : 针对Dawn引擎的下一代剔除和渲染（Deferred+: Next-Gen Culling and Rendering for the Dawn Engine）&/p&&p&2. 使用保守光栅化的可编程每像素采样放置（Programmable Per-pixel Sample Placement with Conservative Rasterizer）&/p&&p&3. 手机卡通渲染（Mobile Toon Shading）&/p&&p&4. 高质量高效GPU图像细节处理（High Quality GPU-efficient Image Detail Manipulation）&/p&&p&5. 使用线性变换余弦的线性光着色（Linear-Light Shading with Linearly Transformed Cosines）&/p&&p&&br&&/p&&h2&11.4 第四部分屏幕空间（Screen Space）&/h2&&p&1.可扩展的自适应屏幕空间环境光遮蔽（Scalable Adaptive SSAO）&/p&&p&2.在PS4上达到1ms 1080p的鲁棒性屏幕空间环境光遮蔽（Robust Screen Space&br&Ambient Occlusion in 1 ms in 1080p on PS4）&/p&&p&3.基于实际采集的散景（Practical Gather-based Bokeh）&/p&&p&&br&&/p&&p&&br&&/p&&h2&&b&结语&/b&&/h2&&p&&br&&/p&&p&可以发现，仅“GPU精粹三部曲”目录式的200+核心章节名称的列举，都已有几千字之多，可谓内容丰富，干货无数。&/p&&p&&br&&/p&&p&希望透过这“GPU精粹三部曲”的11本书，透过这个新的系列文章，不仅能让我们的图形学技术和实时渲染技术再上一个台阶，也能站在巨人的肩膀上，管中窥豹，品味这10多年间，实时渲染与游戏开发领域的蜕变。&/p&&p&&br&&/p&&p&希望自己能将“GPU精粹三部曲”这游戏开发、图形学和渲染领域进阶知识的饕餮盛宴，总结得出色。&/p&&p&&br&&/p&&p&同时也希望这个新的系列文章，能对热爱游戏开发，图形学和渲染的各位朋友们有所帮助。&/p&&p&&br&&/p&&p&最后，放出“GPU精粹三部曲”的全家福，结束这篇文章。&/p&&figure&&img src=&https://pic2.zhimg.com/v2-ddc2625b3afd2092711e_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&2268& data-rawheight=&2401& class=&origin_image zh-lightbox-thumb& width=&2268& data-original=&https://pic2.zhimg.com/v2-ddc2625b3afd2092711e_r.jpg&&&/figure&&p&&br&&/p&&p&下篇文章，《GPU Gems 1》全书核心内容提炼总结，再见。&/p&&p&&br&&/p&&p&With best wishes.&/p&&p&&/p&
系列文章前言《GPU Gems》1~3 、《GPU Pro》1~7 以及《GPU Zen》组成的GPU精粹系列书籍，是游戏开发、计算机图形学和渲染领域的业界大牛们优秀经验的分享合辑汇编，是江湖各大武林门派绝学经典招式的精华荟萃，可谓游戏开发、图形学和渲染领域进阶知识精彩…
&figure&&img src=&https://pic2.zhimg.com/v2-c7dceeac17f_b.jpg& data-rawwidth=&1663& data-rawheight=&987& class=&origin_image zh-lightbox-thumb& width=&1663& data-original=&https://pic2.zhimg.com/v2-c7dceeac17f_r.jpg&&&/figure&&p&&b&update
在最下面&/b&&/p&&hr&&p&&b&&a href=&http://link.zhihu.com/?target=http%3A//scratchapixel.com/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Scratchapixel.com&/a& 是一个非营利性的CG教学网站，提供了从与CG相关的数学、物理知识，到如何自己编写一个3D渲染器，再到各种优化、各种黑科技的一系列教程。&/b&&/p&&figure&&img src=&https://pic3.zhimg.com/v2-152d1ebbcd3cb481bd2ee6ae_b.jpg& data-size=&normal& data-rawwidth=&785& data-rawheight=&1018& class=&origin_image zh-lightbox-thumb& width=&785& data-original=&https://pic3.zhimg.com/v2-152d1ebbcd3cb481bd2ee6ae_r.jpg&&&figcaption&↑ 目录：数学物理知识 3D渲染基础&/figcaption&&/figure&&figure&&img src=&https://pic1.zhimg.com/v2-5c5a6dc952e49ee239f44_b.jpg& data-size=&normal& data-rawwidth=&569& data-rawheight=&1189& class=&origin_image zh-lightbox-thumb& width=&569& data-original=&https://pic1.zhimg.com/v2-5c5a6dc952e49ee239f44_r.jpg&&&figcaption&↑ 目录：优化、黑科技和其他教程&/figcaption&&/figure&&figure&&img src=&https://pic2.zhimg.com/v2-ffcd0ed253c86a14364fd_b.jpg& data-size=&normal& data-rawwidth=&768& data-rawheight=&983& class=&origin_image zh-lightbox-thumb& width=&768& data-original=&https://pic2.zhimg.com/v2-ffcd0ed253c86a14364fd_r.jpg&&&figcaption&正文：很多gif 图文并茂帮助理解&/figcaption&&/figure&&h2&&b&并且，这些都是完全免费的！&/b&&/h2&&h2&&b&网站上非常干净，连一个广告都没有！&/b&&/h2&&hr&&p&&b&知乎上也有很多优秀的回答都有提到Scratchapixel，我也是通过这些回答才知道了Scratchapixel这个网站。&/b&&/p&&p&&a href=&https://www.zhihu.com/question//answer/& class=&internal&&如何开始用 C++ 写一个光栅化渲染器？&/a&
&a class=&member_mention& href=&http://www.zhihu.com/people/7c72e9cfdc727& data-hash=&7c72e9cfdc727& data-hovercard=&p$b$7c72e9cfdc727&&@UncP&/a& &/p&&p&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&针孔相机与昆虫复眼&/a&
&a class=&member_mention& href=&http://www.zhihu.com/people/2a6e20aafc5fd0cc85e4d3& data-hash=&2a6e20aafc5fd0cc85e4d3& data-hovercard=&p$b$2a6e20aafc5fd0cc85e4d3&&@章佳杰&/a& &/p&&p&&a href=&https://www.zhihu.com/question//answer/& class=&internal&&知乎用户：大二计算机软件工程学生，应该学计算机图形学还是搞 ACM ?&/a&
—— &a class=&member_mention& href=&http://www.zhihu.com/people/7c06fb1fd50a2d5462d2d& data-hash=&7c06fb1fd50a2d5462d2d& data-hovercard=&p$b$7c06fb1fd50a2d5462d2d&&@杜子白&/a& &/p&&p&&a href=&https://www.zhihu.com/question//answer/& class=&internal&&Barney Zhao：求实时渲染基本算法好书推荐?&/a&
&a class=&member_mention& href=&http://www.zhihu.com/people/8c66cc6a2a75ad3d352bc& data-hash=&8c66cc6a2a75ad3d352bc& data-hovercard=&p$b$8c66cc6a2a75ad3d352bc&&@Barney Zhao&/a& &/p&&figure&&img src=&https://pic1.zhimg.com/v2-eb9fb531e2c62e0c3f5678_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&702& data-rawheight=&354& class=&origin_image zh-lightbox-thumb& width=&702& data-original=&https://pic1.zhimg.com/v2-eb9fb531e2c62e0c3f5678_r.jpg&&&/figure&&hr&&p&网站的建立者叫Jean-Colas Prunier，这些教程也都是他一个人在进行编写。&/p&&p&当然，大家应该也注意到了，这上面的教程远远落后于作者自己定的进度表，通过邮件联系得知，网站面临严重的资金问题，以前还有一个小团队一起运营，但现在只剩建立者一个人在支撑。&/p&&p&&br&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-cee2bf52373fcbd86ed6bcb767ee3e8c_b.jpg& data-size=&normal& data-rawwidth=&1001& data-rawheight=&353& class=&origin_image zh-lightbox-thumb& width=&1001& data-original=&https://pic1.zhimg.com/v2-cee2bf52373fcbd86ed6bcb767ee3e8c_r.jpg&&&figcaption&邮件中，作者显得非常沮丧&/figcaption&&/figure&&p&在上一个版本的网站中开发者曾经尝试过接入Google广告，但每个月只有20刀的广告收入，所以在新版本中开发者放弃了Google广告。&/p&&p&&br&&/p&&h2&作者的目标是&b&免费提供优质的内容&/b&&/h2&&p&目前网站每个月有3万的访问量，开发者表示：&b&3万个对CG感兴趣的人对于广告商来说是一个很精准的目标群体，&/b&他希望能够找到一家公司赞助Scratchapixel。&/p&&p&&br&&/p&&p&先@一波我关注的大佬，希望能够点个赞同带来一些流量，让更多的人看到。&/p&&p&&a class=&member_mention& href=&http://www.zhihu.com/people/ecc0ec035f& data-hash=&ecc0ec035f& data-hovercard=&p$b$ecc0ec035f&&@vczh&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/1c7fdb6b3cf05e2b6eaac0a9& data-hash=&1c7fdb6b3cf05e2b6eaac0a9& data-hovercard=&p$b$1c7fdb6b3cf05e2b6eaac0a9&&@陈萌萌&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/ec03b8e839a6fb763e1bdb& data-hash=&ec03b8e839a6fb763e1bdb& data-hovercard=&p$b$ec03b8e839a6fb763e1bdb&&@winter&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/78e3b2ae1be4ab038a6e& data-hash=&78e3b2ae1be4ab038a6e& data-hovercard=&p$b$78e3b2ae1be4ab038a6e&&@赵劼&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/9558cac1a8fe6b7b1a0f7b& data-hash=&9558cac1a8fe6b7b1a0f7b& data-hovercard=&p$b$9558cac1a8fe6b7b1a0f7b&&@白如冰&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/d073f194bcabc1cec5ef69d0b534de99& data-hash=&d073f194bcabc1cec5ef69d0b534de99& data-hovercard=&p$b$d073f194bcabc1cec5ef69d0b534de99&&@空明流转&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/1e2cccc3ce33& data-hash=&1e2cccc3ce33& data-hovercard=&p$b$1e2cccc3ce33&&@Milo Yip&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/0b21747b1fec79ad8af7e68a2b1ff681& data-hash=&0b21747b1fec79ad8af7e68a2b1ff681& data-hovercard=&p$b$0b21747b1fec79ad8af7e68a2b1ff681&&@叛逆者&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/745f68a74a02e455b1e1& data-hash=&745f68a74a02e455b1e1& data-hovercard=&p$b$745f68a74a02e455b1e1&&@蓝色&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/13ba78a859eaf6b9a5b27c5c56ee8419& data-hash=&13ba78a859eaf6b9a5b27c5c56ee8419& data-hovercard=&p$b$13ba78a859eaf6b9a5b27c5c56ee8419&&@ze ran&/a& &a class=&member_mention& href=&http://www.zhihu.com/people/11ca7f85ad9fff60c29c9& data-hash=&11ca7f85ad9fff60c29c9& data-hovercard=&p$b$11ca7f85ad9fff60c29c9&&@Cat Chen&/}

叫阿莫西中心