AI里llayer是怎么出现的

点击联系发帖人 时间：2018-07-30 04:42

AI预测

&p&群论不简单么？一个集合和一个二元运算，并且满足群论四大公理。黑纸白字，没有一个符号、一个汉字是我不认识的。经过这么多年的数学训练，加上刷题，那是想证明就证明、想计算就计算，砍瓜切菜、手起刀落、猛虎下山、势如破竹。&/p&&p&但是！我很不爽，这种感觉好比有人叫你去砍人，你也不问问为什么，一言不合就出手把人砍翻在地，或者被人砍翻在地，这种行为我们一般把它成为脑残，你的身份就是别人的小弟。&/p&&p&我们不要做数学的小弟，刷题不能给我们自由，唯有思考可以。&/p&&p&下面就讲一下我对群论的一些思考。&/p&&p&&strong&1 集合&/strong&&/p&&p&讲群论先从集合讲起，集合简单来说就是把一堆东西放在一起（暂时就别提罗素悖论了）：&figure&&img src=&https://pic1.zhimg.com/50/v2-c7ab0c02df10e0bde75846e_b.jpg& data-rawwidth=&600& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&600& data-original=&https://pic1.zhimg.com/50/v2-c7ab0c02df10e0bde75846e_r.jpg&&&/figure&&/p&&p&可是这用处不大啊，东西之间得有相互作用才能更好的描述世界啊：&figure&&img src=&https://pic3.zhimg.com/50/v2-de1bf6e41108b6fae01ecad_b.jpg& data-rawwidth=&701& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&701& data-original=&https://pic3.zhimg.com/50/v2-de1bf6e41108b6fae01ecad_r.jpg&&&/figure&&/p&&p&东西我们把它称之为对象，对象之间的互相作用我们称之为操作或者运算。&/p&&p&自然数 &img src=&//www.zhihu.com/equation?tex=N& alt=&N& eeimg=&1&& 是一个集合，我们从自然数 &img src=&//www.zhihu.com/equation?tex=N& alt=&N& eeimg=&1&& 这个集合出发，通过运算可以创造越来越大的集合( &img src=&//www.zhihu.com/equation?tex=N& alt=&N& eeimg=&1&& 、 &img src=&//www.zhihu.com/equation?tex=Z& alt=&Z& eeimg=&1&& 、 &img src=&//www.zhihu.com/equation?tex=Q& alt=&Q& eeimg=&1&& 、 &img src=&//www.zhihu.com/equation?tex=R& alt=&R& eeimg=&1&& 、 &img src=&//www.zhihu.com/equation?tex=C& alt=&C& eeimg=&1&& 分别是自然数、整数、有理数、实数、复数)：&figure&&img src=&https://pic2.zhimg.com/50/v2-4a587f2776eac5a7fb10_b.jpg& data-rawwidth=&868& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&868& data-original=&https://pic2.zhimg.com/50/v2-4a587f2776eac5a7fb10_r.jpg&&&/figure&&/p&&p&运算不止加减乘除，数学学到后面就多了很多抽象运算。甚至从集合和运算的角度来看，学数学的过程很多时候就是在不断的扩大对集合和运算的认知。理解的集合和运算越多，相关领域的数学基本上也就理解了。&/p&&p&其中有种特殊的&b&集合+运算&/b&就是群。&/p&&p&&strong&2 群&/strong&&/p&&p&简单来说，群的作用是描述对称。&/p&&p&&strong&2.1 什么叫对称？&/strong&&/p&&p&我们来看看：&/p&&ul&&li&&p&正方形对称吗?&/p&&/li&&li&&p&物理定律对称吗？&/p&&/li&&li&&p&多项式的根对称吗？&/p&&/li&&/ul&&p&上面的问题的答案都是：对称！&/p&&p&对称就是：&b&“某种操作下的不变性”&/b&，关键字是两个：&b&“操作”和“不变性”&/b&，要说明这点让我们通过上面的三个问题来理解。&/p&&p&&strong&2.1.1 正方形是否对称？&/strong&&/p&&p&先看看正方形，其实它对称是蛮明显的，符合我们日常的语义，可是我们也要把它放到数学的语境里来分析一下：&figure&&img src=&https://pic1.zhimg.com/50/v2-dec02ff2e20b841b8ccd51_b.jpg& data-rawwidth=&768& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&768& data-original=&https://pic1.zhimg.com/50/v2-dec02ff2e20b841b8ccd51_r.jpg&&&/figure&&/p&&p&围绕中心点旋转这个&b&操作&/b&，正方形所具有的&b&不变性&/b&就是对称。&/p&&p&我们换一种操作，正方形也可以对称：&figure&&img src=&https://pic4.zhimg.com/50/v2-fedebd3ed0_b.jpg& data-rawwidth=&768& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&768& data-original=&https://pic4.zhimg.com/50/v2-fedebd3ed0_r.jpg&&&/figure&&/p&&p&围绕中垂线这个&b&操作&/b&，正方形也具有&b&不变性&/b&，也是一种对称。但是因为操作变了，所以这种对称和上面的那种对称不是同一种对称，之后我会再说到这个问题。&/p&&p&假如刚才的正方形只是桌子的桌面，继续围绕中垂线翻转这个操作就不对称了：&figure&&img src=&https://pic4.zhimg.com/50/v2-58b48ff8d8a3feb456c2fbe_b.jpg& data-rawwidth=&920& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&920& data-original=&https://pic4.zhimg.com/50/v2-58b48ff8d8a3feb456c2fbe_r.jpg&&&/figure&&/p&&p&&strong&2.1.2 物理定律是否对称？&/strong&&/p&&p&这个听起来就有点奇怪了，但是从不变性的角度出发，相对于时间流逝这个操作，物理定律保持不变，我们可以说物理定律相对时间对称。相对于空间改变这个操作，物理定律保持不变，我们可以说物理定律相对空间对称：&figure&&img src=&https://pic3.zhimg.com/50/v2-91c8cb6ebb0fb0e27e9474890cfe22dd_b.jpg& data-rawwidth=&838& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&838& data-original=&https://pic3.zhimg.com/50/v2-91c8cb6ebb0fb0e27e9474890cfe22dd_r.jpg&&&/figure&&/p&&p&这听起来蛮哲学的，不是说数学学到后面都是哲学吗？&/p&&p&物理我属于民科水平，大家可以参看 &a href=&//link.zhihu.com/?target=https%3A//zh.wikipedia.org/wiki/%25E5%25AF%25B9%25E7%25A7%25B0%25E6%_%28%25E7%%25E7%E5%25AD%25A6%29& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&对称性----维基百科&/a& 。&/p&&p&&strong&2.1.3 多项式的根是否对称？&/strong&&/p&&p&说明下，多项式方程指的是形如 &img src=&//www.zhihu.com/equation?tex=x%5E+n%2Ba_1x%5E%7Bn-1%7D%2B%5Ccdots+%2Ba_+n%3D0& alt=&x^ n+a_1x^{n-1}+\cdots +a_ n=0& eeimg=&1&& 这样的方程。&/p&&p&群论就是从解多项式的根开始发展起来的，所以自然要谈一下为什么多项式的根具有对称性。&/p&&p&首先要从简单的一元二次方程说起：&figure&&img src=&https://pic1.zhimg.com/50/v2-1687fbe78ee6cc64749a44_b.jpg& data-rawwidth=&1133& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&1133& data-original=&https://pic1.zhimg.com/50/v2-1687fbe78ee6cc64749a44_r.jpg&&&/figure&&/p&&p&从上图中来看，相对于 &img src=&//www.zhihu.com/equation?tex=%2B%5Ctimes+& alt=&+\times & eeimg=&1&& 运算，多项式的根互换之后结果不变，针对这个运算它们是对称的。对于 &img src=&//www.zhihu.com/equation?tex=-%5Cdiv+& alt=&-\div & eeimg=&1&& 运算就没有对称性。&/p&&p&这个对称性有什么用？根据 &a href=&//link.zhihu.com/?target=https%3A//zh.wikipedia.org/wiki/%25E9%259F%25A6%25E8%25BE%25BE%25E5%25AE%259A%25E7%& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&韦达定理&/a& ，一元二次方程 &img src=&//www.zhihu.com/equation?tex=x%5E2%2Bax%2Bb%3D0& alt=&x^2+ax+b=0& eeimg=&1&& ，其中&img src=&//www.zhihu.com/equation?tex=a%3D-%28x_1%2Bx_2%29%2Cb%3Dx_1x_2& alt=&a=-(x_1+x_2),b=x_1x_2& eeimg=&1&& ，系数是已知的，实际上我可以联立这样的二元方程组求得方程的根。&/p&&p&所以顺便说一下，群论的发展过程是这样的：&figure&&img src=&https://pic4.zhimg.com/50/v2-eb8aee04f3f6_b.jpg& data-rawwidth=&1059& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&1059& data-original=&https://pic4.zhimg.com/50/v2-eb8aee04f3f6_r.jpg&&&/figure&&/p&&p&关于伽罗瓦与一元五次方程的问题，与群紧密相关，但是又涉及到更多别的知识，本文就不继续推下去了。&/p&&p&&strong&2.2 对称如何用数学表示？&/strong&&/p&&p&让我们从正方形开始解读如何来表示对称.&/p&&p&之前说过，对称最重要的是在&b&“某种操作下的不变性”&/b&，所以我们先讨论正方形围绕中心点旋转，总共有4种对称操作：&figure&&img src=&https://pic4.zhimg.com/50/v2-ead5217cdc2700_b.jpg& data-rawwidth=&444& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&444& data-original=&https://pic4.zhimg.com/50/v2-ead5217cdc2700_r.jpg&&&/figure&&/p&&p&或许你觉得应该不止4种操作，比如转两圈，这可以等价于“保持不动”，而转45°，这会导致不对称（因为你会明显发现变化）。&/p&&p&起始点是完全不用关心的：&figure&&img src=&https://pic2.zhimg.com/50/v2-ecd573c3ca5e16baeb87_b.jpg& data-rawwidth=&906& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&906& data-original=&https://pic2.zhimg.com/50/v2-ecd573c3ca5e16baeb87_r.jpg&&&/figure&&/p&&p&甚至是不是正方形也不重要：&figure&&img src=&https://pic4.zhimg.com/50/v2-5baba0a0e1f5d6bc2f699_b.jpg& data-rawwidth=&875& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&875& data-original=&https://pic4.zhimg.com/50/v2-5baba0a0e1f5d6bc2f699_r.jpg&&&/figure&&/p&&p&是的，群只关心对称最本质、最抽象的性质。所以我们只关心操作，只需要把操作放到集合里。&/p&&p&要放进去我们必须要把操作给数学化，也就是符号化，我们起码有两种符号化的选择，类比于加法或者乘法：&figure&&img src=&https://pic2.zhimg.com/50/v2-3e11fc2d2e_b.jpg& data-rawwidth=&1059& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&1059& data-original=&https://pic2.zhimg.com/50/v2-3e11fc2d2e_r.jpg&&&/figure&&/p&&p&稍微解释一下，什么叫做类比于加法？比如我们通过类比于加法得到 &img src=&//www.zhihu.com/equation?tex=%5C%7B+0%2Cr%2C2r%2C3r%5C%7D+& alt=&\{ 0,r,2r,3r\} & eeimg=&1&& ，“保持不变”映射为了0，“旋转90°”映射为了 &img src=&//www.zhihu.com/equation?tex=r& alt=&r& eeimg=&1&& ，而两个操作的依次进行映射为加法。所以“保持不变” + “旋转90°” ＝ &img src=&//www.zhihu.com/equation?tex=0%2Br%3Dr& alt=&0+r=r& eeimg=&1&& ＝ “旋转90°”，是合理。而“旋转90°” + “旋转90°” ＝ &img src=&//www.zhihu.com/equation?tex=r%2Br%3D2r& alt=&r+r=2r& eeimg=&1&& ＝ “旋转180°”，也是合理的。注意，运算不需要符合交换律。&/p&&p&还要说明的一点是，这里的加法和乘法是模加法、模乘法，类似于钟表，按照12小时制算， &img src=&//www.zhihu.com/equation?tex=3%2B11%3D2& alt=&3+11=2& eeimg=&1&& ， &img src=&//www.zhihu.com/equation?tex=3%5Ctimes+6%3D6& alt=&3\times 6=6& eeimg=&1&& 。&/p&&p&这样我们就得到了两个群，一个是 &img src=&//www.zhihu.com/equation?tex=%28G%2C%2B%29%3D%28%5C%7B+0%2Cr%2C2r%2C3r%5C%7D+%2C%2B%29& alt=&(G,+)=(\{ 0,r,2r,3r\} ,+)& eeimg=&1&& ，一个是&img src=&//www.zhihu.com/equation?tex=%28G%2C%5Ctimes+%29%3D%28%5C%7B+1%2Cr%2Cr%5E2%2Cr%5E3%5C%7D+%2C%5Ctimes+%29& alt=&(G,\times )=(\{ 1,r,r^2,r^3\} ,\times )& eeimg=&1&& 。但是我们明明知道它们应该是一样的啊，只是符号不一样，运算不一样，所以我们可以称之为&b&同构&/b&，就是结构相同的意思。&/p&&p&这里先用到群的解析式了，下面就要解释一下。&/p&&p&&strong&2.3 群的定义&/strong&&/p&&p&先祭出大杀器，群的标准定义：&/p&&blockquote&&b&群是一个集合 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& ，连同一个运算& &img src=&//www.zhihu.com/equation?tex=%5Ccdot+& alt=&\cdot & eeimg=&1&& &，它结合任何两个元素 &img src=&//www.zhihu.com/equation?tex=a& alt=&a& eeimg=&1&& 和 &img src=&//www.zhihu.com/equation?tex=b& alt=&b& eeimg=&1&& 而形成另一个元素，记为 &img src=&//www.zhihu.com/equation?tex=a%5Ccdot+b& alt=&a\cdot b& eeimg=&1&& 。符号& &img src=&//www.zhihu.com/equation?tex=%5Ccdot+& alt=&\cdot & eeimg=&1&& &是对具体给出的运算，比如整数加法的一般占位符。要具备成为群的资格，这个集合和运算 &img src=&//www.zhihu.com/equation?tex=%28G%2C%5Ccdot+%29& alt=&(G,\cdot )& eeimg=&1&& 必须满足叫做群公理的四个要求：&ul&&li&&p&封闭性：对于所有 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中 &img src=&//www.zhihu.com/equation?tex=a%2Cb& alt=&a,b& eeimg=&1&& ，运算 &img src=&//www.zhihu.com/equation?tex=a%5Ccdot+b& alt=&a\cdot b& eeimg=&1&& 的结果也在 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中。&/p&&/li&&li&&p&结合性：对于所有 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中的 &img src=&//www.zhihu.com/equation?tex=a%2Cb& alt=&a,b& eeimg=&1&& 和 &img src=&//www.zhihu.com/equation?tex=c& alt=&c& eeimg=&1&& ，等式 &img src=&//www.zhihu.com/equation?tex=%28a%5Ccdot+b%29%5Ccdot+c%3Da%5Ccdot+%28b%5Ccdot+c%29& alt=&(a\cdot b)\cdot c=a\cdot (b\cdot c)& eeimg=&1&& 成立。&/p&&/li&&li&&p&单位元：存在 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中的一个元素 &img src=&//www.zhihu.com/equation?tex=e& alt=&e& eeimg=&1&& ，使得对于所有 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中的元素 &img src=&//www.zhihu.com/equation?tex=a& alt=&a& eeimg=&1&& ，等式&img src=&//www.zhihu.com/equation?tex=e%5Ccdot+a%3Da%5Ccdot+e%3Da& alt=&e\cdot a=a\cdot e=a& eeimg=&1&& 成立。&/p&&/li&&li&&p&逆元：对于每个 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中的 &img src=&//www.zhihu.com/equation?tex=a& alt=&a& eeimg=&1&& ，存在 &img src=&//www.zhihu.com/equation?tex=G& alt=&G& eeimg=&1&& 中的一个元素 &img src=&//www.zhihu.com/equation?tex=b& alt=&b& eeimg=&1&& 使得 &img src=&//www.zhihu.com/equation?tex=a%5Ccdot+b%3Db%5Ccdot+a%3De& alt=&a\cdot b=b\cdot a=e& eeimg=&1&& ，这里的 &img src=&//www.zhihu.com/equation?tex=e& alt=&e& eeimg=&1&& 是单位元。&/p&&/li&&/ul&&/b&&p&维基百科&/p&&/blockquote&&p&数学是自然科学的语言，和日常的说话相比最大的优点是精确没有歧义，缺点就是晦涩不好理解。群的定义也是这样，下面我们用人话来解释群。&/p&&p&套用正方形的例子来解读群的定义，选 &img src=&//www.zhihu.com/equation?tex=%28G%2C%2B%29%3D%28%5C%7B+0%2Cr%2C2r%2C3r%5C%7D+%2C%2B%29& alt=&(G,+)=(\{ 0,r,2r,3r\} ,+)& eeimg=&1&& 这个群吧：&/p&&ul&&li&&p&集合里的对象：所有保证对称性的操作。&/p&&/li&&li&&p&二元运算：模加法。&/p&&/li&&li&&p&封闭性：操作相加还是在集合内，比如 &img src=&//www.zhihu.com/equation?tex=r%2B2r%3D3r& alt=&r+2r=3r& eeimg=&1&& 。&/p&&/li&&li&&p&结合性： &img src=&//www.zhihu.com/equation?tex=r%2B3r%2B2r%3Dr%2B%283r%2B2r%29& alt=&r+3r+2r=r+(3r+2r)& eeimg=&1&& 。&/p&&/li&&li&&p&单位元：保持不动就是单位元，映射为0，所以 &img src=&//www.zhihu.com/equation?tex=0%2Br%3Dr& alt=&0+r=r& eeimg=&1&& 。&/p&&/li&&li&&p&逆元：首先旋转正方形的操作是可逆的，所以 &img src=&//www.zhihu.com/equation?tex=r%2B%28-r%29%3D0& alt=&r+(-r)=0& eeimg=&1&& ，同时这还是一个循环的运算， &img src=&//www.zhihu.com/equation?tex=r%2B3r%3D0& alt=&r+3r=0& eeimg=&1&& ，都可以说是 &img src=&//www.zhihu.com/equation?tex=r& alt=&r& eeimg=&1&& 的逆元。&/p&&/li&&/ul&&p&其实吧，我可以再抽象一点， &img src=&//www.zhihu.com/equation?tex=%28G%2C%2B%29%3D%28%5C%7B+0%2Cr%2C2r%2C3r%5C%7D+%2C%2B%29%3D%28%5C%7B+0%2C1%2C2%2C3%5C%7D+%2C%2B%29& alt=&(G,+)=(\{ 0,r,2r,3r\} ,+)=(\{ 0,1,2,3\} ,+)& eeimg=&1&& ，这个群基本上已经没有原来正方形旋转的影子了。群比我们之前学的数学的抽象性更近了一步，要不怎么放在抽象代数课程里面呢？本文只是想稍微让群具体一点。&/p&&p&&strong&2.4 群的结构与同构&/strong&&/p&&p&之前说过，正方形围绕中垂线翻转是不一样的对称&figure&&img src=&https://pic2.zhimg.com/50/v2-c04e7f532aad020c5e9a04ae_b.jpg& data-rawwidth=&495& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&495& data-original=&https://pic2.zhimg.com/50/v2-c04e7f532aad020c5e9a04ae_r.jpg&&&/figure&&/p&&p&上图我把运算直接表示为& &img src=&//www.zhihu.com/equation?tex=%5Ccdot+& alt=&\cdot & eeimg=&1&& &。这个群很明显和正方形围绕中心点旋转的群不一样，所以对称也就不一样，用群的术语来说就是，这两种群结构不一样。&/p&&p&现实中，还有各种各样的对称，比如正方形和圆：&figure&&img src=&https://pic2.zhimg.com/50/v2-388a596cfeded89a6d9c41_b.jpg& data-rawwidth=&1059& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&1059& data-original=&https://pic2.zhimg.com/50/v2-388a596cfeded89a6d9c41_r.jpg&&&/figure&&/p&&p&这两种对称的结构也不同，对应的群也不一样。群论就是对各种群的研究。&/p&&p&&strong&2.5 进一步的思考&/strong&&/p&&p&关于同构，这里再进一步思考，圆是有无数种对称操作的，之前提到的相对于时间对称的物理定律，也是有无数种对称操作的（因为时间是可以无限流逝的），从某种意义上讲，两者是不是同一种对称，也就是同构？如果是同构，那么我只要研究一个群就可以研究两者了。&/p&&p&思考，才是数学最大的乐趣所在。&/p&&p&最后推荐一本书，Visual Group Theory &a class=& wrap external& href=&//link.zhihu.com/?target=https%3A//book.douban.com/search/Nathan%2520Carter& target=&_blank& rel=&nofollow noreferrer&&Nathan Carter&/a&，谢谢 &a data-hash=&f3b41fa8bb0f7c& href=&//www.zhihu.com/people/f3b41fa8bb0f7c& class=&member_mention& data-hovercard=&p$b$f3b41fa8bb0f7c&&@金凯&/a&。这书我以前看过，挺好的，就是没有中文版，贵。&/p&
群论不简单么？一个集合和一个二元运算，并且满足群论四大公理。黑纸白字，没有一个符号、一个汉字是我不认识的。经过这么多年的数学训练，加上刷题，那是想证明就证明、想计算就计算，砍瓜切菜、手起刀落、猛虎下山、势如破竹。但是！我很不爽，这种感觉好…
&figure&&img src=&https://pic3.zhimg.com/v2-8fc75d86d7c2b2dc0860_b.jpg& data-rawwidth=&888& data-rawheight=&439& class=&origin_image zh-lightbox-thumb& width=&888& data-original=&https://pic3.zhimg.com/v2-8fc75d86d7c2b2dc0860_r.jpg&&&/figure&&p&在入门深度学习时，梯度下降、反向传播、激活函数这三个概念是绕不过的知识点，如果不能好好理解这些点那么深度学习可能就入不了门；如果不能好好的将这些点联系起来，我觉得对深度神经网络的理解也会很迷惑。网上介绍这些概念的文章有很多，但是往往都是单独介绍的，因为每一个概念要介绍起来到需要很多笔墨；然而为了更好的理解这些概念，为了知其然并且知其所以然，我觉得有必要将它们串起来讲一讲。&/p&&p&为什么说它们是深度学习的前戏呢，只因为它们是深度学习的基础之中的基础概念。在我看来，深度学习包含两方面内容：&/p&&ol&&li&更好的训练深度神经网络。神经网络隐藏层超过两层就算深度神经网络，三层的NN的训练还好说，但是如果NN很多层数呢？那将会面临梯度弥散和梯度爆炸等问题。所以为了让训练的DNN取得好的效果，就有了一些训练DNN的技巧，比如反向传播算法、激活函数、批量归一化、dropout等技术的发明；而梯度下降是为了更好的优化代价函数，不管是机器学习还是深度学习，总会需要优化代价函数（损失函数）。&/li&&li&设计网络结构以更好的提取特征。增加神经网络隐藏层就能提取更高层次特征，卷积神经网络能提取空间上的特征，循环神经网络能够提取时间序列特征，等等；于是各种网络结构被发明出来，比如AlexNet，LeNet，GooleNet，Inception系列，ResNet等等，另外还有LSTM等等。&/li&&/ol&&p&网络结构再美，如果不能训练到收敛，就是不work。所以我们今天介绍的这些技术就是为了更好的训练DNN，它们是保证能够训练好的DNN的基础，所以它们叫深度学习的前戏！！&/p&&hr&&h2&1、梯度下降&/h2&&p&假设有输入 &img src=&http://www.zhihu.com/equation?tex=%5C%7Bx_1%2Cx_2%2C...%2Cx_n%5C%7D& alt=&\{x_1,x_2,...,x_n\}& eeimg=&1&& ，对应的输出为 &img src=&http://www.zhihu.com/equation?tex=%5C%7By_1%2Cy_2%2C..%2Cy_n%5C%7D& alt=&\{y_1,y_2,..,y_n\}& eeimg=&1&& ，我们希望神经网络的输出f(x)可以拟合所有训练输入xi，为此，我们需要定义一个代价函数：&/p&&p&&img src=&http://www.zhihu.com/equation?tex=C%28v_1%2Cv_2%29%5Cequiv+%5Cfrac%7B1%7D%7B2n%7D%5Csum_%7Bi%7D%5E%7Bn%7D%7B%28f%28x_i%29-y_i%29%5E2%7D& alt=&C(v_1,v_2)\equiv \frac{1}{2n}\sum_{i}^{n}{(f(x_i)-y_i)^2}& eeimg=&1&&&/p&&p&要找到一组合适的参数（v1,v2）最小化上述代价函数，只要用微积分的知识解出上述代价函数右边部分的极值点就行了，也就是求导就够了；然而求导的方法在参数较少的时候行的通，但是参数数量一旦多了就不好办了。正好，深度学习中神经网络参数动辄几百万几千万个参数，所以直接计算倒数求极值行不通。&/p&&p&为了解决这个问题，我们以上述例子来介绍一下梯度下降算法是如何求得极值的。&/p&&figure&&img src=&https://pic1.zhimg.com/v2-2dbcbc05bb08_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&482& data-rawheight=&342& class=&origin_image zh-lightbox-thumb& width=&482& data-original=&https://pic1.zhimg.com/v2-2dbcbc05bb08_r.jpg&&&/figure&&p&如上图所示，首先我们初始化v1,v2，假如现在（v1,v2）的取值如上图小球所在的位置，我们要做的就是寻找最佳v1,v2的取值使得代价函数最小，也就是使上图上的小球从山坡上移动到谷底。这里有两个方向v1,v2，也就是两个变量，想象一下小球分别往两个方向移动很小的量，即 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+v_1%2C%5CDelta+v_2& alt=&\Delta v_1,\Delta v_2& eeimg=&1&& ,那么小球移动的大小将为：&/p&&p&&img src=&http://www.zhihu.com/equation?tex=%5CDelta+C%5Capprox+%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+v_1%7D%5CDelta+v_1+%2B+%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+v_2%7D%5CDelta+v_2& alt=&\Delta C\approx \frac{\partial C}{\partial v_1}\Delta v_1 + \frac{\partial C}{\partial v_2}\Delta v_2& eeimg=&1&&&/p&&p&&img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+v_1%7D& alt=&\frac{\partial C}{\partial v_1}& eeimg=&1&& 表示函数C对变量v1的偏导数，也就是代价函数在v1上的变化速率，乘以变量的变化量就是代价函数自身的变化量了。&/p&&p&定义倒三角形C为梯度向量，即：&/p&&figure&&img src=&https://pic2.zhimg.com/v2-f2b6bd7c749_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&202& data-rawheight=&67& class=&content_image& width=&202&&&/figure&&p&那么小球移动的量可以表示为：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-2445bfa0ff00e95eb42b22_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&151& data-rawheight=&32& class=&content_image& width=&151&&&/figure&&p&为了使得 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+C& alt=&\Delta C& eeimg=&1&& 为负数，也就是C能够逐渐变小，我们可以取&/p&&figure&&img src=&https://pic1.zhimg.com/v2-9a1cda6addc_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&128& data-rawheight=&28& class=&content_image& width=&128&&&/figure&&p&这里 &img src=&http://www.zhihu.com/equation?tex=%5Ceta& alt=&\eta& eeimg=&1&& 称为学习率，一般是一个很小的正数，于是有了&/p&&figure&&img src=&https://pic3.zhimg.com/v2-ecbeeba8103a8a_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&257& data-rawheight=&30& class=&content_image& width=&257&&&/figure&&p&这样保证了 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+C& alt=&\Delta C& eeimg=&1&& 为负数。因此我们得到了变量v的变动方式了：&/p&&figure&&img src=&https://pic2.zhimg.com/v2-aede29f195c1_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&171& data-rawheight=&30& class=&content_image& width=&171&&&/figure&&p&总结起来，梯度下降算法的工作方式就是重复计算梯度，然后沿着相反的方向移动，使得小球沿着山谷“滚动”。&/p&&p&由于考虑到梯度下降算法的性能各方面因素，后来又有了随机梯度下降算法等；但是我这里不是介绍优化算法有什么，而是告诉大家，&b&在优化代价函数的时候需要计算“梯度”，也就是代价函数必须可导，且代价函数在变量（参数）上的导数不能为0，不然就不能通过改变变量来优化代价函数了。&/b&&/p&&hr&&h2&2、反向传播&/h2&&p&回忆一下线性回归， &img src=&http://www.zhihu.com/equation?tex=y+%3D+f%28%5Ctheta%3Bx%29& alt=&y = f(\x)& eeimg=&1&& ，x是输入，y是输出， &img src=&http://www.zhihu.com/equation?tex=%5Ctheta& alt=&\theta& eeimg=&1&& 是参数，梯度下降算法就是用来求得最优参数 &img src=&http://www.zhihu.com/equation?tex=%5Ctheta& alt=&\theta& eeimg=&1&& 的。在这里最初始的输入x和最终的输出y是直接关联的，如果将线性回归看做一个神经网络的话，那这个网络就只有输入层和输出层，而没有隐藏层。在深度神经网络中，隐藏层可能有多层，那么它的初始输入和最终输出是如何关联的呢？怎么应用梯度下降到神经网络中呢？&/p&&figure&&img src=&https://pic1.zhimg.com/v2-df6e1af2cd0_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&499& data-rawheight=&281& class=&origin_image zh-lightbox-thumb& width=&499& data-original=&https://pic1.zhimg.com/v2-df6e1af2cd0_r.jpg&&&/figure&&p&&br&&/p&&p&如上图所示，该网络有两个隐藏层，每个隐藏层 &img src=&http://www.zhihu.com/equation?tex=layer_i& alt=&layer_i& eeimg=&1&& 的输入其实就是上一层 &img src=&http://www.zhihu.com/equation?tex=layer_%7Bi-1%7D& alt=&layer_{i-1}& eeimg=&1&& 的输出，而它的输出又是下一层 &img src=&http://www.zhihu.com/equation?tex=layer_%7Bi%2B1%7D& alt=&layer_{i+1}& eeimg=&1&& 的输入；反向传播的思想其实就是，对于每一个训练实例，将它传入神经网络，计算它的输出；然后测量网络的输出误差（即期望输出和实际输出之间的差异），并&b&计算出上一个隐藏层中各神经元为该输出结果贡献了多少的误差&/b&；反复一直从后一层计算到前一层，直到算法到达初始的输入层为止。此反向传递过程有效地测量网络中所有连接权重的误差梯度，最后&b&通过在每一个隐藏层中应用梯度下降算法来优化该层的参数&/b&（反向传播算法的名称也因此而来）。&/p&&p&上面中文看不懂没关系，可以参考一下英文“ for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).”&/p&&p&上面的描述仅仅是反向传播算法的思想，工作原理；那么具体反向传播算法是怎样计算出每个神经元的误差，并且将这个误差关联到“梯度”上的呢？有兴趣的可以看看下面的数学描述，没兴趣的可以略过……&/p&&p&引入一个中间量 &img src=&http://www.zhihu.com/equation?tex=%5Cdelta+_j%5E%7Bl%7D& alt=&\delta _j^{l}& eeimg=&1&& ，将其称为神经网络中在 &img src=&http://www.zhihu.com/equation?tex=l%5E%7Bth%7D& alt=&l^{th}& eeimg=&1&& 层第 &img src=&http://www.zhihu.com/equation?tex=j%5E%7Bth%7D& alt=&j^{th}& eeimg=&1&& 个神经元上的误差。如下图所示：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-ecef0bd288b_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&609& data-rawheight=&306& class=&origin_image zh-lightbox-thumb& width=&609& data-original=&https://pic4.zhimg.com/v2-ecef0bd288b_r.jpg&&&/figure&&p&反向传播将给出计算误差 &img src=&http://www.zhihu.com/equation?tex=%5Cdelta+_j%5E%7Bl%7D& alt=&\delta _j^{l}& eeimg=&1&& 的流程，然后将其关联到“梯度”（ &img src=&http://www.zhihu.com/equation?tex=%5Cpartial+C%2F%5Cpartial+w_%7Bjk%7D%5El%2C%5Cpartial+C%2F%5Cpartial+b_%7Bj%7D%5El& alt=&\partial C/\partial w_{jk}^l,\partial C/\partial b_{j}^l& eeimg=&1&& ）的计算上。&/p&&p&想象一下，当每一层的输入进入到该层的每一个神经元，由于前后两层的神经元之间是由权重w连接的，而这个权重在最开始是由我们人为随机设置的，肯定不是最优的权重，所以必然会给上一层的输出带来一个误差，我们将这个误差记为 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+z_j%5El& alt=&\Delta z_j^l& eeimg=&1&& ，它是的神经元的输出从 &img src=&http://www.zhihu.com/equation?tex=%5Csigma+%28z_j%5El%29& alt=&\sigma (z_j^l)& eeimg=&1&& 变成 &img src=&http://www.zhihu.com/equation?tex=%5Csigma%28%5CDelta+z_j%5El+%2Bz_j%5El%29& alt=&\sigma(\Delta z_j^l +z_j^l)& eeimg=&1&& ，这个变化会向网络后面的层进行传播，最终导致整个代价产生 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+z_j%5El%7D%5CDelta+z_j%5El& alt=&\frac{\partial C}{\partial z_j^l}\Delta z_j^l& eeimg=&1&& 的改变。我们的梯度下降算法正是用来优化代价的，为了使得 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+z_j%5El& alt=&\Delta z_j^l& eeimg=&1&& 更小，假设 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+z_j%5El%7D& alt=&\frac{\partial C}{\partial z_j^l}& eeimg=&1&& 有一个很大的值（或正或负），那么梯度下降将会选择与 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+z_j%5El%7D& alt=&\frac{\partial C}{\partial z_j^l}& eeimg=&1&& 符号相反的 &img src=&http://www.zhihu.com/equation?tex=%5CDelta+z_j%5El& alt=&\Delta z_j^l& eeimg=&1&& 来降低代价。而如果 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+z_j%5El%7D& alt=&\frac{\partial C}{\partial z_j^l}& eeimg=&1&& 接近0，那么无论如何也不能优化代价函数了。因此，我们直觉的认为 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+z_j%5El%7D& alt=&\frac{\partial C}{\partial z_j^l}& eeimg=&1&& 是神经元误差的度量。因此，我们有：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-4d70af472e0e24cd13a3ba_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&112& data-rawheight=&68& class=&content_image& width=&112&&&/figure&&p& 我们使用 &img src=&http://www.zhihu.com/equation?tex=%5Cdelta+%5El& alt=&\delta ^l& eeimg=&1&& 表示关联于 l 层的误差向量。&/p&&p&由于我们每一层神经网络的输出和输入之间使用了激活函数 &img src=&http://www.zhihu.com/equation?tex=%5Csigma& alt=&\sigma& eeimg=&1&&（这里提到激活函数，先不管为什么有激活函数，后面再讲），我们使用 &img src=&http://www.zhihu.com/equation?tex=a_j+%5El& alt=&a_j ^l& eeimg=&1&& 表示 &img src=&http://www.zhihu.com/equation?tex=z_j%5El& alt=&z_j^l& eeimg=&1&& 的激活值，那么我们可以使用 &img src=&http://www.zhihu.com/equation?tex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+a_j%5El%7D& alt=&\frac{\partial C}{\partial a_j^l}& eeimg=&1&& 作为度量误差的方法。&/p&&p&有了以上认识之后，我们来描述反向传播算法：&/p&&p&&b&计算输出层误差的方程&/b&，δ^L: 每个元素定义如下:&/p&&figure&&img src=&https://pic1.zhimg.com/v2-6cbf465e93a1371bbba738_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&496& data-rawheight=&67& class=&origin_image zh-lightbox-thumb& width=&496& data-original=&https://pic1.zhimg.com/v2-6cbf465e93a1371bbba738_r.jpg&&&/figure&&p&右式第一个项 &img src=&http://www.zhihu.com/equation?tex=%5Cpartial+_%7Ba_j%5EL%7D& alt=&\partial _{a_j^L}& eeimg=&1&& 表示代价随着 &img src=&http://www.zhihu.com/equation?tex=j%5E%7Bth%7D& alt=&j^{th}& eeimg=&1&& 输出激活值的变化而变化的速度。假如 C 不太依赖一个特定的输出神经元 j，那么 δ_j^L 就会很小，这也是我们想要的效果。右式第二项 σ′(zjL) 刻画了在 &img src=&http://www.zhihu.com/equation?tex=z_j%5EL& alt=&z_j^L& eeimg=&1&& 处激活函数 σ 变化的速度。&/p&&p&&b&使用下一层的误差 δ^{l+1} 来表示当前层的误差 δ^l:&/b& 特别地，&/p&&figure&&img src=&https://pic1.zhimg.com/v2-8dcc9d0fcbdfec4cfb286bb4_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&519& data-rawheight=&56& class=&origin_image zh-lightbox-thumb& width=&519& data-original=&https://pic1.zhimg.com/v2-8dcc9d0fcbdfec4cfb286bb4_r.jpg&&&/figure&&p&其中 (w^{l+1})^T 是 (l + 1)^{th} 层权重矩阵 w^{l+1} 的转置。这个公式看上去有些复杂，但每一个元素有很好的解释。假设我们知道 l + 1^{th} 层的误差 δ^{l+1}。当我们应用转置的权重矩阵 (w^{l+1})^T ，我们可以凭直觉地把它看作是在沿着网络反向移动误差，给了我们度量在 lth 层输出的误差方法。然后，我们进行 Hadamard 乘积运算 ⊙σ′(z^l)。这会让误差通过 l 层的激活函数反向传递回来并给出在第 l 层的带权输入的误差 δ。&br&通过组合 (BP1) 和 (BP2)，我们可以计算任何层的误差 δ^l。首先使用 (BP1) 计算 δ^L，然后应用方程 (BP2) 来计算 δ^{L-1}，然后再次用方程 (BP2) 来计算 δ^{L-2}，如此一步一步地反向传播完整个网络。&/p&&p&&b&代价函数关于网络中任意偏置的改变率:&/b&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-b7f8fd04de05c5cf1635ab4_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&462& data-rawheight=&79& class=&origin_image zh-lightbox-thumb& width=&462& data-original=&https://pic1.zhimg.com/v2-b7f8fd04de05c5cf1635ab4_r.jpg&&&/figure&&p&&b&代价函数关于任何一个权重的改变率:&/b&&/p&&figure&&img src=&https://pic2.zhimg.com/v2-8efddaee9d75bcf4a9f33a3d891becb5_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&473& data-rawheight=&83& class=&origin_image zh-lightbox-thumb& width=&473& data-original=&https://pic2.zhimg.com/v2-8efddaee9d75bcf4a9f33a3d891becb5_r.jpg&&&/figure&&p&将上式简化可以写成如下形式：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-4d10beb82b17a72dc7c731548eec880c_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&138& data-rawheight=&60& class=&content_image& width=&138&&&/figure&&p&其中 a_{in} 是输入给权重 w 的神经元的激活值，δ_{out} 是输出自权重 w 的神经元的误差。从上式很直观的可以看到，当激活值很小时，梯度也会很小，趋近于0，这样，我们就说权重缓慢学习，表示梯度下降的时候，这个权重改变不多；这样的情形我们称之为神经元已经饱和了，最终层的权重学习也会终止（或缓慢）。这种现象也称为梯度弥散（消失），这很不利于深度神经网络的学习。&/p&&hr&&h2&3、激活函数&/h2&&p&激活函数的作用网上有很多帖子介绍，也不需我多言。总而言之，它能使得神经网络的每层输出结果变得非线性化，为什么神经网络的输出结果需要非线性化？因为只有如此，神经网络才能拟合任意函数（线性函数、非线性函数），&a href=&https://www.zhihu.com/question/& class=&internal&&点这里详细了解&/a&；其实非线性化的作用还有这一点：保持每一层的输出具有“梯度”，只有有了“梯度”我们才能应用反向传播算法、梯度下降算法来优化代价函数（正如我们在前文看到的那样），从而训练出更深的神经网络。&/p&&p&在讲反向传播的时候我们已经说过，误差需要从输出层一层一层的传递到输入层的，然而在传递过程中你会发现梯度越来越小，甚至都没有梯度了（这种现象称为梯度弥散问题 &i&vanishing gradients &/i&problem）；而又存在再一些相反的案例，比如循环神经网络中，梯度会越来越大，这样权重始终在更新，因而训练一直得不到收敛，这种现象称为梯度爆炸问题（explod&i&ing gradients &/i&problem）。&/p&&p&最开始作为激活函数的函数是sigmod函数，如下图所示：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-d3da2ba0b260e23e3bee68_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&535& data-rawheight=&326& class=&origin_image zh-lightbox-thumb& width=&535& data-original=&https://pic1.zhimg.com/v2-d3da2ba0b260e23e3bee68_r.jpg&&&/figure&&p&从图中可以看到，当输入的值比较大时（负值或者正值），sigmod函数的导数趋近于0，也就是梯度饱和了。当反向传播kicks in时，它几乎没有梯度通过神经网络传播到输入层，即使有小梯度存在，也会不断的被稀释在顶层；所以到了低层时，已经没有什么梯度剩下了，也就是权重将不能得到更新。为了避免梯度小时或者爆炸，有人认为神经网络的每一层的输出的方差必须等于该层的输入值的方差（ we need the variance of the outputs of each layer to be equal to the variance of its inputs），于是他们针对sigmod激活函数发明了一种初始化权重的技术，具体公式如下：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-b867ea93c85d9d_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&659& data-rawheight=&132& class=&origin_image zh-lightbox-thumb& width=&659& data-original=&https://pic4.zhimg.com/v2-b867ea93c85d9d_r.jpg&&&/figure&&p&这个初始化技术称为
&i&Xavier initialization或者 Glorot initialization，&/i&&b&这能加速神经网络的训练速度，这也是导致当前训练深度神经网络成功的诀窍之一。&/b&&/p&&p&&b&针对不同激活函数，不同论文也给出了它们相应的权重初始化策略：&/b&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-e54a3fcdd6b8c7ceab6b70_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&560& data-rawheight=&243& class=&origin_image zh-lightbox-thumb& width=&560& data-original=&https://pic1.zhimg.com/v2-e54a3fcdd6b8c7ceab6b70_r.jpg&&&/figure&&p&在tensorflow中的全连接函数 fully_connected()中默认使用的是&i&Xavier initialization&/i&，你也可以手动替换为 He initialization，通过使用 variance_scaling_initializer()函数：&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&he_init&/span& &span class=&o&&=&/span& &span class=&n&&tf&/span&&span class=&o&&.&/span&&span class=&n&&contrib&/span&&span class=&o&&.&/span&&span class=&n&&layers&/span&&span class=&o&&.&/span&&span class=&n&&variance_scaling_initializer&/span&&span class=&p&&()&/span&
&span class=&n&&hidden1&/span& &span class=&o&&=&/span& &span class=&n&&fully_connected&/span&&span class=&p&&(&/span&&span class=&n&&X&/span&&span class=&p&&,&/span& &span class=&n&&n_hidden1&/span&&span class=&p&&,&/span& &span class=&n&&weights_initializer&/span&&span class=&o&&=&/span&&span class=&n&&he_init&/span&&span class=&p&&,&/span& &span class=&n&&scope&/span&&span class=&o&&=&/span&&span class=&s2&&&h1&&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&但是因为存在梯度饱和的问题，所以sigmod作为激活函数始终不怎么好，于是有了ReLU激活函数，它不存在饱和现象（当值为正数的时候），如下图红色虚线所示：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-9c55ebd7d1fd06db5b176fe_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&341& data-rawheight=&240& class=&content_image& width=&341&&&/figure&&p&然而，当输入为负值的时候神经网络的经过ReLU激活之后输出就变成0了，这一定程度上加速了神经网络的计算速度，但是却也使得一大半的神经元“死了”，因为这些神经元不输出任何值（只有0），而当学习率比较大时，你会发现有更多的神经元“死了”，这不是我们想要的结果。为了解决这个问题，于是有了LeakyReLU激活函数：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-b1cb8051d3cbbb7de13e362_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&257& data-rawheight=&31& class=&content_image& width=&257&&&/figure&&figure&&img src=&https://pic2.zhimg.com/v2-cee0c9c1fcee5_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&440& data-rawheight=&258& class=&origin_image zh-lightbox-thumb& width=&440& data-original=&https://pic2.zhimg.com/v2-cee0c9c1fcee5_r.jpg&&&/figure&&p&当输入z小于0时不再直接输出0，而是输出 &img src=&http://www.zhihu.com/equation?tex=%5Calpha+z& alt=&\alpha z& eeimg=&1&& ，这里 &img src=&http://www.zhihu.com/equation?tex=%5Calpha& alt=&\alpha& eeimg=&1&& 通常设置为0.01，这样神经元就不会“死”，而有机会“复苏”起来。有人发现使用这个变体激活函数往往比原激活函数效果更好，而设置 &img src=&http://www.zhihu.com/equation?tex=%5Calpha+%3D+0.2& alt=&\alpha = 0.2& eeimg=&1&& （大的leaky）效果比设置为0.01更好。&/p&&p&有时，也可以不将 &img src=&http://www.zhihu.com/equation?tex=%5Calpha& alt=&\alpha& eeimg=&1&& 设置为一个固定的值，而是在训练神经网络时将其作为一个参数进行学习，这在大数据集上表现良好，然而对于小数据集却容易过拟合。&/p&&p&&b&指数线性单元激活函数
&i&exponential linear unit &/i&(ELU)&/b&&/p&&p&2015年， Djork-Arne? Clevert等人提出这个激活函数，表示它表任何ReLU及其变体激活函数效果都要好，ELU具体定义如下：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-70cd230bd578b22ac6f8f6_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&332& data-rawheight=&63& class=&content_image& width=&332&&&/figure&&figure&&img src=&https://pic4.zhimg.com/v2-746dfb0cf0d49ee84db83_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&393& data-rawheight=&249& class=&content_image& width=&393&&&/figure&&p&可以看到它的主要改变就是在输出值z&0时，将其替换为一个系数为 &img src=&http://www.zhihu.com/equation?tex=%5Calpha& alt=&\alpha& eeimg=&1&& 的指数函数。当负值很大时，超参数 &img src=&http://www.zhihu.com/equation?tex=%5Calpha& alt=&\alpha& eeimg=&1&& 一般设置为1，当然也可以在训练时通过调参去设置它。&/p&&p&ELU的缺点是计算度慢，但在训练过程中，更快的收敛速度补偿了这一点。&/p&&p&在训练神经网络需要选择激活函数时，你一般可以按照这个顺序取选择：ELU & leaky ReLU (and its variants) & ReLU & tanh & logistic。
&/p&&blockquote&If you care a lot about runtime performance, then you may prefer leaky ReLUs over ELUs. If you don’t want to tweak yet another hyperparameter, you may just use the default &i&α &/i&values suggested earlier (0.01 for the leaky ReLU, and 1 for ELU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.&/blockquote&&p&tensorflow里面有ELU激活函数。可以这样调用：&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&hidden1&/span& &span class=&o&&=&/span& &span class=&n&&fully_connected&/span&&span class=&p&&(&/span&&span class=&n&&X&/span&&span class=&p&&,&/span& &span class=&n&&n_hidden1&/span&&span class=&p&&,&/span& &span class=&n&&activation_fn&/span&&span class=&o&&=&/span&&span class=&n&&tf&/span&&span class=&o&&.&/span&&span class=&n&&nn&/span&&span class=&o&&.&/span&&span class=&n&&elu&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&你也可以自定义
leaky ReLUs激活函数：&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&k&&def&/span& &span class=&nf&&leaky_relu&/span&&span class=&p&&(&/span&&span class=&n&&z&/span&&span class=&p&&,&/span& &span class=&n&&name&/span&&span class=&o&&=&/span&&span class=&kc&&None&/span&&span class=&p&&):&/span&
&span class=&k&&return&/span& &span class=&n&&tf&/span&&span class=&o&&.&/span&&span class=&n&&maximum&/span&&span class=&p&&(&/span&&span class=&mf&&0.01&/span& &span class=&o&&*&/span& &span class=&n&&z&/span&&span class=&p&&,&/span& &span class=&n&&z&/span&&span class=&p&&,&/span& &span class=&n&&name&/span&&span class=&o&&=&/span&&span class=&n&&name&/span&&span class=&p&&)&/span&
&span class=&n&&hidden1&/span& &span class=&o&&=&/span& &span class=&n&&fully_connected&/span&&span class=&p&&(&/span&&span class=&n&&X&/span&&span class=&p&&,&/span& &span class=&n&&n_hidden1&/span&&span class=&p&&,&/span& &span class=&n&&activation_fn&/span&&span class=&o&&=&/span&&span class=&n&&leaky_relu&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&好了，本文到此结束，如有纰漏敬请斧正；另外希望在过年前我能写完Dropout，Softmax， Batch Normalization等深度学习基础概念的介绍。&/p&&h2&4、参考文献：&/h2&&blockquote&1、Aure?lien Ge?ron，《 Hands-On Machine Learning with Scikit-Learn and TensoFlow》&br&2、Michael Nielsen，《Neural Networks and Deep Learning》&br&3、Ian Goodfellow et. al，《Deep Learning》&/blockquote&&p&禁止转载。&/p&
在入门深度学习时，梯度下降、反向传播、激活函数这三个概念是绕不过的知识点，如果不能好好理解这些点那么深度学习可能就入不了门；如果不能好好的将这些点联系起来，我觉得对深度神经网络的理解也会很迷惑。网上介绍这些概念的文章有很多，但是往往都是单…
&figure&&img src=&https://pic3.zhimg.com/v2-da6e908fbff06e8e14c60d86d776d225_b.jpg& data-rawwidth=&768& data-rawheight=&576& class=&origin_image zh-lightbox-thumb& width=&768& data-original=&https://pic3.zhimg.com/v2-da6e908fbff06e8e14c60d86d776d225_r.jpg&&&/figure&&p&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-abd42bbb61ee_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&558& data-rawheight=&315& class=&origin_image zh-lightbox-thumb& width=&558& data-original=&https://pic1.zhimg.com/v2-abd42bbb61ee_r.jpg&&&/figure&&p&作为一名久经片场的老司机，早就想写一些探讨驾驶技术的文章。这篇就介绍利用生成式对抗网络（GAN）的两个基本驾驶技能：&br&&/p&&p&1) 去除(爱情)动作片中的马赛克&/p&&p&2) 给(爱情)动作片中的女孩穿(tuo)衣服&/p&&p&&br&&/p&&h2&生成式模型&/h2&&p&上一篇《&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&用GAN生成二维样本的小例子&/a&》中已经简单介绍了GAN，这篇再简要回顾一下生成式模型，算是补全一个来龙去脉。&/p&&p&生成模型就是能够产生指定分布数据的模型，常见的生成式模型一般都会有一个用于产生样本的简单分布。例如一个均匀分布，根据要生成分布的概率密度函数，进行建模，让均匀分布中的样本经过变换得到指定分布的样本，这就可以算是最简单的生成式模型。比如下面例子：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-d11b5fb26d3cc8e942f841bafe010cd8_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1021& data-rawheight=&443& class=&origin_image zh-lightbox-thumb& width=&1021& data-original=&https://pic1.zhimg.com/v2-d11b5fb26d3cc8e942f841bafe010cd8_r.jpg&&&/figure&&p&图中左边是一个自定义的概率密度函数，右边是相应的1w个样本的直方图，自定义分布和生成这些样本的代码如下：&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&from&/span& &span class=&nn&&functools&/span& &span class=&kn&&import&/span& &span class=&n&&partial&/span&
&span class=&kn&&import&/span& &span class=&nn&&numpy&/span&
&span class=&kn&&from&/span& &span class=&nn&&matplotlib&/span& &span class=&kn&&import&/span& &span class=&n&&pyplot&/span&
&span class=&c1&&# Define a PDF&/span&
&span class=&n&&x_samples&/span& &span class=&o&&=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&arange&/span&&span class=&p&&(&/span&&span class=&o&&-&/span&&span class=&mi&&3&/span&&span class=&p&&,&/span& &span class=&mf&&3.01&/span&&span class=&p&&,&/span& &span class=&mf&&0.01&/span&&span class=&p&&)&/span&
&span class=&n&&PDF&/span& &span class=&o&&=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&empty&/span&&span class=&p&&(&/span&&span class=&n&&x_samples&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&&span class=&p&&)&/span&
&span class=&n&&PDF&/span&&span class=&p&&[&/span&&span class=&n&&x_samples&/span& &span class=&o&&&&/span& &span class=&mi&&0&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&round&/span&&span class=&p&&(&/span&&span class=&n&&x_samples&/span&&span class=&p&&[&/span&&span class=&n&&x_samples&/span& &span class=&o&&&&/span& &span class=&mi&&0&/span&&span class=&p&&]&/span& &span class=&o&&+&/span& &span class=&mf&&3.5&/span&&span class=&p&&)&/span& &span class=&o&&/&/span& &span class=&mi&&3&/span&
&span class=&n&&PDF&/span&&span class=&p&&[&/span&&span class=&n&&x_samples&/span& &span class=&o&&&=&/span& &span class=&mi&&0&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&mf&&0.5&/span& &span class=&o&&*&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&cos&/span&&span class=&p&&(&/span&&span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&pi&/span& &span class=&o&&*&/span& &span class=&n&&x_samples&/span&&span class=&p&&[&/span&&span class=&n&&x_samples&/span& &span class=&o&&&=&/span& &span class=&mi&&0&/span&&span class=&p&&])&/span& &span class=&o&&+&/span& &span class=&mf&&0.5&/span&
&span class=&n&&PDF&/span& &span class=&o&&/=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&(&/span&&span class=&n&&PDF&/span&&span class=&p&&)&/span&
&span class=&c1&&# Calculate approximated CDF&/span&
&span class=&n&&CDF&/span& &span class=&o&&=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&empty&/span&&span class=&p&&(&/span&&span class=&n&&PDF&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&&span class=&p&&)&/span&
&span class=&n&&cumulated&/span& &span class=&o&&=&/span& &span class=&mi&&0&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span& &span class=&ow&&in&/span& &span class=&nb&&range&/span&&span class=&p&&(&/span&&span class=&n&&CDF&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]):&/span&
&span class=&n&&cumulated&/span& &span class=&o&&+=&/span& &span class=&n&&PDF&/span&&span class=&p&&[&/span&&span class=&n&&i&/span&&span class=&p&&]&/span&
&span class=&n&&CDF&/span&&span class=&p&&[&/span&&span class=&n&&i&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&cumulated&/span&
&span class=&c1&&# Generate samples&/span&
&span class=&n&&generate&/span& &span class=&o&&=&/span& &span class=&n&&partial&/span&&span class=&p&&(&/span&&span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&interp&/span&&span class=&p&&,&/span& &span class=&n&&xp&/span&&span class=&o&&=&/span&&span class=&n&&CDF&/span&&span class=&p&&,&/span& &span class=&n&&fp&/span&&span class=&o&&=&/span&&span class=&n&&x_samples&/span&&span class=&p&&)&/span&
&span class=&n&&u_rv&/span& &span class=&o&&=&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&random&/span&&span class=&o&&.&/span&&span class=&n&&random&/span&&span class=&p&&(&/span&&span class=&mi&&10000&/span&&span class=&p&&)&/span&
&span class=&n&&x&/span& &span class=&o&&=&/span& &span class=&n&&generate&/span&&span class=&p&&(&/span&&span class=&n&&u_rv&/span&&span class=&p&&)&/span&
&span class=&c1&&# Visualization&/span&
&span class=&n&&fig&/span&&span class=&p&&,&/span& &span class=&p&&(&/span&&span class=&n&&ax0&/span&&span class=&p&&,&/span& &span class=&n&&ax1&/span&&span class=&p&&)&/span& &span class=&o&&=&/span& &span class=&n&&pyplot&/span&&span class=&o&&.&/span&&span class=&n&&subplots&/span&&span class=&p&&(&/span&&span class=&n&&ncols&/span&&span class=&o&&=&/span&&span class=&mi&&2&/span&&span class=&p&&,&/span& &span class=&n&&figsize&/span&&span class=&o&&=&/span&&span class=&p&&(&/span&&span class=&mi&&9&/span&&span class=&p&&,&/span& &span class=&mi&&4&/span&&span class=&p&&))&/span&
&span class=&n&&ax0&/span&&span class=&o&&.&/span&&span class=&n&&plot&/span&&span class=&p&&(&/span&&span class=&n&&x_samples&/span&&span class=&p&&,&/span& &span class=&n&&PDF&/span&&span class=&p&&)&/span&
&span class=&n&&ax0&/span&&span class=&o&&.&/span&&span class=&n&&axis&/span&&span class=&p&&([&/span&&span class=&o&&-&/span&&span class=&mf&&3.5&/span&&span class=&p&&,&/span& &span class=&mf&&3.5&/span&&span class=&p&&,&/span& &span class=&mi&&0&/span&&span class=&p&&,&/span& &span class=&n&&numpy&/span&&span class=&o&&.&/span&&span class=&n&&max&/span&&span class=&p&&(&/span&&span class=&n&&PDF&/span&&span class=&p&&)&/span&&span class=&o&&*&/span&&span class=&mf&&1.1&/span&&span class=&p&&])&/span&
&span class=&n&&ax1&/span&&span class=&o&&.&/span&&span class=&n&&hist&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&p&&,&/span& &span class=&mi&&100&/span&&span class=&p&&)&/span&
&span class=&n&&pyplot&/span&&span class=&o&&.&/span&&span class=&n&&show&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&p&对于一些简单的情况，我们会假设已知有模型可以很好的对分布进行建模，缺少的只是合适的参数。这时候很自然只要根据观测到的样本，学习参数让当前观测到的样本下的似然函数最大，这就是最大似然估计(&b&M&/b&aximum &b&L&/b&ikelihood &b&E&/b&stimation)：&br&&/p&&p&&img src=&https://www.zhihu.com/equation?tex=%5Chat%7B%5Ctheta%7D%3D%5Coperatorname%2A%7Bargmax%7D_%7B%5Ctheta%7D+P%28%5Cbm%7Bx%7D%7C%5Ctheta%29+%3D+%5Coperatorname%2A%7Bargmax%7D_%7B%5Ctheta%7D+%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28x_i%7C%5Ctheta%29+& alt=&\hat{\theta}=\operatorname*{argmax}_{\theta} P(\bm{x}|\theta) = \operatorname*{argmax}_{\theta} \prod_{i=1}^{n}P(x_i|\theta) & eeimg=&1&&&/p&&p&MLE是一个最基本的思路，实践中用得很多的还有KL散度(Kullback–Leibler divergence)，假设真实分布是P，采样分布是Q，则KL散度为：&/p&&p&&img src=&https://www.zhihu.com/equation?tex=D_%7BKL%7D%28P%7C%7CQ%29%3D%5Csum_%7Bx+%5Cin+%5COmega%7DP%28%7Bx%7D%29%5Clog%5Cfrac%7BP%28x%29%7D%7BQ%28x%29%7D+& alt=&D_{KL}(P||Q)=\sum_{x \in \Omega}P({x})\log\frac{P(x)}{Q(x)} & eeimg=&1&&&/p&&p&从公式也能看出来，KL散度描述的是两个分布的差异程度。换个角度来看，让产生的样本和原始分布接近，也就是要让这俩的差异减小，所以最小化KL散度就等同于MLE。从公式上来看的话，我们考虑把公式具体展开一下：&/p&&p&&br&&/p&&p&&img src=&https://www.zhihu.com/equation?tex=%5Cbegin%7Balign%7D+D_%7BKL%7D%28P%7C%7CQ%29+%26%3D%5Csum_%7Bx+%5Cin+%5COmega%7DP%28%7Bx%7D%29%5Clog%5Cfrac%7BP%28x%29%7D%7BQ%28x%29%7D+%5C%5C+%26+%3D-%5Csum_%7Bx%5Cin%5COmega%7DP%28%7Bx%7D%29%5Clog%7BQ%28x%29%7D+%2B%5Csum_%7Bx%5Cin%5COmega%7DP%28%7Bx%7D%29%5Clog%7BP%28x%29%7D+%5C%5C+%26+%3D-%5Csum_%7Bx%5Cin%5COmega%7DP%28%7Bx%7D%29%5Clog%7BQ%28x%29%7D+%2BH%28P%29+%5Cend%7Balign%7D& alt=&\begin{align} D_{KL}(P||Q) &=\sum_{x \in \Omega}P({x})\log\frac{P(x)}{Q(x)} \\ & =-\sum_{x\in\Omega}P({x})\log{Q(x)} +\sum_{x\in\Omega}P({x})\log{P(x)} \\ & =-\sum_{x\in\Omega}P({x})\log{Q(x)} +H(P) \end{align}& eeimg=&1&&&/p&&p&公式的第二项就是熵，先不管这项，用H(P)表示。接下来考虑一个小trick：从Q中抽样n个样本&img src=&https://www.zhihu.com/equation?tex=%7Bx_1%2Cx_2%2C...%2Cx_n%7D& alt=&{x_1,x_2,...,x_n}& eeimg=&1&&，来估算P(x)的经验值(empirical density function)：&br&&/p&&p&&img src=&https://www.zhihu.com/equation?tex=%5Chat%7BP%7D%28x%29%3D%5Cfrac+1+n+%5Csum_%7Bi%3D1%7D%5En+%5Cdelta%28x_i-x%29& alt=&\hat{P}(x)=\frac 1 n \sum_{i=1}^n \delta(x_i-x)& eeimg=&1&&&/p&&p&其中&img src=&https://www.zhihu.com/equation?tex=%5Cdelta%28%5Ccdot%29& alt=&\delta(\cdot)& eeimg=&1&&是狄拉克&img src=&https://www.zhihu.com/equation?tex=%5Cdelta& alt=&\delta& eeimg=&1&&函数，把这项替换到上面公式的P(x)：&/p&&p&&br&&/p&&p&&img src=&https://www.zhihu.com/equation?tex=%5Cbegin%7Balign%7D+D_%7BKL%7D%28P%7C%7CQ%29+%26%3D-%5Csum_%7Bx%5Cin%5COmega%7D%5Cfrac+1+n+%5Csum_%7Bi%3D1%7D%5En+%5Cdelta%28x_i-x%29%5Clog%7BQ%28x%29%7D+%2BH%28P%29+%5C%5C+%26+%3D-%5Cfrac+1+n+%5Csum_%7Bi%3D1%7D%5En+%5Csum_%7Bx%5Cin%5COmega%7D+%5Cdelta%28x_i-x%29%5Clog%7BQ%28x%29%7D+%2BH%28P%29+%5Cend%7Balign%7D& alt=&\begin{align} D_{KL}(P||Q) &=-\sum_{x\in\Omega}\frac 1 n \sum_{i=1}^n \delta(x_i-x)\log{Q(x)} +H(P) \\ & =-\frac 1 n \sum_{i=1}^n \sum_{x\in\Omega} \delta(x_i-x)\log{Q(x)} +H(P) \end{align}& eeimg=&1&&&/p&&p&因为是离散的采样值，所以&img src=&https://www.zhihu.com/equation?tex=%5Csum_%7Bx%5Cin%5COmega%7D+%5Cdelta%28x_i-x%29& alt=&\sum_{x\in\Omega} \delta(x_i-x)& eeimg=&1&&中只有&img src=&https://www.zhihu.com/equation?tex=x%3Dx_i& alt=&x=x_i& eeimg=&1&&的时候狄拉克&img src=&https://www.zhihu.com/equation?tex=%5Cdelta& alt=&\delta& eeimg=&1&&函数才为1，所以考虑&img src=&https://www.zhihu.com/equation?tex=x%3Dx_i& alt=&x=x_i& eeimg=&1&&时这项直接化为1：&/p&&p&&br&&/p&&p&&img src=&https://www.zhihu.com/equation?tex=D_%7BKL%7D%28P%7C%7CQ%29+%3D-%5Cfrac+1+n%5Csum_%7Bi%3D1%7D%5En+%5Clog%7BQ%28x_i%29%7D+%2BH%28P%29& alt=&D_{KL}(P||Q) =-\frac 1 n\sum_{i=1}^n \log{Q(x_i)} +H(P)& eeimg=&1&&&/p&&p&第一项正是似然的负对数形式。&/p&&p&说了些公式似乎跑得有点远了，其实要表达还是那个简单的意思：通过减小两个分布的差异可以让一个分布逼近另一个分布。仔细想想，这正是GAN里面adversarial loss的做法。&/p&&p&很多情况下我们面临的是更为复杂的分布，比如&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&上篇文章&/a&中的例子，又或是实际场景中更复杂的情况，比如生成不同人脸的图像。这时候，作为具有universal approximation性质的神经网络是一个看上去不错的选择[1]：&br&&/p&&figure&&img src=&https://pic4.zhimg.com/v2-6fee20494f50baae2c1dc5fc_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&1561& data-rawheight=&549& class=&origin_image zh-lightbox-thumb& width=&1561& data-original=&https://pic4.zhimg.com/v2-6fee20494f50baae2c1dc5fc_r.jpg&&&/figure&&p&所以虽然GAN里面同时包含了生成网络和判别网络，但本质来说GAN的目的还是生成模型。从生成式模型的角度，Ian Goodfellow总结过一个和神经网络相关生成式方法的“家谱”[1]：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-8c6f1d8ee39dfbb4fcfb2_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&771& data-rawheight=&498& class=&origin_image zh-lightbox-thumb& width=&771& data-original=&https://pic1.zhimg.com/v2-8c6f1d8ee39dfbb4fcfb2_r.jpg&&&/figure&&p&在这其中，当下最流行的就是GAN和&b&V&/b&ariational &b&A&/b&uto&b&E&/b&ncoder(VAE)，两种方法的一个简明示意如下[3]：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-380cde71a2f6ece28b4a97_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&568& data-rawheight=&274& class=&origin_image zh-lightbox-thumb& width=&568& data-original=&https://pic1.zhimg.com/v2-380cde71a2f6ece28b4a97_r.jpg&&&/figure&&p&本篇不打算展开讲什么是VAE，不过通过这个图，和名字中的autoencoder也大概能知道，VAE中生成的loss是基于重建误差的。而只基于重建误差的图像生成，都或多或少会有图像模糊的缺点，因为误差通常都是针对全局。比如基于MSE(Mean Squared Error)的方法用来生成超分辨率图像，容易出现下面的情况[4]：&/p&&p&&br&&/p&&p&&br&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-78f53b142fab51b0c09a1_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&892& data-rawheight=&598& class=&origin_image zh-lightbox-thumb& width=&892& data-original=&https://pic1.zhimg.com/v2-78f53b142fab51b0c09a1_r.jpg&&&/figure&&p&在这个二维示意中，真实数据分布在一个U形的流形上，而MSE系的方法因为loss的形式往往会得到一个接近平均值所在的位置（蓝色框）。&/p&&p&GAN在这方面则完爆其他方法，因为目标分布在流形上。所以只要大概收敛了，就算生成的图像都看不出是个啥，清晰度常常是有保证的，而这正是去除女优身上马赛克的理想特性！&/p&&p&&br&&/p&&h2&马赛克-&清晰画面：超分辨率(Super Resolution)问题&/h2&&p&说了好些铺垫，终于要进入正题了。首先明确，去马赛克其实是个图像超分辨率问题，也就是如何在低分辨率图像基础上得到更高分辨率的图像：&/p&&figure&&img src=&https://pic2.zhimg.com/v2-31c84b42ad_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&784& data-rawheight=&324& class=&origin_image zh-lightbox-thumb& width=&784& data-original=&https://pic2.zhimg.com/v2-31c84b42ad_r.jpg&&&/figure&&p&视频中超分辨率实现的一个套路是通过不同帧的低分辨率画面猜测超分辨率的画面，有兴趣了解这个思想的朋友可以参考我之前的一个答案：&a href=&https://www.zhihu.com/question//answer/& class=&internal&&如何通过多帧影像进行超分辨率重构？ &/a& &/p&&p&不过基于多帧影像的方法对于女优身上的马赛克并不是很适用，所以这篇要讲的是基于单帧图像的超分辨率方法。&/p&&h2&SRGAN&/h2&&p&说到基于GAN的超分辨率的方法，就不能不提到SRGAN[4]：《Photo-Realistic Single Image &b&S&/b&uper-&b&R&/b&esolution Using a &b&G&/b&enerative &b&A&/b&dversarial&br&&b&N&/b&etwork》。这个工作的思路是：基于像素的MSE loss往往会得到大体正确，但是高频成分模糊的结果。所以只要重建低频成分的图像内容，然后靠GAN来补全高频的细节内容，就可以了：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-128029dfc7c470b07a4a1_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&446& data-rawheight=&131& class=&origin_image zh-lightbox-thumb& width=&446& data-original=&https://pic3.zhimg.com/v2-128029dfc7c470b07a4a1_r.jpg&&&/figure&&p&这个思路其实和最早基于深度网络的风格迁移的思路很像（有兴趣的读者可以参考我之前文章&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&瞎谈CNN：通过优化求解输入图像&/a&的最后一部分），其中重建内容的content loss是原始图像和低分辨率图像在VGG网络中的各个ReLU层的激活值的差异：&/p&&p&&br&&/p&&figure&&img src=&https://pic3.zhimg.com/v2-331e02e394cfd04e7114a_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&529& data-rawheight=&150& class=&origin_image zh-lightbox-thumb& width=&529& data-original=&https://pic3.zhimg.com/v2-331e02e394cfd04e7114a_r.jpg&&&/figure&&p&生成细节adversarial loss就是GAN用来判别是原始图还是生成图的loss：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-fa5af2a10fe9a4dadfb04_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&394& data-rawheight=&89& class=&content_image& width=&394&&&/figure&&p&把这两种loss放一起，取个名叫perceptual loss。训练的网络结构如下：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-17861edeb4bcfae4e9f369_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&780& data-rawheight=&400& class=&origin_image zh-lightbox-thumb& width=&780& data-original=&https://pic1.zhimg.com/v2-17861edeb4bcfae4e9f369_r.jpg&&&/figure&&p&正是上篇文章中讲过的C-GAN，条件C就是低分辨率的图片。SRGAN生成的超分辨率图像虽然PSNR等和原图直接比较的传统量化指标并不是最好，但就视觉效果，尤其是细节上，胜过其他方法很多。比如下面是作者对比bicubic插值和基于ResNet特征重建的超分辨率的结果：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-f3b4376938ffcbd23c42d_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&981& data-rawheight=&392& class=&origin_image zh-lightbox-thumb& width=&981& data-original=&https://pic4.zhimg.com/v2-f3b4376938ffcbd23c42d_r.jpg&&&/figure&&p&可以看到虽然很多细节都和原始图片不一样，不过看上去很和谐，并且细节的丰富程度远胜于SRResNet。这些栩栩如生的细节，可以看作是GAN根据学习到的分布信息“联想”出来的。&/p&&p&对于更看重“看上去好看”的超分辨率应用，SRGAN显然是很合适的。当然对于一些更看重重建指标的应用，比如超分辨率恢复嫌疑犯面部细节，SRGAN就不可以了。&/p&&h2&pix2pix&/h2&&p&虽然专门用了一节讲SRGAN，但本文用的方法其实是pix2pix[5]。这项工作刚在arxiv上发布就引起了不小的关注，它巧妙的利用GAN的框架解决了通用的Image-to-Image translation的问题。举例来说，在不改变分辨率的情况下：把照片变成油画风格；把白天的照片变成晚上；用色块对图片进行分割或者倒过来；为黑白照片上色；…每个任务都有专门针对性的方法和相关研究，但其实总体来看，都是像素到像素的一种映射啊，其实可以看作是一个问题。这篇文章的巧妙，就在于提出了pix2pix的方法，一个框架，解决所有这些问题。方法的示意图如下：&/p&&p&&br&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-e2ea753b7b0d7f18abee3_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&485& data-rawheight=&437& class=&origin_image zh-lightbox-thumb& width=&485& data-original=&https://pic1.zhimg.com/v2-e2ea753b7b0d7f18abee3_r.jpg&&&/figure&&p&就是一个Conditional GAN，条件C是输入的图片。除了直接用C-GAN，这项工作还有两个改进：&/p&&p&1）&b&利用U-Net结构生成细节更好的图片&/b&[6]&/p&&figure&&img src=&https://pic4.zhimg.com/v2-beb074bebbfa0db_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&907& data-rawheight=&612& class=&origin_image zh-lightbox-thumb& width=&907& data-original=&https://pic4.zhimg.com/v2-beb074bebbfa0db_r.jpg&&&/figure&&p&U-Net是德国Freiburg大学模式识别和图像处理组提出的一种全卷积结构。和常见的先降采样到低维度，再升采样到原始分辨率的编解码(Encoder-Decoder)结构的网络相比，U-Net的区别是加入skip-connection，对应的feature maps和decode之后的同样大小的feature maps按通道拼(concatenate)一起，用来保留不同分辨率下像素级的细节信息。U-Net对提升细节的效果非常明显，下面是pix2pix文中给出的一个效果对比：&/p&&p&&br&&/p&&figure&&img src=&https://pic4.zhimg.com/v2-2fb4ddb2fdc24eea31eea_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&563& data-rawheight=&307& class=&origin_image zh-lightbox-thumb& width=&563& data-original=&https://pic4.zhimg.com/v2-2fb4ddb2fdc24eea31eea_r.jpg&&&/figure&&p&可以看到，各种不同尺度的信息都得到了很大程度的保留。&/p&&p&2）&b&利用马尔科夫性的判别器(PatchGAN)&br&&/b&&/p&&p&pix2pix和SRGAN的一个异曲同工的地方是都有用重建解决低频成分，用GAN解决高频成分的想法。在pix2pix中，这个思想主要体现在两个地方。一个是loss函数，加入了L1 loss用来让生成的图片和训练的目标图片尽量相似，而图像中高频的细节部分则交由GAN来处理：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-cb180ad03d8a72e7883285b_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&447& data-rawheight=&51& class=&origin_image zh-lightbox-thumb& width=&447& data-original=&https://pic4.zhimg.com/v2-cb180ad03d8a72e7883285b_r.jpg&&&/figure&&p&还有一个就是&b&PatchGAN&/b&，也就是具体的GAN中用来判别是否生成图的方法。PatchGAN的思想是，既然GAN只负责处理低频成分，那么判别器就没必要以一整张图作为输入，只需要对NxN的一个图像patch去进行判别就可以了。这也是为什么叫Markovian discriminator，因为在patch以外的部分认为和本patch互相独立。&/p&&p&具体实现的时候，作者使用的是一个NxN输入的全卷积小网络，最后一层每个像素过sigmoid输出为真的概率，然后用BCEloss计算得到最终loss。这样做的好处是因为输入的维度大大降低，所以参数量少，运算速度也比直接输入一张快，并且可以计算任意大小的图。作者对比了不同大小patch的结果，对于256x256的输入，patch大小在70x70的时候，从视觉上看结果就和直接把整张图片作为判别器输入没什么区别了：&/p&&figure&&img src=&https://pic1.zhimg.com/v2-5172ca51efb4ee3e453b15_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&842& data-rawheight=&107& class=&origin_image zh-lightbox-thumb& width=&842& data-original=&https://pic1.zhimg.com/v2-5172ca51efb4ee3e453b15_r.jpg&&&/figure&&h2&生成带局部马赛克的训练数据&/h2&&p&利用pix2pix，只要准备好无码和相应的有码图片就可以训练去马赛克的模型了，就是这么简单。那么问题是，如何生成有马赛克的图片？&/p&&p&有毅力的话，可以手动加马赛克，这样最为精准。这节介绍一个不那么准，但是比随机强的方法：利用分类模型的激活区域进行自动马赛克标注。&/p&&p&基本思想是利用一个可以识别需要打码图像的分类模型，提取出这个模型中对应类的CAM（&b&C&/b&lass &b&A&/b&ctivation &b&M&/b&ap）[7]，然后用马赛克遮住响应最高的区域即可。这里简单说一下什么是CAM，对于最后一层是全局池化(平均或最大都可以)的CNN结构，池化后的feature map相当于是做了个加权相加来计算最终的每个类别进入softmax之前的激活值。CAM的思路是，把这个权重在池化前的feature map上按像素加权相加，最后得到的单张的激活图就可以携带激活当前类别的一些位置信息，这相当于一种弱监督（classification--&localization）：&/p&&p&&br&&/p&&figure&&img src=&https://pic4.zhimg.com/v2-fd28f0b871bd_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&660& data-rawheight=&314& class=&origin_image zh-lightbox-thumb& width=&660& data-original=&https://pic4.zhimg.com/v2-fd28f0b871bd_r.jpg&&&/figure&&p&上图是一个CAM的示意，用澳洲梗类别的CAM，放大到原图大小，可以看到小狗所在的区域大致是激活响应最高的区域。&/p&&p&那么就缺一个可以识别XXX图片的模型了，网上还恰好就有个现成的，yahoo于2016年发布的开源色情图片识别模型Open NSFW(&b&N&/b&ot &b&S&/b&afe &b&F&/b&or &b&W&/b&ork)：&/p&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/yahoo/open_nsfw& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&yahoo/open_nsfw&/a&&/p&&p&CAM的实现并不难，结合Open NSFW自动打码的代码和使用放在了这里：&/p&&p&&a href=&https://link.zhihu.com/?target=https%3A//github.com/frombeijingwithlove/dlcv_for_beginners/tree/master/random_bonus/generate_mosaic_for_porno_images& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&给XX图片生成马赛克&/a&&/p&&p&&br&&/p&&p&(成功打码的)效果差不多是下面这样子：&/p&&figure&&img src=&https://pic4.zhimg.com/v2-cbefa39dc983f2645dd8_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&768& data-rawheight=&256& class=&origin_image zh-lightbox-thumb& width=&768& data-original=&https://pic4.zhimg.com/v2-cbefa39dc983f2645dd8_r.jpg&&&/figure&&h2&去除(爱情)动作片中的马赛克&/h2&&p&这没什么好说的了，一行代码都不用改，只需要按照前面的步骤把数据准备好，然后按照pix2pix官方的使用方法训练就可以了：&/p&&p&Torch版pix2pix：&a href=&https://link.zhihu.com/?target=https%3A//github.com/phillipi/pix2pix& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&phillipi/pix2pix&/a&&/p&&p&pyTorch版pix2pix(Cycle-GAN二合一版)：&a href=&https://link.zhihu.com/?target=https%3A//github.com/junyanz/pytorch-CycleGAN-and-pix2pix& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&junyanz/pytorch-CycleGAN-and-pix2pix&/a&&/p&&p&从D盘里随随便便找了几千张图片，用来执行了一下自动打码和pix2pix训练(默认参数)，效果是下面这样：&/p&&p&&br&&/p&&figure&&img src=&https://pic2.zhimg.com/v2-9f52b17c0e1296767cbfbfafc290a5bd_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&814& data-rawheight=&691& class=&origin_image zh-lightbox-thumb& width=&814& data-original=&https://pic2.zhimg.com/v2-9f52b17c0e1296767cbfbfafc290a5bd_r.jpg&&&/figure&&p&什么？你问说好给女优去马赛克呢？女优照片呢？&/p&&figure&&img src=&https://pic4.zhimg.com/v2-480fb8a4dcfc7a4f92ec_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&75& data-rawheight=&60& class=&content_image& width=&75&&&/figure&&p&还是要说一下，在真人照片上的效果比蘑菇和花强。&/p&&h2&对偶学习（Dual Learning）&/h2&&p&去马赛克已经讲完了，接下来就是给女孩穿(tuo)衣服了，动手之前，还是先讲一下铺垫：&b&对偶学习&/b&和&b&Cycle-GAN&/b&。&/p&&p&对偶学习是MSRA于2016年提出的一种用于机器翻译的增强学习方法[8]，目的是解决海量数据配对标注的难题，个人觉得算是一种弱监督方法（不过看到大多数文献算作无监督）。以机器翻译为例，对偶学习基本思想如下图[9]：&/p&&figure&&img src=&https://pic3.zhimg.com/v2-c4b1eeda364fb6c9bada02f3_b.jpg& data-caption=&& data-size=&normal& data-rawwidth=&866& data-rawheight=&399& class=&origin_image zh-lightbox-thumb& width=&866& data-original=&https://pic3.zhimg.com/v2-c4b1eeda364fb6c9bada02f3_r.jpg&&&/figure&&p&左边的灰衣男只懂英语，右边的黑衣女只懂中文，现在的任务就是，要学习如何翻译英语到中文。对偶学习解决这个问题的思路是：给定一个模型&img src=&https://www.zhihu.com/equation?tex=f%3Ax%5Crightarrow+y& alt=&f:x\rightarrow y& eeimg=&1&&一上来无法知道f翻译得是否正确，但是如果考虑上&img src=&https://www.zhihu.com/equation?tex=f& alt=&f& eeimg=&1&&的对偶问题&img src=&https://www.zhihu.com/equation?tex=g%3Ay%5Crightarrow+x& alt=&g:y\rightarrow x& eeimg=&1&&，那么我可以尝试翻译一个英文句子到中文，再翻译回来。这种转了一圈的结果&img src=&https://www.zhihu.com/equation?tex=x%27%3Dg%28f%28x%29%29& alt=&x'=g(f(x))& eeimg=&1&&，灰衣男是可以用一个标准（BLEU）判断x'和x是否一个意思，并且把结果的一致性反馈给这两个模型进行改进。同样的，从中文取个句子，这样循环翻译一遍，两个模型又能从黑衣女那里获取反馈并改进模型。其实这就是强化学习的过程，每次翻译就是一个action，每个action会从环境（灰衣男或黑衣女）中获取reward，对模型进行改进，直至收敛。&/p&&p&也许有的人看到这里会觉得和上世纪提出的Co-training很像，这个在知乎上也有讨论：&/p&&p&&a href=&https://www.zhihu.com/question/& class=&internal&&如何理解刘铁岩老师团队在NIPS 2016上提出的对偶学习（Dual Learning）？&/a&&/p&&p&个人觉得还是不一样的，Co-Training是一种multi-view方法，比如一个输入x，如果看作是两个拼一起的特征&img src=&https://www.zhihu.com/equation?tex=x%3D%28x_1%2Cx_2%29& alt=&x=(x_1,x_2)& eeimg=&1&&，并且假设&img src=&https://www.zhihu.com/equation?tex=x_1& alt=&x_1& eeimg=&1&&和&img src=&https://www.zhihu.com/equation?}

叫阿莫西中心