FAQs
Q: If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
A: This is wrong. Let us consider a very simple
y=wx+b
and we have
βy/βw=x,βy/βx=w,βy/βb=1
and if $w=0$ and $x \neq 0$, then
βy/βwξ =0,βy/βx=0,βy/βbξ =0
which means as long as $x \neq 0$, one gradient descent iteration will make $w$ non-zero. Then
βy/βxξ =0
so that the zero convolutions will progressively become a common conv layer with non-zero weights.