What Is a Compressed Nerve?
A deep learning model usually contains millions or even tens of millions of parameters and a network of dozens or even tens of layers, which usually requires a very large computational cost and storage space. Neural network compression refers to reducing the network parameters or storage space by changing the structure of the network or using quantitative and approximate methods. Without affecting the performance of the neural network, the network computing cost and storage space are reduced.
- Neural network compression simply reduces the parameters and storage space of the network through related methods, in the case that the performance of the neural network has little effect. Neural network compression can be roughly divided into three categories: approximation, quantization, and clipping. The approximate method mainly uses the idea of matrix or tensor decomposition to reconstruct the original network parameter matrix or parameter tensor with a small number of parameters to achieve the purpose of reducing network storage overhead. Usually, when the network is running, these parameters will be properly reconstructed, and the network's runtime overhead is not effectively reduced. The second type of method is the quantization method. The main idea of the quantization method is to map the possible values of the network parameters from the real number domain to a finite number of sets, or to represent the network parameters with fewer bits. The quantization method constrains an infinite number of possible parameters to a few parameters, and then reuses these parameters to reduce network storage overhead. By changing the data type of the parameters, such as quantizing the original 64-bit floating-point type to integer or even Boolean, the network's runtime overhead will be greatly reduced. The third type of method is the method of network cutting. Compared with the first two methods, the main characteristic of network cutting is that it will directly change the structure of the network. Network cutting can be divided into hierarchical cutting, neuron-level cutting and neural connection-level cutting according to granularity. Hierarchical cropping is the network layer, and the result of cropping is a shallower network.
- Neural network compression is not only necessary but also possible. First, although the deeper a neural network is, the deeper it is, the better the effect, but for specific application scenarios and needs, a network of the appropriate depth and number of parameters can be satisfied. The weak performance improvement caused by blindly deepening network complexity is not significant in many applications. Second, neural networks often suffer from over-parameterization. The functions of network neurons are highly repetitive. Even in network performance-sensitive scenarios, most networks can be "securely" compressed without affecting their performance. Neural network compression can help understand the mechanism of neuron action and enable neural network models to be deployed on lightweight devices.
- Deep neural networks have achieved good results in computer vision, speech recognition, robotics and other fields. The practical application of deep learning is often limited by its storage and computation scale. For example, the VGG-16 network contains approximately 140 million floating-point parameters. Assuming that each parameter is stored in a 32-bit floating-point format, the entire network requires more than 500 megabytes of storage space. Such a calculation amount can only be performed by a high-performance parallel device, and still does not have good real-time performance. High-performance parallel computing equipment has the characteristics of large volume, large energy consumption, and high price, which cannot be used in many occasions. Therefore, how to run neural networks on resource-constrained places, such as mobile phones, tablet computers, various embedded and portable devices, is a key step in deep learning towards daily life, and it is also one of the hot topics in academic and industrial research.
- Network compression based on tensor decomposition
- Tensors are a natural extension of vectors and matrices. Vectors can be called first-order tensors, matrices can be called second-order tensors, and matrices are stacked to form "cubes." This data structure is called a third-order tensor. A grayscale image is represented by a matrix in a computer and is a second-order tensor. An RGB three-channel color image is saved as a third-order tensor in the computer. Of course, third-order tensors can also be stacked to form higher-order tensors. Tensor decomposition is an important part of tensor analysis. The basic principle is to use the structural information in tensor data to decompose tensor into a combination of several tensors with simpler form and smaller storage scale. Typical tensor decomposition methods include CP decomposition and Tucker decomposition. In neural networks, parameters are usually stored centrally in the form of "tensors". For a fully connected layer, a fully connected layer transforms an input vector into an output vector through a weight matrix, and its parameters are second-order tensors. For the convolutional layer, let the input data be a third-order tensor with a unitary channel. Then each convolution kernel in the convolutional layer is also a third-order convolution kernel with a channel, so a set of convolution kernels included in a convolutional layer constitutes a shape of × × × Fourth-order tensor. The basic idea of network compression based on tensor decomposition is to use the tensor decomposition technology to re-express the parameters of the network into a combination of small tensors. The re-expressed tensor group can generally be approximately the same as the original tensor with a certain accuracy, and the occupied space is greatly reduced, thereby obtaining the effect of network compression.
- Quantization-based network compression
- The second type of network compression method is based on quantization. The quantification here mainly includes two meanings. The first is to replace the high-precision parameters with low-precision parameters and to accurately intercept the parameters. The essence is uniform quantization. The second is to perform weight sharing to limit the types of network weights that are desirable. A limited number of weights can be subsequently coded further. This quantization method is essentially non-uniform quantization. A limit case for reducing the accuracy of weights. In this document, the weights of a convolutional network are binarized to +1 and -1, the network operation speed is greatly improved, the storage consumption is greatly reduced, and the binarized network There are potential possibilities for implementation using hardware logic operations. The process of value sharing quantization maps the value of the network weight from the whole real number set to the finite number set [1] .
- Clip-based network compression
- The network compression method based on tensor decomposition and quantization focuses on the parameters of the network. The network topology remains unchanged during network compression. In pruning-based network compression, both the network topology and data inference methods may change. Clipping-based network compression will directly change the structure of the network, and its essence is to eliminate redundant parts in the network.
- According to the different cutting objects, network cutting can be divided into multiple levels of granularity, such as hierarchical cutting, neuron-level cutting, and neural connection-level cutting. The cutting object of hierarchical cutting is the entire network layer, which is mainly suitable for models with a large number of network layers. The result of cutting is that the neural network becomes "shallower". Removing several modules of the deep residual network is actually a hierarchical cutting. . The neuron-level clipping is a single neuron or filter, and as a result of the clipping, the neural network becomes thinner. The goal of neural connection level clipping is the connection weight of a single neural network. The result of clipping is to make the neural network more "sparse". Once a neuron is clipped, all connection rights connected to it are clipped. So neuron clipping is actually a special case of neural connection clipping.
- Hierarchical cropping has a relatively large granularity, which has a great impact on the expression of features in the layer, so there is relatively little research. Neural link-level clipping is one of the more studied network compression methods, and it has relatively finer granularity. But it has the side effect that the preservation of sparse neural connections requires the use of sparse tensor storage and arithmetic methods. The storage of sparse tensors requires additional storage overhead to save the position of the data points, and the actual storage space saved is less than the number of parameters that are cut out. The calculation of sparse tensor requires a special calculation method, which is not conducive to parallel calculation. We claim that neural-connection-level tailoring undermines the "regularity" of the network.