What is Data Compression?

Data compression refers to reducing the amount of data to reduce storage space and improving its transmission, storage, and processing efficiency without losing useful information, or to reorganize data according to a certain algorithm to reduce data redundancy and storage space A technical approach. Data compression includes lossy and lossless compression.

Data compression refers to reducing the amount of data to reduce storage space and improving its transmission, storage, and processing efficiency without losing useful information, or to reorganize data according to certain algorithms to reduce data redundancy and storage space A technical approach. Data compression includes lossy and lossless compression.
In computer science and information theory, data compression or source encoding is the process of representing information with fewer data bits (or other information-related units) than unencoded according to a specific encoding mechanism. For example, if we encode "compression" as "comp" then this article can be represented with fewer data bits. A popular example of compression is the ZIP file format, which is used by many computers. It not only provides compression, but also functions as an archive tool, which can store many files in the same file.
Chinese name
data compression
Foreign name
Data Compression
Including
Lossy and lossless compression
Features
compression

Data compression summary

For any form of communication, compressed data communication can only work if the sender and receiver of the information can understand the encoding mechanism. For example, this article is only meaningful if the recipient knows that the article needs to be interpreted in English characters. Also, the recipient can understand the compressed data only if the recipient knows the encoding method. Some compression algorithms take advantage of this feature and encrypt data during the compression process, such as using a password to ensure that only authorized parties can get the data correctly.
Data compression is possible because most real-world data has statistical redundancy. For example, the letter "e" is more commonly used in English than the letter "z", and it is very unlikely that the letter "q" will be followed by "z". Lossless compression algorithms often take advantage of statistical redundancy, which allows a more concise but still complete representation of the sender's data.
If a certain degree of fidelity loss is allowed, then further compression can be achieved. For example, when watching pictures or television pictures, people may not notice that some details are incomplete. Similarly, two audio recording sample sequences may sound the same, but they are not actually the same. Lossy compression algorithms use fewer bits to represent an image, video, or audio with minor differences.
Since it can help reduce the consumption of expensive resources such as hard disk space and connection bandwidth, compression is very important. However, compression requires information processing resources, which may also be expensive. Therefore, the design of the data compression mechanism needs to make a compromise between compression capacity, distortion, required computing resources, and other different factors that need to be considered.
Some mechanisms are reversible, so that the original data can be restored. This mechanism is called lossless data compression; other mechanisms allow a certain degree of data loss in order to achieve higher compression rates, and this mechanism is called lossy data compression.
However, there are often files that cannot be compressed by a lossless data compression algorithm. In fact, any compression algorithm that does not contain data of discernable patterns cannot be compressed. Attempts to compress already compressed data usually result in expanded data, and attempts to compress encrypted data usually also result in this kind of result.
In fact, lossy data compression will eventually reach the point where it won't work. Let's take an extreme example. Each time the compression algorithm removes the last byte of the file, after this algorithm continues to compress until the file becomes empty, the compression algorithm will not continue to work.

Data compression classification

There are many data compression methods. Different characteristics of data have different data compression methods (that is, encoding methods). We will classify them from several aspects. [1]
(1) Instant compression and non-instant compression
For example, when making an IP call, the voice signal is converted into a digital signal, compressed at the same time, and then transmitted through the Internet. This data compression process is performed immediately. Real-time compression is generally used in the transmission of video and audio data. Real-time compression is commonly used in specialized hardware devices, such as compression cards.
Non-instant compression is often used by computer users. This compression is performed only when needed, without immediateness. For example, compress a picture, an article, a piece of music, etc. Non-instant compression generally does not require special equipment, it is sufficient to install and use the corresponding compression software directly in the computer.
(2) Data compression and file compression
In fact, data compression includes file compression. Data originally refers to any digital information, including various files used in computers. However, sometimes data refers to some time-oriented data. These data are often collected immediately Processed or transmitted. File compression refers to the compression of data to be stored on a physical medium such as a disk, such as the compression of an article data, a piece of music data, and a program coded data.
(3) Lossless compression and lossy compression
Lossless compression uses the statistical redundancy of the data for compression. The theoretical limit of data redundancy is 2: 1 to 5: 1, so the compression ratio of lossless compression is generally lower. This type of method is widely used in text data, programs, and image data for special applications where compression is required to accurately store data. The lossy compression method makes use of the characteristics that human vision and hearing are insensitive to certain frequency components in images and sounds, and allows certain information to be lost during the compression process. Although the original data cannot be completely recovered, the lost part has less influence on the understanding of the original image, but in exchange for a larger compression ratio. Lossy compression is widely used in the compression of voice, image, and video data.

Data compression principle

In fact, there is a lot of data redundancy in multimedia information. For example, the static building background, blue sky, and green space in an image, many of which are the same. If stored point by point, a lot of space is wasted, which is called spatial redundancy. For another example, in the adjacent sequences of television and animation, only a few changes in the moving object, and only the difference is stored, which is called temporal redundancy. In addition, there are structural redundancy and visual redundancy, which provide conditions for data compression.
In short, the theoretical basis of compression is information theory. From the perspective of information, compression is to remove the redundancy in the information, that is, to remove certain or inferred information, while retaining uncertain information, that is, to replace the original with a description closer to the essence of the information. The redundant description of this essential thing is the amount of information.

Data compression applications

A very simple compression method is run-length encoding. This method uses simple encoding such as data and data length to replace the same continuous data. This is an example of lossless data compression. This method is often used in office computers to make better use of disk space or to make better use of bandwidth in computer networks. For symbolic data such as spreadsheets, text, executable files, etc., losslessness is a very critical requirement, because except for some limited cases, even a change in one data bit is unacceptable in most cases.
For video and audio data, a certain level of quality degradation is acceptable as long as important parts of the data are not lost. By using the limitations of the human perception system, storage space can be greatly saved and the quality of the results obtained is not significantly different from the quality of the original data. These lossy data compression methods usually require a compromise between compression speed, compressed data size, and quality loss.
Lossy image compression is used in digital cameras, greatly improving storage capacity, and at the same time, the image quality is hardly reduced. Lossy MPEG-2 codec video compression for DVDs implements a similar function.
In lossy audio compression, psychoacoustic methods are used to remove inaudible or hard to hear components of the signal. Human speech compression often uses more specialized techniques, so people sometimes distinguish "speech compression" or "speech coding" as an independent research area from "audio compression". Different audio and speech compression standards belong to the category of audio codecs. For example voice compression is used for Internet calls, while audio compression is used for CD ripping and decoding using MP3 players.

Data compression theory

The theoretical basis of compression is information theory (which is closely related to algorithmic information theory) and rate-distortion theory. Research in this area was mainly established by Claude Shannon, who published this area in the late 1940s and early 1950s Basic thesis. Doyle and Carlson wrote in 2000 that data compression "has one of the simplest and most beautiful design theories in all areas of engineering." Cryptography and coding theory are also closely related disciplines, and the ideas of data compression and statistical inference have deep roots.
Many lossless data compression systems can be considered as four-step models. Lossy data compression systems usually include more steps, such as prediction, frequency transformation, and quantization.

Popular data compression algorithms

Lempel-Ziv (LZ) compression method is one of the most popular lossless storage algorithms. DEFLATE is a variant of LZ that is optimized for decompression speed and compression rate. Although its compression speed may be very slow, PKZIP, gzip, and PNG all use DEFLATE. LZW (Lempel-Ziv-Welch) is a patent of Unisys until the expiration of the patent in June 2003. This method is used for GIF images. Also worth mentioning is the LZR (LZ-Renau) method, which is the basis of the Zip method. The LZ method uses a table-based compression model in which entries in the table are replaced with duplicate data strings. For most LZ methods, this table is dynamically generated from the initial input data. This table is often maintained using Huffman coding (for example, SHRI, LZX). A good LZ-based encoding mechanism is LZX, which is used in Microsoft's CAB format.

Data compression algorithm coding

The best compression tools use probabilistic model predictions for arithmetic coding. Arithmetic coding was invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary. This method can achieve better compression than the well-known Huffman algorithm, and it is very suitable for adaptive data compression. The prediction of adaptive data compression is closely related to the context. Arithmetic coding has been used in the binary image compression standard JBIG and the document compression standard DejaVu. The text input system Dasher is an inverse arithmetic encoder.

Data compression type

Data compression can be divided into two types, one is called lossless compression and the other is called lossy compression.
Lossless compression refers to the use of compressed data for reconstruction (or reduction, decompression). The reconstructed data is exactly the same as the original data; lossless compression is used when the reconstructed signal is completely consistent with the original signal. A very common example is the compression of disk files. The lossless compression algorithm can generally compress the data of ordinary files to the original 1/2 1/4. Some commonly used lossless compression algorithms are Huffman algorithm and LZW (Lenpel-Ziv & Welch) compression algorithm.
Lossy compression refers to the use of compressed data for reconstruction. The reconstructed data is different from the original data, but it does not affect people's misunderstanding of the information expressed in the original data. Lossy compression is suitable when the reconstructed signal does not have to be exactly the same as the original signal. For example, lossy compression can be used for image and sound compression, because it contains more data than our visual and auditory systems can receive, and some data is discarded so that the meaning of the sound or image is not generated. Misunderstanding, but can greatly increase the compression ratio .

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?