What Happens When You Unzip a Document?

In short, the file compressed by the compression software is called a compressed file. The principle of compression is to compress the binary code of the file and reduce the adjacent 0,1 codes. For example, if there are 000000, it can be changed into 6 zeros. 60, to reduce the space of the file.

In short, the file compressed by the compression software is called a compressed file. The principle of compression is to compress the binary code of the file and reduce the adjacent 0,1 codes. For example, if there are 000000, it can be changed into 6 zeros. 60, to reduce the space of the file.
The basic principle of compressed files is to find duplicate bytes in the file and create a "dictionary" file of the same byte, which is represented by a code. The code represents and writes into the "dictionary" file, so that the purpose of reducing the file can be achieved. [1]
Chinese name
Compressed file
Affiliation
file
To source
Files Compressed by Compression Software

Compressed file compression principle

Compress the binary code of the file and reduce the adjacent 0,1 codes. For example, if there are 000000, you can change it to 6 0 writing 60 to reduce the space of the file.
Because the information processed by the computer is expressed in the form of binary numbers, the compression software is to mark the same character string in the binary information with special characters to achieve the purpose of compression. To help understand file compression, imagine a picture of blue sky and white clouds in your head. For thousands of monotonically repeated blue pixels, instead of defining a long string of colors "blue, blue, blue ..." one by one, it is better to tell the computer: "From this location, 1117 blues are stored Like the dots is simple, and it also saves a lot of storage space. This is a very simple example of image compression. In fact, all computer files are stored in the form of "1" and "0" in the final analysis. Just like blue dots, the volume of the file can be greatly compressed to achieve "data lossless" as long as the formula is reasonable. Dense "effect. Generally speaking, compression can be divided into lossy and lossless compression. If the loss of individual data does not cause much impact, it is a good idea to ignore them at this time, which is lossy compression. Lossy compression is widely used in animation, sound and image files. Typical examples are video file format mpeg, music file format mp3 and image file format jpg. But in many cases, the compressed data must be accurate, and people have designed lossless compression formats, such as common zip, rar, and so on. Compression software is naturally a tool that uses the compression principle to compress data. The files generated after compression are called archives, and the volume is only a fraction of the original or even smaller. Of course, the compression package is already another file format. If you want to use the data in it, you must first use compression software to restore the data. This process is called decompression. Common compression software includes winzip, winrar, etc. [1]
There are two forms of duplication in computer data, and zip is the compression of these two types of duplication. [1]
One is a repetition in the form of a phrase, that is, a repetition of more than three bytes. For this kind of repetition, zip uses two numbers: 1. the distance between the repetition position and the current compression position; 2. the length of the repetition, which represents the repetition, Assuming these two numbers each occupy one byte, the data is compressed, which is easy to understand. [1]
One byte has 256 possible values from 0 to 255, and three bytes have 256 * 256 * 256. There are more than 16 million possible cases. The possibility of longer phrase values grows exponentially. It seems that the probability of repetition is extremely low. In fact, it is not true. Various types of data have a tendency to repeat. In a paper, a small number of terms tend to appear repeatedly. In a novel, the names of people and places will appear repeatedly. A gradient background image, pixels in the horizontal direction will appear repeatedly; in the source file of the program, the syntax keywords will appear repeatedly (how many times before and after we copy, paste?), In tens of K units In the uncompressed format of the data, phrase repetitions tend to occur. After compression in the above-mentioned manner, the tendency of phrase repetition is completely destroyed, so performing a second phrase compression on the result of compression is generally ineffective. [1]
The second kind of repetition is a single-byte repetition. There are only 256 possible values for a byte, so this kind of repetition is inevitable. Among them, some bytes may appear more frequently, while others are less. There is a tendency to be unevenly distributed statistically, which is easy to understand. For example, in an ASCII text file, some symbols may be rarely used. Letters and numbers are used more frequently, and the frequency of each letter is different. It is said that the probability of the use of the letter e is the highest; many pictures are dark or light, and dark (or light) pixels are used more (here By the way: The png picture format is a lossless compression, and its core algorithm is the zip algorithm. The main difference between it and a zip file is that as a picture format, it stores the size of the picture, the Information such as the number of colors); the phrase compression results mentioned above also have this tendency: the repetition tends to appear closer to the current compression position, and the repetition length tends to be shorter (within 20 bytes). In this way, there is the possibility of compression: re-encoding 256 byte values, so that more bytes appear shorter, and fewer bytes appear longer, so that it becomes shorter. The more bytes than the variable length bytes, the smaller the total length of the file, and the more uneven the byte usage ratio, the larger the compression ratio. [1]

Compressed file compression method

Install compression software
First install the compression software.
WinRAR archive
The more popular one is WinRAR "an efficient and fast file compression software (Chinese version)".
Right-click on the file you want to compress
The second is to create a compressed package: select the file or folder you want to make into a compressed package. Of course, you can also select multiple files. The method is the same as that of the Explorer, that is, hold down Ctrl or Shift and then select the file (folder).
Add to compressed file, choose General
After selecting, you can click the "Compress" button on the toolbar, where you can choose the compression format: RAR and ZIP. If you want to get a larger compression ratio, it is recommended to choose RAR format.
After the various options are selected, click the OK button to start making the compressed package, which is very convenient.
Batch compressed volume size
(In bytes), 1M = 1024K, 1K
= 1024 bytes, just fill in the numbers.
Sometimes people will encounter this problem, that is, you need to upload some file compression packages in a forum, the size of the compression package is 3M, but the forum limits members to upload only 2M, what should I do?
In fact, the method is very simple, when you compress this file, divide it into several compressed packages with sub-volumes, and the size of the sub-volume packages can be set to 2M. There are two files, 123.part1.rar (2M) and 123.part2.rar (1M), so you can upload them.

Compressed file decompression method

When you download a compressed package with a volume, how do you unzip the file?
Zip archive
The specific method is as follows:
1.Download all compressed files
2.All sub-volumes must be in the same folder
3. Then double-click to decompress any sub-volume, you can
Note: The unpacked files must be continuous
If the sub-volume is not downloaded completely, it will naturally prompt that the next compressed sub-volume is needed when decompressing.

Compressed file software introduction

WinRAR Compressed file WinRAR

WinRAR, a popular Windows compression tool!
WinRAR is a powerful compressed file management tool. It can backup your data, reduce the size of your E-mail attachments, decompress compressed files in RAR, ZIP and other formats downloaded from the Internet, and can create compressed files in RAR and ZIP formats. Before purchasing, you can download a trial version.
WINRAR is a popular compression tool, with a friendly interface and easy to use. It has a good performance in terms of compression rate and speed. Its compression ratio is high. 3.x uses a more advanced compression algorithm, which is one of the formats with larger compression rates and faster compression speeds. 3.3 Added the functions of scanning for viruses in compressed files, decompressing "enhanced compression" ZIP files, and upgrading the function of volume compression.

Main features of compressed files

1. Full support for RAR and ZIP;
2. Support decompression of ARJ, CAB, LZH, ACE, TAR, GZ, UUE, BZ2, JAR, ISO files;
3. Multi-volume compression function;
4. Create a self-extracting file, you can make a simple installation program, easy to use;
5. The compressed file size can reach 8,589,934 TB;
6. Locking and powerful data recovery recording function, meticulous protection of data, the use of new recovery volume is more powerful

How compressed files work

Lossy and lossless compression.

Lossless compression of compressed files

If you have downloaded many programs and files from the Internet, you may encounter many ZIP files. This compression mechanism is a very convenient invention, especially for network users, because it can reduce the total number of bits and bytes in a file, enable the file to be transferred faster through a slower Internet connection, and it can also reduce The disk footprint of the file. After downloading the file, the computer can use a program like WinZip or Stuffit to expand the file and restore it to its original size. If everything is fine, the expanded file will be exactly the same as the original file before compression.
At first glance it seems mysterious: how did you reduce the number of bits and bytes and restore them back intact? After everything comes out, you will find that the basic idea behind this process is actually very simple and clear. In this article, we will discuss this method of significantly reducing files by simple compression.
Most computer file types contain quite a bit of redundancythey repeatedly list some of the same information. File compression programs are designed to eliminate this redundancy. Rather than listing a piece of information repeatedly, a file compression program lists the information only once and then re-references it when it appears in the original program.
Take for example the type of information we are familiar with-words.
John F. Kennedy made the following famous statement in his 1961 inaugural address:
Ask not what your country can do for youask what you can do for your country. (Don't ask what the country can do for you, but ask yourself what you can do for the country.)
This passage has 17 words, including 61 letters, 16 spaces, a dash, and a period. If each letter, space, or punctuation occupies 1 memory unit, the total file size is 79 units. To reduce the file size, we need to find the redundant parts.
We immediately discovered:
If you ignore the difference between upper and lower case letters, this sentence is almost half redundant. The nine words (ask, not, what, your, country, can, do, for, you) provide almost everything you need to compose the entire sentence. To construct the other half of the sentence, we just need to take out the words in the first half of the sentence and add spaces and punctuation.
Most compression programs use the adaptive dictionary-based LZ algorithm to shrink files. "LZ" refers to Lempel and Ziv, the inventors of this algorithm. "Dictionary" refers to the method of classifying data blocks.
There are many mechanisms for arranging dictionaries, and it can be as simple as a numbered list. As we examine Kennedy's famous speech, we can pick out duplicate words and put them in the numbered index. We then write the number directly instead of writing the entire word.
So if our dictionary is:
ask
what
your
country
can
do
for
you
Our sentence should now look like this:
1 not 2 3 4 5 6 7 8-- 1 2 8 5 6 7 3 4
If you understand this mechanism, you can easily reconstruct the original sentence using only the dictionary and numbering scheme. This is what the decompressor on your computer does when you unpack a downloaded file. You may also come across compressed files that can decompress themselves. To create such a file, the programmer needs to set up a simple decompression program in the compressed file. After downloading, it can automatically reconstruct the original file.
But how much space can be saved using this mechanism? "1 not 2 3 4 5 6 7 81 2 8 5 6 7 3 4" is certainly shorter than "Ask not what your country can do for you-- ask what you can do for your country." The thing is, we need to save this dictionary with the file.
In the actual compression scheme, calculating various file requirements is a fairly complicated process. Let's go back and consider the example above. Each character and space occupies 1 memory unit, and the entire original sentence occupies 79 units. The compressed sentence (including spaces) occupies 37 units, and the dictionary (words and numbers) also occupies 37 units. In other words, the file size is 74 units, so we have not reduced the file size much.
But this is just a case of a sentence! It is conceivable that if we use the compression program to process the rest of Kennedy's speech, we will find that these and other words are repeated more times. And, as mentioned in the next section, to get the highest possible organizational efficiency, the dictionary can be rewritten.
In the previous example, we picked out all the duplicate words and put them in a dictionary. For us, this is the most obvious way to write a dictionary. But the compression program doesn't think so: it has no concept of words-it only looks for patterns. In order to minimize the file size, it carefully selects the optimal mode.
If we process the sentence from this perspective, we end up with a completely different dictionary.
If the compression program scans this sentence of Kennedy, the first redundant part it encounters is only a few letters long. In ask not what your, there is a repeating pattern, namely the letter t followed by a space-in not and what. If the compressor writes this pattern to a dictionary, it writes a "1" every time "t" is followed by a space. But in this short sentence, this pattern does not appear often enough to keep it as an entry in the dictionary, so the program eventually overwrites it.
The next thing the program notices is ou, which appears in both your and country. If this is a long document, writing this pattern into a dictionary saves a lot of space-ou is a very common combination of letters in English. But after looking at the entire sentence, the compression program immediately found a better choice of dictionary entries: not only did ou repeat, but also the entire word of your and country repeated, and they were actually together as a phrase your country Repeated. In this example, the program will overwrite the ou entry in the dictionary with your country entry.
The phrase can do for is also repeated, once followed by your and another followed by you, so we find that can do for you is also a repeating pattern. In this way, we can replace 15 characters (including spaces) with a number, and your country only allows us to replace 13 characters (including spaces) with a number, so the program will overwrite your country entries with r country entries, and then write Enter a separate can do for you entry. The program continues to work in this way, picking out all the duplicate information and then calculating which mode should be written to the dictionary. The "adaptive" part of the adaptive dictionary-based LZ algorithm refers to this ability to rewrite the dictionary. The process by which the program does this work is actually very complicated.
No matter what method is used, this deep search mechanism can compress files more efficiently than just picking out words. If we use the pattern we extracted above and replace the space with "__", we will end up with the following larger dictionary:
ask__
what __ & shy;
you
r__country
__can__do__for__you
The sentences are shorter:
"1not__2345 __--__ 12354"
Sentences now occupy 18 memory cells and dictionaries occupy 41 cells. So, we compressed the total file size from 79 units to 59 units! This is just one way to compress sentences, and it is not necessarily the most efficient method. (Can you find a better way?)
File compression rates depend on a number of factors, including file type, file size, and compression scheme.
In most languages of the world, certain letters and words often appear together in the same pattern. Because of this high redundancy, the compression ratio of text files can be very high. Normally, a properly sized text file can achieve a compression rate of 50% or higher. Most programming languages are also very redundant, because they have relatively few commands, and the commands often adopt a set mode. For files that contain a lot of non-repeating information (such as images or MP3 files), this mechanism cannot be used to achieve high compression rates, because they do not contain patterns that repeat multiple times.
If the file has a lot of repeating patterns, then the compression ratio usually increases as the file size increases. This can be seen in our example-if our excerpted Kennedy speech is longer, you will find that the pattern in our dictionary appears again and again, so you can save more file space with each dictionary entry . In addition, for larger files, more general patterns may appear, which can create more efficient dictionaries.
In addition, file compression efficiency depends on the specific algorithm used by the compression program. Some programs are better at finding patterns in certain types of files, so they can compress these types of files more efficiently. Some other compression programs use dictionaries in the dictionary, which makes them perform well when compressing large files, but inefficient when compressing smaller files. Although all compression programs in this category are based on the same basic idea, they perform differently. Program developers are always trying to establish better compression mechanisms.

Lossy compression of compressed files

The type of compression we discussed above is called lossless compression because the file you recreate is exactly the same as the original file. All lossless compression is based on the idea of turning the file into a "smaller" form for transfer or storage, and recovering it when it is received by another party so that it can be reused.
Lossy compression is quite different. These programs directly remove "unnecessary" information and trim the file to make it smaller. This type of compression is widely used to reduce the file size of bitmap images, because bitmap images are often very large. To understand how lossy compression works, let's see how your computer compresses a scanned photo.
For such files, the compression ratio of the lossless compression program is usually not high. Although most of the pictures look the same-for example, the entire sky is blue-there are slight differences between most pixels. In order to make the picture smaller without reducing its resolution, you must change the color values of some pixels. If the picture contains a lot of blue sky, the program picks a blue that can be used for all pixels. The program then rewrites the file and uses this information for all sky pixel values. If the compression scheme is selected properly, you will not notice any changes, but the file size will be significantly reduced.
Of course, with lossy compression, you can't restore the file to its original appearance after compression. You must accept a reinterpretation of the original file by the compression program. Therefore, if you need to fully reproduce the original content (such as software applications, databases, and presidential inaugural speeches), this compression format should not be used.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?