How Do I Remove Duplicate Files?
A data reduction technology commonly used in disk-based backup systems designed to reduce the amount of storage used in storage systems. It works by finding duplicate variable-size data blocks in different locations in different files within a certain period of time. Duplicate data blocks are replaced with indicators.
- When your backup program backs up the same file from the same directory multiple times in the network, or backs up the same file from multiple addresses, duplicate data is backed up in the temporary area. The amount of repetitive data on most networks is staggering. This data ranges from invitations in PDF format for holiday parties saved by 56 users in their local directories to 3GB Windows files on the system drive of each server.
- The solution to duplicate files in the temporary area is
- In addition to the different approaches adopted by major vendors, their physical architecture backup goals are also different. Data Domain, ExaGrid, and
- Subfile deduplication is not only used to save disk space in backup applications. Next-generation backup applications, including Telefaulting by Asigra, Avamar Axion by EMC and
- According to different deployment locations, deduplication can be divided into source-side deduplication and target-side deduplication. Source-side deduplication is to delete the deduplicated data before transferring the data to the backup device. The deduplication of the target end is to transfer the data to the backup device first, and then delete the deduplicated data when storing.
- According to different algorithms for checking deduplication, deduplication can be divided into object / file-level and block-level deduplication. Object-level deduplication guarantees that files are not duplicated. Block-level deduplication divides files into data blocks for comparison.
- According to different methods of segmenting data blocks, deduplication technology can be divided into fixed-length blocks and variable-length blocks. For deduplication of variable-length blocks, the length of the data blocks varies. Deduplication of fixed-length blocks, the length of the data block is fixed.
- According to different applications, it can be divided into general-purpose deduplication systems and dedicated deduplication systems. A universal deduplication system refers to a vendor that provides a common deduplication product, rather than being tied to a specific virtual tape library or backup device. The dedicated data deduplication system is associated with a specific virtual tape or backup device, and generally uses the target-side deduplication method.
- Deduplication can be implemented at the hardware or software level, or a combination of both. Similarly, deduplication can be performed on the data source side, on the backup destination side, or both.
- Source-side deduplication is helpful in applications where the data transmission link is slow. Data that is deduplicated at the source allows data to be compressed before transmission, enabling faster data transmission.
- Target deduplication operates on a backup target or a remote storage device. Its main purpose is to reduce storage costs. Targeted deduplication reduces the amount of storage space actually used by reducing deduplication. [2]
- During the recovery process, the data you need may not be stored in contiguous disk blocks, or even in non-deduplicated backups. When the backup data expires and the storage space is released, storage fragmentation occurs and the recovery time is prolonged. Because the data and its pointers may be stored out of order, the deduplicated data that is deleted can also cause fragmentation, which reduces recovery performance.
- Some provide deduplication
- Device vs. Software: You need to understand the pros and cons of deploying a deduplication solution as a dedicated device and using the deduplication software running on the server. Some software solutions are relatively inexpensive, but may not scale well to meet growing capacity requirements, and their performance depends on the server on which they are located. Software solutions seem to be less flexible, but may be effective for customers who have the resources to take on the task of integration, management, and monitoring. If you choose a software approach, it's important to understand the processing power required to run the aforementioned "cleanup" tasks and their impact on the server. Hardware devices have their own space and power requirements, and sometimes consume a lot of power. They are usually self-managed, provide greater flexibility and simplicity, and benefit from hardware optimization. For customers looking for rapid deployment and easy integration into their current environment, hardware solutions are very popular.
- Available capacity vs. original capacity: Available capacity is the most direct and applicable specification for end users. It refers to the capacity before any deduplication, and does not include data for metadata,
- Data deduplication technology reduces the storage space required for backups, which facilitates faster and more frequent backups, which benefits
- Deduplication starts with creating data. Then all other businesses-backup, replication, archiving, and any network transfer-can benefit from reduced data.
- However, the application of deduplication to master data is difficult for users to accept, because doing so is tampering with the master data set and making a backup. Even if there is no deduplication technology, it will not be caused by messing up the production environment data There is a big mess, but if you move to main storage, the problem is very big, and you need to understand how this technology will affect performance, reliability and data integrity.
- Only a few primary storage arrays currently offer deduplication as an additional feature of the product. Only less than 5% of disk arrays truly support online deduplication and compression.
- The space saved by deduplication is significant, depending on the type of data and the chunk size of the data deduplication engine used. Taking the file and virtual desktop architecture environment as an example, benefiting from the high deletion rate, the compression ratio can reach 40: 1. The video can be compressed, but it cannot be duplicated. Storage vendors consider 6: 1 to be the best average of deduplication rates. Coupled with the same block compression, data centers can easily achieve 10: 1 storage space savings with these technologies.
- Deduplication can save space and is very useful, but deduplication is a computationally intensive technology. In relatively unimportant secondary storage, problems generally do not occur, but transient congestion may occur in the primary storage environment.
- Deduplication not only removes deduplication in real time, it also allows vendors to maximize potential data compression rates through algorithms. Taking Quantum's DXi series backup device as an example, a deduplication algorithm that can change the block size is used, which is more than three times more efficient than the fixed block size method. [2]
- Data related to deduplication
- <5%: The current market share of disk arrays supporting online deduplication
- 75%: Predicted market share of disk arrays that will support deduplication and compression in the next three years
- 6: 1: average ratio of deduplication
- 40: 1: Deduplication rate in VDI and text file environments
- 10: 1: Deduplication processing delete rate
- $ 1: Cost per GB of deduplication ordinary hard drive
- $ 8 ~ $ 9: Cost per GB of deduplication flash drive
- Not only can deduplication make better use of expensive flash memory resources, but deduplication is also easier to implement. Deduplication is compared to most storage vendors.
- Deduplication last front
- Deduplication technology cannot support existing storage arrays. In theory, promoting deduplication technology can extend the service cycle of storage that is already in use.
- Deduplication flash manufacturers are vying for market share from these traditional storage giants. To solve this problem, giving away deduplication storage alone is not enough. [1]
When to use deduplication
- Deduplication starts with creating data. Deduplication followed by all other businesses-backup, replication, archiving, and any network transfer-can benefit from reduced data, and deduplication is located in Hopkinton.
- However, the application of deduplication in master data is difficult for users to accept, because this is tampering with the master data set and making backups. Even without deduplication technology, deduplication will not mess up the production environment Data, but if the deduplication is moved to the main storage, the problem is very big, and you need to understand how this technology will affect performance, reliability and data integrity. [2]
Deduplication Data Deduplication Ratio
- The space saved by data deduplication is considerable, depending on the type of data and the chunk size of the data deduplication engine used. Deduplication takes text files and virtual desktop architecture environments as examples. Benefiting from high deletion rates, the compression ratio can reach 40: 1. The video can be compressed, but it cannot be duplicated. Deduplication storage vendors consider 6: 1 to be the best average of deduplication rates. Coupled with the same block compression, data centers can easily achieve 10: 1 storage space savings with these technologies.
- These deduplication technologies are space-saving and very useful, but deduplication is computationally intensive. Deduplication is generally not a problem in relatively unimportant secondary storage, but deduplication may cause transient congestion to the primary storage environment.
- Deduplication can not only delete deduplication in real time, deduplication also allows suppliers to maximize the potential data compression rate through algorithms. Deduplication uses a deduplication algorithm that can change the block size. The deduplication algorithm is more than three times more efficient than the fixed block size method. [2]