What Is a Data Processing System?
Data is an expression of facts, concepts or instructions, which can be processed by manual or automated devices. After data is interpreted and given a certain meaning, it becomes information. Data processing is the collection, storage, retrieval, processing, transformation, and transmission of data.
- Processing software
- Data processing is inseparable
- data processing
- A technology that uses computers to collect and record data, and processes them to produce new information forms. Data refers to a collection of numbers, symbols, letters, and various words. Data processing involves much more extensive processing than general arithmetic operations.
- Computer data processing mainly includes 8 aspects.
- Data acquisition: Collect the required information.
- Data conversion: The information is converted into a form that can be received by the machine.
- Data grouping: Specify coding, and group effectively according to relevant information.
- Data organization: arrange the data or arrange the data in some ways for processing.
- Data calculation: perform various arithmetic and logical operations in order to obtain further information.
- Data storage: Save the original data or calculation results for later use.
- Data retrieval: find useful information according to user requirements.
- Data sorting: Sort the data into order according to certain requirements.
- The process of data processing is roughly divided into three stages: data preparation, processing and output. In the data preparation phase, enter data offline into punched cards, punched tapes, tapes, or
- Data processing is the process of extracting valuable information from a large amount of raw data, that is, the conversion of data into information. It mainly processes and organizes the input of various forms of data, and the process includes the entire process of evolution and derivation of data collection, storage, processing, classification, merging, calculation, sorting, conversion, retrieval and dissemination.
- Data management refers to the operations of data collection, organization, storage, maintenance, retrieval, and transmission. It is the basic link of data processing business, and it must be a common part in all data processing processes.
- In data processing, the calculation is usually relatively simple, and the processing calculation in the data processing business varies from business to business, and an application program needs to be written according to the needs of the business to solve it. Data management is more complicated. Due to the explosive growth of available data and the variety of data, from the perspective of data management, not only data must be used, but also data must be effectively managed. Therefore, a universal, convenient and efficient management software is needed to effectively manage the data.
- Data processing and data management are related. The quality of data management technology will directly affect the efficiency of data processing. Database technology is a branch of computer applications that is researched, developed, and improved according to the requirements.
- Big data processing: Three major changes in the concept of the data age: Do not sample all, do not be absolutely accurate, and do not cause and effect. There are actually many big data processing methods, but according to long-term practice, Tianhu Data summarizes a basic big data processing process, and this process should be able to help everyone straighten out the processing of big data. The entire processing flow can be summarized into four steps, namely acquisition, import and preprocessing, statistics and analysis, and mining.
- collection
- In the process of collecting big data, its main feature and challenge is the high number of concurrency, because at the same time, there may be thousands of users to access and operate, such as train ticket sales sites and Taobao, their concurrent visits in The peak value reaches millions, so a large number of databases need to be deployed on the acquisition side to support it. And how to perform load balancing and sharding between these databases does require in-depth thinking and design.
- Statistical Analysis
- Statistics and analysis mainly use distributed databases or distributed computing clusters to perform ordinary analysis and classification and summary of large amounts of data stored in it to meet most common analysis needs. In this regard, some real-time requirements will Uses EMC's GreenPlum, Oracle's Exadata, and MySQL-based columnar Infobright, and some batch processing or semi-structured data requirements can use Hadoop. The main feature and challenge of this part of statistics and analysis is the large amount of data involved in the analysis, which will greatly occupy system resources, especially I / O.
- Import / preprocessing
- Although there will be many databases on the acquisition side, if you want to analyze these large amounts of data effectively, you should still import the data from the front end into a centralized large distributed database or distributed storage cluster, and you can Do some simple cleaning and pretreatment. Some users will also use Storm from Twitter to perform streaming calculations on data during import to meet the real-time computing needs of some businesses. The characteristics and challenges of the import and preprocessing process are mainly the large amount of imported data, and the import volume per second often reaches 100M, even Giga level.
- Dig
- Different from the previous statistics and analysis process, data mining generally does not have any pre-set topics. It is mainly based on the calculation of various algorithms on the existing data to achieve the prediction effect and achieve some high-level data. Analyze the needs. Typical algorithms include K-Means for clustering, SVM for statistical learning, and NaiveBayes for classification. The main tools used are Mahout of Hadoop. The characteristics and challenges of this process are mainly that the algorithms used for mining are complex, and the amount of data and calculation involved in the calculation are large. In addition, the common data mining algorithms are mainly single-threaded [2] .