What Is the Data Mining Process?
Data mining refers to the process of searching for information hidden in a large amount of data through algorithms.
- Chinese name
- Data mining
- Foreign name
- Data mining
- Alias
- Data exploration, data mining
- Subject
- computer science
- Application area
- Information retrieval, data analysis, pattern recognition, etc.
- Related field
- Artificial intelligence, database
- Data mining refers to the process of searching for information hidden in a large amount of data through algorithms.
- Data mining is usually related to computer science, and achieves the above goals through many methods such as statistics, online analysis and processing, information retrieval, machine learning, expert systems (depending on past rules of thumb), and pattern recognition. [1]
Introduction to data mining
- Need is the mother of invention. In recent years, data mining has attracted great attention from the information industry. The main reason is that there is a large amount of data that can be widely used, and it is urgent to transform these data into useful information and knowledge. The obtained information and knowledge can be widely used in various applications, including business management, production control, market analysis, engineering design and scientific exploration. [2]
- Data mining is a hot topic in the field of artificial intelligence and database research. The so-called data mining is a non-trivial process that reveals hidden, previously unknown and potentially valuable information from a large amount of data in a database. Data mining is a decision support process. It is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, databases, visualization technologies, etc., and analyzes enterprise data with a high degree of automation, makes inductive reasoning, and mines potential patterns from it. To help decision makers adjust market strategies, reduce risks, and make correct decisions. The process of knowledge discovery consists of the following three phases: data preparation; data mining; expression and interpretation of results. Data mining can interact with users or knowledge bases. [1]
- Data mining is a technique that analyzes each piece of data and finds its regularity from a large amount of data. There are three main steps: data preparation, regularity finding, and regular representation. Data preparation is to select the required data from related data sources and integrate them into a data set for data mining; law finding is to find out the rules contained in the data set by some method; the law is expressed as much as possible by the user The way of understanding (such as visualization) shows the rules found. Data mining tasks include association analysis, cluster analysis, classification analysis, anomaly analysis, specific group analysis, and evolution analysis. [1]
- In recent years, data mining has attracted great attention from the information industry. The main reason is that there is a large amount of data that can be widely used, and it is urgent to transform these data into useful information and knowledge. The obtained information and knowledge can be widely used in various applications, including business management, production control, market analysis, engineering design, and scientific exploration. Data mining uses ideas from the following areas: sampling, estimation, and hypothesis testing from statistics; search algorithms, modeling techniques, and learning theories for artificial intelligence, pattern recognition, and machine learning. Data mining also quickly embraced ideas from other fields, including optimization, evolutionary computing, information theory, signal processing, visualization, and information retrieval. Some other areas also play an important supporting role. In particular, database systems are required to provide effective storage, indexing, and query processing support. Technologies derived from high-performance (parallel) computing are often important in processing large data sets. Distributed technology can also help deal with huge amounts of data, and it is even more important when the data cannot be processed together. [1]
Data mining generates background
- In the 1990s, with the widespread application of database systems and the rapid development of network technology, database technology has also entered a completely new stage, that is, from managing only some simple data in the past to the management of graphics, images, Audio, video, electronic archives, Web pages, and other types of complex data, and the amount of data is also increasing. While the database provides us with rich information, it also shows obvious characteristics of massive information. In the era of information explosion, mass information has brought many negative effects to people. The most important thing is that it is difficult to extract effective information. Too much useless information will inevitably produce information distance (information state transition distance). It is an obstacle to information state transition of a thing. Measure, referred to as DIST or DIT) and loss of useful knowledge. This is what John Nalsbert called the "information-rich and knowledge-poor" dilemma. Therefore, people urgently hope to conduct in-depth analysis of massive data, find and extract the information hidden in it, in order to make better use of this data. However, only with the functions of database system entry, query, statistics, etc., it is impossible to discover the relationships and rules existing in the data, to predict the future development trend based on the existing data, and to lack the means of mining the hidden knowledge behind the data. It is under such conditions that data mining technology came into being. [3]
Data Mining Data Mining Object
- The type of data can be structured, semi-structured, or even heterogeneous. The method of discovering knowledge can be mathematical, non-mathematical, or inductive. The discovered knowledge can be used for information management, query optimization, decision support, and maintenance of the data itself. [4]
- The object of data mining can be any type of data source. It can be a relational database, such a data source containing structured data; it can also be a data warehouse, text, multimedia data, spatial data, time series data, web data, such a data source containing semi-structured data or even heterogeneous data . [4]
- The method of discovering knowledge can be digital, non-digital, or inductive. The discovered knowledge can be used for information management, query optimization, decision support, and maintenance of the data itself. [4]
Data mining data mining steps
- Before implementing data mining, it is necessary to formulate what steps to take, what to do at each step, and what kind of goals are necessary. A good plan can ensure that data mining is implemented in an orderly manner and succeeds. Many software vendors and data mining consultants have provided some data mining process models to guide their users step by step in data mining. For example, 5SS from SPSS and SEMMA from SAS. [3]
- The data mining process model steps mainly include defining problems, establishing a data mining database, analyzing data, preparing data, establishing models, evaluating models, and implementing. Let's take a closer look at the specific content of each step: [3]
- (1) Definition. The first and most important requirement before starting knowledge discovery is understanding data and business issues. There must be a clear definition of the goal, which is to decide what exactly you want to do. For example, when you want to improve the utilization rate of e-mail, you may want to do "improve user utilization" or "improve the value of one-time user use." The models established to solve these two problems are almost completely different A decision must be made.
- Figure 1 System model of data mining [3]
- (2) Establish a data mining database. Establishing a data mining library includes the following steps: data collection, data description, selection, data quality assessment and data cleaning, merging and integration, building metadata, loading a data mining database, and maintaining a data mining database. [3]
- (3) Analyze the data. The purpose of the analysis is to find the data fields that have the greatest impact on the predicted output, and decide whether to define the derived fields. If the data set contains hundreds or thousands of fields, then browsing and analyzing the data will be a very time-consuming and tiring task. At this time, you need to choose a good interface and powerful tool software to assist you to complete these things. . [3]
- (4) Prepare the data. This is the last step of data preparation before building the model. This step can be divided into four parts: selecting variables, selecting records, creating new variables, and converting variables. [3]
- (5) Build a model. Modeling is an iterative process. You need to look closely at the different models to determine which one is most useful for the business problem you are facing. Use a part of the data to build the model, and then use the remaining data to test and verify the resulting model. Sometimes there is a third data set, called the validation set, because the test set may be affected by the characteristics of the model, and a separate data set is needed to verify the accuracy of the model. Training and testing data mining models requires the data to be divided into at least two parts, one for model training and one for model testing. [3]
- (6) Evaluation model. After the model is established, the results obtained must be evaluated and the value of the model explained. The accuracy obtained from the test set is only meaningful for the data used to build the model. In practical applications, it is necessary to further understand the types of errors and the associated costs. Experience has shown that an effective model is not necessarily the correct one. The direct reason for this is the various assumptions implicit in the model establishment, so it is important to test the model directly in the real world. First apply it in a small area, obtain test data, and then promote it to a large area after being satisfied. [3]
- (7) Implementation. After the model is established and verified, there are two main ways to use it. The first is to provide a reference for analysts; the other is to apply this model to different data sets. [3]
Data mining data mining analysis method
- Data mining is divided into supervised data mining and unsupervised data mining. Guided data mining is to build a model from the available data,
- 1. classification. It first selects the training set that has been classified from the data, uses data mining technology on the training set to establish a classification model, and then uses the model to classify the unclassified data. [5]
- 2. Valuation. The valuation is similar to classification, but the final output of the valuation is a continuous value, and the amount of valuation is not predetermined. Valuation can be used as preparation for classification. [5]
- 3 prediction. It is carried out by classification or valuation, and a model is obtained through training of classification or valuation. If the model has a high accuracy rate for the test sample group, the model can be used for unknown variables of new samples Make predictions. [5]
- 4 Relevance grouping or association rules. The goal is to discover which things always happen together. [5]
- 5. Clustering. It is a method to automatically find and establish grouping rules. It judges the similarity between samples and divides similar samples into a cluster. [5]
Data mining success stories
- 1. Data mining helps Credilogros Cía Financiera SA improve customer credit scores
- Credilogros Cía Financiera SA is the fifth largest credit company in Argentina with an estimated asset value of $ 95.7 million. For Credilogros, it is important to identify the potential risks associated with potential prepayment customers in order to minimize the risks assumed. [6]
- The company's first goal was to create a decision engine that interacted with the company's core system and two credit reporting company systems to process credit applications. At the same time, Credilogros is also looking for custom risk scoring tools for the low-income customer groups it serves. In addition to these, other needs include solutions that can operate in real time at any of its 35 branch offices and more than 200 related points of sale, including retail appliance chain stores and mobile phone sales companies. [6]
- In the end, Credilogros chose SPSS Inc.'s data mining software PASWModeler because it can be flexibly and easily integrated into Credilogros' core information system. By implementing PASW Modeler, Credilogros has reduced the time it takes to process credit data and provide a final credit score to less than 8 seconds. This allowed the organization to quickly approve or reject credit requests. The decision engine also enables Credilogros to minimize the identification documents that each customer must provide, and in exceptional cases, only a single proof of identity is required to approve credit. In addition, the system also provides monitoring functions. Credilogros currently uses an average of 35,000 applications per month using PASW Modeler. Helped Credilogros reduce loan payment defaults by 20% after just 3 months of implementation. [6]
- 2.Data mining helps DHL track container temperature in real time
- DHL is a global market leader in the international express and logistics industry. It provides express delivery, land, water and air transportation, contract logistics solutions, and international mail services. DHL's international network connects more than 220 countries and regions and employs more than 285,000 people. Under pressure from the U.S. FDA to ensure that the temperature of the drug shipment during the delivery process meets the standards, DHL's pharmaceutical customers are strongly demanding more reliable and affordable options. This requires DHL to track the temperature of the container in real time during all stages of delivery. [6]
- Although the information generated by the logger method is accurate, the data cannot be delivered in real time, and neither the customer nor DHL can take any precautions and corrective actions when temperature deviations occur. Therefore, DHL's parent company Deutsche Post World Network (DPWN) has clearly formulated a plan through the Technology and Innovation Management (TIM) Group to prepare to use RFID technology to track the temperature of the shipment at different points in time. Draw a process framework that determines the key functional parameters of the service through IBM Global Business Consulting Services. DHL has benefited in two ways: For end customers, it enables pharmaceutical customers to respond in advance to shipping issues that occur during shipping, and comprehensively and effectively enhances shipping reliability at a compelling low cost. For DHL, it improves customer satisfaction and loyalty; lays a solid foundation for maintaining competitive differentiation; and becomes an important new source of revenue growth. [6]
- 3. Application in telecommunication industry
- Price competition is unprecedentedly fierce, and the growth of voice services is slowing down. The fast-growing Chinese mobile communications market is facing unprecedented pressure for survival. The acceleration of China's telecommunications industry reform has created a new competitive situation, and the breadth and intensity of competition in the mobile operating market will further increase, which is particularly manifested in the field of group customers. Mobile informatization and group customers have become the new engines for operators to cope with competition and obtain continuous growth in the future. [6]
- With the three-legged full-service competition in the country and the issuance of 3G licenses, it is the general trend for operators to provide integrated information solutions for group customers, and mobile information technology will become a leading force to enter the field of information services. Traditional mobile operators are therefore facing the challenge of shifting from traditional personal business to simultaneously expanding the field of information services for group customers. How to respond to internal and external challenges and quickly use the mobile information business as one of the competitive tools for converging business to expand the group's customer market and be invincible in emerging markets is an urgent problem that traditional mobile operators need to solve. [6]
Data mining classic algorithm
- At present, data mining algorithms mainly include neural network method, decision tree method, genetic algorithm, rough set method, fuzzy set method, association rule method, etc. [4]
Data mining neural network method
- The neural network method simulates the structure and function of the biological nervous system. It is a non-linear prediction model that is learned through training. It connects each
Data mining decision tree method
- A decision tree is a process of categorizing data based on the difference in the utility of the target variable. The process of classifying data through a series of rules is similar to a tree-like flowchart. The most typical algorithm is J. R. Quinlan proposed the ID3 algorithm in 1986, and then proposed the extremely popular C4.5 algorithm based on the ID3 algorithm. The advantage of using a decision tree method is that the decision making process is visible, it does not require a long process of construction, simple description, easy to understand, and fast classification; the disadvantage is that it is difficult to find rules based on multiple variable combinations. Decision tree method is good at processing non-numerical data, and it is especially suitable for large-scale data processing. Decision trees provide a way to show rules like what values you get under what conditions. For example, in the application for a loan, it is necessary to judge the risk of the application. [4]
Data mining genetic algorithm
- Genetic algorithm simulates the phenomena of reproduction, mating, and gene mutation in natural selection and heredity. It is a method of genetic learning based on evolutionary theory that uses operations such as genetic combination, genetic cross mutation, and natural selection to generate rules. Its basic view is the "survival of the fittest" principle, which has the properties of implicit parallelism and easy integration with other models. The main advantage is that it can process many data types and can process various data in parallel; the disadvantage is that it requires too many parameters, it is difficult to encode, and the general calculation is relatively large. Genetic algorithms are often used to optimize neural networks and can solve problems that are difficult to solve with other technologies. [4]
Data mining rough set method
- Rough set method, also called rough set theory, was proposed by Polish mathematician Z Pawlak in the early 1980s. It is a new mathematical tool for dealing with vague, imprecise and incomplete problems. It can handle data reduction, data Correlation discovery, evaluation of data significance, etc. The advantage is that the algorithm is simple, and no prior knowledge about the data is required in its processing, and the inherent law of the problem can be automatically found out; the disadvantage is that it is difficult to directly process continuous attributes, and the attribute discretization must be performed first. Therefore, the problem of discretization of continuous attributes is a difficult point restricting the practical application of rough set theory. Rough set theory is mainly applied to problems such as approximate reasoning, digital logic analysis and simplification, and the establishment of prediction models. [4]
Data mining fuzzy set method
- The fuzzy set method is to use fuzzy set theory to carry out fuzzy evaluation, fuzzy decision, fuzzy pattern recognition and fuzzy cluster analysis on problems. The fuzzy set theory is to describe the attributes of fuzzy things with membership degree. The higher the complexity of the system, the stronger the ambiguity. [4]
Data mining association rule method
- Association rules reflect the interdependence or correlation between things. Its most famous algorithm is R. Apriori algorithm proposed by Agrawal et al. The idea of its algorithm is: first find all frequency sets with a frequency that is at least the same as the minimum support of a predetermined meaning, and then generate strong association rules from the frequency sets. The minimum support and the minimum credibility are to find the two thresholds given by the meaningful association rules. In this sense, the purpose of data mining is to mine association rules that satisfy the minimum support and minimum credibility from the source database. [4]
Problems with data mining
- Related to data mining, it also involves privacy issues. For example, an employer can access medical records to screen out those with diabetes or severe heart disease, with the intention of cutting insurance spending. However, this approach can lead to ethical and legal issues. [6]
- The mining of government and business data may involve issues such as national security or trade secrets. This is also a big challenge for confidentiality. [6]
- There are many legitimate uses of data mining, for example, the relationship between a drug and its side effects can be found in the database of the patient group. This relationship may not occur in one thousand people, but pharmacy-related projects can use this method to reduce the number of patients with adverse reactions to drugs, and may save lives; but there are still databases that may be Problems of abuse. [6]
- Data mining implements methods for discovering information that are not possible with other methods, but it must be regulated and should be used with appropriate instructions. [6]
- If the data is collected from a specific individual, then there are issues related to confidentiality, law, and ethics. [6]