What Is Data Classification?
The basis of data classification survey analysis is data, and the types of data can be divided into continuous variables and categorical variables. Data classification is to combine data with a certain common attribute or characteristic, and distinguish the data by the attribute or characteristic of its category. In other words, the same content, information of the same nature, and information requiring unified management are grouped together, and the information that is different and needs to be managed separately is distinguished, and then the relationship between each collection is determined to form a method Classification system.
- Data classification is the
- The basic principles of data classification are as follows:
- According to different classification methods, statistical data can be divided into the following types:
Data classification by measurement level
- According to the measurement level of the data, the statistical data can be divided into class data, order data, fixed distance data and fixed ratio data.
- 1. Classification data. This is the lowest level of data. It classifies data according to category attributes, and there is an equal and parallel relationship between the categories. This data does not carry quantitative information and cannot be sorted between categories. For example, a shopping mall divides the colors of clothes that customers like into red, white, and yellow. Red, white, and yellow are categorical data. As another example, human beings are divided into male and female according to sex, which also belongs to categorical data. Although the classification data is expressed as categories, in order to facilitate statistical processing, different numbers or codes can be used to represent different categories. For example, 1 indicates female and 2 indicates male, but these numbers do not mean that these numbers can be distinguished from each other or performed mathematical operations. No matter what encoding is used, there is no loss of information contained in it. The main numerical operation performed on categorical data is to calculate the frequency and frequency of items in each category. [3]
- 2. Sequencing data. This is the middle level of the data. Ordered data can not only divide the data into different categories, but also compare the advantages and disadvantages between the categories. In other words, the main difference between sequence data and category data is that sequence data can still be compared. For example, the education level of a person belongs to ordinal data. We can still use digital codes to represent different categories: illiterate and semi-illiterate = 1, elementary school = 2, junior high school-3, high school = 4, college = 5, master = 6, doctorate = 7. By sorting the codes, it can be clearly expressed Show the difference in education level. Although this degree of difference cannot be accurately measured by the difference between codes, its order can be determined, that is, the inequality operation can be performed by the coded values. [3]
- 3. Fixed distance data. Fixed distance data is the actual measurement value with a certain unit (such as Celsius temperature, test scores, etc.). At this time, we can not only know the difference between the two variables, but also accurately calculate the actual gap between the variables through addition and subtraction operations. It can be said that the accuracy of fixed-distance data is a big step forward than that of fixed-type data and sequence data. It can measure the actual distance between categories or orders of things. For example, A's English score is 80 points, and B's English score is 85 points. It can be seen that B's English score is 5 points higher than A's. [3]
- 4. Fixed ratio data. This is the highest level of data. Its data representation is the same as the fixed-distance data, which are actual measured values. The only difference between fixed-ratio data and fixed-distance data is that absolute zeros exist in fixed-ratio data, while absolute zeros do not exist in fixed-distance data (zeros are artificially formulated). Therefore, the fixed ratio data can not only compare the size, perform addition and subtraction operations, but also perform multiplication and division operations. [3]
- In statistical analysis, it is very important to distinguish the types of data. Different types of data play different roles. [3]
Data classification by source
- There are two main sources of data: one is the raw data obtained through direct surveys, which are generally called first-hand or direct statistical data; the other is data surveyed by others, and these data are processed and summarized Post-published data is often referred to as second-hand or indirect statistics. [3]
Data classification by time
- 1. Time series data. It refers to the data collected at different times, reflecting the change of the phenomenon over time.
- 2. Sectional data. It refers to the data collected at the same or similar time points, describing the change of the phenomenon at a certain moment. [3]