What Is Schema Matching?
Pattern matching is a basic operation of strings in a data structure. Given a substring, it is required to find all substrings that are the same as the substring in a certain string. This is pattern matching.
- Pattern matching is a basic operation of strings in a data structure. Given a substring, it is required to find all substrings that are the same as the substring in a certain string. This is pattern matching.
Pattern matching concept
- Pattern matching is a basic operation of strings in a data structure. Given a substring, it is required to find all substrings that are the same as the substring in a certain string. This is pattern matching.
- Suppose P is a given substring and T is the string to be searched. It is required to find all substrings that are the same as P from T. This problem becomes a pattern matching problem. P is called the mode and T is the goal. If there is one or more substrings with a pattern P in T, the position of the substring in T is given as a successful match; otherwise, the match fails. [1]
Pattern matching common pattern matching algorithms
Simple pattern matching algorithm for pattern matching
- Algorithm idea: compare the first character of the target string with the first character of the pattern string. If they are equal, continue to compare the characters, otherwise the target string starts from the second character and the first character of the pattern string. The characters are compared again until each character in the pattern string is in turn equal to a continuous character sequence in the target string, which is called a successful match, otherwise the match fails.
- If the length of the pattern substring is m and the length of the target string is n, the worst case is that each comparison will be different at the end, that is, the maximum number of comparisons is m, and the maximum comparison is n-m + 1 times The total number of comparisons is at most m (n-m + 1), so the time complexity of a simple pattern matching algorithm is O (mn). There is backtracking in naive pattern matching algorithms, which affects the efficiency of the matching algorithm, so naive pattern matching algorithms are rarely used in practical applications. In actual applications, matching algorithms without backtracking are mainly used. Both KMP algorithm and BM algorithm are matching algorithms without backtracking. [2]
KMP Pattern matching KMP matching algorithm
- The Knuth-Morris-Pratt algorithm (KMP for short) is an improved algorithm jointly proposed by DEKnuth, JHMorris, and VRPratt, which eliminates the problem of backtracking in simple pattern matching algorithms and completes string pattern matching.
- Algorithm idea:
- Let the target string be s, the pattern string be t, i and j be pointers indicating s and t, and the initial values of i and j are 0.
- If si = tj, i and j increase by 1 respectively; otherwise, i does not change and j returns to the position of j = next [j] (also can be understood as the string s does not move, the pattern string t moves to the right to tnext [j] aligned);
- Compare si and tj. If they are equal, the pointers are increased by 1; otherwise, j returns to the next j = next [j] position (that is, the pattern string continues to move to the right), and then si and tj are compared.
- And so on until one of the following two cases:
- 1) When j returns to a certain j = next [j] and si = tj, the pointer is incremented by 1 to continue matching;
- 2) j returns to j = -1. At this time, the pointer is increased by l, that is, the next comparison between si + 1 and t0.
- Let the length of the pattern P be m and the length of the target T be n. The analysis of the time complexity of the KMP matching algorithm is as follows:
- The entire matching algorithm consists of two algorithms, Find () and GenKMPNext (). Find () contains a loop, the initial value of J is 0, and the value of j is strictly equal to 1 each time. The loop ends when j is equal to n, so the loop is executed n times. In GenKMPNext (), there are two loops on the surface, and the time complexity seems to be O (). In fact, otherwise, the outer loop of GenKMPNext () is executed exactly m-1 times. In addition, the initial value of j is -1 In the outer loop, the value of j is increased by 1 each time. At the same time, the value of j in the inner loop is reduced, but at least it is not less than -1. Therefore, the total number of statements of j = next [j] The number of executions of should be less than or equal to the number of times the value of j is incremented by 2 in the outer loop. That is, at the end of the algorithm GenKMPNext (), the total number of executions of j = next [j] is less than or equal to m-1 times.
- In summary, for the pattern matching of the pattern of length m and the target T of length n, the time complexity of the KMP algorithm is O (m + n).
BM Pattern matching BM matching algorithm
- The BM algorithm is an exact string matching algorithm (different from fuzzy matching). The right-to-left comparison method is used, and two heuristic rules, namely the bad character rule and the good suffix rule, are applied to determine the distance to jump to the right. The basic flow of the BM algorithm: Set the text string T and the pattern string to P. First align T and P to the left, and then perform a right-to-left comparison, as shown in the following figure: [3]
- If a certain comparison does not match, the BM algorithm uses two heuristic rules, that is, the bad character rule and the good suffix rule, to calculate the distance that the pattern string moves to the right until the end of the entire matching process.
- 1) Bad Character Rule:
- During the BM algorithm scanning from right to left, if a character x is found to be mismatched, it is discussed as follows:
- If the character x does not appear in the pattern P, then it is obviously impossible for m texts starting from the character x to match P, and all regions can be skipped directly.
- If x appears in pattern P, the character is aligned.
- Expressed in mathematical formulas, let Skip (x) be the distance to the right of P, m is the length of the pattern string P, and max (x) is the rightmost position of the character x in P.
- 2) Good Suffix:
- If some characters are found to match at the same time, some characters have been successfully matched, they will be discussed as follows:
- If the matched part P 'at position t in P also appears at a position t' in P, and the previous character at position t 'is not the same as the previous character at position t, then shift P right to make t' correspond tsquared's location.
- If no part of P 'has been matched at any position in P, then find the longest prefix x of P that is the same as the suffix P' 'of P', and move P to the right so that x corresponds to the position of the P '' position.
- Expressed in mathematical formulas, let Shift (j) be the distance to the right of P, m is the length of the pattern string P, j is the currently matched character position, and s is the distance between t 'and t (case i above) or P '' distance.
Pattern matching code implementation
- Naive pattern matching algorithm (C language)
- #include <stdio.h>
- int main ()
- {
- char s [20];
- char p [5];
- printf ("Please input the source string:");
- scanf ("% s", s);
- printf ("Please input the goal string:");
- scanf ("% s", p);
- printf ("The result of finding is:% d \ n", Find (s, p));
- }
- int Find (char * s, char * p)
- {
- int j = 0, i = 0, k = 0;
- int r = -1;
- while (r ==-1 && s [i]! = '\ 0')
- {
- while (p [j] == s [i] && p [j]! = '\ 0')
- {
- i ++;
- j ++;
- }
- if (p [j] == '\ 0')
- {
- r = k;
- }
- else
- {
- j = 0;
- k ++;
- i = k;
- }
- }
- return r;
- }
- KMP pattern matching algorithm (C language)
- #include <stdio.h>
- #include <string.h>
- #include <stdlib.h>
- FILE * fin = fopen ("test.in", "r");
- FILE * fout = fopen ("test.out", "w");
- char s1 [200], s2 [200];
- int next [200];
- int max (int a, int b)
- {
- if (a> b) return a;
- return b;
- }
- void getnext ()
- {
- memset (next, 0, sizeof (next));
- int i = -1, j = 0;
- next [0] =-1;
- while (j <strlen (s2))
- {
- if (i ==-1 || s2 [i] == s2 [j]) {
- i ++; j ++;
- next [j] = i;
- }
- else i = next [i];
- }
- }
- int KMP ()
- {
- int i = 0, j = 0, len1 = strlen (s1), len2 = strlen (s2);
- while ((i <len1) && (j <len2))
- {
- if (j ==-1 || s1 [i] == s2 [j]) {j ++; i ++;}
- else j = next [j];
- }
- if (j == len2) return i-len2;
- else return -1;
- }
- int index_KMP ()
- {
- int i = 0, j = 0, len1 = strlen (s1), len2 = strlen (s2), re = 0;
- while (i <len1 && j <len2)
- {
- if (j ==-1 || s1 [i] == s2 [j]) {i ++; j ++;}
- else j = next [j];
- re = max (re, j);
- }
- return re;
- }
- int main ()
- {
- fscanf (fin, "% s", s1);
- for (int i = 1; i <= 3; i ++)
- {
- fscanf (fin, "% s", s2);
- getnext ();
- fprintf (fout, "% d% d \ n", KMP (), index_KMP ());
- }
- return 0;
- }
- BM matching algorithm code implementation (C ++)
- // BM pattern matching algorithm I.cpp: Defines the entry point for the console application.
- //
- #include "stdafx.h"
- #include <iostream>
- #define MAX 200
- using namespace std;
- void get_dist (int * dist, char * t, const int lenT)
- {
- int i;
- for (i = 0; i <= MAX; i ++)
- dist [i] = lenT;
- for (i = 0; i <lenT; i ++)
- dist [(int) t [i]] = lenT-i-1;
- }
- //
- int BM (char * s, char * t, int * dist, const int lenS, const int lenT)
- {
- int i, j, k;
- i = lenT-1;
- while (i <lenS)
- {
- j = lenT-1;
- k = i;
- while (j> = 0 && s [k] == t [j])
- {
- j--;
- k--;
- }
- if (j <0)
- return i + 2-lenT;
- else
- i = i + dist [s [k]];
- }
- if (i> = lenS)
- return 0;
- }
- int _tmain (int argc, _TCHAR * argv [])
- {
- int cases;
- char s [MAX], t [MAX];
- int dist [MAX];
- cout << "Please enter the number of cases:";
- cin >> cases;
- while (cases--)
- {
- cout << "Please enter the main string:" << endl;
- cin >> s;
- int lenS = strlen (s);
- while (1)
- {
- cout << "Please enter the pattern string to be matched (ending 0):" << endl;
- cin >> t;
- if (! strcmp (t, "0"))
- break;
- int lenT = strlen (t);
- get_dist (dist, t, lenT);
- int pos = BM (s, t, dist, lenS, lenT);
- if (pos == 0)
- cout << "No matches!" << endl;
- else
- cout << "The start of the match is:" << pos << endl;
- }
- }
- system ("pause");
- return 0;
- }