How Do Search Engines Work?
The basic working principle of a search engine includes the following three processes: first, find and collect web page information on the Internet; extract and organize information at the same time; build an index database; and then the searcher quickly searches the index database based on the query keywords entered by the user Check out the document, evaluate the relevance between the document and the query, sort the results to be output, and return the query results to the user.
- In order to get the search results at the fastest speed, the search engine usually searches the index database of webpages that have been arranged in advance. Ordinary search can't really understand the content on the webpage, it can only mechanically match the text on the webpage. A search engine in the true sense generally refers to a full-text search engine that collects tens to billions of web pages on the Internet and indexes each text (that is, keywords) in the web pages. When a user searches for a certain keyword, all web pages that contain the keyword in the page content will be searched out as search results. After sorting by a complex algorithm, these results will be ranked in order according to their relevance to the search keywords. A typical search engine consists of three modules:
- (1) Information acquisition module
- An information collector is a program that can browse the web, and is described as a "web crawler". It first opens a webpage, and then uses the link of the webpage as the starting address for browsing, fetches the linked webpage, extracts the links appearing on the webpage, and decides which links to visit next by a certain algorithm. At the same time, the information collector stores the URLs that have been accessed to its own webpage list and marks the searched. The automatic indexing program checks the page and creates an index record for him, and then adds the record to the entire lookup table. The information collector continues to repeat this process from the beginning of the webpage to the hyperlink until the end. The collector of a general search engine only takes pages with a chain length ratio (ratio of the number of hyperlinks to the length of the document) that is smaller than a certain threshold during the search process. The data is collected from the content page and does not involve the catalog page. Record the address information, modification time, document length and other status information of each document while collecting documents. It is used for monitoring site resources and updating the database. During the collection process, an appropriate heuristic strategy can be constructed to guide the search path and range of the collector and reduce the blindness of document collection.
- (Two) query table module
- The lookup table module is a full-text index database. It analyzes web pages, excludes markup symbols in languages such as HTML, extracts all words or words that appear, and records the URL and corresponding position where each word appears (for example, it appears on a web page). In the title, it still appears in the introduction or text). Finally, the data is stored in a query table and becomes a database that is directly provided for users to search.
- (Three) retrieval module
- The retrieval module is a program that implements the retrieval function. Its role is to split the retrieval expression entered by the user into words or words with retrieval significance, and then access the query table to obtain the corresponding retrieval results through a certain matching algorithm. The returned results are generally based on word frequency and information reflected in web links to build a statistical model and output them in order of relevance from high to low. [1]
- The working mechanism of the search engine is to use an efficient spider program, starting from the specified URL, following the hyperlinks on the webpage, using the depth-first algorithm or breadth-first algorithm to traverse the entire Internet, and crawling the webpage information to the local database. Then use the indexer to index important information units in the database, such as titles, keywords, abstracts, etc., or the full text for query navigation. Finally, the retriever matches the query request submitted by the user through the browser with the information in the index database using some retrieval technology, and then returns the retrieval results to the user in some sorting method.
- (1) Discover and collect webpage information in the Internet
- We mentioned in the search engine classification section
- In general, we can measure the performance of a search engine from the following aspects:
- The recall rate refers to the ratio of the number of relevant documents in the search results provided by the search engine to the number of relevant documents existing in the network. It is the true reflection of the search engine's coverage of network information.
- The accuracy rate is the degree of matching between the search results provided by the search engine and the user's information needs, and it is also the ratio of the number of valid documents in the search results to the total number of documents provided by the search engine.
- The response time generally depends on two factors, namely the network speed related to the bandwidth and the speed of the search engine itself. The ideal retrieval speed can only be guaranteed if both have reliable technical support. For search engines, recall and accuracy are difficult to achieve the best of both worlds. The main factors affecting search engine performance are information retrieval models, including representation methods for documents and queries, matching strategies to evaluate the relevance of documents and user queries, The ranking method of query results and the mechanism of user feedback.