How Do I Choose the Best Virtualization Strategy?

Data virtualization is an umbrella term used to describe all data management methods that allow applications to retrieve and manage data without requiring data-related technical details.

Data virtualization-the process of data integration to obtain more data information, this process usually introduces other technologies, such as
A complete data virtualization system should be capable of creating views / virtual tables, providing data services, optimizing joint queries, data caching, and fine-grained security, so that users can discover data in different data sources, retrieve and access data. Although data virtualization can greatly improve the flexibility and agility of data integration, such as users accessing data from different data sources through a single access point, data services are oriented to all data consumers, avoiding physical data transfer, Data usage, etc., but there are still some problems and challenges that need to be addressed.
Integration of heterogeneous data sources
Data from different data sources may use data of different structures, including structured, semi-structured, and mixed-structure data. Some data sets use a relational data model, some use HTML / XML files, and some use log format files. These heterogeneous data sources are a huge challenge that must exist in the unified data integration. Some researchers conduct research from different perspectives, such as improving and expanding the query language, dividing query requests into multiple sub-queries, and merging metadata based on semantic similarity. Data virtualization is targeted at a variety of different applications. If you query each type of application, optimizing the query language separately will affect the efficiency of the entire data virtualization platform. Therefore, the data virtualization platform provides multiple access interfaces to different data sources. Data for unified access or access. For example, for relational databases such as SQL Server, Oracle, Access, Excel, etc., using the ODBC interface to access through SQL language; for web applications, you can use REST or JSON interfaces, etc., in order to shield the heterogeneity of the data source data model and provide uniformity of data access. Solving the heterogeneity of data sources is the basis for ensuring data services. However, due to the diversity and complexity of data sources, more types of access interfaces may need to be developed and improved, so the issue of heterogeneity of data sources cannot be underestimated.
Integration of heterogeneous data
For data heterogeneity, Sujansky analyzed the heterogeneity of multi-source data from four aspects: structural differences, naming differences, semantic differences, and content differences. For example, there are multiple date and time stamp formats in the data format. The same data There are different definitions in different data sources, etc. Therefore, improper processing of data heterogeneity will lead to a sharp decline in the quality of integrated data. Some solutions have shielded the user by virtualizing and abstracting the data, using a unified description language (such as XML) or code generation technology to process the required data or creating virtual tables, and automatically generating data services. The difference in the storage format and semantics of the underlying data solves this problem to some extent, but there is still no universal data model. Due to the ease of operation, understanding, and cross-platform portability of XML, many data virtualization platforms use XML to describe data uniformly. Based on the universality of data description methods, a DIMs (datainformationmodel) based on XML is proposed. Data information model and meet the portability of the data model. Resolving data heterogeneity is a prerequisite for metadata organization and the creation of reusable views / virtual tables. It is worth noting that due to the differences in the underlying data models, how to ensure the correctness, completeness, and consistency of the data when converting to a unified data format, thereby ensuring the accuracy of data mapping, is a key issue that needs to be urgently addressed.
Data mapping
Data mapping is important for querying to accurate source data. Multiple encapsulated tables can be defined based on a data source, and multiple virtual tables can also be defined based on a encapsulated table. Due to the complexity and diversity of the underlying data sources, data duplication will inevitably occur between scattered data sources, and there will also be data overlap between the created encapsulated tables. For queries, the same virtual table may produce multiple mappings, which will cause repeated queries for the same data in the underlying data source, resulting in a decrease in the overall efficiency of the query. This problem involves the encapsulation table and mapping strategy. During the definition of the encapsulation table and the mapping process, the phenomenon of duplicate data in the source data must be considered. Is the encapsulation table with overlapping data created based on different data sources discarded or effectively? Merge? In addition, for the virtual table defined from top to bottom, how to correctly define the mapping according to the query needs? While ensuring that the data consumer queries the required data while avoiding repeated data queries, this is the face of achieving efficient and accurate data mapping. Challenge.
Metadata Organization Model
Similar to a database management system, metadata is also the core of a virtualized system's operation. In terms of metadata organization models, some of the current data organization models focus only on a specific application or service itself, without considering the data relationships between data sources. As a result, users have to analyze the data description and organization method when querying, and then Finding the underlying data source programmatically is too complicated for users who want to quickly and easily get the data resources they need. Based on the idea of data as a service, in view of the inadequacy of traditional HTML data models in user query, a conceptual information model DEMODS (description model for DaaS) is proposed. This model hides the automatic service query method and exchanges data from various data sources. Pre-data combination and analysis tools, users do not need to care about the operation of intermediate queries. Aiming at the heterogeneity of data models existing in the information transformation and data sharing of railway distributed systems, an XML-based three-dimensional metadata organization model is proposed, which describes the relationships between data in different systems and implements different data models and the metadata Mapping between data organization models. For a data virtualization system, a proper metadata organization model is the key. Analyze and reclassify metadata according to user query requirements and the relationship between source data, and establish a structured and relevant general metadata organization model, especially in the data service layer without pre-defined virtual tables or data services In this case, the metadata organization model is important for timely and fast data delivery. At present, there is no unified standard for the metadata organization model in the virtualization system. Due to the flexible and diverse application requirements, it is very important to study a suitable metadata organization model.
data service
The innovation of big data technology and the development of the DaaS model have not only promoted the trend of data services but also promoted the research of composite data services, that is, multiple basic data services can be combined into composite data services that meet business needs through association. The incompatibility of data structures and data privacy in DaaS composite services are studied, and corresponding solutions are proposed. The data virtualization system also supports the combination of data services. During the combination process, due to the differences in data attributes, data privacy, and data structures in different data services, there are also incompatibilities in data structures, decreased data quality, and data access rights. Differences and other issues, the current data virtualization solutions are less researched in this area, most of them are using the corresponding analysis and combination tools to deal with, which increases the burden on users. Since the main object created by the data virtualization system is data services, future research can integrate the data virtualization system with the DaaS model, use the data services created by the data virtualization system as basic services, and merge the basic data service models through DaaS. , Delete, sort, adjust data structure, etc., to form a new data service, while reducing user operations while ensuring the quality of the combined data service. In addition, for non-relational databases and unstructured data content, how to create a data service and whether to display various underlying data source queries in the form of virtual tables is also an important issue in creating data services. In order to ensure the continuous availability of data services, the data virtualization system needs to update the virtual tables, which involves how the virtual tables are automatically and instantly updated according to changes in the underlying data source. For virtual tables generated by data services, how to ensure the consistency and efficiency of updates are important issues that need to be studied.
Query optimization
The goal of query optimization is to improve the efficiency of users in obtaining the required data resources, and it is also a key issue in data virtualization systems. Some studies use the idea of middleware to build query optimization on the basis of data models. Through the analysis and mining of the data, the minimum metadata that can represent the data attributes and relationships between the data is extracted, and the query scope of the data is reduced by accessing the metadata to reduce the query response time and optimize the query. This also illustrates the metadata from another perspective. The importance of quality and organizational models. There are also some studies that improve the query efficiency by optimizing the performance of the query system. For example, the literature improves the parallel processing capability of stored data by adding a rich set of hardware acceleration engines to the storage system. Some companies such as Cisco and Composite use rule-based or cost-based optimizers to formulate the best query plan for each query request, and use scanning multiplexing technology, constraint propagation technology, and parallel processing technology to optimize network resources and databases to ensure target data. Fast and timely delivery. Some query optimization techniques currently used in data virtualization systems are Querysubstitution, SQLpushdown, Parallelprocessing, Distributedjoins, Shipjoins, SQLoverride, etc. These technologies are aimed at different application scenarios, and there are certain limitations in some aspects. For example, Querysubstitution is mainly applied to the query of nested virtual tables. SQLpushdown cannot be used for the underlying data source as a sequence file or XML Web service. Different data sources. Due to the diversity of applications, no query strategy can be applied to all application scenarios. In addition to improving query efficiency, in addition to applying these optimization techniques, for some specific query applications, it is sometimes necessary to translate query requests into another query language. From another perspective, the application of caching technology is of great benefit to improve the query performance of the system. The data virtualization system provides a flexible and scalable caching mechanism to cache related data for the underlying data source. For queries, you can find data from the cache to speed up the query and reduce the query load of the underlying data source. In addition, in order to ensure the consistency and freshness of the cached data, the cache must also be updated in real time according to changes in the underlying data, which will involve data consistency and update efficiency issues.
System scalability
As a platform for data virtualization systems, new data sources, application requests, and data structures will continue to be added, and the system must have good scalability. The new data owner will open the data source and register it to the system platform, and the original open data source will be completely or partially deregistered, which will cause metadata organization, virtual table definition, data mapping, data caching, etc. in the system. Reconstruction problem, the system must have the ability to add, modify, and update online. How to improve the scalability and scalability of the system is an important issue. Because the user's application query requirements are unpredictable, to ensure the performance of data query, especially when processing some large-capacity data, the performance challenges brought by the expansion of the system scale must be considered. In the early stages of designing and developing a data virtualization system, it is necessary to pay attention to the performance of the data query processing process and the scalability of the related solutions to improve the scalability and query performance of the data virtualization system. In addition, when the data source is constantly updated and the scale of data consumer access increases, how to ensure the synchronization of the data source, the encapsulated table and the virtual table, data consistency, and ensure the QoE experience of all data consumers, this is also data Important issues to consider in virtualized systems.
Data Security
Data security includes data authentication, authorization, and encryption. Authentication and authorization are mainly for users. Encryption is considered from the data itself. Only effective sharing of data on the basis of security can produce greater value. . Data virtualization systems implement different authentication and authorization mechanisms for the same data service for different applications. This feature has new requirements for data security, such as secure communication between queries and data sources, and data security for cross-platform / cross-system access. Wait. When a user requests data access, the data virtualization system detects user credentials (such as user ID, password, etc.). Different users have different access rights to the same data element. For example, for a virtual table, some users may only have access rights to part of the table. It is important to note that the data virtualization system only performs the detection of data consumer permissions to determine the user's access to data, and the authorization of source data access is determined by the owner of the underlying data source, and the access permissions of both Must be distinguished. Some underlying data sources have their own secure access mechanism. For data consumers, to implement access to the source data requires two levels of access permissions to the virtual table and the underlying data source; there are also some underlying data stores that do not define a security mechanism. Upper-level data access is completely public, and users only have access to virtual tables. Therefore, the security preservation mechanism of user credentials, appropriate authorization mechanism, and security authentication performance are all issues to be considered. The definition of data service access rights also limits the scope of user query data to a certain extent. Therefore, the corresponding security mechanism needs to be considered when defining the data service. But on the other hand, too complicated security mechanisms will also affect the performance of the virtualized system when querying and processing data. Therefore, how to make a compromise between security and performance when designing data services will be a challenge.
System Management
The data virtualization system is also essentially a data sharing platform. It provides data services in a simpler and agile architecture. It must also provide good management of the entire data virtualization environment, and resolve who is responsible for the shared infrastructure and who is responsible for it. Shared data services, etc. The data virtualization management plane needs a variety of system management tools to implement management and monitoring of the system operation process, such as monitoring the number of queries, query performance, system availability, cache usage, and cache update rate. These need to be considered. problem.
What is data virtualization software?
data
1. Centralize responsibility for data virtualization. One of the main advantages of this is the ability to move quickly and work on larger concepts such as defining common specifications and implementing smart storage components.
2. Contract and implement a common data model. This will ensure consistent, high-quality data, give business users more confidence in the data, and increase the flexibility and productivity of IT staff.
3. Determine a governance approach. This requires thinking about how to manage the data virtualization environment. The key question is who is responsible for the shared infrastructure and who is responsible for the shared services.
4. Educate business users to realize the benefits of data virtualization. Take the time to communicate with business users and make sure they understand the data. Do more on a daily basis to make data virtualization acceptable to the rest of the organization.
5. Pay attention to performance tuning and scalability. In the early stages of the development process, you need to adjust performance and test the scalability of the solution. Consider the introduction of massively parallel processing capabilities to handle query performance in terms of large volumes of data. Consider the fact that users are unpredictable in terms of specialized queries and reports.
6. Implement data virtualization in stages. First abstract the data source, then put the business intelligence application on it, and finally implement the more advanced federation functions of data virtualization.
HealthNow is a Blue Shield insurance provider that supports approximately 800,000 members in Western New York. The company has approximately 2,500 employees, including a 30-person super data management team.
HealthNow's IT department runs IBM's DB2, Microsoft's SQL Server and Sybase database management software. The company also runs IBM's Cognos Business Intelligence (BI) and TriZetto's Facets health plan management software.
Myers said, "I joined HealthNow at this time three years ago. At that time, the company was still using traditional ETL and did not have the concept of data services. To be honest, there was no SOA architecture and no reference architecture at that time."
All of this started to change about a year and a half ago, when Myers and his team began evaluating data federation and data virtualization technologies, and one of the significant results was that they were able to extract available information to business analysts faster than in the past.
At the time, Informatica was still retailing Composite software, which was part of the original equipment manufacturer agreement, and had just self-developed Informatica data services products. Myers said the roadmap for Informatica's data services looks promising, but the determinants are mainly related to Informatica's ability to use data quality rules.
Myers said: "Informatica tells us that we can not only federate data, we can also attach quality rules. We can configure the data and those rules so that they map correctly. Developers can write virtual data model mappings. They can do it at every step See and configure data. "
Another important factor that made HealthNow's Informatica outperform IBM and Composite focus on the company's ability to provide quality maintenance and support. HealthNow's negotiation agreement with Informatica stipulates that all service calls must be directly known to everyone.
The insurance company used Informatica data services about a year ago, and Myers says the main benefit of the software can be described in one keyword: speed.
The company now has the ability to create "virtualized data marts" that extract data from different data sources, which can be easily accessed by business users. Myers said the process of creating a data mart may take only two days. In the past, collating data for business analysts could take months to complete.
HealthNow faces data virtualization challenges
A recent Forrester research report pointed out that data virtualization software packages are often questioned by end-user organizations, who are concerned about ensuring the security of sensitive information.
But security is not a major issue for HealthNow, thanks to the company's highly robust IT architecture. The company runs IBM's WebSphere DataPower application in front of the Informatica data service, which provides a layer of security to the system.
"The people who can really reach the data service layer are other application developers. And if we let end users access it through Cognos, we have the Cognos security layer on it."
For Myers, the bigger challenge is to convince HealthNow's data integration team to convince them that Informatica's data services are a stable environment. Myers has overcome this obstacle by performing some small data virtualization projects and generating results.
"This shows that the environment is stable and that data services are reusable components that have proven to be able to submit the right data to the business side. These components can be reused in many new projects that continue the line of business," he said.
Myers said that although HealthNow's lack of experience in data virtualization has brought about minor issues, Informatica's data service implementation has been largely painless. The company has installed and configured the system within two months and passed a proof of concept test.
Myers said, "Of course you can do it faster. But I think, as with any well-architected team, we spend a lot of time first assessing the best use of a technology and where we are heading."

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?