Data Virtualization Relies on Optimization

Up-to-the-minute data is a key business requirement.

Cisco Information Server’s query optimization algorithms and techniques deliver the timely data business requires, on demand, with high performance.

Cisco is the Data Virtualization Performance Leader

The Cisco Data Virtualization Platform includes the most technically advanced and mature federated query processing engine in the data virtualization market.

Based on a decade of pioneering research and more than 300 man years of development, CIS’ query optimization and data-streaming technologies return complex, federated results faster and more reliably than any other vendor’s data virtualization offering.

CIS uses sophisticated rule- and cost-based query-optimization strategies that examine the plan of each query, and then automatically creates a plan that optimizes processing and performance, with minimum overhead.

Query Optimization Techniques Ensure High Performance

The Cisco Data Virtualization Platform employs a number of query optimization strategies and techniques including:

  • SQL Pushdown – CIS offloads as much query processing as possible by pushing down select query operations such as string searches, comparisons, local joins, sorting, aggregating, grouping into the underlying relational data sources. This allows CIS to take advantage of native data source capabilities and limit the amount of intermediate data returned from each data source.
  • Parallel Processing – CIS optimizes query execution by employing parallel and asynchronous request processing. After building an optimized query plan, CIS executes data service calls asynchronously on separate threads, reducing idle time and data source response latency.
  • Distributed Joins – CIS detects when a query being executed involves data consumed from different data sources and tries to employ distributed query optimization techniques to improve overall performance and minimize the amount of data moved over the network. CIS optimizes queries and takes advantage of a variety of sort-merge, semi, hash and nested-loop joins depending on the nature of the query and types of data sources.
  • Caching – CIS can be configured to cache results for query, procedure and web service calls on a per view/per query basis. When enabled, CIS’ caching engine stores result sets either in a local file-based cache or in a relational data base. CIS will verify whether results of a query it is about to execute are already stored in the caching system and will use the cached data as appropriate. While caching can improve the overall execution time, its impact is most pronounced when used on frequently invoked queries that are executed against rarely changing data or high latency sources such as Web Service based data providers.
  • Advanced Query Optimization –  CIS provides a number of additional techniques and algorithms include data source grouping, join algorithm selection, join ordering, union-join inversion, predicate pooling and propagation, and projection pruning.
  • Optimize the Network and the Database – Network bandwidth is generally the scarcest resource in the query processing pipeline, so reducing the amount of data that needs to be transferred has a significant impact on the latency and overall performance.  CIS is also aware of the query processing capabilities of underlying data sources and uses this information to determine the optimal workloads for each source.

How Cisco Information Server’s Query Engine Works

Query processing within the Cisco Data Virtualization Platform occurs in several phases as seen below.

Data Virtualization High-Performance Query Processing Engine

Cisco Data Virtualization Query Processing Engine

  • Plan Generation and Optimization Phase – The plan generator and cost-based optimizer are responsible for producing one or more plans for each query. Based on rules and dynamic cost functions, the optimal plan is automatically selected. The cost function measures the time and computing effort required to execute each element of the query plan. Cost information comes from a service that pulls data from the repository about each data source in the query as well as other environmental information during plan-generation. For each data source, capabilities that are factored into plans include indexes, join support, and cardinality estimates. The plan generator also compensates for the fact that each data source might have different capabilities that need to be normalized across the distributed and disparate data sources to successfully resolve the query. It’s important to note that the cost-based optimizer can work with data sources with limited metadata (e.g., a Web service) as well as systems that have a significant volume of metadata (e.g., Oracle). Also, during plan generation, the server employs non-blocking join algorithms that use both memory-based and memory-disk algorithms to manage large result sets, and that have underlying data source awareness for pushing joins and predicates to underlying sources for execution..
  • Query Execution Phase – Proprietary algorithms enable CIS to stream results from massively complex queries while the user query is still running. This lets applications begin consuming data from the server before the query completes execution, and results in substantial performance gains. Queries are decomposed and executed in the target data sources, and the streaming results are assembled and pipelined directly to the user’s client software. The parallel multi-threaded architecture of the server is truly exploited in this module where there are multiple queries with different run times across multiple data sources all required to resolve a single CIS query. The execution engine also does dynamic plan re-writing based on actual results found at run time.
  • Query Plan Enhancement Phase – An out-of-the-box installation of the Cisco Data Virtualization Platform will do a very good job of optimizing queries. However you can also enhance these query plans to improve its optimization results. During plan generation and execution, extensive statistics are collected and logged for use in a powerful query plan visualization tool found within the CIS Studio. CIS developers use these insights to learn about the cardinality and distribution of data in the data source’s tables, which in turn allows query optimization to make better decisions about join algorithms and ordering. Join options let you manually specify some of the run-time behaviors of the join instead of letting query optimization automatically determine the behavior. For example, you can select the actual algorithm with which the join will be processed. Query hints can be included to help query optimization do its job. For example, you can provide the expected cardinality of either or both sides of a join. All available query hints are outlined and explained in the Cisco Information Server Reference Manual.