With the ever increasing amount of data stored and processed, there is an ongoing need of testing database management systems but also data-intensive systems in general. Specifically, emerging new technologies such as Non-Volatile Memory impose new challenges (e.g., avoiding persistent memory leaks and partial writes), and novel system designs including FPGAs, GPUs, and RDMA call for additional attention and sophistication.
Reviving the previous success of the seven previous workshops, the goal of DBTest 2020 is to bring researchers and practitioners from academia and industry together to discuss key problems and ideas related to testing database systems and applications. The long-term objective is to reduce the cost and time required to test and tune data management and processing products so that users and vendors can spend more time and energy on actual innovations.
As most workshops associated with this year's SIGMOD (and SIGMOD 2020 itself), DBTest will be a virtual workshop. We will use the Zoom video conferencing platform to stream the presentations, and have an interactive discussion about the papers and topics presented at the workshop.
The program for this year features three keynotes, five full papers and two short papers, and is structured as follows (all times are PST):
|Keynote 1: From HyPer to Hyper: Integrating an academic DBMS into a leading analytics and business intelligence platform||Tobias Mühlbauer and Jan Finis||https://youtu.be/iAlkUq411mc||Slides|
|SparkFuzz: Searching Correctness Regressions in Modern Query Engines||Bogdan Ghit, Nicolas Poggi, Josh Rosen, Reynold Xin and Peter Boncz||https://youtu.be/l90E5EUCKQA||Slides|
|On Another Level: How to Debug Compiling Query Engines||Timo Kersten and Thomas Neumann||https://youtu.be/vIcoM2cyLKs||Slides|
|Keynote 2: Benchmark(et)ing an LSM-Tree vs a B-Tree||Mark Callaghan||https://youtu.be/vyLZHx9aZQc||Slides|
|Automated System Performance Testing at MongoDB||Henrik Ingo and David Daly||https://youtu.be/FI5BrYkWNvg||Slides|
|CoreBigBench: Benchmarking Big Data Core Operations||Todor Ivanov, Ahmad Ghazal, Alain Crolotte, Pekka Kostamaa and Yoseph Ghazal||https://youtu.be/V8SdCOl7VyM||Slides|
|FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation||Michael Günther, Paul Sikorski, Maik Thiele and Wolfgang Lehner||https://youtu.be/UE3ZW5HIGDw||Slides|
|Keynote 3: How to clear your backlog of failing tests and make your test suite All Green||Greg Law||https://youtu.be/WH5Gu9RC6to|
|Testing Query Execution Engines with Mutations||Xinyue Chen, Chenglong Wang and Alvin Cheung||Slides|
|Workload Merging Potential in SAP Hybris||Robin Rehrmann, Martin Keppner, Wolfgang Lehner, Carsten Binnig and Arne Schwarz||https://youtu.be/eUfnuvsy-Mw||Slides|
We are still in the process of collecting and uploading the remaining videos and slides that are not yet available above, please stay tuned.
With more than 1200 contributors, Apache Spark is one of the most actively developed open source projects. At this scale and pace of development, mistakes are bound to happen. In this paper we present SparkFuzz, a toolkit we developed at Databricks for uncovering correctness errors in the Spark SQL engine. To guard the system against correctness errors, SparkFuzz takes a fuzzing approach to testing by generating random data and queries. Spark-Fuzz executes the generated queries on a reference database system such as PostgreSQL which is then used as a test oracle to verify the results returned by Spark SQL. We explain the approach we take to data and query generation and we analyze the coverage of SparkFuzz. We show that SparkFuzz achieves its current maximum coverage relatively fast by generating a small number of queries.
Compilation-based query engines generate and compile code at runtime, which is then run to get the query result. In this process there are two levels of source code involved: The code of the code generator itself and the code that is generated at runtime. This can make debugging quite indirect, as a fault in the generated code was caused by an error in the generator. To find the error, we have to look at both, the generated code and the code that generated it. Current debugging technology is not equipped to handle this situation. For example, GNU's gdb only offers facilities to inspect one source line, but not multiple source levels. Also, current debuggers are not able to reconstruct additional program state for further source levels, thus, context is missing during debugging. In this paper, we show how to build a multi-level debugger for generated queries that solves these issues.We propose to use a timetravelling debugger to provide context information for compile-time and runtime, thus providing full interactive debugging capabilities for every source level.We also present how to build such a debugger with low engineering effort by combining existing tool chains.
Distributed Systems Infrastructure (DSI) is MongoDB's framework for running fully automated system performance tests in our Continuous Integration (CI) environment. To run in CI it needs to automate everything end-to-end: provisioning and deploying multinode clusters, executing tests, tuning the system for repeatable results, and collecting and analyzing the results. Today DSI is MongoDB's most used and most useful performance testing tool. It runs almost 200 different benchmarks in daily CI, and we also use it for manual performance investigations. As we can alert the responsible engineer in a timely fashion, all but one of the major regressions were fixed before the 4.2.0 release. We are also able to catch net new improvements, of which DSI caught 17. We open sourced DSI in March 2020.
Significant effort was put into big data benchmarking with focus on end-to-end applications. While covering basic functionalities implicitly, the details of the individual contributions to the overall performance are hidden. As a result, end-to-end benchmarks could be biased toward certain basic functions. Micro-benchmarks are more explicit at covering basic functionalities but they are usually targeted at some highly specialized functions. In this paper we present CoreBigBench, a benchmark that focuses on the most common big data engines/platforms functionalities like scans, two way joins, common UDF execution and more. These common functionalities are benchmarked over relational and key-value data models which covers majority of data models. The benchmark consists of 22 queries applied to sales data and key-value web logs covering the basic functionalities. We ran CoreBigBench on Hive as a proof of concept and verified that the benchmark is easy to deploy and collected performance data. Finally, we believe that CoreBigBench is a good fit for commercial big data engines performance testing focused on basic engine functionalities not covered in end-to-end benchmarks.
Today's natural language processing and information retrieval systems heavily depend on word embedding techniques to represent text values. However, given a specific task deciding for a word embedding dataset is not trivial. Current word embedding evaluation methods mostly provide only a one-dimensional quality measure, which does not express how knowledge from different domains is represented in the word embedding models. To overcome this limitation, we provide a new evaluation data set called FacetE derived from 125M Web tables, enabling domain-sensitive evaluation. We show that FacetE can effectively be used to evaluate word embedding models. The evaluation of common general-purpose word embedding models suggests that there is currently no best word embedding for every domain.
Query optimizer engine plays an important role in modern database systems. However, due to the complex nature of query optimizers, validating the correctness of a query execution engine is inherently challenging. In particular, the high cost of testing query execution engines often prevents developers from making fast iteration during the development process, which can increase the development cycle or lead to production-level bugs. To address this challenge, we propose a tool, MutaSQL, that can quickly discover correctness bugs in SQL execution engines. MutaSQL generates test cases by mutating a query Q over database D into a query Q′ that should evaluate to the same result as Q on D. MutaSQL then checks the execution results of Q′ and Q on the tested engine. We evaluated MutaSQL on previous SQLite versions with known bugs as well as the newest SQLite release. The result shows that MutaSQL can effectively reproduce 34 bugs in previous versions and discover a new bug in the current SQLite release.
OLTP DBMSs in enterprise scenarios are often facing the challenge to deal with workload peaks resulting from events such as Cyber Monday or Black Friday. The traditional solution to prevent running out of resources and thus coping with such workload peaks is to use a significant over-provisioning of the underlying infrastructure. Another direction to cope with such peak scenarios is to apply resource sharing. In a recent work, we showed that merging read statements in OLTP scenarios offers the opportunity to maintain low latency for systems under heavy load without over-provisioning.
In this paper, we analyze a real enterprise OLTP workload --- SAP Hybris --- with respect to statements types, complexity, and hot-spot statements to find potential candidates for workload sharing in OLTP. We additionally share work of the Hybris workload in our system OLTPShare and report on savings with respect to CPU consumption. Another interesting effect we show is that with OLTPShare, we can increase the SAP Hybris throughput by 20%.