With the ever increasing amount of data stored and processed, there is an ongoing need of testing database management systems but also data-intensive systems in general. Specifically, emerging new technologies such as Non-Volatile Memory impose new challenges (e.g., avoiding persistent memory leaks and partial writes), and novel system designs including FPGAs, GPUs, and RDMA call for additional attention and sophistication.
Reviving the previous success of the eight previous workshops, the goal of DBTest 2022 is to bring researchers and practitioners from academia and industry together to discuss key problems and ideas related to testing database systems and applications. The long-term objective is to reduce the cost and time required to test and tune data management and processing products so that users and vendors can spend more time and energy on actual innovations.
The growing popularity of JSON as exchange and storage format in business and analytical applications led to its rapid dissemination, thus making a timely storage and processing of JSON documents crucial for organizations. Consequently, specialized JSON document stores are ubiquitously used for diverse domain-specific workloads, while a JSON-specific benchmark is missing.
In this work, we specify DeepBench, an extensible, scalable benchmark that addresses nested JSON data, as well as queries over JSON documents. DeepBench features configurable domain-independent (e. g., varying document sizes, concurrent users) and JSON-specific scale levels (e. g., object, array nesting). The evaluation of well-known document stores with a prototypical DeepBench implementation shows its versatility and gives new insights into potential weaknesses that were not found by existing, non-JSON benchmarks
Treasure Data is processing millions of distributed SQL queries every day on the cloud. Upgrading the query engine service at this scale is challenging because we need to migrate all of the production queries of the customers to a new version while preserving the correctness and performance of the data processing pipelines. To ensure the quality of the query engines, we utilize our query logs to build customer-specific benchmarks and replay these queries with real customer data in a secure pre-production environment. To simulate millions of queries, we need effective minimization of test query sets and better reporting of the simulation results to proactively find incompatible changes and performance regression of the new version. This paper describes the overall design of our system and shares various challenges in maintaining the quality of the query engine service on the cloud.
Dataframes have become a popular means to represent, transform and analyze data. This approach has gained traction and a large user base for data science practitioners - resulting in a new wave of systems that implement a dataframe API but allow for performance, efficiency, and distributed/parallel extensions to systems such as R and pandas. However, unlike relational databases and NoSQL systems with a variety of benchmarking, testing, and workload generation suites, there is an acute lack of similar tools for dataframe-based systems. This paper presents fuzzydata, a first step in providing an extensible workflow generation system that targets dataframe-based APIs. We present an abstract data processing workflow model, random table and workflow generators, and three clients implemented using our model. Using fuzzydata, we can encode a real-world workflow or randomly generate workflows using various parameters. These workflows can be scaled and replayed on multiple systems to provide stress testing, performance evaluation, and a breakdown of performance bottlenecks present on popular dataframe systems.
Authors are invited to submit original, unpublished research papers that are not being considered for publication in any other forum.
March 11, 2022 / 11:59PM US PST
April 4, 2022 / 11:59PM US PST
April 17, 2022 / 11:59PM US PST
DBTest will be held as a hybrid workshop this year. While there will be a physical presence, we will also use the Zoom video conferencing platform to stream the presentations, and have an interactive discussion about the papers and topics presented at the workshop.
The program for this year features two keynotes, three full papers, one short presentation, and a panel discussion, and is structured as follows (all times are EST):
Start Time (EST) | Title | Presenter | Mode | Recording | Slides |
---|---|---|---|---|---|
9:30 AM | Welcome | Manuel Rigger and Pınar Tözün | Remote | YouTube | |
9:45 AM | Journey of Migrating Millions of Queries on The Cloud | T. Saito, N. Takezoe, Y. Okada, T. Shimamoto, D. Yu, S. Chandrashekharachar, K. Sasaki, S. Okumiya, Y. Wang, T. Kurihara, R. Kobayashi, K. Suzuki, Z. Yang, and M. Onizuka | Remote | YouTube | Slides |
10:15 AM | Benchbot: Benchmark as a Service for TiDB | Yuying Song and Huansheng Chen | Remote | YouTube | Slides |
10:30 AM | Break | ||||
11:00 AM | Keynote 1: DuckDB Testing - Present and Future | Mark Raasveldt | Remote | YouTube | Slides |
12:00 AM | DeepBench – Benchmarking JSON Document Stores | Stefano Belloni, Daniel Ritter, Marco Schröder, Nils Rörup | Remote | YouTube | Slides |
12:30 AM | Lunch | ||||
2:00 PM | Keynote 2: Tackling performance and correctness problems in database-backed web applications | Shan Lu | Remote | YouTube | Slides |
3:00 PM | FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems | Mohammed Suhail Rehman, Aaron Elmore | In-person | YouTube | Slides |
3:30 PM | Break (Poster Session) | ||||
4:30 PM | Panel Discussion | Greg Law, Allison Lee, Abdul Quamar, Yingjun Wu | Hybrid | YouTube | |
5:25 PM | Closing | Manuel Rigger and Pınar Tözün | Remote |
DBTest is supported by the following sponsors: