Over the last few years, we have seen an increase in academic and industry work focusing on benchmarking and performance evaluation in novel cloud settings, driven both by an increase and diversity in workloads as well as hardware such as FPGAs or GPUs. Moreover, there are new classes of data-driven applications (i.e., machine learning and big data scenarios) that now need to be considered while at the same time, SQL and no-SQL engines alike are developing, and new engine concepts such as unified engines are being created. Consequentially, special testing efforts and rigor to ensure classical database strengths such as reliability, integrity, and performance are crucial for these novel system architectures and designs.
The goal of DBTest 2024 is to bring researchers and practitioners from academia and industry together to discuss challenges, approaches and open problems around these issues.
Standardized benchmarks are crucial to ensure a fair comparison of performance across systems. While extremely valuable, these benchmarks all use a setup where the workload is well-defined and known in advance. Unfortunately, this has led to overly-tuning data management systems for particular benchmark workloads such as TPC-H or TPC-C. As a result, benchmarking results frequently do not reflect the behavior of these systems in many real-world settings since workloads often significantly vary from the "known" benchmarking workloads. To address this issue, we present surprise benchmarking, a complementary approach to the current standardized benchmarking where "unknown" queries are exercised during the evaluation.
Key-value stores are at the core of several modern NoSQL-based data systems, and thus, a comprehensive benchmark tool is of paramount importance in evaluating their performance under different workloads. Prior research reveals that real-world workloads have a diverse range of characteristics, such as the fraction of point queries that target non-existing keys, point and range deletes, as well as, different distributions for queries and updates, all of which have very different performance implications. State-of-the-art key-value workload generators, such as YCSB and db_bench, fail to generate workloads that emulate these practical workloads, limiting the dimensions on which we can benchmark the systems' performance.
In this paper, we present KVBench, a novel synthetic workload generator that fills the gap between classical key-value workload generators and more complex real-life workloads. KVBench supports a wide range of operations, including point queries, range queries, inserts, updates, deletes, range deletes, and among these options, inserts, queries, and updates can be customized by different distributions. Compared to state-of-the-art key-value workload generators, KVBench offers a richer array of knobs, including the proportion of empty point queries, customized distributions for updates and queries, and range deletes with specific selectivity, constituting a significantly flexible framework that can better emulate real-world workloads.
Database systems use indexes on frequently accessed attributes to accelerate query and transaction processing. This requires paying the cost of maintaining and updating those indexes, which can be thought of as the process of adding structure (e.g., sort) to an otherwise unstructured data collection. The indexing cost is redundant when data arrives pre-sorted, even if only up to some degree. While recent work has studied how classical indexes like B+-trees cannot fully exploit the near-sortedness during ingestion, there is a lack of this exploration on other index designs like read-optimized learned indexes or write-optimized LSM-trees.
In this paper, we bridge this gap by conducting the first-ever study on the behavior of learned indexes and LSM-trees when varying the data sortedness in an ingestion workload. Specifically, we build on prior work on benchmarking data sortedness on B+-trees and we expand the scope to benchmark: (i) ALEX and LIPP, which are updatable learned index designs; and (ii) the LSM-tree engine offered by RocksDB. We present our evaluation framework and detail key insights on the performance of the index designs when varying data sortedness. Our observations indicate that learned indexes exhibit unpredictable performance when ingesting differently sorted data, while LSM-trees can benefit from sortedness-aware optimizations. We highlight the potential headroom for improvement and lay the groundwork for further research in this domain.
Log-structured merge (LSM) tree is an ingestion-optimized data structure that is widely used in modern NoSQL key-value stores. To support high throughput for writes, LSM-trees maintain an in-memory buffer that absorbs the incoming entries before writing them to slower secondary storage. We point out that the choice of the data structure and implementation of the memory buffer has a significant impact on the overall performance of LSM-based storage engines. In fact, even with the same implementation of the buffer, the performance of a storage engine can vary by up to several orders of magnitude if there is a shift in the input workload.
In this paper, we benchmark the performance of LSM-based storage engines with different memory buffer implementations and under different workload characteristics. We experiment with four buffer implementations, namely, (i) vector, (ii) skip-list, (iii) hash skip-list, and (iv) hash linked-list, and for each implementation, we vary any design choices (such as bucket count in a hash skip-list and prefix length in a hash linked-list). We present a comprehensive performance benchmark for each buffer configuration, and highlight how the relative performance of the different buffer implementations varies with a shift in input workload. Lastly, we present a guideline for selecting the appropriate buffer implementation for a given workload and performance goal.
Performance benchmarking is a crucial tool to evaluate whether performance degradation occurs when a system is modified, allowing cloud providers to plan the provisioning of and changes to their resources. This is most commonly realized through the execution of standardized industry benchmarks. However, these benchmarks oftentimes do not capture all facets of actual customer workloads. To mitigate this issue, one can use an anonymized version of the customer's workload through algorithms such as hashing, k-anonymity, or differential privacy, allowing for targeted and customized performance benchmarking. In this paper, we focus on one of these techniques, namely differential privacy, and examine its impact on benchmarking performance, i.e., whether differential privacy can maintain the same performance characteristics as the original workload. We discuss several challenges specific to differential privacy algorithms such as the handling of unique column values, queries that span multiple tables, and continuous values, as well as how effectively it can be deployed. Our examination shows that differential privacy is a promising technique in this space but has practical limitations such as scaling problems and, for some queries, a distortion of the performance characteristics.
The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the microarchitectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.
Ev has a passion for teaching computers to find bugs – ideally without making other developers too unhappy.
He's been on the Spanner Engineering Productivity team at Google in 2018, where he’s building tooling to sustainably ensure the correctness and reliability of Spanner. He was an invited industry attendee at the 2023 Daghstul Seminar on Ensuring the Reliability and Robustness of Database Management Systems.
Prior to Google, he worked at Microsoft from 2014 to 2018 on scaling static analysis and security tooling. While there, he contributed to the SARIF standard for tooling results and spoke at BlueHat about building security tooling.
When he isn’t trying to teach computers to break software (usually other people’s), he collects fountain pens, mixes a decent manhattan, plays violin, and recently adopted a puppy.
He graduated from the University of Virginia with a BA in Computer Science and Mathematics in December 2013.
Jesús is a Principal Research SDE Manager at the Gray Systems Lab (GSL), the applied research group within Azure Data. His research focuses broadly on optimizing data systems performance and efficiency, with close collaboration with product and engineering teams.
Before joining Microsoft, Jesús held various engineering positions at Cloudera, where he worked on query processing and optimization in Cloudera’s SQL data warehouse engines. He has also been actively involved with open-source projects such as Apache Calcite and Apache Hive.
Jesús earned his PhD from the University of Paris-Sud and Inria, France, and holds a degree in Computer Science and Engineering from the University of Almería, Spain.
Authors are invited to submit original, unpublished research papers, or experience papers that describe the implementation of testing-related solutions in applications and products that are not being considered for publication in any other forum pertaining to any of the topics listed above as DBTest's topics of interest.
The authors' submission is expected to contain 4 to 6 pages excluding references and appendix.
Accepted research submissions will be published in the ACM DL.
Authors can submit a talk proposal for previously published but relevant content as well as new ideas within the scope of DBTest's topics of interest that they want feedback from the community on.
The submission should consist of 1 page including references and appendix. Talk proposals need to be marked with the suffix [Talk Proposal] in the submission's title.
Accepted talk proposals will be listed on the DBTest homepage.
Authors are required to follow the current ACM Proceedings Format.
The submission will be handled through HotCRP.
March 15**, 2024 / 11:59PM US PST
April 11, 2024 / 11:59PM US PST
May 6, 2024 / 11:59PM US PST