Skip to content

DataManagementLab/PBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

PBench

A database workload synthesizer

IntroductionEnvironmentWorkloadUsage

Introduction

This repository contains the code for VLDB'25 paper "PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking". In this paper we introduces a new problem of workload synthesis with real statistics, which aims to generate synthetic workloads that closely approximate real execution statistics, including key performance metrics and operator distributions, in real cloud workloads. To address this problem, we propose PBench, a novel workload synthesizer that constructs synthetic workloads by judiciously selecting and combining workload components (i.e., queries and databases) from existing benchmarks.

Environment

Python 3.10 is required to run PBench. To set up the environment, follow the steps below:

  1. Install Python 3.10

    sudo apt-get install python3.10
    
  2. Install required packages

    pip install -r requirements.txt
    

Workload

Snowset contains several statistics (timing, I/O, resource usage, etc..) pertaining to ~70 million queries from all customers that ran on Snowflake over a 14 day period from Feb 21st 2018 to March 7th 2018. PBench uses the statistics in Snowset to synthesize database workloads.

Usage

PBench can synthesize database workloads using different methods. The following sections describe how to use each method.

Configuration

Conguration can be set in PBench-tool/config/*.yml.

This document provides a brief overview of the configuration parameters specified in the YAML file. The setup is designed to generate and execute workloads efficiently.

  • Workload Path: ../../Workloads/Snowset/workload1h-5m-30s_1.csv
    This file specifies the original workload file.

  • Workload Name: workload1h-5m-30s_1
    A unique identifier for the workload, facilitating easy referencing and logging.

  • Count Limit: 1000
    Sets the maximum number of operations or queries to be executed during the workload.

  • Time Limit: 270 seconds
    Defines the total duration within the time-window which all operations must be completed.

  • Use Operator: 1
    Indicates whether the operation involves using operator target (value 1 suggests usage).

  • Interval: 30 seconds
    Determines the interval between operations or queries execution cycles.

  • Query Types: [TPCH, TPCH, TPCH, TPCH, tpcds_all, tpcds_all, imdb, llm]
    Lists the types of queries included in the workload, covering different datasets and benchmarks.

  • Database Names: [tpch500m, tpch1g, tpch5g, tpch9g, tpcds1g, tpcds2g, imdb, llm]
    Corresponds to the databases against which the queries will be executed, each representing different sizes or datasets.

  • Operator Scale: 100
    Scaling factors affecting the operation's intensity or frequency.

  • Initial Count: 10
    Specifies the initial target number of queries in time-window.

Baseline experiments

The baseline tools include two widely known workload synthesizer: CAB and Stitcher. We provide our implementation and startup code in Baseline/do_baseline.py.

PBench

To synthesize database workloads by PBench, follow the steps below:

  1. Collect the statistics of queries

    python Collect_metrics/collect.py 
    
  2. Synthesize workload and replay

    python PBench-tool/run_pbench.py
    

Parameters of PBench can be set in PBench-tool/configs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Jupyter Notebook 1.0%