Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores

Multicore scalability. VLDB 14'.


This paper explores the performance of seven (class of 2PL and Timestamp ordering) concurrent control algorithms when the scaling of cores grows up to 1024. The author introduces the motivation of the DBMS schemes on with thousands of cores and the mechanisms concurrent control algorithms concisely. And in the experiment part, this paper did a very thorough analysis of different concurrency control algorithms on different workloads (YCSB and TPC-C) and identified the bottleneck of current algorithms. And in the last part, the author concluded that none of current concurrency algorithms can scale well on chips with up to 1024 cores, even if they are state-of-the-art.


  • This paper gives a good introduction to the motivation and background of this paper. For example, it illustrates seven concurrency algorithms concisely.

  • The Author completed this experiment with great patience and cautiousness. They implemented all algorithms by themselves to avoid interference from internal database system.

  • The results of the performance on simulator and real hardware are compared, which makes the conclusion more convincing.


  • Although the author have stated the reason why they choose to build custom DBMS instead of commercial DBMS, it may be better if they can compare the results of custom DBMS and commercials ones.

  • Hybrid concurrency control schemes may be evaluated to make the paper more comprehensive.

Evaluation of scalability (up to 1k cores) of 7 classical concurrency control protocols for main memory OLTP with a prototype DBx1000. All of them failed to scale to 1k cores.


Bottlenecks for each CC scheme

  • DL_DETECT - Scales under low contention. Suffers from lock thrashing.

  • NO_WAIT - Has no centralized point of contention. Highly scalable. Very high abort rate

  • WAIT_DIE - Suffers from lock thrashing and timestamp bottleneck.

  • TIMESTAMP - High overhead from copying data locally. Non-blocking writes. Suffers from timestamp bottleneck.

  • OCC - High overhead for copying data locally. High abort cost. Suffers from timestamp bottleneck.

  • MVCC - Performs well with read-intensive workload. Non-blocking reads and writes. Suffers from timestamp bottleneck.

  • H-STORE - The best algorithm for partitioned workloads. Suffers from multi-partition transactions and timestamp bottleneck.

Optimizations in DBx1000 - mutex based allocation -> atomic instruction -> batch allocation, hardware counter, distributed clock


  • What are the pros and cons of timestamp ordering over two-phase locking? Can you think of other examples of using timestamps in other fields of CS?

    • Pros of T/O: simpler, no deadlocks

    • Cons of T/O: timestamp allocation bottleneck, storing read/write timestamps

    • Examples of timestamps: third-party authentication, sync/recovery in distributed systems, consensus, parallel compilation, cryptocurrency, network protocol, cache coherence, distributed file systems, transactional memory, vector-clock

  • What are the main pros and cons of a multi-version concurrency control (MVCC) protocol? How is MVCC related to HTAP (Hybrid transactional/analytical processing)?

    • Pros: good for rolling back, efficient writes, non-block reads

    • Cons: Memory copy, memory space, timestamp bottleneck

    • HTAP: old versions for OLAP and new versions for OLTP

  • Can you think of any hardware changes to a multicore CPU that can improve the performance/scalability of concurrency control?

    • NUMA management, clock synchronization, low-overhead locking, timestamp allocation, conflict detection in hardware, persistent memory for logging, locking prefetch



  • 2PL: scalable (DL_DETECT ~ NO_WAIT > WAIT_DIE ~ MVCC)

  • T/O: Timestamp allocation limits scalability.

  • OCC: Memory copy hurts performance

  • 2PL > T/O > OCC

Medium contention

  • 2PL

    • DL_DETECT does not scale due to deadlocks and thrashing

    • NO_WAIT/WAIT_DIE scales

  • 2PL > T/O > OCC > DL_DETECT

High Contention

  • Scaling stops at small core count

  • NO_WAIT has good performance until 1000 cores

  • OCC wins at 1000 cores

  • NO_WAIT, OCC > others > DL_DETECT


WAIT_DIE: A transaction T1 waits for another transaction T2 only if T1 has higher priority than T2 (e.g., T1 starts execution before T2).

  • Pros over NO_WAIT

    • Guaranteed forward progress (i.e., no starvation)

    • Fewer aborts

  • Cons over NO_WAIT

    • Locking logic is more complex

Solutions to Timestamp Allocation (downer are better)

  • Mutex based allocation

  • Atomic instruction

  • Batch allocation

  • Hardware Counter (~1000 million ts/s)

  • Distributed Clock (perfect scalability)

Calvin: Fast Distributed Transactions for Partitioned Database Systems

Deterministic DB

Distributed transactions - 2PC (two-phase commit)

High availability

  • Every tuple is mapped to one partition

  • A partition of data is unavailable if a server crashes (or more generally, the warehouse where the server located)

  • Replicate data across multiple servers so data is available if at least one partition is still alive

  • If the primary node fails, failure over to a secondary node. Recovery from log if all replicas fail.

  • RAID take more time to recover from disks (minutes), compared to MM above.

  • Problem for traditional distributed db

    • 2PC is expensive

    • Network can be a bottleneck for log shipping

Distributed Transactions

  • Decide the global execution order of transactions before executing them

  • All replicas follow same order to execute the transactions

  • Non-deterministic events are resolved and logged before dispatching the transactions (e.g. assign values for potential random value before dispatching)

  • Pros over traditional distributed db

    • Log batch of inputs -> No two-phase commit

    • Replicate inputs -> Less network traffic than log shipping

      • (extreme case of operational logging, which can happen on any level, can recover from inputs directly)



different replica may have different scheduler result, but the final result is the same

  • Sequencer

    • Distributed across all nodes

      • No single point of failure

      • High scalability (infinite horizontal scalability?)

    • Replicate transaction inputs asynchronously through Paxos (instead of log shipping)

    • 10ms batch epoch for batching

    • Batch the transaction inputs, determine their execution sequence, and dispatch them to the schedulers

  • Scheduler

    • All transactions have to declare all lock requests before the transaction execution starts (one con cause it's a too strong assumption)

    • Single thread issuing lock requests

  • Scheduler is a bottleneck for read-only workloads

  • Transaction Execution Phases

    1. Analysis all read/write sets

      • Passive participants (read-only partition)

      • Active participants (has write in partition)

    2. Perform local reads

    3. Serve remote reads

      • send data needed by remote ones.

    4. Collect remote read results

      • receive data from remote.

    5. execute transaction logic and apply writes

  • Exampleimage-20200207201556991

Dependent transactions

  • UPDATE table SET salary = 1.1 * salary WHERE salary < 1000

  • Need to perform reads to determine a transaction’s read/write set

  • How to compute the read/write set?

    • Modifying the client transaction code

    • Reconnaissance query (read-only query to discover sets, but do not need to update) to discover full read/write sets

    • If prediction is wrong (read/write set changes), repeat the process

Disk Based Storage

  • Fixed serial order leads to more blocking

    • T1 write(A), write(B)

    • T2 write(B), write(C)

    • T3 write(C), write(D)

  • Solution

    • Prefetch ( warmup ) request to relevant storage components

    • Add artificial delay – equals to I/O latency

    • Transaction would find all data items in memory


  • Logs before a checkpoint can be truncated

  • Checkpointing modes

    • Naïve synchronous mode:

      • Stop one replica, checkpoint, replay delayed transactions

    • Zig-Zag

      • Stores two copies of each record


  • Conventional distributed transactions

    • Partition -> 2PC (network messages and log writes)

    • Replication -> Log shipping (network traffic)

  • Deterministic transaction processing

    • Determine the serial order before execution

    • Replicate transaction inputs (less network traffic than log shipping)

    • No need to run 2PC


  • Is knowing read/write sets necessary for deterministic transactions? How does the protocol change if we remove this assumption?

    • With just partition, no

    • With partition and replications, it is possible to remove the assumption if they do not require remote data.

  • Can you think of other optimizations if the read/write sets are known before transaction execution?

    • No need to broadcast reads to all active participants

    • Use better deterministic ordering to improve performance

    • Enforce no conflicts within a batch -> no need to lock

    • Blind write optimization

  • For a batch of transactions, Calvin performs a single Paxos to replicate inputs. Is it possible to amortize 2PC overhead with batch execution but not using deterministic transactions?

    • Run 2PC in batches

    • Epoch-based concurrency control like Silo

A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics



Many previous works on GPU-bases database by both scholars in academic and industry claimed that their products can achieve great speedup, even higher than the bandwidth ratio between modern GPU and CPU. This paper explores the fundamental performance characteristics of GPU/CPU-based OLAP database and argues why the modal where GPUs act as coprocessors is inefficient, which many paper adopted. The author try to implement the state-of-the-art model for both of them and make the comparison more unbiased. Specifically, they first present Crystal to generate efficient query codes with composition of data processing primitives. Secondly, they implement state-of-the-art operators for both CPU and GPU with predictive performance models. The result shows that for individual query operators, speedup for selection, projection and sorts approximate to the bandwidth ration while join achieved less speedup. But surprisingly, full query performance on their implementation of GPU database achieved higher speedup than the bandwidth ratio, tested on simplified TPC-H. Additionally, the author compared the cost of GPU/CPU database and concluded that now the state-of-the art GPU database can achieve higher speedup than the bandwidth ratio (25x) with lower cost (4x effective). That means GPU can already be an ideal option when the dataset can fit into GPU memory.


S1. This paper present a very fair comparison between GPU/CPU based database. In order to achieve this, they implement state-of-the-art operators for both CPU and GPU along with other optimizations for each architecture. The result shows that many GPU databases have exaggerate their performance compared to traditional CPU database.

S2. The author argued why the model where GPU served as a coprocessor is less efficient compared to use GPU as the main execution engine. Since GPU coprocessor model is very popular in many GPU database, it seems that many of them have exaggerate their performance, although the other architecture not necessarily achieve better performance when the dataset is too large.

S3. The author implemented state-of-the-art individual query operators and established Tile-based execution model to optimize the performance. The results generally meet their modal of expected runtime.


W1. One main disadvantage of GPU-based database is the smaller memory capacity of GPU. To be more practical more GPU are needed in a single machine compared to CPU. So it's better for author to experiment with multiple GPUs in a single machine.

W2. The experiment assumed that both data is already stored in memory of CPU and GPU. But initially data shipping from CPU memory/disk to GPU memory also takes time. (However, if the dataset is not very large and can be fit in GPU memory directly, this single transferring cost seems not to be a bottleneck as indicated.)

W3. The scale of workload (simplified TPC-H) and GPU number seems can not necessarily prove that GPU-based database can scale very well.


C1. This paper concluded finally that GPU database can be a choice considering both the cost and the performance, especially in case that dataset is not very large.

C2. Some commercial products of GPU database are not open-source (e.g. SQream). Since they are commercial, I'm curious about the actual speedup they can achieve.

C3. It seems that in a long time GPU database will execute on heterogenous system with significant CPU memory. Potential optimization for this system to utilize both resource of GPU and CPU effectively will be a problem.

Some startup working on GPU DB: MapD (earlier & open-source), SQream DB (Israel & Ali Cloud), Zilliz (China)

Disk access vs. Memory access

  • Disk order access bandwidth: hundreds of M/S (1/100 of MM), random access (100,000x slower than MM), so the bottleneck is IO and techniques like Buffer Manager are used.

  • (Pure) In-memory DB: new bottlenecks are bandwidth of MM, CPU performance, L1 cache miss, branch prediction miss.


  • NVIDIA Tesla K80 FLOPS = 8.74TFLOPS,bandwidth = 500G/S

  • CPU: at most 50GFLOPS and 30G/s

  • GPU

    • Pros: better parallelism, higher FLOPS and higher bandwidth

    • Cons: Simple arch, limited flexibility, limited types of operations, fewer dependencies (cores of GPU can't execute independently today, less tolerance lead for divergent execution lead to the difficulty in parallelizing algorithms) , bad performance on logic control and branch prediction, higher latency, slower clock-speed

    • CPU: high-frequency, deep-pipelined, heavily-cached


  • Data parallelism: better parallelism, high requirement for MM

  • Model parallelism: lower requirement for MM, lower efficiency (idle of one part of the model, due to dependencies only one part is working)

  • Typically

    • When the modal is larger than the capacity of single MM or GPU, model parallelism are preferred.

    • Otherwise data parallelism is preferred due to faster speed.


  • Performance - GPU-only > CPU-only > co-processor

  • Crystal - tile based execution model

  • GPUs are 25x faster and 4x more cost effective


  • What is the advantages and disadvantages of executing transactions on GPUs?

    • Pros: More parallelism; higher memory bandwidth; good for read-only transactions

    • Cons: Limited memory; ACID over SIMT (e.g., logging latency, concurrency control scalability, etc.); scratchpad hard to use

  • Can you think of any solutions (either software or hardware) to overcome the problems of (1) limited PCIe bandwidth between CPU and GPU and (2) limited GPU memory capacity?

    • Avoiding PCIe bottleneck: better scheduler for data movement (only warm data in GPU, use GPU when results set if small); GPUDirect; compress data transfer;

    • Handling limited GPU memory: virtual memory for GPU; HBM to CPU; compress data in GPU; multi-GPU; Hybrid CPU-GPU system

  • What are the main opportunities and challenges of deploying a database on heterogeneous hardware?

    • Opportunities: Workload specific optimizations; minimize data transfer cost; massive parallelism

    • Challenges: complex scheduling, load balancing; hard to program; difficult failure handling; complex coordination among devices with difference architecture

Q100: The Architecture and Design of a Database Processing Unit

DB Accelerator


In this paper the author presents the design of Database Processing Units (DPU), which mainly include three parts: ISA, micro-architecture and hardware implementation. About the ISA, they implement standard relational operators then can manipulate primitives with good fit for acceleration and parallelism. Then they examined possible optimal configurations for hardware tiles from two aspects (per-tile sensitivity analysis and case-by-case experiment for the last 150 configurations). Apart from the above they also evaluated the communication needs, query scheduling algorithms and their performance. Finally in the evaluation part the author proves that Q100 can both perform better on both TPC-H and energy efficiency compared to traditional database system.


S1. The author presented a clear approach on how to design the micro-architecture for Q100 and explained why their configurations are potentially optimal using models.

S2. The author did a thorough evaluation for Q100 performance with different factors and explored their bottlenecks respectively.

S3. Different from query-specific circuits in previous works, in this paper domain-specific circuits are presented. Their evaluation shows that by doing this they can manage to achieve great better performance on TPC-H compared to traditional RDBMS.


W1. Further optimization for query scheduling algorithms may be applied. But indeed it's difficult to perform effective optimizations on ASIC chip.

W2. The author did not experiment with multiple DPUs within a network. So maybe it should be tested that what can we do to balance the communication and computation costs between DPUs within a network (also fault tolerance).


C1. Maybe DPU can also act as a coprocessor to accelerate computation-intensive analytics queries.

C2. Although Q100 DPUs are still not flexible enough, in areas like regular expression matching and non-relational DBMS, but it seems that they have a promising future in OLAP DBMS.

C3. I'm really not familiar with this area so it's quite difficult for me to go through this paper


  • Accelerator for analytical queries (not txns)

  • Hardware support for relational operators - join, aggregation, sort, select

  • Process data as streams

  • Combination of spatial and temporal instructions

Functional tiles

  • Functional

    • Aggregator: both group_byand aggregate columns are sorted

    • ALU

    • BoolGen: compare two columns and generate bit vector

    • ColFilter: select values from a column based on a bit vector

    • Joiner: Inner-equijoin (hash or merge join?)

    • Partitioner: range-partition input column•Sorter: bitonicsort for 1024 records

  • Auxiliary

    • Apend: Append two tables with the same schema

    • ColSelect: extract column from table

    • Concat: concatenate two columns

    • Stitch: produce a table based on multiple input columns


  • Q100 is an efficient domain-specific accelerator for analytical database workloads

  • ISA exploits parallelism and streaming efficiencies

  • At < 15% area and power of a Xeon core, a Q100 device gets exceptional performance and energy efficiency


  • Key to Q100’s high performance?

    • Streaming to reduce data movement, pipeline parallelism, optimized for TPC-H, efficient instructions, fewer memory writes

  • DPU vs. optimized CPU?

    • Q100 is better due to specialization (1)

    • DPU may be better than Opt-CPU on TPC-H (2)

    • Unclear (1)

  • Limitation of Q100?

    • Communication (between Q100 and CPU, between tiles) may be a bottleneck

    • Large overhead when table is big

    • Extra compiler and software development overhead•Cost of special hardware

    • UDF

  • Optimization goals of database accelerators?

    • Minimizing data movement, maximize memory bandwidth. Minimizing PCIe transfer,. Process in memory?

    • Data placement, scheduling, concurrency control, energy efficiency

Managing Non-Volatile Memory in Database Systems



Non-volatile memory is a new storage technology which is believed to have the potential to serve as a new layer in the storage hierarchy between DRAM and disks. Currently there are two major approaches to integrate NVM of modern DBMS, to use NVM exclusively as main memory or to use page-based DRAM cache in front of NVM. However, the first approach performs worse compared to in-memory DBMS while the second one fails to utilize the byte addressability of NVM and the cost of accessing cold data on NVM is still expensive. To solve this problem, the author presented a coherent and efficient 3-tier system design of buffer manager which includes three key techniques: cache-line-grained pages, mini pages and pointer swizzling. The lightweight buffer manager and storage engine they implemented performed well on YCSB and TPC-C. Overall, the author argues that conceptual simplicity is a major advantage.


  1. This 3-tier buffer manager which includes all of DRAM, NVM and SSD makes it possible to improve the accessing speed of hot data in the first two layers, and the capacity of storing cold data in the last two layers.

  2. The author combined three techniques coherently and effectively to increase the simplicity of overall system design. All of these techniques are either simple or already well-known.

  3. The author used a hybrid approach to accessing different size of data (full pages or mini pages) and completing specific operations (cache-line-grained or not), which improves the overall performance effectively.


  1. Cache-line-grained access is much more difficult to program compared to conventional full page-based approach. We need to know all data's residence explicitly before accessing it.

  2. When the operation is either very complex or infrequent, or the mini page overflows, the author uses full pages instead of mini pages. But in this case, how can we extract relatively hot data and avoid new overhead? Maybe there is still room to optimize in this situation.

  3. The author maybe can explain a little more about why they choose this specific size for mini pages. Will other sizes for mini pages help perform better overall?


Q1. NVM's latency of read and write operations is not symmetric. What are some other possible optimizations we can adopt to solve the problem of slow write of NVM? Maybe some write-efficient algorithms need to be developed for some complicate or write-intensive operations.

Q2. Will NVM be able to contribute to the further development of HTAP databases?

Q3. What's some other critical obstacles in the commercialization process of NVM?Can they be solved in the foreseeable future?

Q4. Is it possible to combine NVM and GPU/FPGA accelerator for DBMS?



  • SDD: SATA, SAS -> PCIe -> NVMe

Paper: Managing Non-Volatile Memory in Database Systems


  • Primary storage for relations and index structures

    • Can fully utilize the byte addressability of NVM

    • Slower than MM DBMS

    • Smaller capacity compared to SSD

  • Use a page-based DRAM cache in front of NVM

    • Accessing uncached pages become expensive

    • Smaller capacity compared to SSD

NVM's operation modes in a db

  • NVM Direct Mode (DRAM and NVM in different address space)

    • NVM as the primary storage

    • All changes persistent to NVM when a transaction commits

    • Logging can be simplified

    • etc

  • DRAM as a Cache (DRAM as a cache, managed by hardware)

    • NVM write throughput is lower than DRAM

    • Buffer management (BM) for DRAM data

    • Write-ahead logging

    • This paper was written before memory-mode was available

Buffer management in SSD/HDD vs NVM


    • Block storage

    • Load a full page at a time (e.g. 16kb)

  • NVM

    • Byte addressable

    • Waste of bandwidth if full pages are loaded

    • Loading a cache line at a time (64 B)

    • Cache-line-grained pages

      • Page initially empty, cache lines loaded as they are accessed

      • Overhead: each access checks/updates resident/dirty bits

    • Mini pages - a sparse representation of a page

      • Cache lines are sorted

      • Promote to full page when a mini-page is full

    • Pointer swizzling - Reduce overhead of page table lookup

      • Store pointer rather than pageIDif page is in main memory

      • Cannot swap out a page before its children



  • NVM: new device in the storage hierarchy

    • Byte-addressable

    • Non-volatile

  • Taking advantage of byte-addressability to improve performance

    • Cache-Line-Grained pages

    • Mini pages

    • Pointer swizzling

    • Three-tier buffer management


  • How does memory-mode affect the design?

    • Will be faster when data fits in DRAM

    • Need to take care of logging

    • Memory mode can ease programming

    • Just use existing main-memory DB without change

  • Advantage of app-direct mode over memory mode

    • Directly manage replacement policy

    • Larger aggregated memory space

    • Logging can potentially be simplified

    • Allows hot/cold data separation

  • How would you design NVM-DB differently?

    • Better recovery structures that use NVM

    • Minimize writes to NVM

    • Use memory-mode or the dual-mode

    • Replace SSD with NVM (cost?)

    • Build LSM-tree based storage system

Write-Behind Logging

NVM 2.


The arrival of new non-volatile memory brings new opportunities for re-design logging and recovery mechanisms for DBMS to exploit full potential of NVM. Traditionally, DBMS uses write-ahead logging method to do logging and recovery for DRAM and HDD/SSDs. Compared to WAL, the author developed WBL, which is specially designed for NVM and tracks what parts of the database have changed rather than how it was changed and flushing the changes to the database before recording them in log. After implementing this protocol in the Peloton and testing on YCSB, the author concluded that this protocol can improve a DBMS's transactional throughput by 1.3x, reduces the recovery time by two orders of magnitude and decrease the storage footprint of the DBMS on NVM by 1.5x.


  1. The author redesigned the structure of WBL Record to make it more efficient and more friendly to the limited life span of NMV.

  2. WBL exploited the special features of NVM greatly. Compared to WAL, WBL has better throughput and recovery speed due to the fast writing and random accessing speed of NVM, as well as the byte addressability.

  3. WBL log record contains all the needed information to recover from failure and there no need for DBMS to periodically construct WAL-style checkpoints to speed up recovery.


  1. Maybe it's better for the author to describe more about the replication schemes under WBL protocol.

  2. In the recovery process of WBL, the author chose not to delete tuples in (Cp, Cd) immediately and used background process to clean these dirty tuples asynchronously. This seems to bring new complexity and overhead.


C1. WBL is specially designed for a specific kind of DBMS with NVM. How to make this protocol more compatible to other potential storage hierarchies with NVM? In some cases WBL may even perform worse than WAL.

C2. NVM's limited life span still seems to be a big problem.

C3. New log replication method may be developed to exploit the full potential of new techs like NVM and RDMA.

Steal vs no steal

  • No steal: dirty pages stay in DRAM

    • Processor can directly update a page

    • Main memory database

  • Steal: dirty pages may overwrite pages on disk

    • Must flush UNDO log(before-image) to disk before writing to the page

Force vs no force

  • No Force: Modified pages may stay in main memory

    • Flush REDO log(after-image) to disk before committing the transaction

  • Force: All modified pages written back to disk before commit

    • Can commit transaction after all pages are forced to disk



  • NVM: new device in the storage hierarchy

    • Byte-addressable

    • Non-volatile

  • Taking advantage of both byte-addressability and non-volatility to improve performance of fault tolerance

    • Force + steal ---> UNDO only

    • Force + MVCC ---> No UNDO, No REDO


  • High availability and remote recovery

    • HA requires synchronous round trip

    • Network may become a bottleneck with HA

    • WBL cannot be simply extended to provide HA or remote recovery. Possible solutions (1) add REDO logging (2) use RDMA to directly update remote NVM

  • REDO vs. UNDO vs. WBL

    • WBL pros: instant recovery, small log size

    • WBL cons: requires multi-versioning, works only for NVM

    • UNDO: bounded log size (typically small)

  • WBL with three-tier architecture

    • Challenges: Pulling cold data from SSD, page vs. byte granularity

    • Log and hot data in NVM, cold data in SSD

Joins in a Heterogeneous Memory Hierarchy: Exploiting High-Bandwidth Memory



Where is HBM used?

  • Nvidia Pascal GPU, Nvidia Volta GPU

  • AMD Radeon GPU

  • Intel Xeon Phi CPU (abandoned), so HBM is mostly used in GPUs

Join algorithms

  • Symmetric Hash Join

  • No Partitioning Hash Join

  • Partitioned Hash Join

  • Parallel Radix Join

  • M-Way Sort-Merge Join

Lessons Learned

  • Flat mode better than cache mode

  • High amounts of threads can easily lead to memory bottlenecks

  • Do not place everything in HBM

  • Random memory access patterns do not saturate bandwidth in general, being therefore not ideal for HBM

  • High bandwidth improves highly parallel hash joins and sort-merge joins more or less equally

  • Uneven load balancing by skew as well as latches on partitions are not noticeably influenced by HBM


  • Xeon Phi: High bandwidth memory integrated with multicore CPU

  • HBM is more popular with GPU than CPU

  • Cache mode does not always lead to performance improvement


  • Why HBM more successful with GPU than CPU?

    • GPU has more computation to saturate HBM bandwidth

    • GPU workloads are throughput-bound, not latency bound

  • Future of storage hierarchy?

    • HBM becomes the new DRAM

    • Need a universal interface to control the hardware

    • Customizable storage solutions

    • Another layer: Smart memory

    • Some may disappear (e.g., HDD)

  • APU for database?

    • Depends on the price

    • Promising because the bandwidth between CPU and GPU increases

    • Maybe hard to program

Query Processing on Smart SSDs: Opportunities and Challenges

Smart SSD.


This paper talks about the opportunities and challenges for query processing on smart SSDs. Smart SSD is a new idea which package CPU processing and DRAM storage inside a single SSD, which makes it possible for us to run programs in the SSD itself. The author briefly introduced the architecture of modern SSDs and the mechanism of smart SSD. Then they implemented an initial prototype of Microsoft SQL Server which can run operations mainly like selection and aggregation. They tested with three synthetic tables from TPC-H and found that pushing down some query tasks to smart SSD can increase the overall performance and be more energy efficient, in the case when IO and limited embedded CPU processing capabilities have not become bottlenecks.


  1. There are already approaches that adopted by IBM and Oracle to push down computation, but different from above, in this paper the author chose to make Smart SSD to have in-built CPU inside the I/O device, rather than next to the IO subsystem.

  2. Running code inside Smart SSD need special development for programmers. And the author build a practical development environment hand-in-hand with Samsung for this project, which is highly beneficial to later database developers who are interested in Smart SSDs.

  3. The discussion part is very thought-provoking. The author identifies several possible bottlenecks and opportunities in the near future for Smart SSDs.


  1. Although embedded CPU in Smart SSD is much more energy-efficient, but these also bring new limitations. This kind of embedded CPU is not deep-cached and deep-pipelined like core CPU and has limited processing capabilities. Several optimized techniques are not realistic in Smart SSD, unless the manufacturers agree to develop Smart SSD with stronger build-in CPU one day. As the author indicated, this could become a bottleneck.

  2. The author may talk more detailed about how they programming for query processing in SSDs using such communication protocols and APIs.

  3. The experiment is not so thorough. For the operators, the author only implement selective and aggregate, which can just exploit Smart SSDs performance to the best. And the workload may be a little more simplified. But I totally understand it's difficult to program with Smart SSDs, which few people have done before.


  1. Is Smart SSDs becoming popular and commercialized now?

  2. As mentioned by the author, it's chicken-and-egg problem between SSD manufacturers and software vendors. Is is still a very common and critical conflict in the system area now?

  3. It seems that pushing down computation is a trend now, like Smart SSD, SmartNICs. This is especially useful in some specific tasks which need to be done repeatedly. But I think the degree of pushing down computation also have a limit. Since we already have the central powerful general-purpose CPU, maybe it could be over-designed for Smart SSD and SmartNICs to be too complicated and general in the future. We need to find a cost/energy balance between them.

Smart memory/storage - pushing computation to memory/storage

  • PIM, Smart SSD, Active Disk (CMU 1998), Intelligent Disk (Berkeley 1998), AWS S3 Select, ...

Query processing on Smart SSDs

  • Internal bandwidth larger than external bandwidth

  • In-SSD processor is less powerful and cheaper, Smart SSD may improve overall cost/performance

  • Reduce energy consumption


  • Gap between external and internal bandwidth determines the potential performance improvement

  • Smart SSD prototype used in the paper delivers 2.7x speedup


  • Fast IO/Network affect smart memory/storage?

    • Closes internal/external bandwidth gap => less gain from smart SSD

    • Cost and energy

  • Supporting complex operators

    • Join: Small table fits in Smart SSD memory; computation simple enough

    • Breakdown the complex operators

    • Not wise to push join entirely

    • Push some simple group-by

    • Data partitioning in Smart SSD

Database Processing-in-Memory: An Experimental Study



Data movement always affect task performance and energy efficiency significantly. In this paper the author tried to push computation from CPU to memory to reduce the cost of data movement. Firstly, brief introduction to current PIM architectures is presented and the author illustrated their motivation through the execution process of selection operator. Secondly, the author presented their implementation of a group of five SIMD query operators. Lastly, they tested their design on performance and energy consumption in each architecture thoroughly and found that a hybrid query scheme can bring significant performance increase and be more energy efficient.


  1. Although pushing down computation to reduce data movement is a classical idea, but it's still novel to discuss the trade-offs of query execution between PIM and x86 in database community. Few (or none) people have done a thorough study on this topic.

  2. The author completed a very thorough experiment. Compared to previous studies, they analyzed the query execution time and energy consumption on multiple datasets.

  3. Their prototype are more mature (compared to the Smart SSD paper we discussed last time). They implemented a group of five SIMD operators and discuss the potential of hybrid scheduling, which can better illustrate the opportunities for PIM.


  1. Maybe the author could talk a little more about the details of their implementation of SIMD query operators.

  2. Maybe the author could briefly compare PIM with other kinds of hardware for query acceleration. Or identify the similarity and differences on high-level.


  1. Are they any studies to compare the overall query execution performance on different methods like GPU, smart SSD and PIM? It's not realistic to use all them to accelerate query execution at the same time, and there should be a fair comparison between them to discuss the trade-offs. (although obviously this means a lot of work.)

  2. Is it also very difficult to program with PIM just like Smart SSD? Apart from the five SIMD query operators they implemented, what about UDF in PIM?

Previous NDP for Databases

  • Previous NDP-DB: Active disk, Intelligent disk, smart SSD

  • No commercial adoption of previous work

    • Limitations of hardware technology => HBM and HMC

    • Continuous growth in CPU performance => Moore’s law is slowing down

    • Lack of general programming interface => SIMD

PIM-256B Arch





  • How to improve group-by aggregation performance?

    • Store aggregation values in registers

    • Use buffers/caches to accelerate random accesses•Keep hot data in registers and swap to memory as needed

    • Sort data by groups in memory to convert random to sequential accesses

  • Smart SSD and PIM for transactions

    • Logging and garbage collection to Smart SSD

    • Compression and decompression of column store

    • Conflict detection

    • Generating TID using SmartSSD/PIM

  • Where will PIM most likely to succeed in the storage hierarchy?

    • NVM: persistency and byte-addressability

    • HBM, DRAM, NVM, or SSD (no benefits for SRAM)

    • Cloud storage obviously (e.g., PushdownDBJ) then move up the stack SSD/NVM and may stop at DRAM due to limited benefits and added complexity

    • Three Tiered DB: admission/eviction can be pushed down to NVM or SSD

The End of Slow Networks: It's Time for a Redesign



Currently, distributed DBMS architecture and distributed algorithms are specially designed assuming that network is the primary bottleneck to minimize the network communication cost. At present, we can already achieved comparable network bandwidth with RDMA and InfiniBand compared with bandwidth with one memory channel, not to say the potential development of RDMA in the future. In this paper, the author present the new Network-Attached Memory architecture and the alternative architectures for distributed DBMS based on this. By redesigning these architectures, the author shows that traditional 2PL schemes can scale well again. Also, the author implemented new JOIN and AGGREGATION operators for distributed OLAP workloads and achieved significant performance increase compared to traditional operators designed for distributed DBMS using locality-awareness or semi-join.


  1. Instead of using RDMA as an afterthought technique for existing DBMS, the author argued that it is necessary to redesign the architecture of current distributed DBMS to leverage the potential of DB with RDMA. This new design bring them significant performance increase.

  2. Traditionally distributed 2PC doesn't scale due to slow network. The author present a new protocol which is able to avoid the scalability limit (theoretically) and can leverage RDMA and NAM to the best.

  3. Their new design of distributed operators like JOIN and AGGREGATION provides performance increase, more robustness and can help better handling data skew.


  1. It shall be better if the author could talk more about their actual implementation of in-memory DBMS for RDMA.

  2. The evaluation part for RSI seems to be not very thorough.


  1. RDMA is of great potential and it's necessary to build more user-friendly programming environment.

  2. Will traditional optimizations for distributed operators such as locality-awareness become insignificant as RDMA become more and more popular?

Bandwidth and latency


InfiniBand and RDMA


Shared-Nothing vs. Shared-Memory



  • Snapshot isolation, generalized snapshot isolation

  • Cost of 2PC+GSI, RDMA+SI


  • Grace Hash Join

    1. Partitioning R and S using the join key

      1. Read data on the sender

      2. Transfer data over network

      3. Materialize data on the receiver

    2. Local join

  • Semi-Reduction using Bloom Filters

    1. Create bloom filters for R and S on the join key

    2. Filter R and S using the bloom filter and partition

    3. Local join (parallel radix join)

  • GHJ with RDMA - time analysis


  • RDMA for transaction execution phase

    • No need to partition or replicate indexes

    • Centralized locking and replication can be faster

    • Centralized logging service

    • Accessing and locking remote data using RDMA

    • Prefetch data into local memory

  • DB components significantly affected by a faster network

    • Two phase commit, Consensus, Replication

    • Less worry about locality

    • Breakdown program into multiple micro services

    • Data shuffling in the network

    • Distributed system becomes NUMA is network is sufficiently fast

  • Opportunities and challenges of memory disaggregation

    • Opportunities: Independent scaling of CPU and memory; simplifying scheduling; larger aggregated memory capacity; potentially faster fault tolerance

    • Challenges: independent failure of compute and memory; consistency and coherency; data placement and partitioning

Rethinking Database High Availability with RDMA Networks

High availability


RDMA based network can achieve a high bandwidth which is comparable to that of main memory. And current state-of-the-art replication algorithms are designed to minimize the network communication cost assuming that network is the main bottleneck. The author analyzed why conventional high available protocol is not able to leverage RDMA based network. Then the author proposed the Active-Memory Replication, which is designed to exploit the potential of RDMA and can achieve high availability. It is an undo-log based fault tolerance protocol that has advantages in both performance and maintains correctness. Evaluation shows the correctness of this protocol and it's 2x faster than the second best protocol on RDMA-based networks.


  1. With good illustrations(code, timeline, branches, figures), the author presented their new protocol thoroughly in a clear manner.

  2. This paper has a well-written background introduction part, which gives readers a concise guide to current replication algorithm and analyzed why they are on longer suitable for RDMA based network.

  3. The evaluation is relatively thorough. The experiment shows that how this new Active-Memory protocol can outperform current replication algorithms in broad scenarios (based on RDMA network) and maintains correctness at the same time.

  4. Don't use log shipping and thus avoid overhead on processing backups, which help this new protocol exploit full potential of RDMA and outperform FaRM.


  1. Would like to see comparison between more state-of-the-art replication algorithms to better illustrate the advantages of Active-Memory protocols in different scenarios.

  2. The scalability experiment may need more nodes to prove a definite advantage on scalability for Active-Memory protocol compared to other conventional replication algorithms.


  1. What are some other potential components in a distributed system that their advantages no longer hold on RDMA based network and need a redesign?

  2. How to extend to architectures other than shared-nothing?

Active-Passive, Active-Active Mode



  • RDMA shifts the bottleneck from network to CPU

  • Conventional HA protocols (i.e., active-passive and active-active) are optimized for reducing network demand and is thus are no longer optimal for RDMA

  • Active-memory is optimized to reduce CPU demand

  • Active-memory achieves 2x performance improvement


  • Active memory without in-order delivery?

    • Assign seq number to each packet and resemble at the receiving side

  • Active Memory vs.WriteBehind Logging?

    • Both use “force” instead of “no-force”

    • Can be combined (single-vs. multi-versioning)

    • Keep data in persistent memory in Active Memory

  • Other examples of increasing computation to reduce network overhead

    • Caching

    • Data centric computing (moving computation to data)

    • Compression and decompression

    • Directory-based cache coherence: unicast vs. multicast

Offloading Distributed Applications onto SmartNICs using iPipe

Smart NICs.


Currently, there is a gap between the increasing network bandwidth and the stagnating and limited CPU resources. And the emerging multicore SoC SmartNICs which have rich computing resources have the potential of offloading generic and complex workloads in datacenter. Compared to previous offloading a specific workload onto FPGA-based SmartNICs, this paper summarized the offloading performance implications from four perspective and built iPipe for offloading distributed applications onto SmartNICs and maximize resource utilization at the same time. Then the paper introduced its actor-based model, hybrid scheduler (FCFS + DRR) and tested with three complex distributed applications to show that SmartNICs have the potential to accelerate more generic and control-intensive workloads and fill the gap in the beginning.


  1. There are some research on offloading some networking workloads onto FPGA-based SmartNICs. But instead of accelerating a domain-specific workload, the paper focused on offloading a class of applications which may have complex logic and algorithms.

  2. The experiment tested three complex distributed applications and thoroughly displayed that iPipe has the potential of accelerating more generic, complex and control-intensive network workloads.

  3. The paper summarized the offloading performance implications from four perspective (traffic control, computing capability, onboard memory and host communication) which can help in identifying the bottlenecks and maximize resource utilization for complex and generic distributed applications.


  1. The background information part of this paper seems to be not very well written (lacks some concepts explanation compared to papers in VLDB) and thus is less friendly to noobs...

  2. It seems that there is still room to optimize for iPipe Actor Scheduler. Current scheduler policy already has good performance, but it's also beneficial to discuss potential optimization theoretically to make this policy more sound.


  1. Is pushing down/offloading control-intensive complex workloads to SmartNICs a promising path? Is it over designed or really necessary?

  2. Recent advances or applications of SmartNICs in data centers of giants like Microsoft, Amazon.

Pushing down computation to ...

  • Storage - Smart SSD

  • Network - Smart NIC

Smart NIC arch


On-path vs off-path


iPipe Framework

  • Actor model vs OOP

  • Scheduler

  • DMO - Distributed Memory Object

iPipe Applications

  • Replicated KV Store

    • Log-structured merge tree for durable storage

    • Replicaiton using Multi-Paxos

    • Actors - Consensus actor, LSM Memtable actor, LSM SSTable read actor, LSM compaction actor

  • Distributed Transactions

    • Phase: read/lock -> validation -> log by coordinator -> commit

    • Actors - coordinator, participant, logging actors

  • Real time analytics

    • Analytics over streaming data

    • Actors - filter, counter, ranker


  • SmartNIC vs. SmartSSD

    • Different application scenarios: one for storage, one for network

    • SATAvs.PCIe?

    • SmartNICs used for reducing CPU overhead; SmartSSD used for reducing data movement

    • SmartNIC seems more popular among hardware vendors

    • Computation in SmartNICis stronger than SmartSSD

  • Database operators pushed to SmartNIC

    • Common: encryption, caching

    • OLTP: filtering, aggregation, locking, indexing

    • OLAP: filtering, project, aggregation, compression

  • Benefits of putting smartness into the NIC

    • Packet processing, latency reduction

    • Effect of SmartSSDis limited due to caching; caching does not apply in SmartN IC

    • Isolate security checks from CPU

    • Collect run time statistics such as network usage and latencies

    • Reduces burden on PCIe

Distributed Join Algorithms on Thousands of Cores


Join operator is very important in both database engines and machine learning algorithms with the advent of big data. Traditionally, radix hash join and sort-merge join are two most popular join implementations. In this paper, the author tried to explore the implementation of distributed join algorithms with potentially thousands of cores and connected by a low-latency network. The author implements these two algorithms based on MPI and tested their performance under different circumstances thoroughly. The author identified some key factors in scaling these two algorithms such as balance between computation and communication capacity, communication inefficiencies, scheduling, etc.


  1. This paper gives reader a friendly and concise background introduction including RDMA, MPI and high level ideas of distributed hash/merge join algorithms.

  2. The author compared the actual performance of two distributed join with their theoretical bound and identified key factors which significantly affected the scalability of current distributed join algorithms. (balance between computation and communication capacity, communication inefficiencies, scheduling, etc. ).

  3. The implementation and experiment part is thorough. The author did a detailed analysis of implementation, actual performance, cost components and possible setup factors to give the final high level conclusion.


  1. This paper did a great job in investigating two most popular join algorithms and identified their bottleneck of scalability. But the author did not talk much about how potential optimization can be implemented, such as scheduling and data skew. Adding these parts will give the paper more novelty.


  1. Any progress in light-weight and effective network scheduling techniques?

  2. How will RDMA reshaped current distributed join algorithms?

  3. Simply adding more cores will break the balance between compute and communication and the performance could even degrade. How can we measure the potential benefit of adding more cores more quantitatively?


Algorithm designs


MPI one-sided operations

Example - analysis of Radix Hash Join and Sort-Merge Join

  • Process, number of passes, time analysis and performance models

  • image-20200612230546038

  • image-20200612230606180

Performance Model



  • SmartNIC for join

    • Filtering, hash table, indexing

    • Network traffic scheduling for shuffling (reduce the problem of bursty traffic)

    • Hash table in SmartNIC?

    • Sort in SmartNIC

    • Data partitioning

  • HW/SW techniques to improve performance of sort-merge join

    • Equivalent performance after removing bottlenecks? (Not necessarily)

    • Hardware acceleration for the sort and merge

  • Radix join to achieve theoretical maximum performance

    • Communication powered by SmartNICs/RDMA (network scheduling for shuffling)

    • Hash partitioning logic in SmartNIC

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

OLTP in cloud.


This paper introduced Amazon Aurora, a relational database for OLTP workloads. For a distributed/cloud database, the bottleneck of achieving high throughput has moved from compute and storage to network. Traditional distributed protocols such as 2PC are intolerant of failure and have high latency. To avoid this kind of questions, Aurora decoupled compute from storage and offload several functions such as redo logging, crash recovery and backup to the storage service. By only shipping logs between the database instance and storage nodes, and keeping doing redo in the background storage node asynchronously, Aurora was able to avoid the network bottleneck, reduce the respond time (both for customer and crash recovery) and achieve higher throughput.


  1. Great paper introduced the considerations of new architecture cloud databases designed for high throughput. Several components are all discussed in a clear manner.

  2. Offloading several functions such as redo logging, crash recovery and backup/restore to the storage tier gives many opportunities. The database instance only need to ship redo log to the storage tier, thus reduce the effect of the bottleneck of the network. Also, this makes quicker respond time for crash recovery due the asynchronous background recovery.

  3. This paper explored how to reason about cloud durability and design quorum systems which enabled the database to be resilient to failures. Combined with only redo log shipping between the database instance and storage nodes, this makes Aurora tolerant of failures and reduced the latency.


  1. Only support writes to only one replica and can't handle different write to multiple replicas.

  2. Seems not to support distributed transactions.


  1. What's Quorum's pros/cons over Paxos?

  2. Shared-disk vs. shared-nothing?

  3. Other possible components that could be offloaded from the database instance to the storage tier?


Cloud storage disaggregation

  • Storage disaggregation

    • Independent management and scaling of compute and storage

    • Cost reduction

  • Smartness in Storage - Storage nodes contain CPUs for computation

Pushdown to cloud storage

  • Concurrency control

  • Indexing

  • Buffer manager

  • Logging

Quorum-based voting protocol, 3-way, 6-way replication

  • Step 3 - identify gaps in the log

  • Step 4 - gossip with peers to fill gaps

  • Step 5 - coalesce log records into data pages

  • Step 6 - periodically stage log and pages to S3

  • Step 7, 8 - Periodically garbage collect old versions and validate CRC code on pages

Eviction, replication


  • Cloud storage vs. SmartSSD

    • Smart SSD for OLAP while Aurora for OLTP?

    • SmartSSD on the read path; Aurora on the write path

    • Computation in cloud storage more powerful than SmartSSD

    • Smart SSD serves one node while Cloud storage serves multiple nodes

    • Cloud storage has higher latency

  • Challenges of multi-master

    • Uniqueness of LSN and ordering guarantees

    • Concurrency control (locking, leader election)

    • Commit protocol (2PC, or 1PC as what Aurora uses)

    • Network overhead

  • Other applications benefit from cloud storage

    • Serverless application

    • Publish-subscribe system like Apache Kafka

    • Graph/Document Store

    • Machine learning

    • Big data analytics

Choosing A Cloud DBMS: Architectures and Tradeoffs

Cloud data warehousing


This paper explored the architectures and tradeoffs when choosing a cloud DBMS. There are always tradeoffs in database areas and people have to make choices many times due to their specific workloads. Common choices are like shared-nothing vs. shared-storage, column store vs. row store, benefits of horizontal scaling vs. vertical scaling etc. This paper chose common database such as Athena, Redshift, Vertica, etc. to perform a thorough experiment on common tradeoffs and possible architectures when choosing a cloud DBMS. They concluded with their key findings from the experiment and pointed out some opportunities for potential future work.


  1. This paper explored some important and common tradeoffs when choosing a cloud DBMS by carefully designed experiments. And their findings can serve as a good guidance for both customer, researcher and cloud service vendor.

  2. The author managed to experiment with a broader selection of cloud offerings and incorporate DBMS offerings, including OLAP DBMSs which not so many previous paper focused on.

  3. This paper is very practical both for customers and cloud service vendors. The product they chose are all popular commercial/open source products and the experiment is under the restricted framework with common choices for customers.


  1. This paper includes a broad selection of cloud databases and relevant offerings, talking about many different tradeoffs, so sometimes the organizing of this paper may not be easy for readers to follow.


  1. This paper is relatively easier for me to understand, but still sometimes I'm puzzled about the focus of this paper. But the key findings the paper concluded are all quite practical with some insights.

  2. Are those findings keep unchanged for all cloud service vendors such as Google, Alibaba, Azure etc.? These companies offer similar products, but their prices vary from time to time. You know for products with generally the same functionalities, the price differences between different vendors sometimes are greater than I thought.

System tested

  • Database-as-a-Service - Redshift, Athena

  • Query engines - Presto, Hive

  • OLTP - Vertica

Complexity of fair comparison

  • System setup, query optimization, data format, data types, data partitioning, query restriction, caching, spilling to disk, etc...


  • Caching benefits, query cost, storage cost, scaling out, scaling up


  • Optimal design that combines the advantages?

    • Athena with instances pre-running

    • Hybrid instance store and S3; decide caching based on the workload

    • High-quality code compilers

    • Heterogeneous system that combines all the existing systems together

  • Optimization opportunities for serverless databases?

    • Optimize resource sharing among users (e.g., cache, computation)

    • SW/HW codesign

    • Heterogeneous hardware and storage (e.g., different function on different hardware)

    • Scale computation and storage on demand

    • Keep instances pre-warmed to reduce cold starts

  • Cloud databases benefit from new hardware?

    • Using GPU, SmartSSD

    • RDMA and SmartNIC(e.g., shared cache in SSD, computation offloading)

    • Persistent memory to improve bandwidth and aid fast restarts

The Snowflake Elastic Data Warehouse



Snowflake probably represents the state-of-the-art cloud OLAP database design. Distributed computing and big data systems are becoming more and more popular in recent years. But for traditional data warehousing, they are designed for fixed resources and lacks elasticity. For big data systems like Hadoop, they still lack much of the efficiency and require significant engineering effort to roll out and use. The author realized these facts and thought that we need a fundamental redesign for cloud database. Thus they introduced Snowflake, a multi-tenant, shared-data, highly scalable and elastic system designed for semi-structured and schema-less data.


  1. Snowflake disaggregates storage and computing. Although this idea is becoming more and more popular nowadays, but at the time when Snowflake was designed (since 2012), this is a farseeing choice which can help the cloud databases achieve great scalability and elasticity.

  2. The design of execution engine is columnar, vectorized, and push-based, which enables Snowflake to have high performance on analytic workloads. This help snowflake to avoid many sources of overhead in traditional query processing and make Snowflake more scalable.

  3. This paper really has a far insight. At the time when Snowflake was designed, big data systems like Hadoop and Spark are more popular choices, but the author chose to implement with their own cloud database with a fundamentally redesign, and their new design contains developmental novelties.


  1. Choosing S3 as the storage layer for snowflake brought great efficiency and elasticity, but this also means the degrade of performance / support less features for specific workloads (for example less support for OLTP workloads).


  1. It seems that choosing the storage layer is very important. Snowflake chose Amazon S3 and can we achieve a higher performance/other valuable features with some other storage nodes?

  2. Difficulties for cloud database that supports HTAP?

  3. BigQuery vs. Snowflake?



  • Snowflake vs shared-nothing

    • Heterogeneous workload

    • Membership changes

  • Storage

    • Data format - PAX

    • Data horizontally partitioned into immutable files

    • Intermediate data spilling to S3

  • Virtual warehouse

    • Elasticity and Isolation

      • Created, destroyed, or resized at any point (may shutdown all VWs)

      • User may create multiple VWs for multiple queries

    • Local caching - S3 data can be cached in local memory or disk

    • Consistent hashing

    • File stealing to tolerate skew

    • Execution engine

      • Columnar - SIMD, compression

      • Vectorized

      • Push based

  • Cloud Services

    • Query optimization

    • Concurrency control - Isolation, S3+MVCC, versioned snapshots

    • Pruning

      • Snowflake has no index (same in Athena, Presto, Hive, etc)

      • Min-max based pruning: store min and max values for a data block

High availability and fault tolerance

  • Stateless cloud services, replicated metadata

  • One node failure in VW, whole AZ failures


  • How far away is Snowflake from the “optimal design”?

    • Auto-scaling

    • Better optimized storage layer (like Aurora)

    • Security and reliability

    • Code compilation

    • Caching can be improved (e.g., workload specific)

    • Data sharing across virtual warehouses

    • Opportunities to extend into providing HTAP solutions

    • Cloud service layer might be a bottleneck

  • Combine data warehousing and OLTP in cloud?

    • Master and slave nodes within a VW to support writes as well

    • Build snapshot isolation into storage (concurrency control)

    • Transaction log -> (intermedia storage) -> S3 -> data warehouse every Y hours

    • VW per transaction?

Starling: A Scalable Query Engine on Cloud Function Services



For bursty analytic database workloads, provisioning a cluster of fixed number nodes is not a wise choice because customers need to pay for idle computing resources. Now a popular solution is FaaS like AWS Lambda which enables customers to run small and fine-grained tasks to reduce the overall cost greatly. But specifically, building an analytic query system on cloud functions also comes with some challenges (managing numerous tiny workers which is resource-constrained, handling stragglers, shuffling data through cloud service, etc.) In this paper, the author introduced Starling, which is a scalable query engine which can mitigate all above challenges and reduced the latency as well as the cost.


  1. As is stated in the paper, Starling is powered up with three key features (does not require loading, pay by query and tunable performance) which all common systems on the market can't offer all of them simultaneously.

  2. Stragglers often cause significant performance degradation for analytic queries. The author based their optimization on the power of two choices to deal with straggler problem specifically. The experiment shows that using randomization and duplicate tasks can accelerate overall query execution greatly.

  3. The author completed a thorough experiment, with four different system (Redshift, Presto, Athena, Starling) on multiple factors (cost of operation, query latency, scalability, concurrency, etc). These statistics soundly demonstrated the good performance of Starling.


  1. Starling is optimized for low to moderate query volumes, there still exists some challenges (FaaS restriction and low-cost coordinator) that limit the throughput of Starling with high volume queries.

  2. There are still some common functions that Starling does not support.

  3. Starling does not use a query optimizer, it's might be better to consider the cost of query optimization.


  1. Can the idea of Starling extended to transactional workloads?

  2. How do other systems deal with stragglers?

Serverless = FaaS + BaaS

Opinion from a CIDR'19 paper

  • 1 step forward: auto scaling

  • 2 step back

    • Communicate only through S3, degrade performance

    • Not good for distributed computation: no intermediate communication

Opinion from Berkeley report

  • Using FaaS to implement a serverless database is a bad idea, should use BaaS

FaaS: Starling

BaaS: Athena, Snowflake, Aurora (all serverless)

Cloud analytics database


Starling vs Snowflake

  • Starling: more fine grained, communicate only through S3

  • Snowflake: coarse grained, fixed size virtual warehouse, communicate through local storage

Small scale cloud database lower the bar

Starling arch



  • Parallel reads

  • Read straggler mitigation (RSM)

    • If a read request times out, send duplicate request

  • Write straggler mitigation (WSM)

    • If a write request times out, send duplicate request

    • SingleTimer: allow only single time out

  • Double write - Producer writes two copies of an object; consumer reads the one ready first

  • Pipelining - Start the following stage before the previous stage finishes

  • Combining to reduce cost of shuffle

Starling vs Snowflake


Future of serverless computing

  • Opinion from Berkeley Report (Cloud Programming Simplified: A Berkeley View on Serverless Computing)

  • Challenges: Abstraction, System, Networking, Security, Architecture

  • Predictions: new BaaS, heterogeneous hardware, easy to program securely, cheaper, DB in BaaS, serverless replacing serverful

  • Opinion from a CIDR’19 Paper (Serverless computing: One step forward, two steps back.)

    • Fluid Code and Data Placement

    • Heterogeneous Hardware Support

    • Long-Running, Addressable Virtual Agents

    • Disorderly programming

    • Flexible Programming, Common IR

    • Service-level objectives & guarantees

    • Security concerns


  • FaaS vs. BaaS for databases

    • BaaS advantages: simplifies communication and state sharing, caching

    • BaaS disadvantages: potentially lower CPU and memory utilization

    • FaaS advantages: fine-granularity pricing model, auto-scaling

    • FaaS disadvantages: overhead of inter-function coordination, functions have limited resources and execution time, communication through S3, inherently designed for small functions

  • What can BaaS (e.g., Snowflake) borrow from FaaS?

    • Auto-scaling: Dynamically resource allocation and fine-grained pricing

  • Benefits and limiting factors of running OLTP on serverless computing?

    • Benefits: Elastic scaling based on demand, transactions are inherently short-lived

    • Limiting factors: S3 has no read-after-write consistency, concurrency control is hard due to lack of communication

HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots



Traditionally, people use two different kinds of databases for different workloads: OLTP for online transaction processing and OLAP for analytic workloads. But splitting the data into two parts meaning the loss of data freshness and will degrade the performance. To solve this problem, the author introduced HyPer, also known as (probably) the first HTAP database to handle both OLTP and OLAP simultaneously. HyPer is a main-memory database that use hardware-assisted mechanisms to cope with the two kinds of workloads. The experiment is based on a combination of TPC-C and TPC-H benchmark and shows good performance.


  1. HyPer is probably the first database to introduce the idea of using one single database to handle both OLTP and OLAP. This idea is now well known as HTAP and is becoming more and more popular among many database service vendors.

  2. The paper introduced their way of transaction isolation and durability in a clear and detailed manner.


  1. The concurrency control scheme in HyPer is not so mature. Using few threads to deal with transaction processing may still leverage multicore to some extent, but there still exists many restrictions to limit the concurrency of HyPer.

  2. It might be better to make more comparison between HyPer and other databases.

  3. Might be better to make more comparison between other databases.


  1. Other possible solutions to HTAP besides virtual memory snapshots?

  2. Commercialization situation of HTAP databases?

  3. Distributed + HTAP?

One ref



Virtual Memory Snapshots

  • Create consistent database snapshot for OLAP queries to read

  • Transactions run with copy-on-write to avoid polluting the snapshots

Fork-based virtual snapshots


Multiple OLAP session

Multi-thread OLTP processing

Logging and checkpointing


  • Challenges of applying VM-snapshot to a shared-memory OLTP system?

    • The additional burden in taking a consistent snapshot.

    • Consistent snapshot across machines is a hard distributed system problem.

    • Fork() becomes more expensive

  • How to reduce the cost of fork() when database is large?

    • Large pages and partitioning

    • Copy on write on the page table itself

    • Append child parameters to parents page table (one page table for all processes)

    • Share the page tables with OLTP while recording changed pages with a side data structure.

    • Fork() performs CoW at tuple granularity

  • Most promising architecture of HTAP? (single vs. separate systems, shared vs. separate data)

    • Single system, separate data: since optimal data formats (row vs column) for OLTP and OLAP.

    • Single system, shared data: different data formats (row/column) in different replicas; avoid data sync and update propagation

    • Separate system and separate data: no interference between OLTP and OLAP

    • sSeparate system and shared data: better scalability and high availability

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Time series.


Facebook widely uses ODS TSDB as an older monitoring and alerting system. The workloads at Facebook raised may requirements for ODS such as write dominate high throughput, high availability, fault tolerance, etc. ODS is based on HBase and cannot scale very well and achieve high throughput, but it is already widely used to store more than 2PB data... Also current TSDBs on the market can't meet the expectation of Facebook. In order to deal with this problem, the author introduced Gorilla, a fast, scalable, in-memory TSDB. Gorilla is used as the write through cache of ODS and uses aggressive compression to make the data fit into the cache and speed up the query throughput and reduce the latency. This also make several new monitoring and debugging tools available.


  1. Aggressive compression techniques. Gorilla uses Delta-Of-Delta to compress timestamps and XOR to compress point value separately. This enables Gorilla to save up to 90% of the space and lay a solid foundation for in-memory design.

  2. In-memory based design. Current ODS based on HBase cannot meet the requirement of great workloads at Facebook. Because it's a monitoring system, this also raises higher standards for TSDB. Combined with aggressive compression, Gorilla make the time series data fit into the memory to increase the query throughput and reduce the latency.

  3. Fault tolerance (based on cross region cluster) and scalability (based on share nothing architecture).

  4. Shared valuable industry experience in section 6 and new monitoring and debugging tools in section 5.


  1. Hopes to see more comparison between other TSDBs and make more performance analysis.

  2. Tradeoff high throughput at the cost of tiny data loss.


  1. Recent advances of TSDB?

  2. Gorilla is used as the write through cache of ODS based on HBase. How would the design change if Facebook directly uses Gorilla as the ODS?

Time Series DB = A time series database is a software system that is optimized for storing and serving time series through associated pairs of time and value

DB popularity in April 2020


Why is TSDB Popular?

  • Monitoring software systems: Virtual machines, containers, services, applications

  • Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies (Internet of Things)

  • Asset tracking applications: Vehicles, trucks, physical containers, pallets

  • Financial trading systems: Classic securities, newer cryptocurrencies

  • Eventing applications: Tracking user/customer interaction data

  • Business intelligence tools: Tracking key metrics and the overall health of the business

Gorilla = An in-memory cache for the slower TSDB on HBase

Time series compression

  • Data format - (timestamp, value)

    • 64-bit timestamp

    • Value - 64-bit doubles

  • Timestamp compression - delta-of-delta

  • Value compression - XOR

image-20200613050603024In-memory data structure


What’s new in TSDB compared to RDBMS?

  • Opportunities of data compression

  • Different access pattern: append intensive

  • Relaxed consistency model?

  • Different query pattern

Adaptive Concurrency Control: Despite the Looking Glass, One Concurrency Control Does Not Fit All

Dixin Tang's 1st paper on adaptive CC


  • Many protocols are optimized for specific workload characteristics (read-heavy, highly-skewed, etc.) and may have poor performance with mixed workloads. Thus we need adaptive/hybrid workloads.


  • Using hybrid CC: ACC automatically partitions the database into clusters and assigns CPU cores for each cluster according to its load statistics; for each cluster, ACC adaptively chooses a concurrency control protocol according to its workload characteristics. (Using cascading binary classifier trained offline.)

  • How to partition database and assign CPU? - Partition-Merge method.

    1. Partition the records into N (=CPU cores) cluster

    2. Merge clusters to ensure that

      • the percentage of transaction operations accessing a single cluster (denoted as utilization) should be higher than a threshold (i.e., maintaining load-balance)

      • the cost of cross-cluster access should be lower than a threshold (i.e., minimizing cross-cluster transactions).

    3. Assign each cluster with the number of cores proportional to their utilization.

  • How to model the characteristics of different workloads?

    • the ratio of read operations per transaction (ReadRatio)

    • the average number of operations issued by a transaction (TransLen)

    • 估计算符运算代价

    • the possibility of concurrent transactions reading/writing the same record (RecContention)

    • the cost of cross-cluster transactions (CrossCost / PartitionConflicts)

    • 1~3 are critical to PartCC, OCC, and 2PL

    • 4 is critical to PartCC. (Determines the applicability)

  • How to select different protocols?

    • Central coordinator: Each transaction worker concurrently collects workload statistics and reports them to a central coordinator periodically. The coordinator computes features of the current workload for each cluster and uses the predictive model to determine which protocol should be used for this cluster. (could be a bottleneck for latency)

    • Move (partial) computation to workers. Emulate a uniform and separate process for all workers using sampled transactions or operations.

      • Each worker samples operations and repeats (mark and detect phases) to check how many other workers have marked the records of the sampled operations to estimate the contention.

  • How to avoid overhead of coordinating conflicts across protocols (one record can be accessed by different protocols)?

    • Data-oriented mixed CC: enables a single protocol process all concurrent read/write operations to a portion of records (a cluster). (Cross-cluster transactions may use different CC when accessing different clusters.)

    • Figure 3: Mixing PartCC, 2PL, and OCC

    • Preprocess: acquire partition locks for PartCC

    • Execution: uses the protocol managing the record to process it.

      • OCC: reads the timestamp and the value of this record using the logic of OCC

      • 2PL: acquires a lock before reading or writing the record

    • Validation: validate the records managed by OCC

    • Commit: applies all write and release locks. (Release partition locks for PartCC, record locks for 2PL, and write locks for OCC.)


  • Env: Based on Doppel (open-src MM OLTP DB), PartCC of H-Store, OCC from Silo, and SS2PL No-Wait based on VLL.

  • Baseline: single protocol; hybrid of 2PL and OCC (uses locks to protect highly conflicted records and uses validation for the other records)

  • Test 1: mix CC under mix of partitionable (no cross-partition transactions) and non-partitionable (100% cross-partition transactions) workloads. For cluster with no cross-partition transactions, increase the percentage of cross-cluster transactions from 0% to 50% gradually.

  • Result1

    • ACC merges non-partitionable warehouses into a single cluster and makes each non-partitionable warehouses of a single cluster.

    • At first, ACC used PartCC to process well-partitionable workload and 2PL to process the rest of the highly conflicted and non-partitionable workload.

    • With more cross-cluster transactions, ACC adopts 2PL for the whole system.

    • For 2PL, lock contention across cores is reduced due to partitioning.

  • Test 2: evaluate ACC’s adaptivity in response to selecting a single protocol according to workload variation.

    • If PartCC is used, the whole store is partitioned such that each CPU core owns a partition (i.e., a cluster).

    • If OCC or 2PL is used, the partitioned stores are merged back into a single cluster (i.e., a shared store).

  • Result 2

    • (non-partitionable -> 100% cross-partition transactions + skew partition access, low-conflict -> read-only) workload -> OCC

    • (partitionable, high-conflict -> write-intensive) workload -> PartCC, where it needs to partition the whole store first. ACC continues to process transaction requests during the partitioning and switches to PartCC when partitioning is done.

    • non-partitionable workload -> more partition conflicts are introduced in the workload; thus, ACC merges the partitioned store and switches from PartCC to 2PL.

    • Overall: The dip in performance for ACC during workload shifts is due to lag while using the prior protocol and a short period where workers reorganize clusters and indices. The throughput improvement of ACC over PartCC, OCC, and 2PL can be up to 4.5x, 3.5x and 2.3x respectively, and ACC can achieve at least 95% throughput of the optimal protocol when it uses this protocol, demonstrating that the overhead of ACC is minimal.


  • ACC: data clustering + workload modeling + mixing protocols

Hybrid approach of 2PL and OCC (denoted as Hybrid), which uses locks to protect highly conflicted records and uses validation for the other records.

Toward Coordination-Free and Reconfigurable Mixed Concurrency Control

Dixin Tang's 2nd work on adaptive CC

Different CC throughput:

  1. Partitionable: PartCC > 2PL > OCC. A transaction is highly likely to access only one partition. So use coarse-grained (partition) locking for partitionable workloads.

  2. Non-partitionable + Read only (+ Skew): OCC (MVCC)> 2PL > PartCC

  3. (Non-partitionable + Write heavy + (non-skew): OCC ~ 2PL ? )

  4. Non-partitionable + Write heavy + Skew: 2PL > OCC ~ PartCC


  • Perform well under partitionable workloads

  • Cross-partition transactions hurt the performance

Elastricity instead of switching?

Two Challenges of Mixed Concurrency Control

  • Mixing: How to partition a workload and mix multiple concurrency control protocols efficiently

  • Switching: How to reconfigure a protocol when the workload changes

一些研究人员认为Timestamp处理读操作为主的事务有益,2PL对写操作多于读操作的事务有益,提出了 某些事务用时间戳排序,另一些事务用2PL。Bernstein. ,1987.

Different CC scalability

  1. Read only

    • 2PL > MVCC > Timestamp > OCC

    • 2PL: Scales. And has higher throughput

    • MVCC: Timestamp allocation limits scalability.

    • OCC: Memory copy hurts performance.

  2. Medium contention (50%:50%)

  3. High Contention

How to apply different CCs?

  1. Partition the database and use different CC to access different cluster.

    • You need to assign CPU initially or dynamically.

    • Different CPUs may need to switch to different CC due to workload variation. (New Overhead)

  2. Fixed CC for all data

  3. Fixed CC + Partition


Some common tradeoffs, pros & cons, challenges & sols

  • Device level, optimization level (and the level where it is applied)

  • Storage hierarchy of different levels

    • Registers

    • CPU cache: L1/L2/L3, data/instruction, SRAM

    • Main Memory: DRAM

    • Disk: HDD vs. SSD

    • Network: NAS, cloud

    • Potential: NVM, HBM, Smart SSD

  • Parallelism of different levels

    • Instruction: FPGA/ASIC acceleration, deep-pipelined, ILP(branch prediction), VLIW

    • Data: SIMD (SSE -> AVX), vectorization

    • Thread: concurrency, multicore, SIMT (GPU)

    • Device: hybrid (CPU + GPU / XPU / FPGA / ASIC, co-processor), NUMA

    • Distributed: RDMA, distributed ML training (parallel on data/model -> pipeline)

    • Task: scheduling, HTAP

  • Key bottleneck:

    • Bandwidth (IO) (internal & external)

      • CPU ~ Disk: SATA (HDD), SAS -> PCIe -> NVMe (SSD)

      • CPU ~ GPU: PCIe, scheduler (data balancing)

      • GPU ~ MM: HBM, NV Link, GPUDirect

    • Computation: accelerators (domain-specific processors)

    • Latency: direct access to different levels (RDMA, smart NIC)

    • Memory capacity: compression, multi-device, hybrid arch, in/shared-memory, virtualization -> containerization

  • Algorithm optimization

    • Operator optimization

      • Query optimization: compilers (static vs. dynamic analysis), DSL

      • ML training: matrix operator, computation graph optimization (TASO), TVM

    • Data models: different ways of storage

      • Structure: relational, document, graph

      • Encoding: compression and decompression (arch, weight), row store vs. column store, quantized (lower accuracy requirement), hybrid accuracy

      • Access: serial (disk) vs. random (DRAM, hash)

    • Scheduling (dynamic)

      • Task (minimize runtime): streaming vs. batching (asynchronous vs. synchronous), workload specific balancing (read/write intensive, high/low contention, minimize idle overlap), configurable isolation levels

      • Device: data transfer cost between CPU, MM, GPU, etc., dynamic device for different workloads (read/writing-intensive), concurrency control

      • Distributed: data, logging shipping, load balancing, synchronization (consistency)

    • Lazy

      • Lazy evaluation, loading, initializing

  • Principles

    • Performance: speed (see key bottleneck)

    • Scalability: multi-device/nodes

    • Reliability: fault-tolerance (networks, time, device)

    • Flexibility: end-to-end

    • Cost: short/long-term (considering tech development)

    • Maintainability: automatic failover, logging

    • ACID: atomic, consistency, Isolation, durable

  • Tradeoff

    • Flexibility vs. performance (vs. cost)

    • Different workloads: performance between read-intensive and write-intensive

    • IO vs. concurrency (vs. latency)

    • Capacity vs. performance: capacity vs. cache & materialization

    • Normalization vs. denormalization

    • Availability vs. consistency