Multicore scalability. VLDB 14'.
This paper explores the performance of seven (class of 2PL and Timestamp ordering) concurrent control algorithms when the scaling of cores grows up to 1024. The author introduces the motivation of the DBMS schemes on with thousands of cores and the mechanisms concurrent control algorithms concisely. And in the experiment part, this paper did a very thorough analysis of different concurrency control algorithms on different workloads (YCSB and TPC-C) and identified the bottleneck of current algorithms. And in the last part, the author concluded that none of current concurrency algorithms can scale well on chips with up to 1024 cores, even if they are state-of-the-art.
This paper gives a good introduction to the motivation and background of this paper. For example, it illustrates seven concurrency algorithms concisely.
The Author completed this experiment with great patience and cautiousness. They implemented all algorithms by themselves to avoid interference from internal database system.
The results of the performance on simulator and real hardware are compared, which makes the conclusion more convincing.
Although the author have stated the reason why they choose to build custom DBMS instead of commercial DBMS, it may be better if they can compare the results of custom DBMS and commercials ones.
Hybrid concurrency control schemes may be evaluated to make the paper more comprehensive.
Evaluation of scalability (up to 1k cores) of 7 classical concurrency control protocols for main memory OLTP with a prototype DBx1000. All of them failed to scale to 1k cores.
Intro to 2PL (DL_DETECT, NO_WAIT, WAIT_DIE), Timestamp Ordering (TIMESTAMP, OCC, MVCC, H-STORE)
Bottlenecks for each CC scheme
DL_DETECT - Scales under low contention. Suffers from lock thrashing.
NO_WAIT - Has no centralized point of contention. Highly scalable. Very high abort rate
WAIT_DIE - Suffers from lock thrashing and timestamp bottleneck.
TIMESTAMP - High overhead from copying data locally. Non-blocking writes. Suffers from timestamp bottleneck.
OCC - High overhead for copying data locally. High abort cost. Suffers from timestamp bottleneck.
MVCC - Performs well with read-intensive workload. Non-blocking reads and writes. Suffers from timestamp bottleneck.
H-STORE - The best algorithm for partitioned workloads. Suffers from multi-partition transactions and timestamp bottleneck.
Optimizations in DBx1000 - mutex based allocation -> atomic instruction -> batch allocation, hardware counter, distributed clock
What are the pros and cons of timestamp ordering over two-phase locking? Can you think of other examples of using timestamps in other fields of CS?
Pros of T/O: simpler, no deadlocks
Cons of T/O: timestamp allocation bottleneck, storing read/write timestamps
Examples of timestamps: third-party authentication, sync/recovery in distributed systems, consensus, parallel compilation, cryptocurrency, network protocol, cache coherence, distributed file systems, transactional memory, vector-clock
What are the main pros and cons of a multi-version concurrency control (MVCC) protocol? How is MVCC related to HTAP (Hybrid transactional/analytical processing)?
Pros: good for rolling back, efficient writes, non-block reads
Cons: Memory copy, memory space, timestamp bottleneck
HTAP: old versions for OLAP and new versions for OLTP
Can you think of any hardware changes to a multicore CPU that can improve the performance/scalability of concurrency control?
NUMA management, clock synchronization, low-overhead locking, timestamp allocation, conflict detection in hardware, persistent memory for logging, locking prefetch
2PL: scalable (DL_DETECT ~ NO_WAIT > WAIT_DIE ~ MVCC)
T/O: Timestamp allocation limits scalability.
OCC: Memory copy hurts performance
2PL > T/O > OCC
DL_DETECT does not scale due to deadlocks and thrashing
2PL > T/O > OCC > DL_DETECT
Scaling stops at small core count
NO_WAIT has good performance until 1000 cores
OCC wins at 1000 cores
NO_WAIT, OCC > others > DL_DETECT
WAIT_DIE: A transaction T1 waits for another transaction T2 only if T1 has higher priority than T2 (e.g., T1 starts execution before T2).
Pros over NO_WAIT
Guaranteed forward progress (i.e., no starvation)
Cons over NO_WAIT
Locking logic is more complex
Solutions to Timestamp Allocation (downer are better)
Mutex based allocation
Hardware Counter (~1000 million ts/s)
Distributed Clock (perfect scalability)
Distributed transactions - 2PC (two-phase commit)
Every tuple is mapped to one partition
A partition of data is unavailable if a server crashes (or more generally, the warehouse where the server located)
Replicate data across multiple servers so data is available if at least one partition is still alive
If the primary node fails, failure over to a secondary node. Recovery from log if all replicas fail.
RAID take more time to recover from disks (minutes), compared to MM above.
Problem for traditional distributed db
2PC is expensive
Network can be a bottleneck for log shipping
Decide the global execution order of transactions before executing them
All replicas follow same order to execute the transactions
Non-deterministic events are resolved and logged before dispatching the transactions (e.g. assign values for potential random value before dispatching)
Pros over traditional distributed db
Log batch of inputs -> No two-phase commit
Replicate inputs -> Less network traffic than log shipping
(extreme case of operational logging, which can happen on any level, can recover from inputs directly)
different replica may have different scheduler result, but the final result is the same
Distributed across all nodes
No single point of failure
High scalability (infinite horizontal scalability?)
Replicate transaction inputs asynchronously through Paxos (instead of log shipping)
10ms batch epoch for batching
Batch the transaction inputs, determine their execution sequence, and dispatch them to the schedulers
All transactions have to declare all lock requests before the transaction execution starts (one con cause it's a too strong assumption)
Single thread issuing lock requests
Scheduler is a bottleneck for read-only workloads
Transaction Execution Phases
Analysis all read/write sets
Passive participants (read-only partition)
Active participants (has write in partition)
Perform local reads
Serve remote reads
send data needed by remote ones.
Collect remote read results
receive data from remote.
execute transaction logic and apply writes
UPDATE table SET salary = 1.1 * salary WHERE salary < 1000
Need to perform reads to determine a transaction’s read/write set
How to compute the read/write set?
Modifying the client transaction code
Reconnaissance query (read-only query to discover sets, but do not need to update) to discover full read/write sets
If prediction is wrong (read/write set changes), repeat the process
Disk Based Storage
Fixed serial order leads to more blocking
T1 write(A), write(B)
T2 write(B), write(C)
T3 write(C), write(D)
Prefetch ( warmup ) request to relevant storage components
Add artificial delay – equals to I/O latency
Transaction would find all data items in memory
Logs before a checkpoint can be truncated
Naïve synchronous mode:
Stop one replica, checkpoint, replay delayed transactions
Stores two copies of each record
Conventional distributed transactions
Partition -> 2PC (network messages and log writes)
Replication -> Log shipping (network traffic)
Deterministic transaction processing
Determine the serial order before execution
Replicate transaction inputs (less network traffic than log shipping)
No need to run 2PC
Is knowing read/write sets necessary for deterministic transactions? How does the protocol change if we remove this assumption?
With just partition, no
With partition and replications, it is possible to remove the assumption if they do not require remote data.
Can you think of other optimizations if the read/write sets are known before transaction execution?
No need to broadcast reads to all active participants
Use better deterministic ordering to improve performance
Enforce no conflicts within a batch -> no need to lock
Blind write optimization
For a batch of transactions, Calvin performs a single Paxos to replicate inputs. Is it possible to amortize 2PC overhead with batch execution but not using deterministic transactions?
Run 2PC in batches
Epoch-based concurrency control like Silo
Many previous works on GPU-bases database by both scholars in academic and industry claimed that their products can achieve great speedup, even higher than the bandwidth ratio between modern GPU and CPU. This paper explores the fundamental performance characteristics of GPU/CPU-based OLAP database and argues why the modal where GPUs act as coprocessors is inefficient, which many paper adopted. The author try to implement the state-of-the-art model for both of them and make the comparison more unbiased. Specifically, they first present Crystal to generate efficient query codes with composition of data processing primitives. Secondly, they implement state-of-the-art operators for both CPU and GPU with predictive performance models. The result shows that for individual query operators, speedup for selection, projection and sorts approximate to the bandwidth ration while join achieved less speedup. But surprisingly, full query performance on their implementation of GPU database achieved higher speedup than the bandwidth ratio, tested on simplified TPC-H. Additionally, the author compared the cost of GPU/CPU database and concluded that now the state-of-the art GPU database can achieve higher speedup than the bandwidth ratio (25x) with lower cost (4x effective). That means GPU can already be an ideal option when the dataset can fit into GPU memory.
S1. This paper present a very fair comparison between GPU/CPU based database. In order to achieve this, they implement state-of-the-art operators for both CPU and GPU along with other optimizations for each architecture. The result shows that many GPU databases have exaggerate their performance compared to traditional CPU database.
S2. The author argued why the model where GPU served as a coprocessor is less efficient compared to use GPU as the main execution engine. Since GPU coprocessor model is very popular in many GPU database, it seems that many of them have exaggerate their performance, although the other architecture not necessarily achieve better performance when the dataset is too large.
S3. The author implemented state-of-the-art individual query operators and established Tile-based execution model to optimize the performance. The results generally meet their modal of expected runtime.
W1. One main disadvantage of GPU-based database is the smaller memory capacity of GPU. To be more practical more GPU are needed in a single machine compared to CPU. So it's better for author to experiment with multiple GPUs in a single machine.
W2. The experiment assumed that both data is already stored in memory of CPU and GPU. But initially data shipping from CPU memory/disk to GPU memory also takes time. (However, if the dataset is not very large and can be fit in GPU memory directly, this single transferring cost seems not to be a bottleneck as indicated.)
W3. The scale of workload (simplified TPC-H) and GPU number seems can not necessarily prove that GPU-based database can scale very well.
C1. This paper concluded finally that GPU database can be a choice considering both the cost and the performance, especially in case that dataset is not very large.
C2. Some commercial products of GPU database are not open-source (e.g. SQream). Since they are commercial, I'm curious about the actual speedup they can achieve.
C3. It seems that in a long time GPU database will execute on heterogenous system with significant CPU memory. Potential optimization for this system to utilize both resource of GPU and CPU effectively will be a problem.
Some startup working on GPU DB: MapD (earlier & open-source), SQream DB (Israel & Ali Cloud), Zilliz (China)
Disk access vs. Memory access
Disk order access bandwidth: hundreds of M/S (1/100 of MM), random access (100,000x slower than MM), so the bottleneck is IO and techniques like Buffer Manager are used.
(Pure) In-memory DB: new bottlenecks are bandwidth of MM, CPU performance, L1 cache miss, branch prediction miss.
GPU vs. CPU
NVIDIA Tesla K80 FLOPS = 8.74TFLOPS，bandwidth = 500G/S
CPU: at most 50GFLOPS and 30G/s
Pros: better parallelism, higher FLOPS and higher bandwidth
Cons: Simple arch, limited flexibility, limited types of operations, fewer dependencies (cores of GPU can't execute independently today, less tolerance lead for divergent execution lead to the difficulty in parallelizing algorithms) , bad performance on logic control and branch prediction, higher latency, slower clock-speed
CPU: high-frequency, deep-pipelined, heavily-cached
Data parallelism: better parallelism, high requirement for MM
Model parallelism: lower requirement for MM, lower efficiency (idle of one part of the model, due to dependencies only one part is working)
When the modal is larger than the capacity of single MM or GPU, model parallelism are preferred.
Otherwise data parallelism is preferred due to faster speed.
Performance - GPU-only > CPU-only > co-processor
Crystal - tile based execution model
GPUs are 25x faster and 4x more cost effective
What is the advantages and disadvantages of executing transactions on GPUs?
Pros: More parallelism; higher memory bandwidth; good for read-only transactions
Cons: Limited memory; ACID over SIMT (e.g., logging latency, concurrency control scalability, etc.); scratchpad hard to use
Can you think of any solutions (either software or hardware) to overcome the problems of (1) limited PCIe bandwidth between CPU and GPU and (2) limited GPU memory capacity?
Avoiding PCIe bottleneck: better scheduler for data movement (only warm data in GPU, use GPU when results set if small); GPUDirect; compress data transfer;
Handling limited GPU memory: virtual memory for GPU; HBM to CPU; compress data in GPU; multi-GPU; Hybrid CPU-GPU system
What are the main opportunities and challenges of deploying a database on heterogeneous hardware?
Opportunities: Workload specific optimizations; minimize data transfer cost; massive parallelism
Challenges: complex scheduling, load balancing; hard to program; difficult failure handling; complex coordination among devices with difference architecture
In this paper the author presents the design of Database Processing Units (DPU), which mainly include three parts: ISA, micro-architecture and hardware implementation. About the ISA, they implement standard relational operators then can manipulate primitives with good fit for acceleration and parallelism. Then they examined possible optimal configurations for hardware tiles from two aspects (per-tile sensitivity analysis and case-by-case experiment for the last 150 configurations). Apart from the above they also evaluated the communication needs, query scheduling algorithms and their performance. Finally in the evaluation part the author proves that Q100 can both perform better on both TPC-H and energy efficiency compared to traditional database system.
S1. The author presented a clear approach on how to design the micro-architecture for Q100 and explained why their configurations are potentially optimal using models.
S2. The author did a thorough evaluation for Q100 performance with different factors and explored their bottlenecks respectively.
S3. Different from query-specific circuits in previous works, in this paper domain-specific circuits are presented. Their evaluation shows that by doing this they can manage to achieve great better performance on TPC-H compared to traditional RDBMS.
W1. Further optimization for query scheduling algorithms may be applied. But indeed it's difficult to perform effective optimizations on ASIC chip.
W2. The author did not experiment with multiple DPUs within a network. So maybe it should be tested that what can we do to balance the communication and computation costs between DPUs within a network (also fault tolerance).
C1. Maybe DPU can also act as a coprocessor to accelerate computation-intensive analytics queries.
C2. Although Q100 DPUs are still not flexible enough, in areas like regular expression matching and non-relational DBMS, but it seems that they have a promising future in OLAP DBMS.
C3. I'm really not familiar with this area so it's quite difficult for me to go through this paper
Accelerator for analytical queries (not txns)
Hardware support for relational operators - join, aggregation, sort, select
Process data as streams
Combination of spatial and temporal instructions
Aggregator: both group_byand aggregate columns are sorted
BoolGen: compare two columns and generate bit vector
ColFilter: select values from a column based on a bit vector
Joiner: Inner-equijoin (hash or merge join?)
Partitioner: range-partition input column•Sorter: bitonicsort for 1024 records
Apend: Append two tables with the same schema
ColSelect: extract column from table
Concat: concatenate two columns
Stitch: produce a table based on multiple input columns
Q100 is an efficient domain-specific accelerator for analytical database workloads
ISA exploits parallelism and streaming efficiencies
At < 15% area and power of a Xeon core, a Q100 device gets exceptional performance and energy efficiency
Key to Q100’s high performance?
Streaming to reduce data movement, pipeline parallelism, optimized for TPC-H, efficient instructions, fewer memory writes
DPU vs. optimized CPU?
Q100 is better due to specialization (1)
DPU may be better than Opt-CPU on TPC-H (2)
Limitation of Q100?
Communication (between Q100 and CPU, between tiles) may be a bottleneck
Large overhead when table is big
Extra compiler and software development overhead•Cost of special hardware
Optimization goals of database accelerators?
Minimizing data movement, maximize memory bandwidth. Minimizing PCIe transfer,. Process in memory?
Data placement, scheduling, concurrency control, energy efficiency
Non-volatile memory is a new storage technology which is believed to have the potential to serve as a new layer in the storage hierarchy between DRAM and disks. Currently there are two major approaches to integrate NVM of modern DBMS, to use NVM exclusively as main memory or to use page-based DRAM cache in front of NVM. However, the first approach performs worse compared to in-memory DBMS while the second one fails to utilize the byte addressability of NVM and the cost of accessing cold data on NVM is still expensive. To solve this problem, the author presented a coherent and efficient 3-tier system design of buffer manager which includes three key techniques: cache-line-grained pages, mini pages and pointer swizzling. The lightweight buffer manager and storage engine they implemented performed well on YCSB and TPC-C. Overall, the author argues that conceptual simplicity is a major advantage.
This 3-tier buffer manager which includes all of DRAM, NVM and SSD makes it possible to improve the accessing speed of hot data in the first two layers, and the capacity of storing cold data in the last two layers.
The author combined three techniques coherently and effectively to increase the simplicity of overall system design. All of these techniques are either simple or already well-known.
The author used a hybrid approach to accessing different size of data (full pages or mini pages) and completing specific operations (cache-line-grained or not), which improves the overall performance effectively.
Cache-line-grained access is much more difficult to program compared to conventional full page-based approach. We need to know all data's residence explicitly before accessing it.
When the operation is either very complex or infrequent, or the mini page overflows, the author uses full pages instead of mini pages. But in this case, how can we extract relatively hot data and avoid new overhead? Maybe there is still room to optimize in this situation.
The author maybe can explain a little more about why they choose this specific size for mini pages. Will other sizes for mini pages help perform better overall?
Q1. NVM's latency of read and write operations is not symmetric. What are some other possible optimizations we can adopt to solve the problem of slow write of NVM? Maybe some write-efficient algorithms need to be developed for some complicate or write-intensive operations.
Q2. Will NVM be able to contribute to the further development of HTAP databases?
Q3. What's some other critical obstacles in the commercialization process of NVM？Can they be solved in the foreseeable future?
Q4. Is it possible to combine NVM and GPU/FPGA accelerator for DBMS?
SDD: SATA, SAS -> PCIe -> NVMe
Paper: Managing Non-Volatile Memory in Database Systems
Primary storage for relations and index structures
Can fully utilize the byte addressability of NVM
Slower than MM DBMS
Smaller capacity compared to SSD
Use a page-based DRAM cache in front of NVM
Accessing uncached pages become expensive
Smaller capacity compared to SSD
NVM's operation modes in a db
NVM Direct Mode (DRAM and NVM in different address space)
NVM as the primary storage
All changes persistent to NVM when a transaction commits
Logging can be simplified
DRAM as a Cache (DRAM as a cache, managed by hardware)
NVM write throughput is lower than DRAM
Buffer management (BM) for DRAM data
This paper was written before memory-mode was available
Buffer management in SSD/HDD vs NVM
Load a full page at a time (e.g. 16kb)
Waste of bandwidth if full pages are loaded
Loading a cache line at a time (64 B)
Page initially empty, cache lines loaded as they are accessed
Overhead: each access checks/updates resident/dirty bits
Mini pages - a sparse representation of a page
Cache lines are sorted
Promote to full page when a mini-page is full
Pointer swizzling - Reduce overhead of page table lookup
Store pointer rather than pageIDif page is in main memory
Cannot swap out a page before its children
NVM: new device in the storage hierarchy
Taking advantage of byte-addressability to improve performance
Three-tier buffer management
How does memory-mode affect the design?
Will be faster when data fits in DRAM
Need to take care of logging
Memory mode can ease programming
Just use existing main-memory DB without change
Advantage of app-direct mode over memory mode
Directly manage replacement policy
Larger aggregated memory space
Logging can potentially be simplified
Allows hot/cold data separation
How would you design NVM-DB differently?
Better recovery structures that use NVM
Minimize writes to NVM
Use memory-mode or the dual-mode
Replace SSD with NVM (cost?)
Build LSM-tree based storage system
The arrival of new non-volatile memory brings new opportunities for re-design logging and recovery mechanisms for DBMS to exploit full potential of NVM. Traditionally, DBMS uses write-ahead logging method to do logging and recovery for DRAM and HDD/SSDs. Compared to WAL, the author developed WBL, which is specially designed for NVM and tracks what parts of the database have changed rather than how it was changed and flushing the changes to the database before recording them in log. After implementing this protocol in the Peloton and testing on YCSB, the author concluded that this protocol can improve a DBMS's transactional throughput by 1.3x, reduces the recovery time by two orders of magnitude and decrease the storage footprint of the DBMS on NVM by 1.5x.
The author redesigned the structure of WBL Record to make it more efficient and more friendly to the limited life span of NMV.
WBL exploited the special features of NVM greatly. Compared to WAL, WBL has better throughput and recovery speed due to the fast writing and random accessing speed of NVM, as well as the byte addressability.
WBL log record contains all the needed information to recover from failure and there no need for DBMS to periodically construct WAL-style checkpoints to speed up recovery.
Maybe it's better for the author to describe more about the replication schemes under WBL protocol.
In the recovery process of WBL, the author chose not to delete tuples in (Cp, Cd) immediately and used background process to clean these dirty tuples asynchronously. This seems to bring new complexity and overhead.
C1. WBL is specially designed for a specific kind of DBMS with NVM. How to make this protocol more compatible to other potential storage hierarchies with NVM? In some cases WBL may even perform worse than WAL.
C2. NVM's limited life span still seems to be a big problem.
C3. New log replication method may be developed to exploit the full potential of new techs like NVM and RDMA.
Steal vs no steal
No steal: dirty pages stay in DRAM
Processor can directly update a page
Main memory database
Steal: dirty pages may overwrite pages on disk
Must flush UNDO log(before-image) to disk before writing to the page
Force vs no force
No Force: Modified pages may stay in main memory
Flush REDO log(after-image) to disk before committing the transaction
Force: All modified pages written back to disk before commit
Can commit transaction after all pages are forced to disk
NVM: new device in the storage hierarchy
Taking advantage of both byte-addressability and non-volatility to improve performance of fault tolerance
Force + steal ---> UNDO only
Force + MVCC ---> No UNDO, No REDO
High availability and remote recovery
HA requires synchronous round trip
Network may become a bottleneck with HA
WBL cannot be simply extended to provide HA or remote recovery. Possible solutions (1) add REDO logging (2) use RDMA to directly update remote NVM
REDO vs. UNDO vs. WBL
WBL pros: instant recovery, small log size
WBL cons: requires multi-versioning, works only for NVM
UNDO: bounded log size (typically small)
WBL with three-tier architecture
Challenges: Pulling cold data from SSD, page vs. byte granularity
Log and hot data in NVM, cold data in SSD
Where is HBM used?
Nvidia Pascal GPU, Nvidia Volta GPU
AMD Radeon GPU
Intel Xeon Phi CPU (abandoned), so HBM is mostly used in GPUs
Symmetric Hash Join
No Partitioning Hash Join
Partitioned Hash Join
Parallel Radix Join
M-Way Sort-Merge Join
Flat mode better than cache mode
High amounts of threads can easily lead to memory bottlenecks
Do not place everything in HBM
Random memory access patterns do not saturate bandwidth in general, being therefore not ideal for HBM
High bandwidth improves highly parallel hash joins and sort-merge joins more or less equally
Uneven load balancing by skew as well as latches on partitions are not noticeably influenced by HBM
Xeon Phi: High bandwidth memory integrated with multicore CPU
HBM is more popular with GPU than CPU
Cache mode does not always lead to performance improvement
Why HBM more successful with GPU than CPU?
GPU has more computation to saturate HBM bandwidth
GPU workloads are throughput-bound, not latency bound
Future of storage hierarchy?
HBM becomes the new DRAM
Need a universal interface to control the hardware
Customizable storage solutions
Another layer: Smart memory
Some may disappear (e.g., HDD)
APU for database?
Depends on the price
Promising because the bandwidth between CPU and GPU increases
Maybe hard to program
This paper talks about the opportunities and challenges for query processing on smart SSDs. Smart SSD is a new idea which package CPU processing and DRAM storage inside a single SSD, which makes it possible for us to run programs in the SSD itself. The author briefly introduced the architecture of modern SSDs and the mechanism of smart SSD. Then they implemented an initial prototype of Microsoft SQL Server which can run operations mainly like selection and aggregation. They tested with three synthetic tables from TPC-H and found that pushing down some query tasks to smart SSD can increase the overall performance and be more energy efficient, in the case when IO and limited embedded CPU processing capabilities have not become bottlenecks.
There are already approaches that adopted by IBM and Oracle to push down computation, but different from above, in this paper the author chose to make Smart SSD to have in-built CPU inside the I/O device, rather than next to the IO subsystem.
Running code inside Smart SSD need special development for programmers. And the author build a practical development environment hand-in-hand with Samsung for this project, which is highly beneficial to later database developers who are interested in Smart SSDs.
The discussion part is very thought-provoking. The author identifies several possible bottlenecks and opportunities in the near future for Smart SSDs.
Although embedded CPU in Smart SSD is much more energy-efficient, but these also bring new limitations. This kind of embedded CPU is not deep-cached and deep-pipelined like core CPU and has limited processing capabilities. Several optimized techniques are not realistic in Smart SSD, unless the manufacturers agree to develop Smart SSD with stronger build-in CPU one day. As the author indicated, this could become a bottleneck.
The author may talk more detailed about how they programming for query processing in SSDs using such communication protocols and APIs.
The experiment is not so thorough. For the operators, the author only implement selective and aggregate, which can just exploit Smart SSDs performance to the best. And the workload may be a little more simplified. But I totally understand it's difficult to program with Smart SSDs, which few people have done before.
Is Smart SSDs becoming popular and commercialized now?
As mentioned by the author, it's chicken-and-egg problem between SSD manufacturers and software vendors. Is is still a very common and critical conflict in the system area now?
It seems that pushing down computation is a trend now, like Smart SSD, SmartNICs. This is especially useful in some specific tasks which need to be done repeatedly. But I think the degree of pushing down computation also have a limit. Since we already have the central powerful general-purpose CPU, maybe it could be over-designed for Smart SSD and SmartNICs to be too complicated and general in the future. We need to find a cost/energy balance between them.
Smart memory/storage - pushing computation to memory/storage
PIM, Smart SSD, Active Disk (CMU 1998), Intelligent Disk (Berkeley 1998), AWS S3 Select, ...
Query processing on Smart SSDs
Internal bandwidth larger than external bandwidth
In-SSD processor is less powerful and cheaper, Smart SSD may improve overall cost/performance
Reduce energy consumption
Gap between external and internal bandwidth determines the potential performance improvement
Smart SSD prototype used in the paper delivers 2.7x speedup
Fast IO/Network affect smart memory/storage?
Closes internal/external bandwidth gap => less gain from smart SSD
Cost and energy
Supporting complex operators
Join: Small table fits in Smart SSD memory; computation simple enough
Breakdown the complex operators
Not wise to push join entirely
Push some simple group-by
Data partitioning in Smart SSD
Data movement always affect task performance and energy efficiency significantly. In this paper the author tried to push computation from CPU to memory to reduce the cost of data movement. Firstly, brief introduction to current PIM architectures is presented and the author illustrated their motivation through the execution process of selection operator. Secondly, the author presented their implementation of a group of five SIMD query operators. Lastly, they tested their design on performance and energy consumption in each architecture thoroughly and found that a hybrid query scheme can bring significant performance increase and be more energy efficient.
Although pushing down computation to reduce data movement is a classical idea, but it's still novel to discuss the trade-offs of query execution between PIM and x86 in database community. Few (or none) people have done a thorough study on this topic.
The author completed a very thorough experiment. Compared to previous studies, they analyzed the query execution time and energy consumption on multiple datasets.
Their prototype are more mature (compared to the Smart SSD paper we discussed last time). They implemented a group of five SIMD operators and discuss the potential of hybrid scheduling, which can better illustrate the opportunities for PIM.
Maybe the author could talk a little more about the details of their implementation of SIMD query operators.
Maybe the author could briefly compare PIM with other kinds of hardware for query acceleration. Or identify the similarity and differences on high-level.
Are they any studies to compare the overall query execution performance on different methods like GPU, smart SSD and PIM? It's not realistic to use all them to accelerate query execution at the same time, and there should be a fair comparison between them to discuss the trade-offs. (although obviously this means a lot of work.)
Is it also very difficult to program with PIM just like Smart SSD? Apart from the five SIMD query operators they implemented, what about UDF in PIM?
Previous NDP for Databases
Previous NDP-DB: Active disk, Intelligent disk, smart SSD
No commercial adoption of previous work
Limitations of hardware technology => HBM and HMC
Continuous growth in CPU performance => Moore’s law is slowing down
Lack of general programming interface => SIMD
How to improve group-by aggregation performance?
Store aggregation values in registers
Use buffers/caches to accelerate random accesses•Keep hot data in registers and swap to memory as needed
Sort data by groups in memory to convert random to sequential accesses
Smart SSD and PIM for transactions
Logging and garbage collection to Smart SSD
Compression and decompression of column store
Generating TID using SmartSSD/PIM
Where will PIM most likely to succeed in the storage hierarchy?
NVM: persistency and byte-addressability
HBM, DRAM, NVM, or SSD (no benefits for SRAM)
Cloud storage obviously (e.g., PushdownDBJ) then move up the stack SSD/NVM and may stop at DRAM due to limited benefits and added complexity
Three Tiered DB: admission/eviction can be pushed down to NVM or SSD
RDMA for DB
Currently, distributed DBMS architecture and distributed algorithms are specially designed assuming that network is the primary bottleneck to minimize the network communication cost. At present, we can already achieved comparable network bandwidth with RDMA and InfiniBand compared with bandwidth with one memory channel, not to say the potential development of RDMA in the future. In this paper, the author present the new Network-Attached Memory architecture and the alternative architectures for distributed DBMS based on this. By redesigning these architectures, the author shows that traditional 2PL schemes can scale well again. Also, the author implemented new JOIN and AGGREGATION operators for distributed OLAP workloads and achieved significant performance increase compared to traditional operators designed for distributed DBMS using locality-awareness or semi-join.
Instead of using RDMA as an afterthought technique for existing DBMS, the author argued that it is necessary to redesign the architecture of current distributed DBMS to leverage the potential of DB with RDMA. This new design bring them significant performance increase.
Traditionally distributed 2PC doesn't scale due to slow network. The author present a new protocol which is able to avoid the scalability limit (theoretically) and can leverage RDMA and NAM to the best.
Their new design of distributed operators like JOIN and AGGREGATION provides performance increase, more robustness and can help better handling data skew.
It shall be better if the author could talk more about their actual implementation of in-memory DBMS for RDMA.
The evaluation part for RSI seems to be not very thorough.
RDMA is of great potential and it's necessary to build more user-friendly programming environment.
Will traditional optimizations for distributed operators such as locality-awareness become insignificant as RDMA become more and more popular?
Bandwidth and latency
InfiniBand and RDMA
Shared-Nothing vs. Shared-Memory
RDMA for OLTP
Snapshot isolation, generalized snapshot isolation
Cost of 2PC+GSI, RDMA+SI
RDMA for OLAP
Grace Hash Join
Partitioning R and S using the join key
Read data on the sender
Transfer data over network
Materialize data on the receiver
Semi-Reduction using Bloom Filters
Create bloom filters for R and S on the join key
Filter R and S using the bloom filter and partition
Local join (parallel radix join)
GHJ with RDMA - time analysis
RDMA for transaction execution phase
No need to partition or replicate indexes
Centralized locking and replication can be faster
Centralized logging service
Accessing and locking remote data using RDMA
Prefetch data into local memory
DB components significantly affected by a faster network
Two phase commit, Consensus, Replication
Less worry about locality
Breakdown program into multiple micro services
Data shuffling in the network
Distributed system becomes NUMA is network is sufficiently fast
Opportunities and challenges of memory disaggregation
Opportunities: Independent scaling of CPU and memory; simplifying scheduling; larger aggregated memory capacity; potentially faster fault tolerance
Challenges: independent failure of compute and memory; consistency and coherency; data placement and partitioning
RDMA based network can achieve a high bandwidth which is comparable to that of main memory. And current state-of-the-art replication algorithms are designed to minimize the network communication cost assuming that network is the main bottleneck. The author analyzed why conventional high available protocol is not able to leverage RDMA based network. Then the author proposed the Active-Memory Replication, which is designed to exploit the potential of RDMA and can achieve high availability. It is an undo-log based fault tolerance protocol that has advantages in both performance and maintains correctness. Evaluation shows the correctness of this protocol and it's 2x faster than the second best protocol on RDMA-based networks.
With good illustrations(code, timeline, branches, figures), the author presented their new protocol thoroughly in a clear manner.
This paper has a well-written background introduction part, which gives readers a concise guide to current replication algorithm and analyzed why they are on longer suitable for RDMA based network.
The evaluation is relatively thorough. The experiment shows that how this new Active-Memory protocol can outperform current replication algorithms in broad scenarios (based on RDMA network) and maintains correctness at the same time.
Don't use log shipping and thus avoid overhead on processing backups, which help this new protocol exploit full potential of RDMA and outperform FaRM.
Would like to see comparison between more state-of-the-art replication algorithms to better illustrate the advantages of Active-Memory protocols in different scenarios.
The scalability experiment may need more nodes to prove a definite advantage on scalability for Active-Memory protocol compared to other conventional replication algorithms.
What are some other potential components in a distributed system that their advantages no longer hold on RDMA based network and need a redesign?
How to extend to architectures other than shared-nothing?
Active-Passive, Active-Active Mode
RDMA shifts the bottleneck from network to CPU
Conventional HA protocols (i.e., active-passive and active-active) are optimized for reducing network demand and is thus are no longer optimal for RDMA
Active-memory is optimized to reduce CPU demand
Active-memory achieves 2x performance improvement
Active memory without in-order delivery?
Assign seq number to each packet and resemble at the receiving side
Active Memory vs.WriteBehind Logging?
Both use “force” instead of “no-force”
Can be combined (single-vs. multi-versioning)
Keep data in persistent memory in Active Memory
Other examples of increasing computation to reduce network overhead
Data centric computing (moving computation to data)
Compression and decompression
Directory-based cache coherence: unicast vs. multicast
Currently, there is a gap between the increasing network bandwidth and the stagnating and limited CPU resources. And the emerging multicore SoC SmartNICs which have rich computing resources have the potential of offloading generic and complex workloads in datacenter. Compared to previous offloading a specific workload onto FPGA-based SmartNICs, this paper summarized the offloading performance implications from four perspective and built iPipe for offloading distributed applications onto SmartNICs and maximize resource utilization at the same time. Then the paper introduced its actor-based model, hybrid scheduler (FCFS + DRR) and tested with three complex distributed applications to show that SmartNICs have the potential to accelerate more generic and control-intensive workloads and fill the gap in the beginning.
There are some research on offloading some networking workloads onto FPGA-based SmartNICs. But instead of accelerating a domain-specific workload, the paper focused on offloading a class of applications which may have complex logic and algorithms.
The experiment tested three complex distributed applications and thoroughly displayed that iPipe has the potential of accelerating more generic, complex and control-intensive network workloads.
The paper summarized the offloading performance implications from four perspective (traffic control, computing capability, onboard memory and host communication) which can help in identifying the bottlenecks and maximize resource utilization for complex and generic distributed applications.
The background information part of this paper seems to be not very well written (lacks some concepts explanation compared to papers in VLDB) and thus is less friendly to noobs...
It seems that there is still room to optimize for iPipe Actor Scheduler. Current scheduler policy already has good performance, but it's also beneficial to discuss potential optimization theoretically to make this policy more sound.
Is pushing down/offloading control-intensive complex workloads to SmartNICs a promising path? Is it over designed or really necessary?
Recent advances or applications of SmartNICs in data centers of giants like Microsoft, Amazon.
Pushing down computation to ...
Storage - Smart SSD
Network - Smart NIC
Smart NIC arch
On-path vs off-path
Actor model vs OOP
DMO - Distributed Memory Object
Replicated KV Store
Log-structured merge tree for durable storage
Replicaiton using Multi-Paxos
Actors - Consensus actor, LSM Memtable actor, LSM SSTable read actor, LSM compaction actor
Phase: read/lock -> validation -> log by coordinator -> commit
Actors - coordinator, participant, logging actors
Real time analytics
Analytics over streaming data
Actors - filter, counter, ranker
SmartNIC vs. SmartSSD
Different application scenarios: one for storage, one for network
SmartNICs used for reducing CPU overhead; SmartSSD used for reducing data movement
SmartNIC seems more popular among hardware vendors
Computation in SmartNICis stronger than SmartSSD
Database operators pushed to SmartNIC
Common: encryption, caching
OLTP: filtering, aggregation, locking, indexing
OLAP: filtering, project, aggregation, compression
Benefits of putting smartness into the NIC
Packet processing, latency reduction
Effect of SmartSSDis limited due to caching; caching does not apply in SmartN IC
Isolate security checks from CPU
Collect run time statistics such as network usage and latencies
Reduces burden on PCIe
RDMA for OLAP
Join operator is very important in both database engines and machine learning algorithms with the advent of big data. Traditionally, radix hash join and sort-merge join are two most popular join implementations. In this paper, the author tried to explore the implementation of distributed join algorithms with potentially thousands of cores and connected by a low-latency network. The author implements these two algorithms based on MPI and tested their performance under different circumstances thoroughly. The author identified some key factors in scaling these two algorithms such as balance between computation and communication capacity, communication inefficiencies, scheduling, etc.
This paper gives reader a friendly and concise background introduction including RDMA, MPI and high level ideas of distributed hash/merge join algorithms.
The author compared the actual performance of two distributed join with their theoretical bound and identified key factors which significantly affected the scalability of current distributed join algorithms. (balance between computation and communication capacity, communication inefficiencies, scheduling, etc. ).
The implementation and experiment part is thorough. The author did a detailed analysis of implementation, actual performance, cost components and possible setup factors to give the final high level conclusion.
This paper did a great job in investigating two most popular join algorithms and identified their bottleneck of scalability. But the author did not talk much about how potential optimization can be implemented, such as scheduling and data skew. Adding these parts will give the paper more novelty.
Any progress in light-weight and effective network scheduling techniques?
How will RDMA reshaped current distributed join algorithms?
Simply adding more cores will break the balance between compute and communication and the performance could even degrade. How can we measure the potential benefit of adding more cores more quantitatively?
MPI one-sided operations
Example - analysis of Radix Hash Join and Sort-Merge Join
Process, number of passes, time analysis and performance models
SmartNIC for join
Filtering, hash table, indexing
Network traffic scheduling for shuffling (reduce the problem of bursty traffic)
Hash table in SmartNIC?
Sort in SmartNIC
HW/SW techniques to improve performance of sort-merge join
Equivalent performance after removing bottlenecks? (Not necessarily)
Hardware acceleration for the sort and merge
Radix join to achieve theoretical maximum performance
Communication powered by SmartNICs/RDMA (network scheduling for shuffling)
Hash partitioning logic in SmartNIC
OLTP in cloud.
This paper introduced Amazon Aurora, a relational database for OLTP workloads. For a distributed/cloud database, the bottleneck of achieving high throughput has moved from compute and storage to network. Traditional distributed protocols such as 2PC are intolerant of failure and have high latency. To avoid this kind of questions, Aurora decoupled compute from storage and offload several functions such as redo logging, crash recovery and backup to the storage service. By only shipping logs between the database instance and storage nodes, and keeping doing redo in the background storage node asynchronously, Aurora was able to avoid the network bottleneck, reduce the respond time (both for customer and crash recovery) and achieve higher throughput.
Great paper introduced the considerations of new architecture cloud databases designed for high throughput. Several components are all discussed in a clear manner.
Offloading several functions such as redo logging, crash recovery and backup/restore to the storage tier gives many opportunities. The database instance only need to ship redo log to the storage tier, thus reduce the effect of the bottleneck of the network. Also, this makes quicker respond time for crash recovery due the asynchronous background recovery.
This paper explored how to reason about cloud durability and design quorum systems which enabled the database to be resilient to failures. Combined with only redo log shipping between the database instance and storage nodes, this makes Aurora tolerant of failures and reduced the latency.
Only support writes to only one replica and can't handle different write to multiple replicas.
Seems not to support distributed transactions.
What's Quorum's pros/cons over Paxos?
Shared-disk vs. shared-nothing?
Other possible components that could be offloaded from the database instance to the storage tier?
Cloud storage disaggregation
Independent management and scaling of compute and storage
Smartness in Storage - Storage nodes contain CPUs for computation
Pushdown to cloud storage
Quorum-based voting protocol, 3-way, 6-way replication
Step 3 - identify gaps in the log
Step 4 - gossip with peers to fill gaps
Step 5 - coalesce log records into data pages
Step 6 - periodically stage log and pages to S3
Step 7, 8 - Periodically garbage collect old versions and validate CRC code on pages
Cloud storage vs. SmartSSD
Smart SSD for OLAP while Aurora for OLTP?
SmartSSD on the read path; Aurora on the write path
Computation in cloud storage more powerful than SmartSSD
Smart SSD serves one node while Cloud storage serves multiple nodes
Cloud storage has higher latency
Challenges of multi-master
Uniqueness of LSN and ordering guarantees
Concurrency control (locking, leader election)
Commit protocol (2PC, or 1PC as what Aurora uses)
Other applications benefit from cloud storage
Publish-subscribe system like Apache Kafka
Big data analytics
Cloud data warehousing
This paper explored the architectures and tradeoffs when choosing a cloud DBMS. There are always tradeoffs in database areas and people have to make choices many times due to their specific workloads. Common choices are like shared-nothing vs. shared-storage, column store vs. row store, benefits of horizontal scaling vs. vertical scaling etc. This paper chose common database such as Athena, Redshift, Vertica, etc. to perform a thorough experiment on common tradeoffs and possible architectures when choosing a cloud DBMS. They concluded with their key findings from the experiment and pointed out some opportunities for potential future work.
This paper explored some important and common tradeoffs when choosing a cloud DBMS by carefully designed experiments. And their findings can serve as a good guidance for both customer, researcher and cloud service vendor.
The author managed to experiment with a broader selection of cloud offerings and incorporate DBMS offerings, including OLAP DBMSs which not so many previous paper focused on.
This paper is very practical both for customers and cloud service vendors. The product they chose are all popular commercial/open source products and the experiment is under the restricted framework with common choices for customers.
This paper includes a broad selection of cloud databases and relevant offerings, talking about many different tradeoffs, so sometimes the organizing of this paper may not be easy for readers to follow.
This paper is relatively easier for me to understand, but still sometimes I'm puzzled about the focus of this paper. But the key findings the paper concluded are all quite practical with some insights.
Are those findings keep unchanged for all cloud service vendors such as Google, Alibaba, Azure etc.? These companies offer similar products, but their prices vary from time to time. You know for products with generally the same functionalities, the price differences between different vendors sometimes are greater than I thought.
Database-as-a-Service - Redshift, Athena
Query engines - Presto, Hive
OLTP - Vertica
Complexity of fair comparison
System setup, query optimization, data format, data types, data partitioning, query restriction, caching, spilling to disk, etc...
Caching benefits, query cost, storage cost, scaling out, scaling up
Optimal design that combines the advantages?
Athena with instances pre-running
Hybrid instance store and S3; decide caching based on the workload
High-quality code compilers
Heterogeneous system that combines all the existing systems together
Optimization opportunities for serverless databases?
Optimize resource sharing among users (e.g., cache, computation)
Heterogeneous hardware and storage (e.g., different function on different hardware)
Scale computation and storage on demand
Keep instances pre-warmed to reduce cold starts
Cloud databases benefit from new hardware?
Using GPU, SmartSSD
RDMA and SmartNIC(e.g., shared cache in SSD, computation offloading)
Persistent memory to improve bandwidth and aid fast restarts
Snowflake probably represents the state-of-the-art cloud OLAP database design. Distributed computing and big data systems are becoming more and more popular in recent years. But for traditional data warehousing, they are designed for fixed resources and lacks elasticity. For big data systems like Hadoop, they still lack much of the efficiency and require significant engineering effort to roll out and use. The author realized these facts and thought that we need a fundamental redesign for cloud database. Thus they introduced Snowflake, a multi-tenant, shared-data, highly scalable and elastic system designed for semi-structured and schema-less data.
Snowflake disaggregates storage and computing. Although this idea is becoming more and more popular nowadays, but at the time when Snowflake was designed (since 2012), this is a farseeing choice which can help the cloud databases achieve great scalability and elasticity.
The design of execution engine is columnar, vectorized, and push-based, which enables Snowflake to have high performance on analytic workloads. This help snowflake to avoid many sources of overhead in traditional query processing and make Snowflake more scalable.
This paper really has a far insight. At the time when Snowflake was designed, big data systems like Hadoop and Spark are more popular choices, but the author chose to implement with their own cloud database with a fundamentally redesign, and their new design contains developmental novelties.
Choosing S3 as the storage layer for snowflake brought great efficiency and elasticity, but this also means the degrade of performance / support less features for specific workloads (for example less support for OLTP workloads).
It seems that choosing the storage layer is very important. Snowflake chose Amazon S3 and can we achieve a higher performance/other valuable features with some other storage nodes?
Difficulties for cloud database that supports HTAP?
BigQuery vs. Snowflake?
Snowflake vs shared-nothing
Data format - PAX
Data horizontally partitioned into immutable files
Intermediate data spilling to S3
Elasticity and Isolation
Created, destroyed, or resized at any point (may shutdown all VWs)
User may create multiple VWs for multiple queries
Local caching - S3 data can be cached in local memory or disk
File stealing to tolerate skew
Columnar - SIMD, compression
Concurrency control - Isolation, S3+MVCC, versioned snapshots
Snowflake has no index (same in Athena, Presto, Hive, etc)
Min-max based pruning: store min and max values for a data block
High availability and fault tolerance
Stateless cloud services, replicated metadata
One node failure in VW, whole AZ failures
How far away is Snowflake from the “optimal design”?
Better optimized storage layer (like Aurora)
Security and reliability
Caching can be improved (e.g., workload specific)
Data sharing across virtual warehouses
Opportunities to extend into providing HTAP solutions
Cloud service layer might be a bottleneck
Combine data warehousing and OLTP in cloud?
Master and slave nodes within a VW to support writes as well
Build snapshot isolation into storage (concurrency control)
Transaction log -> (intermedia storage) -> S3 -> data warehouse every Y hours
VW per transaction?
For bursty analytic database workloads, provisioning a cluster of fixed number nodes is not a wise choice because customers need to pay for idle computing resources. Now a popular solution is FaaS like AWS Lambda which enables customers to run small and fine-grained tasks to reduce the overall cost greatly. But specifically, building an analytic query system on cloud functions also comes with some challenges (managing numerous tiny workers which is resource-constrained, handling stragglers, shuffling data through cloud service, etc.) In this paper, the author introduced Starling, which is a scalable query engine which can mitigate all above challenges and reduced the latency as well as the cost.
As is stated in the paper, Starling is powered up with three key features (does not require loading, pay by query and tunable performance) which all common systems on the market can't offer all of them simultaneously.
Stragglers often cause significant performance degradation for analytic queries. The author based their optimization on the power of two choices to deal with straggler problem specifically. The experiment shows that using randomization and duplicate tasks can accelerate overall query execution greatly.
The author completed a thorough experiment, with four different system (Redshift, Presto, Athena, Starling) on multiple factors (cost of operation, query latency, scalability, concurrency, etc). These statistics soundly demonstrated the good performance of Starling.
Starling is optimized for low to moderate query volumes, there still exists some challenges (FaaS restriction and low-cost coordinator) that limit the throughput of Starling with high volume queries.
There are still some common functions that Starling does not support.
Starling does not use a query optimizer, it's might be better to consider the cost of query optimization.
Can the idea of Starling extended to transactional workloads?
How do other systems deal with stragglers?
Serverless = FaaS + BaaS
Opinion from a CIDR'19 paper
1 step forward: auto scaling
2 step back
Communicate only through S3, degrade performance
Not good for distributed computation: no intermediate communication
Opinion from Berkeley report
Using FaaS to implement a serverless database is a bad idea, should use BaaS
BaaS: Athena, Snowflake, Aurora (all serverless)
Cloud analytics database
Starling vs Snowflake
Starling: more fine grained, communicate only through S3
Snowflake: coarse grained, fixed size virtual warehouse, communicate through local storage
Small scale cloud database lower the bar
Read straggler mitigation (RSM)
If a read request times out, send duplicate request
Write straggler mitigation (WSM)
If a write request times out, send duplicate request
SingleTimer: allow only single time out
Double write - Producer writes two copies of an object; consumer reads the one ready first
Pipelining - Start the following stage before the previous stage finishes
Combining to reduce cost of shuffle
Starling vs Snowflake
Future of serverless computing
Opinion from Berkeley Report (Cloud Programming Simplified: A Berkeley View on Serverless Computing)
Challenges: Abstraction, System, Networking, Security, Architecture
Predictions: new BaaS, heterogeneous hardware, easy to program securely, cheaper, DB in BaaS, serverless replacing serverful
Opinion from a CIDR’19 Paper (Serverless computing: One step forward, two steps back.)
Fluid Code and Data Placement
Heterogeneous Hardware Support
Long-Running, Addressable Virtual Agents
Flexible Programming, Common IR
Service-level objectives & guarantees
FaaS vs. BaaS for databases
BaaS advantages: simplifies communication and state sharing, caching
BaaS disadvantages: potentially lower CPU and memory utilization
FaaS advantages: fine-granularity pricing model, auto-scaling
FaaS disadvantages: overhead of inter-function coordination, functions have limited resources and execution time, communication through S3, inherently designed for small functions
What can BaaS (e.g., Snowflake) borrow from FaaS?
Auto-scaling: Dynamically resource allocation and fine-grained pricing
Benefits and limiting factors of running OLTP on serverless computing?
Benefits: Elastic scaling based on demand, transactions are inherently short-lived
Limiting factors: S3 has no read-after-write consistency, concurrency control is hard due to lack of communication
Traditionally, people use two different kinds of databases for different workloads: OLTP for online transaction processing and OLAP for analytic workloads. But splitting the data into two parts meaning the loss of data freshness and will degrade the performance. To solve this problem, the author introduced HyPer, also known as (probably) the first HTAP database to handle both OLTP and OLAP simultaneously. HyPer is a main-memory database that use hardware-assisted mechanisms to cope with the two kinds of workloads. The experiment is based on a combination of TPC-C and TPC-H benchmark and shows good performance.
HyPer is probably the first database to introduce the idea of using one single database to handle both OLTP and OLAP. This idea is now well known as HTAP and is becoming more and more popular among many database service vendors.
The paper introduced their way of transaction isolation and durability in a clear and detailed manner.
The concurrency control scheme in HyPer is not so mature. Using few threads to deal with transaction processing may still leverage multicore to some extent, but there still exists many restrictions to limit the concurrency of HyPer.
It might be better to make more comparison between HyPer and other databases.
Might be better to make more comparison between other databases.
Other possible solutions to HTAP besides virtual memory snapshots？
Commercialization situation of HTAP databases?
Distributed + HTAP?
Virtual Memory Snapshots
Create consistent database snapshot for OLAP queries to read
Transactions run with copy-on-write to avoid polluting the snapshots
Fork-based virtual snapshots
Multiple OLAP session
Multi-thread OLTP processing
Logging and checkpointing
Challenges of applying VM-snapshot to a shared-memory OLTP system?
The additional burden in taking a consistent snapshot.
Consistent snapshot across machines is a hard distributed system problem.
Fork() becomes more expensive
How to reduce the cost of fork() when database is large?
Large pages and partitioning
Copy on write on the page table itself
Append child parameters to parents page table (one page table for all processes)
Share the page tables with OLTP while recording changed pages with a side data structure.
Fork() performs CoW at tuple granularity
Most promising architecture of HTAP? (single vs. separate systems, shared vs. separate data)
Single system, separate data: since optimal data formats (row vs column) for OLTP and OLAP.
Single system, shared data: different data formats (row/column) in different replicas; avoid data sync and update propagation
Separate system and separate data: no interference between OLTP and OLAP
sSeparate system and shared data: better scalability and high availability
Facebook widely uses ODS TSDB as an older monitoring and alerting system. The workloads at Facebook raised may requirements for ODS such as write dominate high throughput, high availability, fault tolerance, etc. ODS is based on HBase and cannot scale very well and achieve high throughput, but it is already widely used to store more than 2PB data... Also current TSDBs on the market can't meet the expectation of Facebook. In order to deal with this problem, the author introduced Gorilla, a fast, scalable, in-memory TSDB. Gorilla is used as the write through cache of ODS and uses aggressive compression to make the data fit into the cache and speed up the query throughput and reduce the latency. This also make several new monitoring and debugging tools available.
Aggressive compression techniques. Gorilla uses Delta-Of-Delta to compress timestamps and XOR to compress point value separately. This enables Gorilla to save up to 90% of the space and lay a solid foundation for in-memory design.
In-memory based design. Current ODS based on HBase cannot meet the requirement of great workloads at Facebook. Because it's a monitoring system, this also raises higher standards for TSDB. Combined with aggressive compression, Gorilla make the time series data fit into the memory to increase the query throughput and reduce the latency.
Fault tolerance (based on cross region cluster) and scalability (based on share nothing architecture).
Shared valuable industry experience in section 6 and new monitoring and debugging tools in section 5.
Hopes to see more comparison between other TSDBs and make more performance analysis.
Tradeoff high throughput at the cost of tiny data loss.
Recent advances of TSDB?
Gorilla is used as the write through cache of ODS based on HBase. How would the design change if Facebook directly uses Gorilla as the ODS?
Time Series DB = A time series database is a software system that is optimized for storing and serving time series through associated pairs of time and value
DB popularity in April 2020
Why is TSDB Popular?
Monitoring software systems: Virtual machines, containers, services, applications
Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies (Internet of Things)
Asset tracking applications: Vehicles, trucks, physical containers, pallets
Financial trading systems: Classic securities, newer cryptocurrencies
Eventing applications: Tracking user/customer interaction data
Business intelligence tools: Tracking key metrics and the overall health of the business
Gorilla = An in-memory cache for the slower TSDB on HBase
Time series compression
Data format - (timestamp, value)
Value - 64-bit doubles
Timestamp compression - delta-of-delta
Value compression - XOR
In-memory data structure
What’s new in TSDB compared to RDBMS?
Opportunities of data compression
Different access pattern: append intensive
Relaxed consistency model?
Different query pattern
Dixin Tang's 1st paper on adaptive CC
Many protocols are optimized for specific workload characteristics (read-heavy, highly-skewed, etc.) and may have poor performance with mixed workloads. Thus we need adaptive/hybrid workloads.
Using hybrid CC: ACC automatically partitions the database into clusters and assigns CPU cores for each cluster according to its load statistics; for each cluster, ACC adaptively chooses a concurrency control protocol according to its workload characteristics. (Using cascading binary classifier trained offline.)
How to partition database and assign CPU? - Partition-Merge method.
Partition the records into N (=CPU cores) cluster
Merge clusters to ensure that
the percentage of transaction operations accessing a single cluster (denoted as utilization) should be higher than a threshold (i.e., maintaining load-balance)
the cost of cross-cluster access should be lower than a threshold (i.e., minimizing cross-cluster transactions).
Assign each cluster with the number of cores proportional to their utilization.
How to model the characteristics of different workloads?
the ratio of read operations per transaction (ReadRatio)
the average number of operations issued by a transaction (TransLen)
the possibility of concurrent transactions reading/writing the same record (RecContention)
the cost of cross-cluster transactions (CrossCost / PartitionConflicts)
1~3 are critical to PartCC, OCC, and 2PL
4 is critical to PartCC. (Determines the applicability)
How to select different protocols?
Central coordinator: Each transaction worker concurrently collects workload statistics and reports them to a central coordinator periodically. The coordinator computes features of the current workload for each cluster and uses the predictive model to determine which protocol should be used for this cluster. (could be a bottleneck for latency)
Move (partial) computation to workers. Emulate a uniform and separate process for all workers using sampled transactions or operations.
Each worker samples operations and repeats (mark and detect phases) to check how many other workers have marked the records of the sampled operations to estimate the contention.
How to avoid overhead of coordinating conflicts across protocols (one record can be accessed by different protocols)?
Data-oriented mixed CC: enables a single protocol process all concurrent read/write operations to a portion of records (a cluster). (Cross-cluster transactions may use different CC when accessing different clusters.)
Preprocess: acquire partition locks for PartCC
Execution: uses the protocol managing the record to process it.
OCC: reads the timestamp and the value of this record using the logic of OCC
2PL: acquires a lock before reading or writing the record
Validation: validate the records managed by OCC
Commit: applies all write and release locks. (Release partition locks for PartCC, record locks for 2PL, and write locks for OCC.)
Env: Based on Doppel (open-src MM OLTP DB), PartCC of H-Store, OCC from Silo, and SS2PL No-Wait based on VLL.
Baseline: single protocol; hybrid of 2PL and OCC (uses locks to protect highly conflicted records and uses validation for the other records)
Test 1: mix CC under mix of partitionable (no cross-partition transactions) and non-partitionable (100% cross-partition transactions) workloads. For cluster with no cross-partition transactions, increase the percentage of cross-cluster transactions from 0% to 50% gradually.
ACC merges non-partitionable warehouses into a single cluster and makes each non-partitionable warehouses of a single cluster.
At first, ACC used PartCC to process well-partitionable workload and 2PL to process the rest of the highly conflicted and non-partitionable workload.
With more cross-cluster transactions, ACC adopts 2PL for the whole system.
For 2PL, lock contention across cores is reduced due to partitioning.
Test 2: evaluate ACC’s adaptivity in response to selecting a single protocol according to workload variation.
If PartCC is used, the whole store is partitioned such that each CPU core owns a partition (i.e., a cluster).
If OCC or 2PL is used, the partitioned stores are merged back into a single cluster (i.e., a shared store).
(non-partitionable -> 100% cross-partition transactions + skew partition access, low-conflict -> read-only) workload -> OCC
(partitionable, high-conflict -> write-intensive) workload -> PartCC, where it needs to partition the whole store first. ACC continues to process transaction requests during the partitioning and switches to PartCC when partitioning is done.
non-partitionable workload -> more partition conflicts are introduced in the workload; thus, ACC merges the partitioned store and switches from PartCC to 2PL.
Overall: The dip in performance for ACC during workload shifts is due to lag while using the prior protocol and a short period where workers reorganize clusters and indices. The throughput improvement of ACC over PartCC, OCC, and 2PL can be up to 4.5x, 3.5x and 2.3x respectively, and ACC can achieve at least 95% throughput of the optimal protocol when it uses this protocol, demonstrating that the overhead of ACC is minimal.
ACC: data clustering + workload modeling + mixing protocols
Hybrid approach of 2PL and OCC (denoted as Hybrid), which uses locks to protect highly conflicted records and uses validation for the other records.
Dixin Tang's 2nd work on adaptive CC
Different CC throughput:
Partitionable: PartCC > 2PL > OCC. A transaction is highly likely to access only one partition. So use coarse-grained (partition) locking for partitionable workloads.
Non-partitionable + Read only (+ Skew): OCC (MVCC)> 2PL > PartCC
(Non-partitionable + Write heavy + (non-skew): OCC ~ 2PL ? )
Non-partitionable + Write heavy + Skew: 2PL > OCC ~ PartCC
Perform well under partitionable workloads
Cross-partition transactions hurt the performance
Elastricity instead of switching?
Two Challenges of Mixed Concurrency Control
Mixing: How to partition a workload and mix multiple concurrency control protocols efficiently
Switching: How to reconfigure a protocol when the workload changes
一些研究人员认为Timestamp处理读操作为主的事务有益，2PL对写操作多于读操作的事务有益，提出了 某些事务用时间戳排序，另一些事务用2PL。Bernstein. ,1987.
Different CC scalability
2PL > MVCC > Timestamp > OCC
2PL: Scales. And has higher throughput
MVCC: Timestamp allocation limits scalability.
OCC: Memory copy hurts performance.
Medium contention (50%:50%)
How to apply different CCs?
Partition the database and use different CC to access different cluster.
You need to assign CPU initially or dynamically.
Different CPUs may need to switch to different CC due to workload variation. (New Overhead)
Fixed CC for all data
Fixed CC + Partition
Some common tradeoffs, pros & cons, challenges & sols
Device level, optimization level (and the level where it is applied)
Storage hierarchy of different levels
CPU cache: L1/L2/L3, data/instruction, SRAM
Main Memory: DRAM
Disk: HDD vs. SSD
Network: NAS, cloud
Potential: NVM, HBM, Smart SSD
Parallelism of different levels
Instruction: FPGA/ASIC acceleration, deep-pipelined, ILP(branch prediction), VLIW
Data: SIMD (SSE -> AVX), vectorization
Thread: concurrency, multicore, SIMT (GPU)
Device: hybrid (CPU + GPU / XPU / FPGA / ASIC, co-processor), NUMA
Distributed: RDMA, distributed ML training (parallel on data/model -> pipeline)
Task: scheduling, HTAP
Bandwidth (IO) (internal & external)
CPU ~ Disk: SATA (HDD), SAS -> PCIe -> NVMe (SSD)
CPU ~ GPU: PCIe, scheduler (data balancing)
GPU ~ MM: HBM, NV Link, GPUDirect
Computation: accelerators (domain-specific processors)
Latency: direct access to different levels (RDMA, smart NIC)
Memory capacity: compression, multi-device, hybrid arch, in/shared-memory, virtualization -> containerization
Query optimization: compilers (static vs. dynamic analysis), DSL
ML training: matrix operator, computation graph optimization (TASO), TVM
Data models: different ways of storage
Structure: relational, document, graph
Encoding: compression and decompression (arch, weight), row store vs. column store, quantized (lower accuracy requirement), hybrid accuracy
Access: serial (disk) vs. random (DRAM, hash)
Task (minimize runtime): streaming vs. batching (asynchronous vs. synchronous), workload specific balancing (read/write intensive, high/low contention, minimize idle overlap), configurable isolation levels
Device: data transfer cost between CPU, MM, GPU, etc., dynamic device for different workloads (read/writing-intensive), concurrency control
Distributed: data, logging shipping, load balancing, synchronization (consistency)
Lazy evaluation, loading, initializing
Performance: speed (see key bottleneck)
Reliability: fault-tolerance (networks, time, device)
Cost: short/long-term (considering tech development)
Maintainability: automatic failover, logging
ACID: atomic, consistency, Isolation, durable
Flexibility vs. performance (vs. cost)
Different workloads: performance between read-intensive and write-intensive
IO vs. concurrency (vs. latency)
Capacity vs. performance: capacity vs. cache & materialization
Normalization vs. denormalization
Availability vs. consistency