1 - Amazon DynamoDB: 可扩展、性能可预测、完全托管的 NoSQL 数据库服务

https://www.usenix.org/conference/atc22/presentation/elhemali

A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Amazon DynamoDB is a NoSQL cloud database service that supports fast and predictable performance at any scale.

DynamoDB 提供了

  • 普通表的可用性服务等级协议(SLA)为99.99%,不可用时间约为 52.6 分钟。
  • 全球表的 SLA 则高达99.999%(5.26 分钟)。

    • 全球表允许跨多个AWS区域在多张表之间进行复制,从而实现了更高级别的数据可靠性与可用性保障。
  • 1年 = 365天 = 8760小时
  • 99.9 = 8760 * 0.1% = 8760 * 0.001 = 8.76小时
  • 99.99 = 8760 * 0.0001 = 0.876小时 = 0.876 * 60 = 52.6分钟
  • 99.999 = 8760 * 0.00001 = 0.0876小时 = 0.0876 * 60 = 5.26分钟

架构

  • 采用表模型,每行有一个主键,由分区键 + 排序键,同时支持二级索引。
  • 为了处理表的吞吐量和存储需求,DynamoDB表被划分为多个分区。每一分区承载着表中键范围的一个不相交且连续的部分。 通过将表分割成多个分区,可以更高效地管理资源,每个分区能够独立处理其部分数据。这种划分有助于在负载高峰期分散请求,并提高系统的整体性能和可靠性。各分区负责存储和访问特定的键范围内的数据,这样可以在读取和写入操作时并行处理这些数据段,从而优化了查询速度和系统响应时间。
  • 在全局表的吞吐量改变或其分区被拆分为子分区时,会调整分配给分区的吞吐量。

    • 当一个分区因大小原因被拆分时,父分区已分配的吞吐量会均匀地分配到子分区中。
    • 如果一个分区是因吞吐量的原因而被拆分,则新的子分区将根据表预先配置的吞吐量进行分配。

      举例来说,假设一个分区能够容纳最大为1000个WCUs(Write Capacity Units)或Read Capacity Units)。

      • 当一个表在创建时分配了3200 WCUs时,DynamoDB会创建四个分区,每个分区将被分配800 WCUs。
      • 如果表的预配置吞吐量增加到3600 WCUs,则每个分区的能力将会增加至900 WCUs。
      • 若表的预配置吞吐量增加到了6000 WCUs,则原来的分区会被拆分为八个子分区,每个子分区将分配750 WCUs。
      • 相反,如果表的容量被降低到5000 WCUs,则每个分区的能力会相应减少至675 WCUs。

    吞吐量在分区间均匀分布依赖于这样的假设:应用能够等概率地访问表中的键,并且对大小的切分能将性能平均分配到各个分区上。然而我们发现,实际应用工作负载通常在其随时间变化的行为以及对于不同键范围的访问模式方面是不均匀的。怎么解决?

    • Bursting
    • Adaptive capacity

3 - 数据库系统架构

1. Introduction

2. Process Models

3. Parallel Architecture: Process and Memory Coordination

4. Relational Query Processor

Query Parsing and Authorization

  1. check that the query is correctly specified,
  2. canonicalizes table names into a fully qualified name of the form server.database.schema.table
  3. convert the query into the internal format used by the optimizer(catalog manager)
  4. verify that the user is authorized to execute the query

Query Rewrite

The rewriter in many commercial systems is a logical component whose actual implementation is in either the later phases of query pars- ing or the early phases of query optimization. In DB2, for example, the rewriter is a stand-alone component, whereas in SQL Server the query rewriting is done as an early phase of the Query Optimizer. main responsibilities are:

  1. View expansion
  2. Constant arithmetic evaluation. R.x < 10+2+R.y as R.x < 12+R.y.
  3. Logical rewriting of predicates NOT Emp.Salary > 1000000 as Emp.Salary < 1000000=
  4. Semantic optimization
SELECT Emp.name, Emp.salary FROM Emp, Dept
WHERE Emp.deptno = Dept.dno

Such queries can be rewritten to remove the Dept table (assuming Emp.deptno is constrained to be non-null), and hence the join. Again, such seemingly implausible scenarios often arise naturally via views.

  1. Subquery flattening and other heuristic rewrites

Predicate transi- tivity, for example, can allow predicates to be copied across subqueries [52].

Query Optimizer

Although Selinger’s paper is widely considered the “bible” of query optimization, it was preliminary research. All systems extend this work significantly in a number of dimensions. Among the main exten- sions are:

Plan space

The System R optimizer constrained its plan space somewhat by focusing only on “left-deep” query plans (where the right-hand input to a join must be a base table),and by “postponing Cartesian products” (ensuring that Cartesian products appear only after all joins in a dataflow). In commercial systems today, it is well known that “bushy” trees (with nested right-hand inputs) and early use of Carte- sian products can be useful in some cases. Hence both options are considered under some circumstances by most systems.

Selectivity estimation

The selectivity estimation techniques in the Selinger paper are based on simple table and index cardinalities and are na ̈ıve by the standards of current generation systems. Most systems today analyze and summarize the distributions of values in attributes via histograms and other summary statistics.

Search Algorithms

Some commercial systems, notably those of Microsoft and Tandem, discard Selinger’s dynamic pro- gramming optimization approach in favor of a goal-directed “top-down” search scheme based on the techniques used in Cascades [25]. Top-down search can in some instances lower the number of plans considered by an optimizer [82], but can also have the negative effect of increasing optimizer memory consumption. If practical success is an indication of quality, then the choice between top-down search and dynamic pro- gramming is irrelevant. Each has been shown to work well in state-of-the-art optimizers, and both still have runtimes and memory requirements that are, unfortunately, exponential in the number of tables in a query.

An educational exercise is to examine the query “optimizer” of the open-source MySQL engine, which at last check was entirely heuristic and relied mostly on exploiting indexes and key/foreign-key constraints.

Parallelism

Auto-Tuning

A variety of ongoing industrial research efforts attempt to improve the ability of a DBMS to make tuning decisions automatically. Some of these techniques are based on collecting a query workload, and then using the optimizer to find the plan costs via various “what-if” analyses. What if, for example, other indexes had existed or the data had been laid out differently? An optimizer needs to be adjusted somewhat to support this activity efficiently, as described by Chaudhuri and Narasayya [12]. The Learning Optimizer (LEO) work of Markl et al. [57] is also in this vein.

Query Executor

Most modern query executors employ the iterator model that was used in the earliest relational systems. Iterators are most simply described in an object-oriented fashion.

1
class iterator { iterator &inputs[]; void init(); tuple get_next(); void close(); }

Where’s the Data?

In practice, each iterator is pre-allocated a fixed number of tuple descrip- tors, one for each of its inputs, and one for its output. A tuple descriptor is typically an array of column references, where each column reference is composed of a reference to a tuple somewhere else in memory, and a column offset in that tuple. The basic iterator superclass logic never allocates memory dynamically. This raises the question of where the actual tuples being referenced are stored in memory.

Access Methods

Access methods are the routines that manage access to the various disk-based data structures that the system supports. These typically included unordered files (“heaps”), and various kinds of indexes. All major commercial systems implement heaps and B+-tree indexes. Both Oracle and PostgreSQL support hash indexes for equality lookups. Some systems are beginning to introduce rudimentary support for multi-dimensional indexes such as R-trees [32]. PostgreSQL supports an extensible index called the Generalized Search Tree (GiST) [39], and currently uses it to implement R-trees for multi-dimensional data, and RD-trees for text data [40]. Systems that target read-mostly data warehousing workloads often include specialized bitmap variants of indexes as well [65],

Standard Practice

In the open source arena, PostgreSQL has a reasonably sophisti- cated query processor with a traditional cost-based optimizer, a broad set of execution algorithms, and a number of extensibility features not found in commercial products. MySQL’s query processor is far simpler, built around nested-loop joins over indices. The MySQL query optimizer focuses on analyzing queries to make sure that common operations are lightweight and efficient — particularly

  • key/foreign-key joins,
  • outer-join-to-join rewrites,
  • and queries that ask for only the first few rows of the result set. It is instructive to read through the MySQL manual and query processing code and compare it to the more involved traditional designs, keeping in mind the high adoption rate of MySQL in practice, and the tasks at which it seems to excel.

5. Storage Manager

两种管理方式:

  1. 直接与底层的快设备打交道
  2. 通过 OS 提供的文件系统

Spatial Control

Sequential bandwidth to and from disk is between 10 and 100 times faster than random access, and this ratio is increasing.

The best way for the DBMS to control spatial locality of its data is to store the data directly to the “raw” disk device and avoid the file system entirely. This works because raw device addresses typically correspond closely to physical proximity of storage locations.

An alternative to raw disk access is for the DBMS to create a very large file in the OS file system, and to manage positioning of data as offsets in that file. The file is essentially treated as a linear array of disk-resident pages. If the file system is being used rather than raw device access, special interfaces may be required to write pages of a different size than that of the file system; the POSIX mmap/msync calls, for example, provide such support.

Temporal Control: Buffering

DBMS contains critical logic that reasons about when to write blocks to disk. Most OS file systems also provide built-in I/O buffering mechanisms to decide when to do reads and writes of file blocks. If the DBMS uses standard file system interfaces for writing, the OS buffering can confound the intention of the DBMS logic by silently postponing or reordering writes. This can cause major problems for the DBMS.

The second set of problems with OS buffering concern performance, but have no implications on correctness. Modern OS file systems typically have some built-in support for read-ahead (speculative reads) and write-behind (delayed, batched writes). These are often poorly suited to DBMS access patterns.

The final performance issues are “double buffering” and the high CPU overhead of memory copies. Copying data in memory can be a serious bottleneck. Copies contribute latency, consume CPU cycles, and can flood the CPU data cache.

This fact is often a surprise to people who have not operated or implemented a database system, and assume that main-memory oper- ations are “free” compared to disk I/O. But in practice, throughput in a well-tuned transaction processing DBMS is typically not I/O-bound. This is achieved in high-end installations by purchasing sufficient disks and RAM so that repeated page requests are absorbed by the buffer pool, and disk I/Os are shared across the disk arms at a rate that can feed the data appetite of all the processors in the system. Once this kind of “system balance” is achieved, I/O latencies cease to be the primary system throughput bottleneck, and the remaining main-memory bottlenecks become the limiting factors in the system. Memory copies are becoming a dominant bottleneck in computer architectures: this is due to the gap in performance evolution between raw CPU cycles per second per dollar (which follows Moore’s law) and RAM access speed (which trails Moore’s law significantly) [67].

Buffer Management

In order to provide efficient access to database pages, every DBMS implements a large shared buffer pool in its own memory space. The buffer pool is organized as an array of frames, where each frame is a region of memory the size of a database disk block. Blocks are copied into the buffer pool from disk without format change, manipu- lated in memory in this native format, and later written back. This translation-free approach avoids CPU bottlenecks in “marshalling” and “unmarshalling” data to/from disk; perhaps more importantly, fixed-sized frames sidestep the memory-management complexities of external fragmentation and compaction that generic techniques cause.

For full table scan access patterns, the recency of reference is a poor predictor of the probability of future reference, so OS page replacement schemes like LRU and CLOCK were well-known to perform poorly for many database access patterns [86]. A variety of alternative schemes were proposed, including some that attempted to tune the replacement strategy via query execution plan information [15]. Today, most systems use simple enhancements to LRU schemes to account for the case of full table scans.

  • One that appears in the research literature and has been implemented in commercial systems is LRU-2 [64].
  • Another scheme used in commercial systems is to have the replacement policy depend on the page type: e.g., the root of a B+-tree might be replaced with a different strategy than a page in a heap file. This is reminiscent of Reiter’s Domain Separation scheme [15, 75].

6. Transaction: Concurrent Control and Recovery

ACID

Roughly speaking, modern DBMSs implement isolation via a locking protocol.

  • Durability is typically implemented via logging and recovery.
  • Isolation and Atomicity are guaranteed by a combination of locking (to prevent visibility of transient database states), and logging (to ensure correctness of data that is visible).
  • Consistency is managed by runtime checks in the query executor: if a transaction’s actions will violate a SQL integrity constraint, the transaction is aborted and an error code returned.

Serializability

Serializability is enforced by the DBMS concurrency control model. There are three broad techniques of concurrency control enforcement. These are well-described in textbooks and early survey papers [7], but we very briefly review them here:

  • Strict two-phase locking (2PL).Transactions acquire a shared lock on every data record before reading it, and an exclusive lock on every data item before writing it. All locks are held until the end of the transaction, at which time they are all released atomically. A transaction blocks on a wait-queue while waiting to acquire a lock.
  • Multi-Version Concurrency Control (MVCC): Transactions do not hold locks, but instead are guaranteed a consistent view of the database state at some time in the past, even if rows have changed since that fixed point in time.
  • Optimistic Concurrency Control (OCC): Multiple transactions are allowed to read and update an item without blocking. Instead, transactions maintain histories of their reads and writes, and before committing a transaction checks history for isolation conflicts they may have occurred; if any are found, one of the conflicting transactions is rolled back.

Most commercial relational DBMS implement full serializability via 2PL. The lock manager is the code module responsible for providing the facilities for 2PL. In order to reduce locking and lock conflicts some DBMSs support MVCC or OCC, typically as an add-on to 2PL. In an MVCC model, read locks are not needed, but this is often implemented at the expense of not providing full serializability, as we will discuss shortly in Section 4.2.1. To avoid blocking writes behind reads, the write is allowed to proceed after the previous version of the row is either saved, or guaranteed to be quickly obtainable otherwise. The in-flight read transactions continue to use the previous row value as though it were locked and prevented from being changed. In commercial MVCC implementations, this stable read value is defined to be either the value at the start of the read transaction or the value at the start of that trans- action’s most recent SQL statement. While OCC avoids waiting on locks, it can result in higher penalties during true conflicts between transactions. In dealing with conflicts across transactions, OCC is like 2PL except that it converts what would be lock-waits in 2PL into transaction rollbacks. In scenaries where conflicts are uncommon, OCC performs very well, avoiding overly conservative wait time. With frequent conflicts, however, excessive rollbacks and retries negatively impact performance and make it a poor choice

Weak Consistency

ANSI SQL - Isolation level

Read Uncommitted

A transaction may read any version of data, committed or not. This is achieved in a locking implementation by read requests proceeding without acquiring any locks

Read Committed

A transaction may read any committed version of data. Repeated reads of an object may result in different (committed) versions. This is achieved by read requests acquiring a read lock before accessing an object, and unlocking it immediately after access.

Repeatable Read

A transaction will read only one version of committed data; once the transaction reads an object, it will always read the same version of that object. This is achieved by read requests acquiring a read lock before accessing an object, and holding the lock until end-of-transaction.

Serializable

Full serializable access is guaranteed.

Vendor provided levels

Cursor Stability

This level is intended to solve the “lost update” problem of READ COMMITTED. A transaction in CURSOR STABILITY mode holds a lock on the most recently read item on a query cursor; the lock is automatically dropped when the cursor is moved (e.g., via another FETCH) or the transaction terminates. CURSOR STABILITY allows the transaction to do read–think–write sequences on individual items without intervening updates from other transactions.

Snapshot Isolation

A transaction running in SNAPSHOT ISOLATION mode operates on a version of the database as it existed at the time the transaction began; subsequent updates by other transactions are invisible to the transaction. This is one of the major uses of MVCC in production database systems. When the transaction starts, it gets a unique start-timestamp from a monotonically increasing counter; when it commits it gets a unique end-timestamp from the counter. The transaction commits only if no other transaction with an overlapping start/end-transaction pair wrote data that this transaction also wrote. This isolation mode depends upon a multi-version concurrency implementation, rather than locking.

Read Consistency

This is an MVCC scheme defined by Oracle; it is subtly different from SNAPSHOT ISOLATION. In the Oracle scheme, each SQL statement (of which there may be many in a single transaction) sees the most recently committed values as of the start of the statement. For statements that fetch from cursors, the cursor set is based on the values as of the time it is opened. This is implemented by maintaining multiple logical versions of individual tuples, with a single transaction possibly referencing multiple versions of a single tuple. Rather than storing each version that might be needed, Oracle stores only the most recent version. If an older version is needed, it produces the older version by taking the current version and “rolling it back” by applying undo log records as needed. Modifications are maintained via long-term write locks, so when two transactions want to write the same object, the first writer “wins” and the second writer must wait for the transaction completion of the first writer before its write proceeds. By contrast, in SNAPSHOT ISOLATION the first committer “wins” rather than the first writer.

4 - 面向持久内存的高效写优化哈希索引方案

摘要

NVM non-volatile memory 因为 data consistency and hardware limitations of NVM,传统为 DRAM 设计的索引技术不适用 inefficient,为了高效索引 NVM 上的数据,本论文提出一种写优化的高性能哈希索引技术 write-optimized and high-performance hashing index,称为 level hashing

  • write-optimized
  • high-performance hashing
  • low-overhead consistency guarantee
  • and cost-efficient resizing.
  • 为了保证 consistency with low overhead

    • level hashing leverages log-free consistency schemes for insertion, deletion, and resizing operations, and an opportunistic log-free scheme for update operation.
  • To cost-efficiently resize this hash table,

    • level hashing leverages an in- place resizing scheme that only needs to rehash 1/3 of buckets instead of the entire table, thus significantly reducing the number of rehashed buckets and improving the resizing performance.

2. Background and Motivation

2.1 Data Consistency in NVM

  • CPU cache lines
  • CPU and memory controller may reorder memory writes.
  • cache line flush instruction (CLFLUSH for short) / memory fence instruction (MFENCE for short), to ensure the ordering of memory writes
  • atomic memory write of NVM is 8 bytes, which is equal to the memory bus width for 64-bit CPUs
  • Existing techniques, such as logging and copy-on-write (CoW), are used to guarantee consistency of the data whose sizes are larger than an atomic-write size.

    • The logging technique first stores the old data (undo logging) or new data (redo logging) into a log and then updates the old data in place.
    • The CoW first creates a new copy of data and then performs updates on the copy. The pointers that point to the old data are finally modified.
    • Nevertheless, logging and CoW have to write twice for each updated data.

Code

RECIPE : high-performance, concurrent indexes for persistent memory (SOSP 2019)

6 - Emacs Lisp 发展历史

Emacs History

  • Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

Basic Language Design

  • Also like MacLisp, Emacs Lisp was (and still is) a Lisp-2 language [Steele and Gabriel 1993], which means that

    1. the namespaces for functions and ordinary values are separate
    2. and to call a function bound to a variable, a program must use funcall.
    3. Also, symbols can be used as function values.

Symbols and Dynamic Scoping

Like any Lisp, Emacs Lisp has always had a symbol data type: The expression ’emacs yields a symbol named emacs. One way to look at symbols is that they are immutable strings with a fast equality check and usually fast hashing. This makes them suitable for values representing enumerations or options. Another is that symbols can represent names in an Emacs Lisp program. Symbols are thus a key feature to make Emacs Lisp homoiconic, meaning that each Emacs Lisp form can be represented by a data structure called an S-expression that prints out the same as the form.

Richard Stallman chose dynamic scoping to meet extensibility requirements in Emacs [Stallman 1981]. The Emacs code base uses variables to hold configuration options. It would in principle be possible to structure the code base to pass configuration options explicitly, but this would make the code verbose, and require changing many function signatures once a new configuration option gets added to the system.

Using Lexical Binding

Even when lexical binding is enabled, certain variables will continue to be dynamically bound. These are called special variables. Every variable that has been defined with defvar, defcustom or defconst is a special variable (see Defining Variables). All other variables are subject to lexical binding.

Backquote

Quote is usually introduced via ' where the reader translates 'SEXP to (quote SEXP), and this expression evaluates to the value SEXP. For example:

1
'(a (b c d) c)

evalutes to a list whose first element is the symbol a, the second a list consisting of elements b, c, and d, and whose third element is c. It does so in constant time, by always returning the same value already present in the source code returned by the reader.

Hook

One important aspect of the extensibility Richard Stallman originally conceived for Emacs was the ability to make existing functions run additional code without having to change them, so as to extend their behavior. Emacs supports this at well-defined points called hooks.

Of course, authors do not always have the foresight to place hooks where users need them, so in 1992, the advice.el package was added to Emacs 19, providing a defadvice macro duplicating a design available in MacLisp and Lisp Machines.

1
2
3
4
(defadvice eval-region (around cl-read activate)
  "Use the reader::read instead of the original read if cl-read-active."
  (with-elisp-eval-region (not cl-read-active)
                          ad-do-it))

Furthermore, the way the defadvice macro gave access to function arguments did not work with lexical scoping. This was solved in late 2012: as a result of a discussion with a user asking how to put several functions into a single variable, Stefan Monnier developed a new package nadvice.el that can not only combine several functions into a single variable, but uses that to provide the same core features as defadvice but with a much simpler design.

For example, the old advice system had special features to control whether a piece of advice should be compiled whereas in the new system there is no need for that because a piece of advice is simply a normal function. Similarly, the old advice system has special primitives to get access to the function’s arguments, whereas in the new system, the function’s original arguments are passed as normal arguments to the piece of advice and can hence be accessed without any special construct. The nadvice.el package was released as part of Emacs 24.4, and has proved popular in the sense that it has largely replaced the old defadvice in new code, but rather few packages that used defadvice have been converted to the new system.

Interactive Functions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(defun forward-symbol (arg)
  "Move point to the next position that is the end of a symbol.
A symbol is any sequence of characters that are in either the word constituent or symbol constituent syntax class.
With prefix argument ARG, do it ARG times if positive, or move backwards ARG times if negative."
  (interactive "^p")
  (if (natnump arg)
      (re-search-forward "\\(\\sw\\|\\s_\\)+" nil 'move arg)
    (while (< arg 0)
      (if (re-search-backward "\\(\\sw\\|\\s_\\)+" nil 'move) (skip-syntax-backward "w_"))
      (setq arg (1+ arg)))))

In the above example, p means that the function accepts a prefix argument: The user can type C-u number prior to invoking the function, and the number will be passed to the function as an argument, in this case as arg. The ^ (a more recent addition) has to do with region selection with a pressed shift key–it may lead to the region being activated.

Buffer-local variables

Variables can be both buffer-local and dynamically bound at the same time:

1
2
3
(let ((buffer-file-name "/home/rms/.emacs"))
  (with-current-buffer "some-other-buffer"
    buffer-file-name))

This example will not return /home/rms/.emacs but the buffer-local value of buffer-file-name in the buffer some-other-buffer instead, because with-current-buffer temporarily changes which buffer is current.

String

Emacs Lisp included support for string objects from the beginning of course. Originally, they were just byte arrays. In 1992, during the early development of Emacs 19, this basic type was extended by Joseph Arceneaux with support for text properties.

Each char of a string can be annotated with a set of properties, mapping property names to values, where property names can be any symbol. This can be used to carry information such as color and font to use when displaying the various parts of the string.

Along with buffer-local variables (Section 4.10), this is one of the cases where the core Emacs Lisp language has been extended with a feature that specifically caters to the needs of the Emacs text editor, to keep track of rendering information. It is more generally useful, of course, and turns strings into a much fancier datatype than in most other languages.

IO

Rather than follow the usual design based on file or stream objects and primitives like open/read/write/close, Emacs Lisp offers only coarser access to files via two primitive functions insert-file-contents and write-region that transfer file contents between a file and a buffer. So all file manipulation takes place by reading the file into a buffer, performing the desired manipulation in this buffer, and then writing the result back into the file.

Since this approach does not extend naturally to interaction with external processes or remote hosts, these are handled in a completely different way. Primitive functions spawn a sub-process or open up a connection to a remote host and return a so-called process object. These objects behave a bit like streams, with process-send-string corresponding to the traditional write, but where the traditional read is replaced by execution of a callback whenever data is received from the sub-process or the remote host.

Basic Language Implementation

Byte-order Interpreter

Emacs has two execution engines for Emacs Lisp,

  • One is a very simple interpreter written in C operating directly on the S-expression representation of the source code.
  • The other is a byte-code engine, implemented as a primitive byte-code function that interprets its string argument as a sequence of stack-based byte codes to execute. A compiler, written in Emacs Lisp, translates Emacs Lisp code to that byte-code language.

While Emacs Lisp is basically a "run of the mill" programming language, with some specific functions tailored to the particular use of a text editor, this byte-code language is much less standard since it includes many byte codes corresponding to Emacs Lisp primitives such as forward-char, insert, or current-column.

Tail-Call Optimization

Emacs Lisp does not implement proper tail calls [Clinger 1998]: Each function call consumes stack space.

Post-XEmacs period

This section discusses some notable evolution of the design of Emacs Lisp after 2010 which have not found their way into XEmacs so far.

Lexical Scoping

Dynamic scoping had two main drawbacks in practice:

  • The lack of closures
  • The global visibility of variable names

Finally in 2010 Igor Kuzmin worked on a summer project under the direction of Stefan Monnier, in which he tried to add lexical scoping to the byte-code compiler differently: instead of directly adding support for lexical scoping and closures to the single-pass byte-code compiler code (which required significant changes to the code and was made more complex by the need to fit into a single pass), he implemented a separate pass (itself split into two passes) to perform a traditional closure conversion as a pre-processing step. This freed the closure conversion from the constraints imposed by the design of the single-pass byte-code compiler, making it much easier to implement, and it also significantly reduced the amount of changes needed in the byte-code compiler, thus reducing the risk of introducing regressions.

Two years later, Emacs 24.1 was released with support for lexical scoping based on Miles Bader’s lexbind branch combined with Igor’s closure conversion.

Pattern Matching

While working on the lexical-binding feature, Stefan Monnier grew increasingly frustrated with the shape of the code used to traverse the abstract syntax tree, littered with car, cdr carrying too little information, compared to the kind of code he would write for that in statically-typed functional languages with algebraic datatypes.

So the pcase.el package was born, first released as part of Emacs 24.1, and used extensively in the part of the byte-code compiler providing support for lexical scoping.

CL-Lib

During the development of Emacs 24.3 the issue of better integration of the cl.el package came up again. The main point of pressure was the desire to use cl.el functions within packages bundled with Emacs. Richard Stallman still opposed it, but this time, a compromise was found: replace the cl.el package with a new package cl-lib.el that provides the same facilities, but with names that all use the cl- prefix. This way, the cl-lib.el package does not turn Emacs Lisp into Common Lisp, but instead provides Common Lisp facilities under its own namespace, leaving Emacs Lisp free to evolve in its own way.

7 - Disruptor

The complexities of Concurrency

Concurrent execution of code is about two thing:

  • mutual exclusion
  • visibility of change

The costs of CAS

  • A CAS operation is a special machine-code instruction that allows a word in memory to be conditionally set as an atomic.
  • This CAS approach is significantly more efficient than locks because it does not require a context switch to the kernel for arbitration.
  • However CAS operations are not free of cost. The processor must lock its instruction pipeline to ensure atomicity and employ a memory barrier to make the changes visible to other threads.

8 - 优化 RocksDB 的空间放大策略

背景知识

MyRocks 编码方式

在MyRocks中,每个MySQL表的行存储为一个RocksDB键值对:

  • 主键(Primary index)编码在 key 中,并且所有其他行数据编码在 value 中。

    • 主键的编码方式遵循 MySQL 的数据类型规则,例如整数按小端序编码,字符串按字节序编码。
  • 辅助索引编码(Secondary Index),不一定唯一,被以独立的键值对存储;key 是 索引ID + 索引列值 + 主键值 value 留空;

    • 副索引查找转换为RocksDB范围查询
    • 索引 ID 用于区分不同的索引,确保不同索引的键不会冲突。

所有的RocksDB键都前缀了一个4字节的表ID或索引ID以便多个表或索引可以在一个RocksDB键空间共存。最后,一个随每次写入操作递增的世界序列号,被每个键值对存储以支持快照。快照用于实现多版本并发控制,反过来允许我们在RocksDB中实现ACID事务。

解决写入放大的手段

​动态调整 LSM 层级大小
如果为每个级别指定固定大小,那么在实践中,存储在最后一个级别的数据大小不太可能是前一个级别目标大小的10倍。 在最坏的情况下,存储在最后一个级别的数据大小仅略大于前一个级别的目标大小,在这种情况下,空间放大将大于2。 然而,如果我们动态调整每个级别的大小为下一个级别数据大小的1/10,那么空间放大将减少到小于1.111…. 级别大小乘数是LSM树中的一个可调参数。上面,我们假设它是10。级别大小乘数越大,空间放大和读放大越低,但写放大越高。因此,这个选择代表了一种权衡。 对于大多数Facebook生产环境中的RocksDB安装,使用的级别大小乘数为10,尽管有少数实例使用8。
压缩
  • 前缀压缩,前缀编码应用于键,通过不写入与前一个键重复的前缀部分。我们发现这在实践中可以减少3%至17%的空间需求,具体取决于数据工作负载。
  • ​序列ID垃圾回收,
  • 数据压缩,压缩作用在 block 级别,
  • 字典压缩,通常在最后一个级别使用强压缩算法(如zlib或Zstandard),尽管这会带来更高的CPU开销,因为大部分(接近90%)的数据位于该级别,而只有少量的读写操作会访问它。在各种使用场景中,对最后一级应用强压缩算法相比仅使用轻量级压缩,可以额外节省15%至30%的存储空间。 相反地,我们在0-2级不使用任何压缩,以允许更低的读取延迟,尽管这会带来更高的空间和写入放大,因为这些级别只占用总存储空间的一小部分。从第3级到最后一级使用轻量级压缩(如LZ4或Snappy),因为其CPU开销是可接受的,同时它减少了空间和写入放大。对位于前三个级别的数据的读取更可能位于操作系统缓存的(未压缩的)文件块中,因为这些块被频繁访问。然而,对第2级以上级别数据的读取将必须进行解压缩,无论它们是否位于操作系统文件缓存中(除非它们也位于RocksDB块缓存中)。