dali 内存数据库系统结构 [复制链接]

论坛徽章:: 0

21楼 [报告]

发表于 2003-07-01 22:49 |只看该作者

dali 内存数据库系统结构

4 FAULT TOLERANCE

This section presents techniques other than those provided directly by transaction management for fault-tolerant programming in Dalí. These techniques help cope with process failure scenarios. The first and second techniques help detect and recover from user programs with "stray pointers, " which might corrupt persistent data stored in shared memory. The third technique returns the system to a fully available state if a process dies with transactions in progress.

本章介绍DALI中由事务管理器直接提供的容错性程序设计技术。这些技术用于应付进程失败情况。第一个和第二个技术帮助检测和恢复有错误指针的用户程序，这种程序将会破坏共识内存中的持久数据。第三个技术在一个事务正在处理时如果进程死亡，那么要把系统带到一个完全可用的状态。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

22楼 [报告]

发表于 2003-07-01 22:50 |只看该作者

dali 内存数据库系统结构

4．1 Protection from Bad Writes

The direct access principle of Dalí implies that at least some user application code will be linked directly with Dalí libraries to access the data stored in the database through shared memory. This scenario brings with it the inevitable[不可避免的, 必然的] possibility that an application error will lead to "bad writes" through uninitialized or corrupt[误用的] pointers, which can cause persistent data to become corrupted. Dalí provides two mechanismsmemory protection and code wordsto minimize the probability that such errors will cause this problem.
DALI的直接访问定理意味着一些用户程序代码将通过与DALI库相连接，来访问共享内存中数据库的数据。这种情况不可避免的使数据库可能会被错误的应用程序“错误写”，通过未初始化的指针或是误用的指针。这种情况会把一致的数据变成错误的。Dalí提供两种机制来减少这种错误出现的可能性，这两种方法是内存保护和关键字。

Memory protection. Applications that want to prevent corruption from stray pointers can map a database file in a special protected mode. Dalí uses the mprotect system call to disable the updates by a particular process to a protected database file. Before a page is updated, the page is unprotected. At the end of the transaction, all the unprotected pages are reprotected. An erroneous update would probably try to update a protected page, resulting in a protection violation. Although this scheme detects erroneous writes immediately and the writes can be traced to their source using a debugger, the system calls used by this scheme decrease performance significantly. As a result, this scheme is more beneficial for use in debugging applications and for read-only databases or databases with low update rates.
内存保护。为了防止被错误的指针破坏，应用程序可以把数据文件映射到一个特殊的保护状态。Dalí使用mprotect系统调用来禁止特殊进程对一个受保护的数据库的更新。在一个页面更新前，页面是未保护的，事务结束后，所有未保护的页面被重保护。当一个错误的更新企图更新一个受保护的页面时，会产生一个违反保护的结果。虽然这种机制可以直接检测到错误写，并且写的代码可以通过调试器检测到，但是它显著降低了数据库的性能。因此，这种机制对调试应用程序代码、只读性数据库、低写比率的数据库更为有利。

Code words. An alternate technique for bad write detection is to associate a logical parity word, or code word, with each page of data. Whenever data is updated using valid Dalí system calls, the code word is updated accordingly. An erroneous write will update only the physical data and not the code word. We then use the strategy of protecting the checkpoint image on disk. Before a page is written to a checkpoint, its contents are verified against the code word for that page. If a mismatch is detected, access to that database is restricted to read-only while the physical image is recovered from the last checkpoint (in-progress transactions are aborted).
代码字。一个替换错误写检测的技术与每个页面数据的逻辑奇偶字，也称为关键字相关。当数据使用合法的DALI系统调用更新数据时，关键字也会因此而更新。一个错误写将只会更新物理数据而不会更新关键字。因此我们可以使用策略保护磁盘上的检查点映象。在页面被写到磁盘上前，它包含的关键字将会被检验。如果不匹配，那么在数据库从上次检查点恢复完成前，对数据库的访问将会是只读的，正在进程的事务也会退出。

Our current implementation of code words, based on page-level latching, is as follows. Each page has an associated latch and code word. While a page is being updated, its latch is held in shared mode by the updater. At the end, the change to the page's code word is computed from the current contents of the updated region and the contents of the region before the update took place (which is determined from the physical undo log record for the update). A short-term exclusive latch on the code word table is then obtained[获得] to apply the computed change to the code word value for the page. The latch ensures that concurrent updaters to different regions on a page do not install an incorrect value for the code word. As a page is being checkpointed, the checkpointer obtains an exclusive latch on the page long enough to copy the page and the code word associated with it to a separate area. The code word for the copy is then computed and compared with the value from the table.

我们当前实现的关键字，基于页级别的锁，如下所示。每一个页面有一个相关的锁和关键字。当一个页被更新时，它的锁被更新者以S模式持有。结束后，页面的关键字将会从更新的区域和旧区域（可以从更新产生的物理UNDO日志计算）的关键字重新计算产生。在关键字表上的一个短期排它锁保持着该页的关键字值计算的改变。锁确定对一个页不同区域的并发更新不会产生一个错误的关键字值。在一个页面被检查点时，检查点执行者获得页面的排它锁，直到把页面和它的关键字复制到一个独立的区域。关键字被计算并与表中值计算。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

23楼 [报告]

发表于 2003-07-01 22:50 |只看该作者

dali 内存数据库系统结构

4．2 Protection from Process Death

Process death may be caused if the process violates hardware protection, such as attempting to access invalid memory, or if a process is killed by an operator. A Dalí system process known as the cleanup server is primarily responsible for cleaning up dead processes.
When a dead process is uncovered, the cleanup process determines which low-level latches, if any, were held by the crashed process. Whenever a process acquires a latch, an entry is made for the process in an active process table. The cleanup process consults this table to determine if the process was holding a latch. Detecting a held latch is complicated[复杂的, 难解的] because in many machine architectures it is not possible to atomically acquire a spin-lock and register ownership. Dalí has implemented a solution for this problem, described in Bohannon et al.8 If the dead process held any system latches, the process may have left system structures in an inconsistent state because of partial updates. In these cases, the entire system is recovered as if a system crash had taken place. If the dead process did not hold any latches, the cleanup server scans the transaction table and aborts in-progress transactions owned by the dead process.

如果进程触发硬件保护的话，如企图访问非法内存，或是被操作员KILL，那么可能会产生进程死亡。因为cleanup server主要是清除死亡的进程，所以DALI的服务进程可以知道进程已死亡。

当一个死亡的进程是未保险时，清除进程要决定崩溃的进程拥有的任何低层锁。无论何时一个进程申请一个锁，在活动进程表中该进程表项都会有一个入口。清除进程检查这个表来确定该进程是不是保持一个锁。因为在许多机器的结构体系中，获得一个自旋锁并注册拥有者不可能是一个原子操作，所以检查一个持有的锁是复杂的，DALI为这个问题提供了一个解决方案，在 Bohannon et al.8中有描述。如果一个死去的进程持有任何系统锁，那么由于奇偶更新的原因，进程将会使系统结构处于一种不一致的状态。这个时候，整个系统就按系统崩溃的方法进行处理。如果死的进程没有持有任何锁，那么清除服务进程扫描事务表并死进程拥有的正在进行的事务。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

24楼 [报告]

发表于 2003-07-01 22:50 |只看该作者

dali 内存数据库系统结构

5 COLLECTIONS AND INDEXES

The storage allocator provides a low-level interface for allocating and freeing data items. Dalí also provides higher-level interfaces for grouping related data items and performing scans, as well as associative access (via indexes) on data items in a group.

存储分配器为分支的分配和释放提供一个低层的接口。和有关联的访问组中分支数据（通过索引）一样，DALI同样为组相关数据分支和进行的扫描提供高层接口。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

25楼 [报告]

发表于 2003-07-01 22:51 |只看该作者

dali 内存数据库系统结构

5．1 Heap File
The heap file is Dalí's abstraction for handling a large number of fixed-length data items. It is implemented as a thin layer on top of the power-of-two allocator. In addition to inserting and deleting data items, the heap file supports locking and an unordered scan of items. Item locks are obtained transparently when items are inserted, deleted, updated, or scanned. Scans are supported through a bit map stored in a segment header for the items in a segment.

堆文件是DALI处理大量定长数据分支的抽象。它的实现是在2 的乘方分配器上的一个薄层。除了插入和删除数据分支，堆文件支持锁和对分支进行未排序的扫描。当分支被插入、删除、更新、扫描时分支锁是显然拥有的。扫描通过分支所在段段头中的位图实现。

image025.jpg (17.11 KB, 下载次数: 75)

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

26楼 [报告]

发表于 2003-07-01 22:51 |只看该作者

dali 内存数据库系统结构

5．2 Extendible Hash

Dalí includes a variation on extendible hashing, as described in Fagin et al.9 Figure 5 shows an overview of our structure, which matches the structure shown in Fagin et al.,9 except that the bucket concept from standard extendible hashing is broken into a hash header and a list of key entries. Each directory entry points to a hash header, and multiple directory entries can point to a single hash header. A variable i is maintained with the hash index, and the first i bits of the computed hash value for a key are used to determine the directory entry for the key (the directory itself contains 2i entries). Allowing multiple directory entries to point to the same bucket prevents too many hash headers from being allocated for directory entries to which very few key values have hashed.
DALI包括由Fagin et al.9设计的扩展HASH的一个变种。图5显示了我们结构的概要，它符合Fagin et al.9设计的结构，除了桶的概念。传统的扩展HASH中桶分为HASH头和键列表入口。每个目录入口指向HASH头，多层目录入口可以指向单个HASH头。变量I保存HASH的索引，计算出的HASH值中前I位用于键的目录入口（目录自己包含有2i个入口）。允许多个目录入口指向同一个桶是为了阻止太多的HASH头，在很少有键值可以HASH的情况下。

Searching for a key value is a fairly straightforward process; it involves searching for the key value in the list of key entries pointed to by the directory entry for the key. Insertions, however, are more complex. They may involve splitting individual key entry lists and, in certain cases, doubling the entire directory. The algorithm attempts to keep lists below a specified size threshold[开始, 开端, 极限]. To prevent interference [冲突, 干涉] between searches and inserts, a lock is maintained with each hash header. The lock is obtained in shared mode by searches traversing the list of key entries for the header, and in exclusive mode by inserters inserting into the list. Although our scheme permits a high degree of concurrency, allowing searches and inserts with different hash headers to execute concurrently, it blocks both searches and splits while a directory is being resized.

查找一个键值很少是一个前向的过程，它包括通过键的目录入口，从键入口列表中查询指向的键值。然而插入更加复杂。他们可能包括分开独立的键入口列表，并且在通常情况下，加倍目录入口。算法尝试保持以下列表在一个尺寸的极限中。为了阻止查询和插入的冲突，每个HASH头都有一个锁。当查询者遍历头部的键入口列表时，锁的拥有模式为S，当向列表中插入时，拥有模式为X。虽然我们的机制保证高度的并发性，允许对不同HASH头的查询和插入并行的执行，但在目录大小改变时它阻止查询和分裂。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

27楼 [报告]

发表于 2003-07-01 22:52 |只看该作者

dali 内存数据库系统结构

5．3  T-tree Indexes

Lehman and Carey10 proposed T-trees as a storage-efficient data structure for main memory databases. T-trees are based on AVL trees proposed by Adel'son Vel'skii and Landis.11 As in AVL trees, the height of left and right subtrees of a T-tree may differ by, at most, one. Unlike AVL trees, each node in a T-tree stores multiple key values in a sorted order rather than a single key value. To keep occupancy high, every internal node must contain a minimum number of key values (typically k&2, if k is the maximum number of keys that can be stored in a node).
Lehman and Carey10建议把T 树作为MMDB高效的存储数据结构。T树是基于Adel'son Vel'skii and Landis.11的AVL树。与在AVL树中一样，T树的左子树和右子树高多数情况下可以相差1。但和AVL树不同，T树的每个结点保存有序的多个键值而不是单个键值。为保证高利用率，每个内部节点必须包含最小数目的键值（典型的是k&2，如果K是一个节点能最大容纳的键值数）

Searching for a key value in a T-tree is a relatively straightforward process. For every node, a check is made to determine if the key value is bounded by the leftmost and the rightmost key values in the node; if this is the case, then the key value is returned if it is contained in the node. (If the key value is not returned, it is not contained in the tree.) If the key value is less than the leftmost key value, then the search moves to the left child; otherwise, the right child node is searched. The process is repeated until either the key is found or the node to be searched is null .
在T树中查找一个键值是一个相关的前向过程，对每个节点，要检查键值是不是在该节点中最左和最右的键值之间。如果是，并且包含该键值，那么直接返回键值。（如果键值没有返回，表示键值不包含在这棵树中）如果键值小于最左的键值，那么查询将对左子树进行，否则，将对右子树进行。这个过程将一直重复到键值疗找到或是查找的节点为空为止。

Insertions and deletions into the T-tree are a bit more complicated. In both cases, first a variant of the search described above is used to find the node that bounds the key value to be inserted/deleted. If there is room in the node for insertions, the key is placed in the node; otherwise, a new node is allocated for the key. For deletions, the key is deleted from the node, and the node itself is deallocated if:

向T树中插入和删除会复杂一点，在两种情况下，首先，使用与上面讨论不同的查找方法查询包含要插入/删除的键值取值范围的节点。如果在这个节点中有空间插入，那么键值写入这个节点中。否则，创建一个新的节点来保存这个键值。如果是删除，键从节点中删除，如果发生以下情况，节点自己将会被重分配空间：

The node becomes empty, or  节点变空
The number of keys in the node falls below the specified minimum and keys in the node can be accommodated in a different node.  该节点中键的数目低于键的下限值，这个节点中的键可以向其它节点调节。

Deletions from interior nodes may require keys to be moved up from lower nodes to ensure that occupancy requirements are met. In both insertions and deletions, allocation and deallocation of a node may unbalance the tree, making it necessary to perform rotations (RR, RL, LL, LR), as described in Lehman and Carey.10
删除内部节点时，为了保证利用率，可能会把键从低层的节点中向上移。在插入或者是删除时，节点的分配和再分配可能会导致树的不平衡，因此要进行适当的旋转(RR, RL, LL, LR)，这和Lehman and Carey.10讨论的一样。

Our T-tree implementation supports next-key locking12 to avoid the phantom phenomenon. It also provides high concurrency by requiring operations on the T-tree to obtain a T-tree latch only for the duration of the operation, not the transaction. Insertions and deletions obtain the latch in X mode, while search operations obtain the latch in S mode.

我们的平衡树支持下一键锁定，这样可以避免幻象。通过要示对T树进行的操作要操作（而不是事务）期间必须拥有一个T树锁，它同样提供高并发性，插入和删除拥有X模式的锁，查询操作拥有S模式的锁。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

28楼 [报告]

发表于 2003-07-01 22:52 |只看该作者

dali 内存数据库系统结构

6 DALI RELATIONAL MANAGER

The Dalí relational manager is a C++ class library interface to a relational system. Access to data is through C++ language classes, corresponding to such tools as tables, iterators, and search criteria, with Structured Query Language (SQL) support limited to data definition statements声明. Schema information is stored in tables, and limited views (project select joins) are allowed. Indexes may be created on arbitrary subsets of the attributes in a table. The Dalí relational manager also supports referential integrity (foreign key constraints), null values, and navigation through interators over a single table. A conjunctive query may be specified for the iterator[重申, 重述], and automatic index selection is performed.
DALI关系管理器是C++类型的关系管理器库接口，访问数据通过C++语言类，相应的这些工具如表、迭代器，和带有SQL支持的对数据定义声明的查询标准。计划信息保存在表中，有限的视图（project select joins）也允许，索引可以在表属性的任意子集创建。DALI关系管理器同样支持参考完整性（外键约束），空值，还有通过对interator<迭代器？>;单表进行导航。一个连接的查询可以为一个迭代器指定，自动索引选择也是允许的。

The one extension to the relational model is its ability to store intertable joins in the schema. From one open iterator, a new iterator on the matching tuples in the other table may be opened easily. As an option, physical information (pointers) can be stored to represent the join relationship, leading to underlying pointer list structures similar to a network database. This last feature enables the relational manager to perform high-speed navigation, which competes with object-oriented models without requiring explicit pointer types.

关系模型的一个扩展是它可以保存一个加入计划折内部表。从一个开放的迭代器中，建立在其它表的tuples上的一个新的迭代器可以容易的打开。作为一个选项，物理信息（指针）可以为了表现加入关系而保存，这将导致与网络数据库一样的基本指针列表。最后一个特征使关系管理器能执行高速导航，与面向对象模型相比不需要直接的指针类型。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

29楼 [报告]

发表于 2003-07-01 22:52 |只看该作者

dali 内存数据库系统结构

7 CONCLUSION

In this paper, we presented an overview of the architecture of the Dalí main memory storage manager. To our knowledge, Dalí contains the only explicit implementation of multilevel recovery for main memory, and one of very few for disk-based systems. We described Dalí's storage architecture, the implementation of the T-tree and extendible hash index structures, its extensive features for detecting bad writes by processes, and the procedure for recovering from process failure.
在本论文中，我们介绍一对DALI内存数据库存储管理器。据我们了解，DALI是唯一包含清楚的多层恢复机制的MMDB，这对磁盘数据库来说也是少数。我们讨论了DALI的存储管理器结构，还有T树和HASH索引结构的实现，以及它在进程的错误写检测、从进程失败过程中恢复的广大的特性。

We are working on a number of additional features for the Dalí storage manager in its product form as DataBlitz. These include support for hot spares that can take over processing if the primary server fails; nonblocking read transactions that use sophisticated version-based concurrency control schemes; and on-line schema evolution, which will enable attributes to be dropped/added to a relation as access to the relation continues.

我们同时也在通过使用DALI存储管理器产品DataBlitz的其它特性。这些包括对热备份的支持，热备份可以在主服务器失败时接管进程，对使用复杂的基于版本的并发箜制机制可以非阻塞的读事务。还有在线计划演变，这允许不中断关系的情况下增加/减少关系。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

无双

腰缠万贯

论坛徽章:: 0

30楼 [报告]

发表于 2003-07-01 22:52 |只看该作者

dali 内存数据库系统结构

7．1 Acknowledgments

Dalí represents a substantial effort, during several years, by many people. We would like to thank H. V. Jagadish, of AT&T, for significant early contributions to Dalí. Dan Lieuwen of Bell Labs also made significant early contributions and has offered several valuable suggestions concerning process failure and the organization of the heap file. S. Seshadri, a consultant to Lucent Technologies when this work was performed and a member of the faculty at the Indian Institute of Technology in Bombay, helped develop the T-tree concurrency algorithm and the relational interface. Steve Coomer of Network Systems suggested the strategy of using code words to protect checkpoint images on disk. The design for the structure used in the coalescing allocator was suggested by Dennis Leinbaugh, also of Network Systems. We would also like to thank Jerry Baulier of Lucent's Advanced Technologies, project manager for DataBlitz, for his support of the project. Many other talented individuals contributed to the design and implementation of specific systems in Dalí during the last three years, including Sandeep Joshi, Mike Nemeth, and James Parker of Lucent Technologies, and Soumya Chakraborty, Ajay Deshpande, Sadanand Gogate, Chandra Gupta, Amit Khivesara, Sekhara Muddana, and Yogesh Wagle, current or former contractors at Lucent.

THE ARCHITECTURE OF THE DALI MAIN MEMORY STORAGE MANAGER
Figure 1.
Architecture of the Dalí system.

Figure 2.
Layers of abstraction in the Dalí system.

Figure 3.
Segments and chunks.

Figure 4.
Overview of recovery structures.

Figure 5.
Extendible hashing in the Dalí system.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

1 234 / 4 页下一页

返回列表

Chinaunix › 论坛 › 程序设计 › 软件配置管理 › dali 内存数据库系统结构