1972 年 R. Bayer 和 E. McCreight 提出了 B 树。1979 年 Douglas Comer 在 The Ubiquitous B-Tree 提出了 B 树的一个变形——B+ 树。由于多路平衡树减少了磁盘读写次数,并且仍然保持 O(logN) 的插入/删除/查找的效率,被广泛应用于数据库和文件系统中。
定义(参考资料 [1, 3])
模仿 B 树的定义(参考资料 [3]),一棵 m 阶的 B+ 树可以这样定义:
- 每个节点最多可以有 m 个元素;
- 除了根节点外,每个节点最少有 (m/2) 个元素;
- 如果根节点不是叶节点,那么它最少有 2 个孩子节点;
- 所有的叶子节点都在同一层;
- 一个有 k 个孩子节点的非叶子节点有 (k-1) 个元素,按升序排列;
- 某个元素的左子树中的元素都比它小,右子树的元素都大于或等于它;
- 非叶子节点只存放关键字和指向下一个孩子节点的索引,记录只存放在叶子节点中;
- 相邻的叶子节点之间用指针相连。
例如一棵 3 阶的 B+ 树如下:
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | |
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| k | l | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
这里主要记录了基本操作的过程,比较零散,具体的步骤归纳见参考资料 [1]。以下例子是本人根据参考资料 [1] 所写的程序输出,不一定正确,欢迎指出其中的错误。
查找
例如要从上面的节点中查找字符“h”。从根节点开始,因为“h”比“m”小,所以顺着“m”的左分支往下走;在“m”的左节点中,经过比较发现“h”比“g”大但是比“i”小,所以沿着“g”和“i”之间的分支往下;最后来到叶节点中,经过查找发现“h”的位置,查找成功。
再如查找字符“i”。从根节点开始,“i”比“m”小,沿着“m”的左分支往下;比较后发现“i”的位置,但是因为在非叶子节点中出现,所以不能确定“i”是否存在,应该继续沿着“i”的右分支往下走(因为这里规定某个元素的左分支都比它“小”,右分支的元素都是“不小于”它的元素);最后到达“k”和“l”所在的叶节点,经过查找后没发现“i”,查找失败。要注意的是在非叶节点中出现的元素不一定存在叶节点中,这是因为删除叶子节点的某些元素并不要求删除在非叶子节点中的相同元素,非叶子节点中的元素只作为索引存在。
插入
以 3 阶的 B+ 树为例,插入的字符按照字典序排列。
从一棵空树开始,依次插入字符“p”,“e”,“h”:
+---+---+---+ insert "p" +---+---+---+ insert "e" +---+---+---+ insert "h" +---+---+---+
| | | | => | p | | | => | e | p | | => | e | h | p |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
由于插入第三个字符“h”时根节点还没满,所以不需分裂。接下来插入字符“m”:
+------+---+---+ +------+---+---+ +------+---+---+
| | | | | | | | | m | | |
+------+---+---+ +------+---+---+ +------+---+---+
/ \ => / \ => / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| e | h | |->| p | | | | e | h | |->| m | p | | | e | h | |->| m | p | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
当往一个已经满了的叶节点中插入记录时,该节点直接分裂成两个节点,原来节点的记录平均分配到两个新节点中,根据需要调整原来的父节点及祖先节点,然后新的记录根据情况插入到其中一个节点。在分裂的时候可以根据新元素要插入的位置进行适当的分配调整,如果新元素插入的位置比 (m-1)/2 小,则可以把 [(m-1)/2, m] 的元素复制给右边的新节点;如果插入位置大于等于 (m-1)/2,说明新元素会插入到新分裂的右边节点,则把 [1+(m-1)/2, m] 位置上的元素复制给右边的新节点,这样尽量做到节点间的元素数量均匀,减少节点分裂次数。
因为插入字符“m”的时候叶节点已经满了,所以要进行分裂,原来的叶子节点分裂成两个新的节点,而由于原来的叶子节点是根节点,所以还需要一个新的非叶节点作为新的根节点,新分裂出来的两个叶节点使用指针连接起来。字符“m”要插入的位置是 2(下标从 0 开始,位置 2 就是旧节点中“p”的位置),这个位置在中点的右边,说明“m”将会被插入到分裂后右边的新节点,所以在节点分裂时右边节点的元素比左边的少一个,这样在插入“m”后两个节点的元素个数就一样了。
接着插入“t”,“x”,“b”:
+------+---+---+ +--------+----+---+
| m | | | | m | t | |
+------+---+---+ +--------+----+---+
/ \ => / | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| e | h | |->| m | p | t | | e | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+--------+----+---+
| m | t | |
+--------+----+---+
/ | \
+---+---+---+ +---+---+---+ +---+---+---+
| b | e | h |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+
然后插入“d”。因为“d”要插入的叶节点(最左边的节点)已经满了,所以这个节点要分裂成两个新的节点,并且把原来节点的元素均匀分配到两个新节点中,然后把“d”插入到相应的节点中。因为新增加了一个叶节点,所以需要把新节点最左边元素的关键字插入到原来节点的父节点中:
+---------------+------------+---+
| m | t | |
+---------------+------------+---+
/ | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+--------+----------+--------+
| e | m | t |
+--------+----------+--------+
/ | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
插入“f”后:
+--------+----------+--------+
| e | m | t |
+--------+----------+--------+
/ | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | h |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
再插入“g”:
+--------+-------------------------+--------+
| e | m | t |
+--------+-------------------------+--------+
/ | | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------------+---+---+
| m | | |
+------------------+---+---+
/ \
+----+--------+---+ +------+---+---+
| e | | | | t | | | =>
+----+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------------+---+---+
| m | | |
+------------------+---+---+
/ \
+----+--------+---+ +------+---+---+
| e | g | | | t | | |
+----+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
将要插入“g”的叶节点(“e”,“f”和“h”所在的节点)已经满了,所以分裂成两个节点并且把“g”插入到相应的节点(参考上面插入“d”时节点的变化);接着应该把“g”复制一份然后插入到父节点中,但是父节点已经满了,所以父节点分裂成两个新的节点,然后把“g”插入到对应的节点中。因为此时的父节点是根节点,分裂后需要一个新的根节点来存储多出来的关键字。
接着插入“i”,“k”,“l”:
insert "i":
+------------------+---+---+
| m | | |
+------------------+---+---+
/ \
+----+--------+---+ +------+---+---+
| e | g | | | t | | |
+----+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | i |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
insert "k":
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | |
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| i | k | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
insert "l":
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | |
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| i | k | l |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
接着插入“j”:
+------------------------------------+---+---+
| m | | |
+------------------------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | | =>
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| i | j | |->| k | l | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+-------------------------+-----------------------------+---+
| g | m | |
+-------------------------+-----------------------------+---+
/ | \
+------+---+---+ +--------+--------+---+ +------+---+---+
| e | | | | i | k | | | t | | |
+------+---+---+ +--------+--------+---+ +------+---+---+
/ \ / | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| i | j | |->| k | l | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
非叶节点分裂和叶节点分裂不一样。叶子节点分裂时,要把新节点的最小值复制到父节点相应的位置;非叶子节点分裂时,把中间的值移到(不是复制)父节点中。
删除
B+树如下:
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | |
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| i | k | l |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
删除“i”之后:
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | |
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| g | h | |->| k | l | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
因为“i”所在的节点删除“i”后剩下的元素个数仍然大于 (m/2),所以只需把剩下的元素移动位置,不需要合并节点。虽然“i”已经从叶子节点中删除了,但是在父节点中的索引“i”仍然存在。
接着删除“g”:
+----------------------+---+---+
| m | | |
+----------------------+---+---+
/ \
+--------+----------+--------+ +------+---+---+
| e | g | i | | t | | | =>
+--------+----------+--------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| | h | |->| k | l | |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+--------------+---+---+
| m | | |
+--------------+---+---+
/ \
+--------+--------+---+ +------+---+---+
| e | g | | | t | | |
+--------+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| h | k | l |->| m | p | |->| t | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
删除“g”后原来的叶节点只剩一个元素“h”,因此需要和其它的节点合并(这里是和右兄弟节点合并,当然也可以和左边的节点合并)。由于合并后的叶节点数少了一个,所以父节点也要相应地删除一个元素“i”。
接着删除元素“x”:
+--------------+---+---+
| m | | |
+--------------+---+---+
/ \
+--------+--------+---+ +------+---+---+
| e | g | | | t | | | =>
+--------+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| h | k | l |->| m | p | |->| t | | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------------+---+---+
| m | | |
+------------------+---+---+
/ \
+--------+--------+---+ +------+---+---+
| e | g | | | | | | =>
+--------+--------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| h | k | l |->| m | p | t |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+----------+---+---+
| g | | |
+----------+---+---+
/ \
+------+---+---+ +------+---+---+
| e | | | | m | | |
+------+---+---+ +------+---+---+
/ \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| e | f | |->| h | k | l |->| m | p | t |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
合并后叶节点少了一个,因此右边的父节点索引“t”要被删除,而父节点的父节点(“m”所在的节点)的索引“m”要移到“t”的位置上,然后从父节点左边的兄弟节点(“e”和“g”所在的节点)中把最大的索引(即“g”)移到“m”的位置上,以保持树的平衡。
接着删除“e”:
+----------+---+---+
| g | | |
+----------+---+---+
/ \
+------+---+---+ +------+---+---+
| e | | | | m | | | =>
+------+---+---+ +------+---+---+
/ \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | |->| | f | |->| h | k | l |->| m | p | t |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------+--------+---+
| g | m | |
+------------+--------+---+
/ | \
+---+---+---+ +---+---+---+ +---+---+---+
| b | d | f |->| h | k | l |->| m | p | t |
+---+---+---+ +---+---+---+ +---+---+---+
合并后叶节点少了一个,因此父节点中的索引元素“e”要被删除,而且右边的兄弟节点只剩一个元素,因此把父节点与右兄弟节点合并,保持树的平衡性。
其它
对于插入/删除有一些优化的方法,如当节点元素个数小于 (m/2) 的时候并不是立即进行合并,而是等到小于某个给定值(如 20%)的时候再进行合并,以减少频繁插入/删除造成的节点频繁分裂/合并的情况;当一个节点满了之后不是立即分裂,而是先看看左右相邻节点是否还有空位置,如果有的话就把一些元素挪到相邻节点的空位上,或者把左,中,右三个节点的元素集中起来均匀分散到三个节点中,再进行插入(这种优化得到的树叫做 B* 树?)。
B+ 树和 B 树最大的不同点是:B 树的关键字和记录是放在一起的,叶子节点和非叶子节点中都同时包含关键字和记录;B+ 树的非叶子节点中只有关键字和指向下一个节点的索引,记录只放在叶子节点中。在 B 树中,越靠近根节点的记录查找时间越快,只要找到关键字即可确定记录的存在;而 B+ 树中每个记录的查找时间基本是一样的,都需要从根节点走到叶子节点,而且在叶子节点中还要再比较关键字。从这个角度看 B- 树的性能好像要比 B+ 树好,而在实际应用中却是 B+ 树的性能要好些。因为 B+ 树的非叶子节点不存放实际的数据,这样每个节点可容纳的元素个数比 B 树多,树高比 B 树小,这样带来的好处是减少磁盘访问次数。尽管 B+ 树找到一个记录所需的比较次数要比 B 树多,但是一次磁盘访问的时间相当于成百上千次内存比较的时间,因此实际中 B+ 树的性能可能还会好些,而且 B+ 树的叶子节点使用指针连接在一起,方便顺序遍历(例如查看一个目录下的所有文件,一个表中的所有记录等),这也是很多数据库和文件系统使用 B+ 树的缘故。
参考资料
[1] B+-Tree Indexes
[2] B+ tree in Wikipedia
[3] B tree definition in Wikipedia
删除x的时候,是否可以将e g m 同时合并成根节点?
可以,只要判断下这三个节点是否能够合并就行,不过也要考虑合并后再插入数据会造成新一轮的节点分裂。这篇博客只写了基本操作,生产环境中还有很多优化的地方。
讲得非常棒,感谢分享
那啥。。。删除操作那一部分,删元素x的时候,按数据库书上的那个算法,就把t左边那个指针和m都移到存e、g那个结点了,这样e、g、m三个元素就成了新的根,可以减少一层。前面插入操作有一步我按书上的算法出来的树结构也有点不同,但是B+树的性质都是满足了的…………
这是分裂/合并策略的问题……插入的时候节点满了就直接分裂,而没有检查是否可以把部分元素挪兄弟节点中;而删除的时候如果节点元素个数不满足(m/2)的条件会先检查兄弟节点,如果兄弟节点元素个数满足(m/2)就从兄弟节点借一个节点而不进行合并……
最后结果不一定完全相同吧,性质满足了就行
巧啊。。。我这两天正好在看原来那个数据库书……刚好看到B+树。。。