1972 年 R. Bayer 和 E. McCreight 的论文(参考资料 [1])提出了 B- 树。B- 树是一棵平衡树,与一般的平衡二叉树(AVL,红黑树等)不同的是,B- 树的每个节点最多可以拥有 m(m>=2)个元素,(m+1)个子节点,并且所有的叶子节点位于同一层。B- 树的查找和插入的时间复杂度和二叉树一样,都是 O(logn),但因为每个节点保存的元素比较多(一般是几十个到几百个之间),树的高度比一般的二叉树要小很多,访问硬盘次数更少,在数据不能全部加载到内存的时候比一般的二叉树效率要好。
定义
一棵最多有 2n+1 个元素(其中 n>= 1)的 B- 树的定义如下(参考资料 [2, 3]):
- 每个节点最多有 2n+1 个元素;
- 每个非叶子节点(根节点除外)至少有 n+1 棵子树;
- 如果根节点不是叶子节点,则至少有 2 棵子树;
- 一个有 k 棵子树的非叶子节点有 k-1 个元素,按升序排列;
- 所有叶子节点位于同一层;
- 某个元素的左子树中的元素都比它小,右子树的元素都大于它(不存在相等的情况)。
很多资料上都是 m 阶 B- 树(其中 m>=2,可以是奇数也可以是偶数)这样的说法,但是我更倾向于上面的定义,这样就规定了 m 一定是奇数,分裂/合并操作能统一起来,如果 m 是偶数则要考虑不平衡的情况,例如以前数据结构书上用的 2-3 树我就很不爽,而 2-3-4 树(就是上面定义中 n=1 时的情况)写起来就很舒服。虽然 m 的定义有些差别,但不影响理解。
例如下面是一棵每个节点最多有 3 个元素的 B- 树:
+-------------------------+-------------------------+---+
| i | p | |
+-------------------------+-------------------------+---+
/ | \
+---------+---------+---+ +-------+---+---+ +---------+---------+---+
| d | g | | | m | | | | t | w | |
+---------+---------+---+ +-------+---+---+ +---------+---------+---+
/ | \ / \ / | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| a | b | c | | e | f | | | h | | | | j | k | l | | n | o | | | q | s | | | v | | | | x | y | z |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
查找
例如要从上面的节点中查找字符“h”。从根节点开始,因为“h”比“i”小,所以顺着“i”的左子树往下走;在“i”的左子树的根节点中,经过比较发现“h”比“g”大,所以沿着“g”的左子树继续往下查找;最后来到叶节点中,经过查找发现“h”的位置,查找成功。
再如查找字符“u”。从根节点开始,“u”比“p”大,沿着“p”的右子树往下找;在子树的根节点比较后发现比“t”大但是比“w”小,顺着“t”的右子树往下找;在叶节点中找不到“u”,说明字符“u”不在树中。
B- 树的 key 是唯一的。假设要查找字符“i”,经过比较在根节点就找到了“i”的位置,这时就可以直接把该元素返回,而不需要继续往下查找。
插入
以 2-3-4 树为例,依次插入字符“p”,“e”,“h”:
+---+---+---+ insert "p" +---+---+---+ insert "e" +---+---+---+ insert "h" +---+---+---+
| | | | => | p | | | => | e | p | | => | e | h | p |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
由于插入第三个字符“h”时根节点还没满,所以不需分裂。接下来插入字符“m”:
+------+---+---+ +------+---+---+ +------+---+---+
| | | | | | | | | h | | |
+------+---+---+ +------+---+---+ +------+---+---+
/ \ => / \ => / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| e | h | | | p | | | | e | h | | | m | p | | | e | | | | m | p | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
因为每个节点最多只能容纳 3 个元素,所以原来的节点先分裂成两个新的子节点,其中一个包含“e”和“h”,另一个包含“p”。在分裂的时候可以根据新元素要插入的位置进行适当的分配调整,如果新元素插入的位置比 (m-1)/2 小,则可以把 [(m-1)/2, m] 之间的元素复制给右边的新节点;如果插入位置大于等于 (m-1)/2,说明新元素会插入到新分裂的右边节点,则把 [1+(m-1)/2, m] 之间的元素复制给右边的新节点,这样可以做到节点间的元素数量均匀,减少节点分裂次数。
分裂后把“p”插入到相应的位置,然后从左节点中取出其中最大的元素“h”作为新的父节点中的元素(因为取的是左节点最大的元素,所以左节点的其它元素不需要移动;如果取的是右节点中最小的元素,则右节点中剩下的其它元素都要向左移动)。
接着插入“t”,“x”,“b”:
+------+---+---+ +------+---+---+ +------+---+---+
| h | | | | m | | | | m | | |
+------+---+---+ +------+---+---+ +------+---+---+
/ \ => / \ => / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| e | | | | m | p | t | | e | h | | | p | t | x | | b | e | h | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
插入元素“x”时,因为要插入的节点已经满了,这时有两种方法:一种是将右节点分裂成两个新的节点,就像插入“m”时一样;另一种方法是把节点中的元素移动到相邻的节点中以便腾出空间给新插入的节点。这里采用的是第二种方法,即把元素“m”移动到根节点代替原来“h”的位置,把“h”移到左节点,这样右节点就空出来一个位置,这时就可以插入“x”了。这样能充分利用存储空间,不至于造成大量只有很少元素的节点。
接着插入元素“d”:
+------+---+---+ +---------------------+---+---+ +------------+------------+---+
| m | | | | e | m | | | e | m | |
+------+---+---+ +---------------------+---+---+ +------------+------------+---+
/ \ => / \ => / | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | e | h | | p | t | x | | b | | | | h | | | | p | t | x | | b | d | | | h | | | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
这时所有的叶节点都满了,因此将需要插入“d”的节点分裂。分裂的方法是,取出该节点中间的元素“e”作为父节点中新的元素,然后把“e”两边的元素分别分配到两个新的节点。分裂后新的两个节点都有了足够的空间,这时就可以把“d”插入到相应的位置。
接着插入元素“f”,“g”,“i”,“k”:
+------------+------------+---+ +------------+------------+---+
| e | m | | | e | m | |
+------------+------------+---+ +------------+------------+---+
/ | \ => / | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | h | | | p | t | x | | b | d | | | f | g | h | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------+----------+------------+ +------------+----------+------------+
| e | g | m | | e | g | m |
+------------+----------+------------+ => +------------+----------+------------+
/ | | \ / | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | i | | | p | t | x | | b | d | | | f | | | | h | i | k | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
当插入元素“i”的时候采用了直接分裂节点的方法而不是把“f”移到左边的兄弟节点中。接着插入“l”:
+------------+----------+------------+
| e | g | m |
+------------+----------+------------+ insert "l"
/ | | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | i | k | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------+----------+---------------------------+
| e | g | m |
+------------+----------+---------------------------+ "i" is pending
/ | | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------+---+---+ +------------+------------+---+
| e | | | | i | m | |
+------+---+---+ +------------+------------+---+ "g" is pending
/ \ / | \ =>
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+----------+---+---+
| g | | |
+----------+---+---+
/ \
+------+---+---+ +------------+------------+---+
| e | | | | i | m | |
+------+---+---+ +------------+------------+---+
/ \ / | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | p | t | x |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
因为要插入“l”的节点已经满了,因此将其分裂后再插入“l”;在把原来的中间元素“i”插入到父节点的时候发现父节点也满了,因此需要将父节点分裂后再插入;因为父节点是根节点,因此需要创建新的根节点,然后插入原来节点中间的元素“g”。
接着插入“n”,“o”,“q”,“s”:
insert "n":
+----------+---+---+
| g | | |
+----------+---+---+
/ \
+------+---+---+ +------------+--------------+------------+
| e | | | | i | m | t | =>
+------+---+---+ +------------+--------------+------------+
/ \ / | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | n | p | | | x | | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
insert "o":
+--------------+---+---+
| i | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------------+------------+---+
| e | g | | | m | t | | =>
+--------+------------+---+ +------------+------------+---+
/ | \ / | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | n | o | p | | x | | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
insert "q":
+--------------+---+---+
| i | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------------+------------+---+
| e | g | | | m | p | | =>
+--------+------------+---+ +------------+------------+---+
/ | \ / | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | n | o | | | q | x | t |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
insert "s":
+--------------+---+---+
| i | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------------+--------------+------------+
| e | g | | | m | p | t |
+--------+------------+---+ +------------+--------------+------------+
/ | \ / | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | n | o | | | q | s | | | x | | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
插入“o”的时候,在查找过程中发现路径上某个非叶子节点已经满了,为了避免后续的插入导致连锁的分裂操作,因此把在查找路径上已经满了的节点进行分裂或把一部分元素挪到没满的兄弟节点。这里做的调整是:把路径上已经满了的节点(“i”所在节点)移动到根节点代替原来“g”的位置,“g”移到左子节点上,并且原来“i”的左子树作为“g”的右子树,这样仍然保持 B- 树的定义。然后继续往下查找并插入“o”。
在查找“q”的插入位置采用了同样的策略,即查找到要插入的节点(原来“n”,“o”,“p”所在的节点)已经满了,所以把元素“p”移动到根节点代替原来“p”的位置,然后把“p”移动到右子树上。但是因为移动后路径上的节点变了,重新比较后发现“q”比新的元素“p”要大,因此插入节点不是原来的左节点,而是新的右节点(……)。
接着插入“v”:
+-----------------------+---------------------------+---+
| i | p | |
+-----------------------+---------------------------+---+
/ | \
+--------+------------+---+ +------+---+---+ +------+---+---+
| e | g | | | m | | | | t | | |
+--------+------------+---+ +------+---+---+ +------+---+---+
/ | \ / \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | d | | | f | | | | h | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
因为在查找过程中发现路径上的节点(原来“m”,“p”,“t”所在节点)已经满了,对其进行分裂,分裂后中间的元素“p”被提升到父节点中,原来的节点分裂成两个新的节点,之后继续往下查找“v”的插入位置,这样能保证最后插入“v”的时候即使要分裂节点也不会造成连锁的分裂操作。
删除
接着上面的操作,删除“d”:
+-----------------------+---------------------------+---+
| i | p | |
+-----------------------+---------------------------+---+
/ | \
+--------+------------+---+ +------+---+---+ +------+---+---+
| e | g | | | m | | | | t | | |
+--------+------------+---+ +------+---+---+ +------+---+---+
/ | \ / \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | | | | f | | | | h | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
删除“d”后节点不为空,且满足树的定义,因此直接删除即可。接着删除“f”:
+-------------------+--------------------+---+
| i | p | |
+-------------------+--------------------+---+
/ | \
+------+---+---+ +------+---+---+ +------+---+---+
| g | | | | m | | | | t | | |
+------+---+---+ +------+---+---+ +------+---+---+
/ \ / \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | e | | | h | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
删除“f”后节点为空,而左右兄弟节点都没有多余的节点可以补充,因此只能和兄弟节点合并。合并后少了一个节点,因此把父节点中的元素“e”下移到“b”所在的节点以保持树的平衡。
删除“h”:
+-------------------+--------------------+---+
| i | p | |
+-------------------+--------------------+---+
/ | \
+------+---+---+ +------+---+---+ +------+---+---+
| g | | | | m | | | | t | | | =>
+------+---+---+ +------+---+---+ +------+---+---+
/ \ / \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | e | | | h | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+-----------+---+---+
| p | | |
+-----------+---+---+
/ \
+------------+----------+------------+ +------+---+---+
| g | i | m | | t | | | =>
+------------+----------+------------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | e | | | h | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+-----------+---+---+
| p | | |
+-----------+---+---+
/ \
+------------+----------+------------+ +------+---+---+
| e | i | m | | t | | |
+------------+----------+------------+ +------+---+---+
/ | | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| b | | | | g | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
在查找“h”的过程中,如果发现路径上的节点只有一个元素,那么在删除要元素后可能会发生合并操作。为了避免这种连锁的回溯,可以在查找的过程中预先把只有一个元素的节点和兄弟节点合并,或者是从兄弟节点中借一个节点,保证删除操作后不会造成连锁反应。例如在经过“g”所在的节点时,发现只有一个元素,而兄弟节点没有多余的元素,因此把父节点中的“i”下移,并且和“m”所在的兄弟节点进行合并,合并后仍然保持树的平衡,再继续往下找。最后删除“h”的时候,为了避免节点的合并,先把左子树中最大的节点“e”移到父节点中代替原来“g”的位置,再把“g”下移到“h”所在的位置。
删除“b”:
+--------------+---+---+
| p | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------+---+---+
| i | m | | | t | | |
+--------+------------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| e | g | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
删除“b”后节点为空且右兄弟节点没有多余的元素,因此只能合并节点。
删除“e”,“g”,“i”:
delete "e":
+--------------+---+---+
| p | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------+---+---+
| i | m | | | t | | |
+--------+------------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| g | | | | k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
delete "g":
+--------------+---+---+
| p | | |
+--------------+---+---+
/ \
+--------+------------+---+ +------+---+---+
| k | m | | | t | | |
+--------+------------+---+ +------+---+---+
/ | \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| i | | | | l | | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
delete "i":
+----------+---+---+
| p | | |
+----------+---+---+
/ \
+------+---+---+ +------+---+---+
| m | | | | t | | |
+------+---+---+ +------+---+---+
/ \ / \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
然后删除“k”:
+------------+----------+------------+
| p | m | t |
+------------+----------+------------+ =>
/ | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| k | l | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
+------------+----------+------------+
| p | m | t |
+------------+----------+------------+
/ | | \
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
| l | | | | n | o | | | q | s | | | v | x | |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
经过原来“m”所在节点的时候发现只有一个元素,因此需要调整,但是父节点和兄弟节点都只有一个元素,因此将这三个节点合并后在继续往下走。
参考资料
[1] Organization and Maintenance of Large Ordered Indexes
[2] B-tree
[3] 算法导论(第二版),(美)Thomas H.Cormen, Charles E.Leiserson, Ronald L.Rivest, Clifford Stein 著,潘金贵,顾铁成,李成法,叶懋 译。