Redis内存分配器

发表于 2023-06-24 更新于 2023-11-04 分类于 redis 阅读次数： Waline：本文字数： 11k 阅读时长 ≈ 10 分钟

像大家原来使用纸币，平时用的纸币只有 1元 5元 10元 20元 50元 100元，为什么没有 1.5元 3元？
因为有了 3元人们使用不方便，也不方便管理。

(1) Redis内存分配器

Redis在内存使用时也是类似纸币，为了方便管理，Redis在分配内存时按照设计的大小分配。

(1.1) 为什么要设置内存分配器

举个简单的例子，你找妈妈要零花钱，你想要1.5元，但是她没有1.5元，给了个你5元。

内存分配器的目标主要有2个：
1.1 减少内存碎片，包括内部碎片和外部碎片:
内部碎片：分配出去的但没有使用到的内存，比如需要 32 字节，分配了 40 字节，多余的 8 字节就是内部碎片。
外部碎片：大小不合适导致无法分配出去的内存，比如一直申请 16 字节的内存，但是内存分配器中保存着部分 8 字节的内存，一直分配不出去。
提高性能:
1 单线程性能
2 多线程性能

(1.2) 内存分配器配置

Linux上默认内存分配器是jemalloc

Mac OS 系统Redis默认内存分配器是libc，可以通过 info memory 结果里的 mem_allocator指标查看；
如果需要使用 jemalloc 内存分配器，在执行make命令时通过 MALLOC=jemalloc 指定

jemalloc 和 tcmalloc 都是对 glibc 中的优化，目的也是为了减少内存碎片和提高性能。

(2) 内存分配的原理

类似于人民币有 1元、5元、10元、100元

jemalloc 将对象按大小分为3类，不同大小类别的分配算法不同:
small: 从对应 bin 管理的 run 中返回一个 region
large: 大小比 chunk 小，比 page 大，会单独返回一个 run
huge: 大小为 chunk 倍数，会分配 chunk

每个 size_class 代表 jemalloc 分配的内存大小。
共有 NSIZES(232)个小类（如果用户申请的大小位于两个小类之间，会取较大的，比如申请14字节，位于8和16字节之间，按16字节分配。

在 2MiB chunk，4KiB page 的64位系统上，size classes 如下

+---------+---------+--------------------------------------+
|Category | Spacing | Size                                 |
+---------+---------+--------------------------------------+
|         |       8 | [8]                                  |
|         +---------+--------------------------------------+
|         |      16 | [16, 32, 48, 64, 80, 96, 112, 128]   |
|         +---------+--------------------------------------+
|         |      32 | [160, 192, 224, 256]                 |
|         +---------+--------------------------------------+
|         |      64 | [320, 384, 448, 512]                 |
|         +---------+--------------------------------------+
|Small    |     128 | [640, 768, 896, 1024]                |
|         +---------+--------------------------------------+
|         |     256 | [1280, 1536, 1792, 2048]             |
|         +---------+--------------------------------------+
|         |     512 | [2560, 3072, 3584, 4096]             |
|         +---------+--------------------------------------+
|         |   1 KiB | [5 KiB, 6 KiB, 7 KiB, 8 KiB]         |
|         +---------+--------------------------------------+
|         |   2 KiB | [10 KiB, 12 KiB, 14 KiB]             |
+---------+---------+--------------------------------------+
|         |   2 KiB | [16 KiB]                             |
|         +---------+--------------------------------------+
|         |   4 KiB | [20 KiB, 24 KiB, 28 KiB, 32 KiB]     |
|         +---------+--------------------------------------+
|         |   8 KiB | [40 KiB, 48 KiB, 54 KiB, 64 KiB]     |
|         +---------+--------------------------------------+
|         |  16 KiB | [80 KiB, 96 KiB, 112 KiB, 128 KiB]   |
|Large    +---------+--------------------------------------+
|         |  32 KiB | [160 KiB, 192 KiB, 224 KiB, 256 KiB] |
|         +---------+--------------------------------------+
|         |  64 KiB | [320 KiB, 384 KiB, 448 KiB, 512 KiB] |
|         +---------+--------------------------------------+
|         | 128 KiB | [640 KiB, 768 KiB, 896 KiB, 1 MiB]   |
|         +---------+--------------------------------------+
|         | 256 KiB | [1280 KiB, 1536 KiB, 1792 KiB]       |
+---------+---------+--------------------------------------+
|         | 256 KiB | [2 MiB]                              |
|         +---------+--------------------------------------+
|         | 512 KiB | [2560 KiB, 3 MiB, 3584 KiB, 4 MiB]   |
|         +---------+--------------------------------------+
|         |   1 MiB | [5 MiB, 6 MiB, 7 MiB, 8 MiB]         |
|         +---------+--------------------------------------+
|Huge     |   2 MiB | [10 MiB, 12 MiB, 14 MiB, 16 MiB]     |
|         +---------+--------------------------------------+
|         |   4 MiB | [20 MiB, 24 MiB, 28 MiB, 32 MiB]     |
|         +---------+--------------------------------------+
|         |   8 MiB | [40 MiB, 48 MiB, 56 MiB, 64 MiB]     |
|         +---------+--------------------------------------+
|         |     ... | ...                                  |
+---------+---------+--------------------------------------+

(2.1) small分配流程

small 的分配流程如下：

查找对应 size classes 的 bin

从 bin 中获取 run:
2.1 bin->runcur
2.2 从 bin->runs 查找未满的 run
从 arena 中获取 run:
3.1 从 arena->avail_runs 中查找空闲 run
3.2 当没有合适 run 时，从 chunk 中分配 run:
3.2.1 arena->spare
3.2.2 arena->cached_tree
3.2.3 arena->retained_tree
3.2.4 调用 mmap() 新分配一块 chunk
从 run 中返回一个空闲 region

small 的释放流程如下：

将该 region 返回给对应的 run，即设置 bitmap 为空闲，增加 nfree

将 run 还给 bin:
2.1 如果 run->nfree == 1，则设置为 bin->runcur 或者插入到 bin->runs 中

如果 run->nfree == bin_info->nregs，则将该 run 与 bin 分离，再将 run 还给 arena:
3.1 尝试与相同 chunk 中前后相邻的空闲 run 进行合并，然后插入到 arena->avail_runs 中
3.2 若合并完后，整个 chunk 为空，则尝试与连续地址空间的空闲 chunk 进行合并，然后插入到 arena->cached_tree 中

(2.2) large

分配 large 和分配 small 类似:

先从 arena->avail_runs 中查找，因为 large object 不由 bin 管理，所以与 small object 相比，少了从 bin->runs 中查找的一步

分配 chunk，步骤和 small 一样，然后从 chunk 中分配需要的 run 大小，此时 run 的大小为单个 object 的大小，而 small run 的大小会从 bin_info[] 中获取
因为 large 大小是 page 的倍数，且会按照 page size 地址对齐，有可能造成 cache line 颠簸，所以会根据配置多分配一个 page，用于内存着色，防止 cache line 的颠簸
large 和 small 的 arena_chunk_map_misc_t 格式也不同，large 只在首个 page 设置 run 的大小。释放流程和 small 一样，只是缺少了 run 在 bin 中的处理，直接将 run 还给 arena。

(2.3) huge

huge object 大小比 chunk 大。分配策略和上面分配 chunk 一样:

从 arena 中分配 extent_node_t

从 arena 中分配 chunk:
2.1 从 arena->cached_tree 中分配 chunk
2.2 从 arena->retained_tree 中分配
2.3 调用 mmap() 新分配一块 chunk

将 chunk 和 node 插入到 chunks_rtree 中

插入到 arena->huge 链表中

释放和分配过程相反:

从 chunks_rtree 中获取 chunk 对应的 node，从而获取对应的 arena

移出 arena->huge

释放 chunk，插入到 arena->cached_tree 中

释放 node
huge 使用了线程间共享的 chunks_rtree 来保存信息，这会导致锁的竞争，但是应用程序很少会分配如此大的内存，所以带来的影响很小。

(3) 源码解读

由于redis要求满足跨平台性，而每个平台又会有自己的内存管理函数。
zmalloc.c和zmalloc.h主要功能就是对原有库里的内存分配函数进行封装，形成独立的一套内存管理函数。

根据系统的不同，使用不同的内存管理函数，例如jemalloc, tcmalloc, cmalloc。

(3.1) zmalloc.h

https://github.com/redis/redis/blob/6.0/src/zmalloc.h

// tcmalloc
#if defined(USE_TCMALLOC)
#define ZMALLOC_LIB ("tcmalloc-" __xstr(TC_VERSION_MAJOR) "." __xstr(TC_VERSION_MINOR))
#include <google/tcmalloc.h>
#if (TC_VERSION_MAJOR == 1 && TC_VERSION_MINOR >= 6) || (TC_VERSION_MAJOR > 1)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) tc_malloc_size(p)
#else
#error "Newer version of tcmalloc required"
#endif

// jemalloc
#elif defined(USE_JEMALLOC)
#define ZMALLOC_LIB ("jemalloc-" __xstr(JEMALLOC_VERSION_MAJOR) "." __xstr(JEMALLOC_VERSION_MINOR) "." __xstr(JEMALLOC_VERSION_BUGFIX))
#include <jemalloc/jemalloc.h>
#if (JEMALLOC_VERSION_MAJOR == 2 && JEMALLOC_VERSION_MINOR >= 1) || (JEMALLOC_VERSION_MAJOR > 2)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) je_malloc_usable_size(p)
#else
#error "Newer version of jemalloc required"
#endif


// 苹果系统 malloc
#elif defined(__APPLE__)
#include <malloc/malloc.h>
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_size(p)
#endif


// 其它情况 malloc
#ifndef ZMALLOC_LIB
#define ZMALLOC_LIB "libc"
#ifdef __GLIBC__
#include <malloc.h>
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_usable_size(p)
#endif
#endif

ZMALLOC_LIB：最终使用的内存库，是 jemalloc-5.1.0-0-g0、tcmalloc-版本号、libc 这样
HAVE_MALLOC_SIZE: 使用的内存分配库是否有自带的获取内存块大小的函数，jemalloc和tcmalloc都有
HAVE_DEFRAG: 是否支持内存碎片处理

// 如果支持支持内存碎片整理，会编译对应的函数
#ifdef HAVE_DEFRAG
void zfree_no_tcache(void *ptr);  // 内存释放
void *zmalloc_no_tcache(size_t size); // 内存分配
#endif

// 内存分配库是否有自带的获取内存块大小的函数
#ifndef HAVE_MALLOC_SIZE
size_t zmalloc_size(void *ptr);  // 获取内存长度
size_t zmalloc_usable(void *ptr);  // 
#else
#define zmalloc_usable(p) zmalloc_size(p)
#endif

(4.2) zmalloc.c

https://github.com/redis/redis/blob/6.0/src/zmalloc.c

/* Explicitly override malloc/free etc when using tcmalloc. */
#if defined(USE_TCMALLOC)  // 使用三方库 tcmalloc
#define malloc(size) tc_malloc(size)
#define calloc(count,size) tc_calloc(count,size)
#define realloc(ptr,size) tc_realloc(ptr,size)
#define free(ptr) tc_free(ptr)
#elif defined(USE_JEMALLOC)  // 使用三方库 jemalloc
#define malloc(size) je_malloc(size)
#define calloc(count,size) je_calloc(count,size)
#define realloc(ptr,size) je_realloc(ptr,size)
#define free(ptr) je_free(ptr)
#define mallocx(size,flags) je_mallocx(size,flags)
#define dallocx(ptr,flags) je_dallocx(ptr,flags)
#endif

功能	函数	tcmalloc	jemalloc
分配内存	malloc(size)	tc_malloc(size)	je_malloc(size)
分配内存空间	calloc(count,size)	tc_calloc(count,size)	je_calloc(count,size)
调整内存大小	realloc(ptr,size)	tc_realloc(ptr,size)	je_realloc(ptr,size)
释放内存	free(ptr)	tc_free(ptr)	je_free(ptr)
	mallocx(size,flags)	-	je_mallocx(size,flags)
	dallocx(ptr,flags)	-	je_dallocx(ptr,flags)

记录内存大小

// 分配/增加使用的内存数
#define update_zmalloc_stat_alloc(__n) do { \
    size_t _n = (__n); \
    if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
    atomicIncr(used_memory,__n); \
} while(0)

// 释放/减少使用的内存数
#define update_zmalloc_stat_free(__n) do { \
    size_t _n = (__n); \
    if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
    atomicDecr(used_memory,__n); \
} while(0)

// 已使用内存变量
static size_t used_memory = 0;
pthread_mutex_t used_memory_mutex = PTHREAD_MUTEX_INITIALIZER;

redis用一个静态原子变量used_memory来记录已使用的内存大小。

(4) 主动碎片整理

########################### ACTIVE DEFRAGMENTATION #######################
#
# What is active defragmentation?
# -------------------------------
#
# Active (online) defragmentation allows a Redis server to compact the
# spaces left between small allocations and deallocations of data in memory,
# thus allowing to reclaim back memory.
#
# Fragmentation is a natural process that happens with every allocator (but
# less so with Jemalloc, fortunately) and certain workloads. Normally a server
# restart is needed in order to lower the fragmentation, or at least to flush
# away all the data and create it again. However thanks to this feature
# implemented by Oran Agra for Redis 4.0 this process can happen at runtime
# in a "hot" way, while the server is running.
#
# Basically when the fragmentation is over a certain level (see the
# configuration options below) Redis will start to create new copies of the
# values in contiguous memory regions by exploiting certain specific Jemalloc
# features (in order to understand if an allocation is causing fragmentation
# and to allocate it in a better place), and at the same time, will release the
# old copies of the data. This process, repeated incrementally for all the keys
# will cause the fragmentation to drop back to normal values.
#
# Important things to understand:
#
# 1. This feature is disabled by default, and only works if you compiled Redis
#    to use the copy of Jemalloc we ship with the source code of Redis.
#    This is the default with Linux builds.
#
# 2. You never need to enable this feature if you don't have fragmentation
#    issues.
#
# 3. Once you experience fragmentation, you can enable this feature when
#    needed with the command "CONFIG SET activedefrag yes".
#
# The configuration parameters are able to fine tune the behavior of the
# defragmentation process. If you are not sure about what they mean it is
# a good idea to leave the defaults untouched.

# Active defragmentation is disabled by default
# activedefrag no

# Minimum amount of fragmentation waste to start active defrag
# active-defrag-ignore-bytes 100mb

# Minimum percentage of fragmentation to start active defrag
# active-defrag-threshold-lower 10

# Maximum percentage of fragmentation at which we use maximum effort
# active-defrag-threshold-upper 100

# Minimal effort for defrag in CPU percentage, to be used when the lower
# threshold is reached
# active-defrag-cycle-min 1

# Maximal effort for defrag in CPU percentage, to be used when the upper
# threshold is reached
# active-defrag-cycle-max 25

# Maximum number of set/hash/zset/list fields that will be processed from
# the main dictionary scan
# active-defrag-max-scan-fields 1000

# Jemalloc background thread for purging will be enabled by default
jemalloc-bg-thread yes

# It is possible to pin different threads and processes of Redis to specific
# CPUs in your system, in order to maximize the performances of the server.
# This is useful both in order to pin different Redis threads in different
# CPUs, but also in order to make sure that multiple Redis instances running
# in the same host will be pinned to different CPUs.
#
# Normally you can do this using the "taskset" command, however it is also
# possible to this via Redis configuration directly, both in Linux and FreeBSD.
#
# You can pin the server/IO threads, bio threads, aof rewrite child process, and
# the bgsave child process. The syntax to specify the cpu list is the same as
# the taskset command:
#
# Set redis server/io threads to cpu affinity 0,2,4,6:
# server_cpulist 0-7:2
#
# Set bio threads to cpu affinity 1,3:
# bio_cpulist 1,3
#
# Set aof rewrite child process to cpu affinity 8,9,10,11:
# aof_rewrite_cpulist 8-11
#
# Set bgsave child process to cpu affinity 1,10,11
# bgsave_cpulist 1,10-11

# In some cases redis will emit warnings and even refuse to start if it detects
# that the system is in bad state, it is possible to suppress these warnings
# by setting the following config which takes a space delimited list of warnings
# to suppress
#
# ignore-warnings ARM64-COW-BUG

参考资料

[1] ☆ jemalloc 源码分析
[2] Jemalloc 内存分配与优化实践
[3] ptmalloc、tcmalloc与jemalloc对比分析

[4] redis源码阅读——内存分配完全解析
[5] 关于redis源码的内存分配,jemalloc,tcmalloc,libc
[6] redis.conf
[7] 浅谈Redis源码—内存分配zmalloc
[8] redis源码阅读——内存分配完全解析
[9] Redis内存模型
[10] 深入学习Redis（1）：Redis内存模型