hardCore 进程管理

整体设计

整个hardCore的进程部分由uCore扩展而来，我们主要扩展的目标有以下几个：

实现对内核线程的支持，使得用户态可以使用多线程
扩展OS支持的调度算法：
1. 基于红黑树实现linux2.6版本中经典进程调度器CFS
2. stride调度器
3. RR调度器
增加用户态的信号量支持，搭配内核线程在用户态实现生产者消费者，读者写者等经典线程同步算法
提高OS进程的可靠性，在用户态因为爆栈等错误退出的时候不会导致OS崩溃
实现top等用户命令，扩展shell功能支持用户态进程后台执行

当前已经实现了内核态进程以及CFS调度器，下文详细的阐述了这两个部分的实现过程，下一阶段主要想实现的内容就是信号量以及用户态程序的支持，从而构建一个完整的OS。

用户态内核进程实现

linux 实现参考

How Stack or memory is allocated for threads under the same process in Linux

The current ‘thread’ concept in Linux is the NPTL one. NPTL usesclone(), which wrapssys_clone(). Allocating a stack for a new ‘thread’ is handled in the user space (ie. libc), not in kernel (ie. Linux). A library can allocate a stack using the allocation of choice (eg.malloc) and then callclone()passing this address as the stack (of course, needs to pass the top of the allocated region, since stacks grow downwards on most platforms):

Unlikefork(),clone()allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers….

The main use ofclone()is to implement threads: multiple threads of control in a program that run concurrently in a shared memory space.

When the child process is created withclone(), it executes the functionfn(arg)…

The child_stack argument specifies the location of the stack used by the child process …

If you want to learn more specific details, open the source of your distropthread_createimplementation and get reading.

For examplepthread_create.c:

int
__pthread_create_2_1 (newthread, attr, start_routine, arg)
  ...
  struct pthread *pd = NULL;
  int err = ALLOCATE_STACK (iattr, &pd);

andallocatestack.c:

# define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)
static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
   ALLOCATE_STACK_PARMS)
...

You’ll see that stack allocation has some whistles and bells, like caching and reusing stack regions, guard pages, but in the end is just a memory region allocated in user space.

uCore 实现

在 linux 中，thread 通过 clone 调用实现，但是分配栈空间是用户空间关心的事，比如使用 malloc 申请之类的。

在 uCore 中并没有实现用户态的 malloc，所以暂时无法使用 malloc 在堆区为一个新的进程开辟一个栈空间，但这并非就是无法实现进程。

于是我们换了另外一种方式就是挪用调用 clone 进程的栈空间。linux 默认为每个进程分配 1M 的栈空间，一旦某个进程调用 clone，我们将该调用进程 1M 的栈空间等分为 16 份，取出一块空闲的栈空间分配给该子线程。针对可能出现递归调用 phthread_create 的情况，不断向上找到第一个调用 clone，然后栈被分割成 16 份的进程，从该进程处获取栈空间。

子线程退出时会向该进程归还所使用的栈空间，所以包含该第一个调用 clone 的进程在内（下文统一称为祖宗线程），同一时间进程里可以存在 16 个线程。

PCB 支持

在 PCB 中增加以下三个变量用于支持内核线程

Type	name	meaning
int	is_thread	标志该进程是否是一个子线程
int	stack_num	标志该子线程占用了父进程的哪一个栈帧，is_thread = 1 才有效
int [ ]	stack[MAX_THREAD]	每个主进程能够开启 16 个线程（包括主线程（自己）在内），每个块为 0 表示该块的栈没有被占用，不为 0 表示被占用，且值是该子线程的 pid

特别说明只有祖宗线程和普通进程 is_thread 为 0，其他子线程该值都是 1。

clone 调用

int do_clone(void *(*fn)(void *), void *arg, void (*exit)(int))
{
    int ret = -E_NO_FREE_PROC;
    struct proc_struct *proc;
    if (nr_process >= MAX_PROCESS)
        goto fork_out;
    ret = -E_NO_MEM;
    // 新建一个空进程描述符
    if ((proc = alloc_proc()) == NULL)
        goto fork_out;
    // 设置线程名称为 父线程名称-t
    int i = 0;
    for (i = 0; current->name[i] != '\0'; i++)
        proc->name[i] = current->name[i];
    proc->name[i] = '-';
    proc->name[i + 1] = 't';
    proc->name[i + 2] = '\0';
    proc->is_thread = 1; //标志该进程是一个子线程
    // 针对可能出现递归调用pthread_create的情况，找到不为线程的主进程
    struct proc_struct *father = current;
    while (father->is_thread)
        father = father->parent;
    // 如果不设置线程归属于调用clone的线程，直接指向主线程会导致子线程中没法调用join来等待
    proc->parent = current;
    // 在主线程里面找一块栈分配给该子线程
    proc->stack_num = 1;
    for (; proc->stack_num < MAX_THREAD; proc->stack_num++)
        if (father->stack[proc->stack_num] == 0)
        {
            father->stack[proc->stack_num] = 1;
            break;
        }
    assert(proc->stack_num != MAX_THREAD);
    assert(current->wait_state == 0);
    // 设置内核栈
    if (setup_kstack(proc) != 0)
        goto bad_fork_cleanup_proc;
    // 文件系统直接指向父进程
    if (copy_fs(CLONE_FS, proc) != 0)
        goto bad_fork_cleanup_kstack;
    // mm 直接指向父进程， 一个用户进程有1MB的栈空间
    // 256页，一个线程给16页，包括原有的主线程的话，能开16个线程
    if (copy_mm(CLONE_VM, proc) != 0)
        goto bad_fork_cleanup_fs;
    // !! 注意这个地址不是这个线程的地址，是上一个线程栈的栈底地址
    // 给线程分配栈,一个进程16页
    uint32_t thread_stack_top = USTACKTOP - (proc->stack_num) * 16 * PGSIZE;
    // 预先给两页给新分配的线程
    assert(pgdir_alloc_page(proc->mm->pgdir, thread_stack_top - PGSIZE, PTE_USER) != NULL);
    assert(pgdir_alloc_page(proc->mm->pgdir, thread_stack_top - 2 * PGSIZE, PTE_USER) != NULL);
    proc->tf = (struct trapframe *)(proc->kstack + KSTACKSIZE) - 1;
    struct trapframe *tf = proc->tf;
    memset(tf, 0, sizeof(struct trapframe));
    tf->tf_cs = USER_CS;
    tf->tf_ds = tf->tf_es = tf->tf_ss = USER_DS;
    // 把栈往上抬4个字节,才开始放东西，否则栈会越界
    // 先放exit的参数为0
    *(uint32_t *)(thread_stack_top - 1 * sizeof(uint32_t)) = (uint32_t)0;
    // 压入线程函数的参数地址
    *(uint32_t *)(thread_stack_top - 2 * sizeof(uint32_t)) = (uint32_t)arg;
    // 压入上一个函数的返回地址为 exit ，保证函数结束并没有显式调用 exit 时系统能帮助该线程退出。
    *(uint32_t *)(thread_stack_top - 3 * sizeof(uint32_t)) = exit;
    tf->tf_esp = thread_stack_top - 3 * sizeof(uint32_t);
    // 设置 eip 指向当前函数开始执行
    tf->tf_eip = fn;
    tf->tf_eflags = FL_IF;
    ret = 0;
    tf->tf_regs.reg_eax = 0;
    tf->tf_eflags |= FL_IF;
    // context 在调度的时候会弹出 tf 内的寄存器，恢复程序执行
    proc->context.eip = (uintptr_t)forkret;
    proc->context.esp = (uintptr_t)(proc->tf);
    bool intr_flag;
    local_intr_save(intr_flag);
    {
        proc->pid = get_pid();
        hash_proc(proc);
        set_links(proc);
    }
    local_intr_restore(intr_flag);
    wakeup_proc(proc);
    // 重新设置父线程的栈被该子线程占用
    father->stack[proc->stack_num] = proc->pid;
    ret = proc->pid;
fork_out:
    return ret;
bad_fork_cleanup_fs: //for LAB8
    put_fs(proc);
bad_fork_cleanup_kstack:
    put_kstack(proc);
bad_fork_cleanup_proc:
    kfree(proc);
    goto fork_out;
}

注意因为用户有可能不会在线程执行的函数结束返回时调用 exit，所以 clone 的一个任务就是帮助用户将线程栈栈底的返回地址设置为 sys_exit (不是直接 do_exit 因为该线程执行在用户态，只有通过系统调用才能执行内核态代码)
最后的线程栈栈帧如图所示

祖宗线程 exit 时守护

若祖宗线程没有调用 join 来等待子线程结束工作，祖宗线程会直接退出，虽然这样内存空间的引用计数不为 0，不会释放他们共用的内存空间，剩余线程还能继续执行，但是所有栈的分配信息都存在祖宗线程的 PCB 表中，如果祖宗线程一旦释放，子进程再想调用 phthread_create 新建子进程会导致失败。于是必须在祖宗线程调用 sys_exit 时守护所有子线程退出后祖宗线程才能退出。

static int
sys_exit(uint32_t arg[]) {
    int error_code = (int)arg[0];
    if (is_ancestral_thread(current)) //祖宗进程只有在所有子线程全部释放后才能退出
        while (current_have_kid())    //只要有儿子就等着
            do_wait(0, NULL);
    return do_exit(error_code);
}

线程一致性维护

linux 上的线程就是基于轻量级进程, 由用户态的 pthread 库实现的.使用 pthread 以后, 在用户看来, 每一个 task_struct 就对应一个线程, 而一组线程以及它们所共同引用的一组资源就是一个进程。但是, 一组线程并不仅仅是引用同一组资源就够了, 它们还必须被视为一个整体。当”进程”收到一个致命信号(比如由于段错误收到 SIGSEGV 信号), 对应的这一组 task_struct 将全部退出。这是 POSIX 对线程实现提出的要求，如果某个线程”挂”了, 整个进程还在若无其事地运行着, 可能会出现很多的不一致状态. 进程将不是一个整体, 而线程也不能称为线程。

在 uCore 中我们也实现了该线程的一致性

case T_PGFLT:  //page fault
    if ((ret = pgfault_handler(tf)) != 0) {
        print_trapframe(tf);
        if (current == NULL) {
            panic("handle pgfault failed. ret=%d\n", ret);
        }
        else {
            if (trap_in_kernel(tf)) {
                panic("handle pgfault failed in kernel mode. ret=%d\n", ret);
            }
            cprintf("killed by kernel.\n");
            // panic("handle user mode pgfault failed. ret=%d\n", ret);
            kill_all_thread();
        }
    }
    break;

比如有一个线程缺页异常的话，对应的这组进程都会被杀死。

CFS进程调度实现

其他两个调度器的实现比较简单，这里主要研究CFS调度器是如何实现的

CFS调度器介绍

cfs 定义了一种新的模型，它给 cfs_rq（cfs的run queue）中的每一个进程安排一个虚拟时钟，vruntime。如果一个进程得以执行，随着时间的增长（也就是一个个tick的到来），其vruntime将不断增大。没有得到执行的进程vruntime不变。而调度器总是选择vruntime跑得最慢的那个进程来执行。这就是所谓的“完全公平”。为了区别不同优先级的进程，优先级高的进程vruntime增长得慢，以至于它可能得到更多的运行机会。

如果分配给进程的运行时间不等于实际运行的时间时：CFS的思想就是让每个调度实体的vruntime增加速度不同，权重越大的增加的越慢，这样高优先级进程就能获得更多的cpu执行时间，而vruntime值较小者也得到执行。

每一个进程或者调度组都对应一个调度的实体，每一个进程都通过调度实体与CFS运行对列建立联系，每次进行CFS调度的时候都会在CFS运行对列红黑树中选择一个进程（vruntime值较小者）。cfs_rq代表CFS运行对列，它可以找到对应的红黑树。进程task_struct ，可以找到对应的调度实体。调度实体sched_entity对应运行对列红黑树上的一个节点。

调度器实现

运行队列中的进程/线程会频繁进行插入和删除。为了效率考虑，CFS调度队列我们需要使用红黑树来实现，每次选择最小vruntime的节点执行（即选择这颗红黑树上的最左节点）将其从这颗红黑树中删除，调度到CPU上执行。因为C语言中没有提供面向对象的编程方法，我们借用linux内核中提供的红黑树的基础库来实现我们自己的cfs红黑树。简单介绍一下linux内核里的红黑树。

Linux有很多地方用到了红黑树，比如高精度计时器使用红黑树树组织定时请求，EXT3文件系统也使用红黑树树来管理目录，虚拟存储管理系统也有用红黑树树进行VMAs（Virtual Memory Areas）的管理。Linux内核红黑树的实现与传统的实现方式有些不同，它对针对内核对速度的需要做了优化。每一个rb_node节点是嵌入在用RB树进行组织的数据结构中，而不是用rb_node指针进行数据结构的组织。对于CFS运行队列这颗红黑树的节点 cfs_node，我们只要在里面内嵌一个 rb_node 节点再加一个指向该进程PCB的指针，我们就可以利用linux提供的红黑树操作函数来封装出一个我们自己的CFS调度队列。

对于一个挂在CFS队列里面的进程，我们需要实现的方法有

#ifndef KERN_SCHEDULE_CFS_RBTREE_H
#define KERN_SCHEDULE_CFS_RBTREE_H
#include
struct cfs_node
{
    struct rb_node node;
    struct proc_struct *proc;
};
struct cfs_node *cfs_search(struct rb_root *root, struct proc_struct *proc);
int cfs_insert(struct rb_root *root, struct proc_struct *proc);
void cfs_node_free(struct cfs_node *node);
struct proc_struct * cfs_find_min(struct rb_root *root);
int compare_cfs_node(struct proc_struct *a, struct proc_struct *b);
#endif

一个节点的插入和删除，找最小（就是找最左子节点），还有节点值的大小比较。这里值得关注的函数就是节点值大小比较。在 linux 中的红黑树中，若两个节点的 value 大小相等会放弃插入，但是连续插入的两个 vruntime 很可能相同，针对这种情况，重新规定红黑树的排序规则

vruntime 不同时比较 vruntime ，vruntime大者大。vruntime 相同时比较 pid ，pid大者大. 因为树里面不可能有两个相同的进程,所以该排序可以保证线序关系。

下面简单的去实现这个红黑树节点的方法

int compare_cfs_node(struct proc_struct *a, struct proc_struct *b)
{
  if (a->vruntime != b->vruntime)
  {
    if (a->vruntime > b->vruntime)
      return 1;
    else
      return -1;
  }
  else
  {
    if (a->pid > b->pid)
      return 1;
    else
      return -1;
  }
  return 0;
}
struct cfs_node *cfs_search(struct rb_root *root, struct proc_struct *proc)
{
  struct rb_node *node = root->rb_node;
  while (node)
  {
    struct cfs_node *data = container_of(node, struct cfs_node, node);
    if (data->proc->pid == proc->pid)
      return data;
    if (compare_cfs_node(data->proc, proc))
      node = node->rb_left;
    else
      node = node->rb_right;
   }
  return NULL;
}
int cfs_insert(struct rb_root *root, struct proc_struct *proc)
{
  struct cfs_node *data = (struct cfs_node *)kmalloc(sizeof(struct cfs_node));
  data->proc = proc;
  struct rb_node **new = &(root->rb_node), *parent = NULL;
  /* Figure out where to put new node */ 
  while (new)
  {
    struct cfs_node this = container_of(new, struct cfs_node, node);
    int result = compare_cfs_node(data->proc, this->proc);
    parent = *new;
    if (result < 0)
      new = &((*new)->rb_left);
    else if (result > 0)
      new = &((*new)->rb_right);
    else
      return 0;
  }
  /* Add new node and rebalance tree. */
  rb_link_node(&data->node, parent, new);
  rb_insert_color(&data->node, root);
  return 1;
}
struct proc_struct *cfs_find_min(struct rb_root *root)
{
  struct rb_node *node = root->rb_node;
  if (node == NULL)
    return NULL;
  while (node->rb_left)
    node = node->rb_left;
  struct cfs_node *data = container_of(node, struct cfs_node, node);
  return data->proc;
}
void cfs_node_free(struct cfs_node *node)
{
  if (node != NULL)
  {
  node->proc = NULL;
  kfree(node);
  node = NULL;
  }
}

下面开始实现CFS调度器

首先是调度器初始化函数, 初始化红黑树根节点，并设置调度队列里面的进程数量为0

static void
cfs_init(struct run_queue *rq)
{
  list_init(&(rq->run_list));
  rq->cfs_rb_tree = RB_ROOT;
  rq->proc_num = 0;
}

之后是插入进程，将进程插入红黑树，并且增加该进程里面的进程数量

static void
cfs_enqueue(struct run_queue *rq, struct proc_struct *proc)
{
  cfs_insert(&(rq->cfs_rb_tree), proc);
  if (proc->time_slice == 0 || proc->time_slice > rq->max_time_slice)
    proc->time_slice = rq->max_time_slice;
  proc->rq = rq;
  rq->proc_num++;
}

然后是把一个待运行的进程从运行队列里面出队，根据PCB找到CFS节点数据结构的位置。销毁创建的CFS节点，减少进程计数。

static void
cfs_dequeue(struct run_queue *rq, struct proc_struct *proc)
{
  struct cfs_node *data = cfs_search(&(rq->cfs_rb_tree), proc);
  if (data)
  {
    rb_erase(&data->node, &(rq->cfs_rb_tree));
    cfs_node_free(data);
    rq->proc_num--;
  }
}

然后是每一次时钟中断的调用，这里是我们实现的时候与 linux 实现不一样的地方。在标准linux 的实现中，若当前运行的 proc 的 vruntime 已经不是队列里面最小的了（实际上会导致频繁调度，所以一般设置一个阈值）超过这个限制会直接触发调度，但是我们为了简单和一定程度上的效率考虑只是通过 vruntime 进行选择，而不是使用 vruntime 进行抢占调度。

static void
cfs_proc_tick(struct run_queue *rq, struct proc_struct *proc)
{
  // 实际上CFS的实现：若 proc->vruntime 已经不是队列里面最小的了（实际上会导致频繁调度，所以一般设置一个阈值）
  // 超过这个限制会直接触发调度
  // 此时的进程不在红黑树中，修改 proc->vruntime 不会破坏红黑树的性质。 但是还是需要完善上述 CFS
  // 仍有运行时间则减少其运行时间，计算虚拟运行时间
  assert(proc->cfs_prior != 0);
  proc->vruntime += proc->cfs_prior;
  if (proc->time_slice > 0)
    proc->time_slice--;
  // 时间片用完则标记该进程需要调度
  if (proc->time_slice == 0)
    proc->need_resched = 1;
}

对一些问题的研究

用户态进程是如何退出的

以 stackOverflow.c 这个程序为例, 在编译的时候总共涉及 gcc 的调用和 ld 的链接操作

1
2

gcc -Iuser/ -fno-builtin -fno-PIC -Wall -ggdb -m32 -gstabs -nostdinc  -fno-stack-protector -Ilibs/ -Iuser/include/ -Iuser/libs/ -c user/stackOverFlow.c -o obj/user/stackOverFlow.o
ld -m    elf_i386 -nostdlib -T tools/user.ld -o obj/__user_stackOverFlow.out  obj/user/libs/panic.o obj/user/libs/syscall.o obj/user/libs/ulib.o obj/user/libs/semaphore.o obj/user/libs/initcode.o obj/user/libs/file.o obj/user/libs/stdio.o obj/user/libs/dir.o obj/user/libs/umain.o obj/user/libs/pthread.o obj/user/libs/proc.o  obj/libs/rbtree.o obj/libs/string.o obj/libs/printfmt.o obj/libs/hash.o obj/libs/rand.o obj/user/stackOverFlow.o

之所以选这个程序是因为简单，使用 objdump 反汇编第一条 gcc 编译指令执行完毕生成的 stackOverFlow.o 文件的 text 段如下所示：
注意现在还没有到链接的阶段，所有 text 段代码的起始位置都是 0 地址处：

Disassembly of section .text:
00000000 <loop>:
   0:    f3 0f 1e fb             endbr32
   4:    55                      push   %ebp
   5:    89 e5                   mov    %esp,%ebp
   7:    83 ec 08                sub    $0x8,%esp
   a:    e8 fc ff ff ff          call   b <loop+0xb>
   f:    90                      nop
  10:    c9                      leave
  11:    c3                      ret
00000012 <main>:
  12:    f3 0f 1e fb             endbr32
  16:    55                      push   %ebp
  17:    89 e5                   mov    %esp,%ebp
  19:    83 e4 f0                and    $0xfffffff0,%esp
  1c:    e8 fc ff ff ff          call   1d <main+0xb>
  21:    b8 00 00 00 00          mov    $0x0,%eax
  26:    c9                      leave
  27:    c3                      ret

下一条链接指令会将 user/lib 下的所有 .o 文件与 stackOverFlow.o 一起进行链接操作
其中链接的 obj/user/libs/initcode.o 是一段汇编代码

.text
.globl _start
_start:
    # set ebp for backtrace
    movl $0x0, %ebp
    # load argc and argv
    movl (%esp), %ebx
    lea 0x4(%esp), %ecx

    # move down the esp register
    # since it may cause page fault in backtrace
    subl $0x20, %esp
    # save argc and argv on stack
    pushl %ecx
    pushl %ebx
    # call user-program function
    call umain
1:  jmp 1b

结合链接脚本中的ENTRY(_start)可知在链接完成后 initcode.o 是最先被执行的，该段汇编指令调用了 umain 函数，定义在 umain.o 里。
umain 打开了输入输出流两个文件描述符，并调用函数 main （注意：umain 上面的 main 函数只是声明，在链接的时候会把 stackOverFlow.o 里面的 main 函数实现给链接上，main 函数执行完成后 umain 调用系统调用 exit 退出）

int main(int argc, char *argv[]);
static int
initfd(int fd2, const char *path, uint32_t open_flags) {
    int fd1, ret;
    if ((fd1 = open(path, open_flags)) < 0) {
        return fd1;
    }
    if (fd1 != fd2) {
        close(fd2);
        ret = dup2(fd1, fd2);
        close(fd1);
    }
    return ret;
}
void
umain(int argc, char *argv[]) {
    int fd;
    if ((fd = initfd(0, "stdin:", O_RDONLY)) < 0) {
        warn("open <stdin> failed: %e.\n", fd);
    }
    if ((fd = initfd(1, "stdout:", O_WRONLY)) < 0) {
        warn("open <stdout> failed: %e.\n", fd);
    }
    int ret = main(argc, argv);
    exit(ret);
}

截取几段链接完成后的反汇编代码

Disassembly of section .text:
00800020 <__panic>:
#include <stdio.h>
#include <ulib.h>
#include <error.h>
void
__panic(const char *file, int line, const char *fmt, ...) {
  800020:    f3 0f 1e fb             endbr32
  800024:    55                      push   %ebp
  800025:    89 e5                   mov    %esp,%ebp
  800027:    83 ec 18                sub    $0x18,%esp

可以看到代码段真如 ld 脚本规定的从 0x00800020 地址处开始

0080066b <_start>:
.text
.globl _start
_start:
    # set ebp for backtrace
    movl $0x0, %ebp
  80066b:    bd 00 00 00 00          mov    $0x0,%ebp
    # load argc and argv
    movl (%esp), %ebx
  800670:    8b 1c 24                mov    (%esp),%ebx
    lea 0x4(%esp), %ecx
  800673:    8d 4c 24 04             lea    0x4(%esp),%ecx

最后操作系统在执行这段代码的时候会根据 elf 的格式将 eip 设置为 0x0080066b 从这开始执行。

为何代码不从 0 地址开始

The stack, which is usually quite small but could grow quite dramatically in some occasions. The stack grows down, and when the stack is full, we really want the process to predictably crash rather than overwriting some data. So there had to be a wide area for the stack, with, at the low end of that area, an unmapped page. And lo! There is an unmapped page at address zero, to catch null pointer dereferences. Hence it was defined that the stack would get the first 128 MB of address space, except for the first page. This means that the code had to go after those 128 MB, at an address similar to 0x080xxxxx.

As Michael points out, “losing” 128 MB of address space was no big deal because the address space was very large with regards to what could be actually used. At that time, the Linux kernel was limiting the address space for a single process to 1 GB, over a maximum of 4 GB allowed by the hardware, and that was not considered to be a big issue.

第一个用户进程 sh

第一个用户进程是从int pid = kernel_thread(user_main, NULL, 0);而来。

int
kernel_thread(int (*fn)(void *), void *arg, uint32_t clone_flags) {
    struct trapframe tf;
    memset(&tf, 0, sizeof(struct trapframe));
    tf.tf_cs = KERNEL_CS;
    tf.tf_ds = tf.tf_es = tf.tf_ss = KERNEL_DS;
    tf.tf_regs.reg_ebx = (uint32_t)fn;
    tf.tf_regs.reg_edx = (uint32_t)arg;
    tf.tf_eip = (uint32_t)kernel_thread_entry;
    return do_fork(clone_flags | CLONE_VM, 0, &tf);
}

送了这个即将执行的 user_main 函数一个内核栈帧和一个 kernel_thread_entry 函数（该函数定义 user_main 执行完成后会自动调用 do_wait）然后就把 user_main 进程给 fork 出来了（此时还在内核态）。
最后 user_main 进程执行了 user_main 函数，引发一个中断调用 exec，把这个进程掏空，该进程变成了进程 sh ，注意在从文件系统载入代码的 load_icode 函数中，在中断栈帧中把 user_main 的段子换成了用户态的段子，只要中断一返回，该进程回到了用户态

虽然上面定义了 user_main 执行完成后会自动调用 do_wait ，但一个 exec 调用过后， eip 指针的值已经被修改，sh 已经和 kernel_thread_entry 没关系了。所以最后 sh 若要退出需要通过中断调用（用户进程也没法直接执行 do_wait）, 具体如何进行的中断调用仍需要研究。

对 top 指令的结果分析

进程数量等于 os 内的全局变量 nr_process 这个变量包含了 idle 进程所以总共 4 个没问题。

init 进程从其创建，第一次执行 do_wait 的时候，就因为其有 user_main（该进程创造 sh 后结束）这个进程而陷入了 sleep ，只要其子进程（sh）不死（就是变 Zombie，sleep 都没用），永远不会唤醒该 init 进程来回收资源（其他父线程托孤的时候可能唤醒 init）。

while (do_wait(0, NULL) == 0) {
    schedule();
}
if (haskid) {
    current->state = PROC_SLEEPING;
    current->wait_state = WT_CHILD;
    schedule();
    if (current->flags & PF_EXITING) {
        do_exit(-E_KILLED);
    }
    goto repeat;
}

idle 的调度是被特判过的, 运行队列里面没有 idleproc ， OS 在没有进程可供调度的时候选择进程 idleproc，自然其 vruntime 不会被计算

static void
sched_class_proc_tick(struct proc_struct *proc) {
    if (proc != idleproc) {
        sched_class->proc_tick(rq, proc);
    }
    else {
        proc->need_resched = 1;
    }
}

sh 是在等待输入的函数dev_stdin_read主动调用schedule()放弃时间片，所以在间隔两次输入top指令的中间，sh 一直在等待输入，并没有被调度，所以两次间隔后 idle 的调度次数会巨多因为除了 idle ，ucore 没有其他进程可供调度，但 vruntime 是 0（因为特判），sh 的 vruntime 可能会有小幅提升因为调用了 top ，中间处理会给 sh 时间，但一旦 top 开始执行， sh 又会因为等 top 执行完毕被挂起，top 执行完成后又在等输入。。。。

对用户栈的研究

每个用户进程都设置了用户栈为，栈顶是 USTACKTOP - 1 （第一个地址不是），栈大小是 USTACKSIZE 。调用 mm_mmap 函数建立用户栈的 vma 结构，明确用户栈的位置在用户虚空间的顶端，大小为 256 个页，即 1MB。但这个只是设置了 vma 。下面调用 pgdir_alloc_page 分配了 4 页（16KB）的栈空间给用户。

vm_flags = VM_READ | VM_WRITE | VM_STACK;
if ((ret = mm_map(mm, USTACKTOP - USTACKSIZE, USTACKSIZE, vm_flags, NULL)) != 0) {
    goto bad_cleanup_mmap;
}
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-2*PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-3*PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-4*PGSIZE , PTE_USER) != NULL);

一旦这 4KB 的栈被用完了，那么就会引起缺页异常。

struct vma_struct *vma = find_vma(mm, addr);
if (vma == NULL || vma->vm_start > addr) {
        cprintf("not valid addr %x, and  can not find it in vma\n", addr);
        goto failed;
    }
ptep = get_pte(mm->pgdir, addr, 1))
pgdir_alloc_page(mm->pgdir, addr, perm)

因为 vma 里写的是 1MB 空间，实际只分配给了 16KB ，缺页的时候只要地址在 1MB 的范围内，那么通过缺页的地址是找得到对应的 vma 的，说明这个是个合法地址，os 会自动根据缺页地址新建页表项并分配 page 。当栈的使用超过 1MB，就会找不到 vma 直接报错，这个行为目前会直接崩掉内核，可用考虑后续的优化。

时钟中断

//设置时钟每秒中断100次
    outb(IO_TIMER1, TIMER_DIV(100) % 256);
    outb(IO_TIMER1, TIMER_DIV(100) / 256);
// 通过中断控制器使能时钟中断
    pic_enable(IRQ_TIMER);

每 1s 中断 100 次则一个时间片的时长为 10ms ， usleep(100) 睡眠 1s。