Me, Myself, and Taco: [Scheduler] Behind The Scene

專門用語：

使用打字機（Courier New）字體表示原始碼。
task/process即為Linux中的thread。
文中的Linux與Linux kernel指的皆為作業系統核心。
使用斜體字型標示Linux中的資料結構，macro，變數或函式名稱。
PID即為TID。

Kernel版本：

v2.6.24.3

本篇要討論的是scheduler背後幫助我們更容易了解scheduler如何運作的一些細節，主要以解釋下列兩個問題的方式進行：

Kernel以什麼為依據決定將task放runqueue¹中哪個位置？
在沒有preemption發生的時候kernel怎麼知道什麼時候該讓current task釋出CPU給其他task使用？

Kernel以什麼為依據決定將task放runqueue中哪個位置？
版本2.6.22之後的kernel在runqueue中以新增的cfs_rq紅黑樹結構存放schedule entity（SE），也就是紀錄與每個task執行相關的時間數據。Kernel在runqueue新增資料的時候以SE中的vruntime為比較依據，與樹中各SE的vruntime相比，帶有小vruntime的SE往左邊擺，反之帶有大vruntime的se則往右邊擺，直到新的SE本身成為最左邊或最右邊的node為止。SE中的vruntime的型態為64-bit integer代表的意義是在第vruntime個tick²之前該SE代表的task要取得CPU執行工作，是kernel用來確保schedule公平性的新做法。

當新task被產生出來以後task_new_fair函式先透過place_entity函式取得新task的vruntime。


place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        u64 vruntime;

        vruntime = cfs_rq->min_vruntime;

        if (sched_feat(TREE_AVG)) {
                struct sched_entity *last = __pick_last_entity(cfs_rq);
                if (last) {
                        vruntime += last->vruntime;
                        vruntime >>= 1;
                }
        } else if (sched_feat(APPROX_AVG) && cfs_rq->nr_running)
                vruntime += sched_vslice(cfs_rq)/2;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice_add(cfs_rq, se);

        if (!initial) {
                /* sleeps upto a single latency don't count. */
                if (sched_feat(NEW_FAIR_SLEEPERS))
                        vruntime -= sysctl_sched_latency;

                /* ensure we never gain time by being placed backwards. */
                vruntime = max_vruntime(se->vruntime, vruntime);
        }

        se->vruntime = vruntime;
}

cfs_rq->min_vruntime其實就是整個紅黑樹中最左邊schedule entity的vruntime，在kernel呼叫__update_curr函式時更新，而整個紅黑樹中每個schedule entity的vruntime會隨著時間而增加。

取得新task的vruntime之後task_new_fair函式再間接透過enqueue_task_fair函式呼叫__enqueue_entity函式將新task的schedule entity放入cfs_rq，vruntime正是在__enqueue_entity函式中被拿來當作決定新task該放在cfs_rq紅黑樹中的哪個位置。


static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
        struct rb_node *parent = NULL;
        struct sched_entity *entry;
        s64 key = entity_key(cfs_rq, se);
        int leftmost = 1;

        /*
         * Find the right place in the rbtree:
         */
        while (*link) {
                parent = *link;
                entry = rb_entry(parent, struct sched_entity, run_node);
                /*
                 * We dont care about collisions. Nodes with
                 * the same key stay together.
                 */
                if (key < entity_key(cfs_rq, entry)) {
                        link = &parent->rb_left;
                } else {
                        link = &parent->rb_right;
                        leftmost = 0;
                }
        }

        /*
         * Maintain a cache of leftmost tree entries (it is frequently
         * used):
         */
        if (leftmost)
                cfs_rq->rb_leftmost = &se->run_node;

        rb_link_node(&se->run_node, parent, link);
        rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

在已知kernel_init為kernel中第一支thread也是當runqueue中沒有其他user threads時所跑的thread的情況下，我們可以合理地推測kernel_init的vruntime應該相當大。

在沒有preemption發生的時候kernel怎麼知道什麼時候該讓current task釋出CPU給其他task使用？
timer interrupt會輾轉透過update_process_times函式呼叫scheduler_tick函式，在2.6.22以前的版本中 scheduler_tick函式會遞減current task的time slice直到time_slice為零或是達到單次允許執行的長度（利用TIMESLICE_GRANULARITY macro取得）時便將current task的flag加上表示該讓scheduler重新挑選task讓CPU執行的數值（TIF_NEED_RESCHED）。但是在2.6.22之後版本的scheduler_tick函數中卻沒有這些機制，這讓我合理的懷疑新版的kernel完全摒棄time slice機制，新版scheduler_tick函式的原始碼如下。

新版的scheduler_tick函式輾轉透過task_tick_fair函式（scheduler_tick函式的第18行，row 3493）與entity_tick函式呼叫check_preempt_tick函式判斷current task是否該將CPU讓出以執行其他task。在版本2.6.22之後的kernel中的time slice成為如vruntime一般的抽象的概念，單純透過比較數字大小決定是否要將TIF_NEED_RESCHED加入task flag，因此kernel也不再花力氣在task_struct中maintain time slice。

sched_slice函式會根據目前runqueue中task的數量決定每個task的最佳執行時間長度⁴與current task使用CPU時間³相比較，當current task使用CPU時間已超過它的最佳執行時間長度kernel便會將current task標示為need reschedule。current task的sum_exec_runtime則是在每次scheduler_tick函式被timer interrupt觸發的時候更新。

看過code之後發現與2.6.22以前的原始碼比較起來新版的scheduler用了比較少magic number，至於效能上改進多少我也就無法得知，留給有緣人去研究。

剛才不論是在討論task的vruntime還是單次執行時間長度，計算過程中都使用到schedule entity中的load欄位，至於這個load究竟是什麼東西又是從何而來，我們就留給下篇去討論吧。

1 正確來說應該是runqueue中的cfs_rq，雖然說將task放入runqueue其實是將代表task的SE放入runqueue的cfs_rq欄位。
2 tick用於紀錄timer interrupt signals的次數，是Linux中最小的時間單位，決定於define值HZ，此版本中的預設值為250。
3 從最後一次被scheduler挑選出來開始使用CPU的時間開始計算。
4 sched_slice函式中也用到了一些magic numbers。

Me, Myself, and Taco

Search Google

Wednesday, July 09, 2008

[Scheduler] Behind The Scene

No comments:

Online counter

Followers

Contact me @

小企鵝時鐘第二發

1k words

Labels

Blog Archive

About Me

Friendly Links

Google Analytics