Linux 2.4 核心內部: 行程與中斷管理
2.1 作業結構與行程表
(譯按: task 跟 process 有點難譯,若依意思時而行程,時而作業,相信會讓讀者更糊塗。 故本文將 task 一律譯為作業;而 process 譯成行程。)
在 Linux 下每個行程都動態地配置一個 struct task_struct 結構。
Linux 限制的最大的行程數目取決於實體記憶體的大小,如這個等式
(參見 kernel/fork.c:fork_init()):
/*
* 內定執行緒最大值要比較保險的值:
* 執行緒結構要能使用將近一半的記憶體。
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
在 IA32 架構,基本上就是 num_physpages/4。舉個例子,
在 512 Mb 的機器上,你應能建立 32K 的執行緒。對舊版以 4k-epsilon 限制
(2.2 及更舊) 的核心而來說,這是相當大的改進。
此外,也改變了執行時使用的 KERN_MAX_THREADS sysctl(2),
或使用簡單地 Proc 檔案系統介面讓核心可調整:
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
核心由 `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118' 產生。
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
在 Linux 系統上行程位置可由 struct task_struct 結構取得,
有二種連結的方式:
- 是有雜湊表 (hashtable) 的,依行程代碼雜湊,及
- 迂迴的,雙連結表依
p->next_task及p->prev_task指標。
雜湊表稱為 pidhash[] ,它定義在 include/linux/sched.h:
/* 行程代碼 (PID) 雜湊。 (這應是動態的?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
作業的雜湊依它們的行程代碼值及上文的雜湊函式應可平均地分派資源到它們的領域
(0 to PID_MAX-1)。
雜湊表常能依給的行程代碼快速找到作業,使用 include/linux/sched.h
中的 find_task_pid() 式子:
static inline struct task_struct *find_task_by_pid(int pid)
{
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
;
return p;
}
這些作業在每個雜湊列 (如: 雜湊到相同的值) 用
p->pidhash_next/pidhash_pprev 連結,使用 hash_pid() 及 unhash_pid()
來增刪雜湊表中給定的行程。
這些都在讀寫的鎖定保護下以 tasklist_lock 來寫入。
環狀的倍連結列用 p->next_task/prev_task 讓所有的作業能簡單地維護。
這由在 include/linux/sched.h 中的 for_each_task() 巨集來實作:
#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
使用者用 for_each_task() 能讀 tasklist_lock。
注意 for_each_task() 使用 init_task 標記列示,
這很安全,因為有個閒置的作業 (pid 0) 永不結束。
行程雜湊表的修改者與行程表連結,特別是 fork()、exit() 和 ptrace(),
要用 tasklist_lock 來寫入。
有趣的是寫入者必須要在本地機器 CPU 中取消中斷。
理由倒非不重要: send_sigio() 函式依作業表 (task list) 而取得唯讀的 tasklist_lock
,它再被中斷表 kill_fasync() 呼叫。
這就是為何寫入者在讀取者不需用時要將中斷取消的原因。
現在我們了解到 task_struct 結構如何連接在一起,
讓我們調查 task_struct 的組成。
它有點像 UNIX 中 'struct proc' 與 'struct user' 一起的組合。
其它版本的 UNIX 分一部份作業資訊到隨時都能使用的記憶體中, (稱之為 'struct proc',它包括了行程狀態、排程資訊等等。); 而另一部份只有在行程在執行時才需要 (稱之為 'u area' 包括了檔案描述表、磁碟配額資訊等等。)
只有一個原因會設計地如此拙劣,因為記憶體曾是非常珍貴的資源。 最新的作業系統 (嗯,當下只有 Linux 如此,諸如 FreeBSD 亦朝 Linux 這個方向改進中。) 不必如此分隔並隨時在核心記憶體定址資料結構上維護行程狀態。
task_struct 結構定義在 include/linux/sched.h 目前它的大小為 1680
位元組。
state 欄位如下:
volatile long state; /* -1 不可執行, 0 可執行, >0 停止 */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32
為何 TASK_EXCLUSIVE 定義為 32 而非 16? 因為 16 被 TASK_SWAPPING
使用了,當我去掉所有涉及 TASK_SWAPPING 處時,我也忘了轉換
TASK_EXCLUSIVE。
在 p->state 中的 volatile 宣告意思是它可以被非同步修改
(從中斷操作者上):
- TASK_RUNNING: 意謂這作業 "應該是" 在執行佇列上。
它可能不在執行佇列上的理由是因為:
將一個作業標示為
TASK_RUNNING, 且置放於執行佇列上的過程是非原子化的。 你必須保持runqueue_lock為適合讀取的讀寫旋鎖 (spinlock), 以便看顧執行佇列的運行。 如此一來你會看見每一個在執行佇列上的作業皆維持在TASK_RUNNING狀態。 然而相反論調便駁斥以上理由。 同樣地驅動程式可以將它們自己(更確切是它們執行的處理脈絡)標記為TASK_INTERRUPTIBLE(或者TASK_UNINTERRUPTIBLE) 而稱作schedule(),那將會把它從執行佇列上移除 (除非有未處理的訊號,此時它留在執行佇列中。) - TASK_INTERRUPTIBLE: 意謂作業在休眠,但可由訊號或是計時器到時喚醒。
- TASK_UNINTERRUPTIBLE: 差不多和
TASK_INTERRUPTIBLE相同,除了它不能被喚醒。 - TASK_ZOMBIE: 作業終止,但狀態並未被其父作業蒐集 (親生或領養的)
(或
wait())。 - TASK_STOPPED: 作業已結束,經工作控制訊號或經由 ptrace(2)。
- TASK_EXCLUSIVE: 這不是分割的狀態,
而應是
TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE二者之一。 這意謂著 當此作業與其它作業在等待佇列休眠時, 它將單獨被喚醒而不致造成所有等待者一起醒來的 "激突" (thundering herd) 問題。
作業旗標包含資訊對於行程狀態並不互斥:
unsigned long flags; /* 每個行程旗標,定義如下 */
/*
* 每個行程旗標
*/
#define PF_ALIGNWARN 0x00000001 /* 直行列示警告訊息 */
/* 尚未實作,只有在 486 機器上 */
#define PF_STARTING 0x00000002 /* 開始建立 */
#define PF_EXITING 0x00000004 /* 進行關機 */
#define PF_FORKNOEXEC 0x00000040 /* 開行程 (fork) 但不執行 */
#define PF_SUPERPRIV 0x00000100 /* 用超級使用員權限 */
#define PF_DUMPCORE 0x00000200 /* 傾印核心 (dumped core) */
#define PF_SIGNALED 0x00000400 /* 由訊號 (signal) 刪除 */
#define PF_MEMALLOC 0x00000800 /* 匹配記憶體 */
#define PF_VFORK 0x00001000 /* 用 mm_release 叫醒父行程*/
#define PF_USEDFPU 0x00100000 /* 行程使用該處理器的 FPU (多處理器機種) */
欄位 p->has_cpu、p->processor、p->counter、p->priority、p->policy 及
p->rt_priority 跟行程表有關,將在稍後再提。
欄位 p->mm 和 p->active_mm 依 mm_struct
結構指到各行程的定址空間描述,同時若行程非真的行程 (如核心執行緒)
則活化定址空間。
這有助於作業在排程外時,在切換位址空間上最小的 TLB 更新。
故,若我們排程進核心執行緒 (當沒有 p->mm) 則
next->active_mm 將被設到 prev->active_mm
;而作業被排程出時,
若 prev->mm != NULL 那將被設成 prev->mm。
若旗標 CLONE_VM 通過 clone(2) 系統呼叫或用
vfork(2) 時,位址空能被執行緒間分享。
欄位 p->exec_domain 及 p->personality
為作業的性格,換言之,某些系統呼叫行為用來模擬其它 UNIX 的性格。
欄位 p->fs 包含檔案系統資訊,它在 Linux 下有下列三種資訊:
- 根目錄的目錄進入點 (dentry, directory entry) 及掛載點,
- 替代根目錄的目錄進入點及掛載點,
- 目前工作目錄的目錄進入點及掛載點。
這個結構亦包含一個參考計數,
因為它能被相同作業共享,
當 CLONE_FS 旗標處理過 clone(2) 系統呼叫。
欄位 p->files 包含檔案描述表。
這的確可讓作業共享,假如 CLONE_FILES 是明確跟著 clone(2) 系統呼叫。
欄位 p->sig 包括了信號操作與能用
CLONE_SIGHAND 共享的複制作業。
2.2 作業的建立及終結與核心執行緒
不同的作業系統書籍對 "行程" 有著不同的定義, 開始是 "程式執行中的執行個體 (instance)" 與結束的 "經由 clone(2) 或 fork(2) 系統呼叫所產生的產物"。 在 Linux 下,有三種行程:
- 閒置執行緒,
- 核心執行緒,
- 使用者作業。
閒置執行緒由編譯時建立給首顆中央處理器;
它"手動地"建立給每個處理器依架構特性
fork_by_hand() 在 arch/i386/kernel/smpboot.c
為了第一顆中央處理器閒置執行緒建立在編譯時期;
它當時 "手動地" 為各不同架構規格中央處理器建立 fork_by_hand()
見於 arch/i386/kernel/smpboot.c,
(在一些 CPU 架構) 由 hand 展開 fork(2) 系統呼叫。
閒置作業共享依 init_task 結構但亦依私有的 TSS 結構,在每顆 CPU 陣列的 init_tss。
閒置作業全有 pid = 0 且沒有其它作業能分享 pid,
換言之使用 CLONE_PID 旗標至 clone(2)。
核心執行緒使用 kernel_thread() 函式建立,核心模式喚起
clone(2) 系統呼叫。
核心執行緒通常沒有使用者位址空間,即 p->mm = NULL,
因為它們明確地 exit_mm(),例如經由 daemonize() 函式。
核心執行緒總能直接存取核心位址空間。它們分派 pid 數字在低範圍。
在處理器的 ring 0 (x86 系列如此) 意謂核心執行緒享有全部 I/O 特權
且不能被排程清空。
使用者作業由 clone(2) 或 fork(2) 系統呼叫建立, 內部訴諸 kernel/fork.c:do_fork()。
讓我們了解當使用者行程建立一個 fork(2) 系統呼叫時發生什麼事。
僅管 fork(2) 與架構相關,由不同的路通過使用者堆疊與註冊 (register)
實際之下的 do_fork() 函式是可移植的,位於 kernel/fork.c。
依下面的步驟來做:
- Local variable
retvalis set to-ENOMEM, as this is the value whicherrnoshould be set to if fork(2) fails to allocate a new task structure. - If
CLONE_PIDis set inclone_flagsthen return an error (-EPERM), unless the caller is the idle thread (during boot only). So, normal user threads cannot passCLONE_PIDto clone(2) and expect it to succeed. For fork(2), this is irrelevant asclone_flagsis set toSIFCHLD- this is only relevant whendo_fork()is invoked fromsys_clone()which passes theclone_flagsfrom the value requested from userspace. -
current->vfork_semis initialised (it is later cleared in the child). This is used bysys_vfork()(vfork(2) system call, corresponds toclone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) to make the parent sleep until the child doesmm_release(), for example as a result ofexec()ing another program or exit(2)-ing. - A new task structure is allocated using arch-dependent
alloc_task_struct()macro. On x86 it is just a gfp atGFP_KERNELpriority. This is the first reason why fork(2) system call may sleep. If this allocation fails, we return-ENOMEM. - All the values from current process' task structure are copied into
the new one, using structure assignment
*p = *current. Perhaps this should be replaced by a memcpy? Later on, the fields that should not be inherited by the child are set to the correct values. - Big kernel lock is taken as the rest of the code would otherwise be non-reentrant.
- If the parent has user resources (a concept of UID, Linux is flexible
enough to make it a question rather than a fact), then verify if the
user exceeded
RLIMIT_NPROCsoft limit - if so, fail with-EAGAIN, if not, increment the count of processes by given uidp->user->count. - If the system-wide number of tasks exceeds the value of the tunable
max_threads, fail with
-EAGAIN. - If the binary being executed belongs to a modularised execution domain, increment the corresponding module's reference count.
- If the binary being executed belongs to a modularised binary format, increment the corresponding module's reference count.
- The child is marked as 'has not execed' (
p->did_exec = 0) - The child is marked as 'not-swappable' (
p->swappable = 0) - The child is put into 'uninterruptible sleep' state, i.e.
p->state = TASK_UNINTERRUPTIBLE(TODO: why is this done? I think it's not needed - get rid of it, Linus confirms it is not needed) - The child's
p->flagsare set according to the value of clone_flags; for plain fork(2), this will bep->flags = PF_FORKNOEXEC. - The child's pid
p->pidis set using the fast algorithm inkernel/fork.c:get_pid()(TODO:lastpid_lockspinlock can be made redundant sinceget_pid()is always called under big kernel lock fromdo_fork(), also remove flags argument ofget_pid(), patch sent to Alan on 20/06/2000 - followup later). - The rest of the code in
do_fork()initialises the rest of child's task structure. At the very end, the child's task structure is hashed into thepidhashhashtable and the child is woken up (TODO:wake_up_process(p)setsp->state = TASK_RUNNINGand adds the process to the runq, therefore we probably didn't need to setp->statetoTASK_RUNNINGearlier on indo_fork()). The interesting part is settingp->exit_signaltoclone_flags & CSIGNAL, which for fork(2) means justSIGCHLDand settingp->pdeath_signalto 0. Thepdeath_signalis used when a process 'forgets' the original parent (by dying) and can be set/get by means ofPR_GET/SET_PDEATHSIGcommands of prctl(2) system call (You might argue that the way the value ofpdeath_signalis returned via userspace pointer argument in prctl(2) is a bit silly - mea culpa, after Andries Brouwer updated the manpage it was too late to fix ;)
Thus tasks are created. There are several ways for tasks to terminate:
- by making exit(2) system call;
- by being delivered a signal with default disposition to die;
- by being forced to die under certain exceptions;
- by calling bdflush(2) with
func == 1(this is Linux-specific, for compatibility with old distributions that still had the 'update' line in/etc/inittab- nowadays the work of update is done by kernel threadkupdate).
Functions implementing system calls under Linux are prefixed with sys_,
but they are usually concerned only with argument checking or arch-specific
ways to pass some information and the actual work is done by do_ functions.
So it is with sys_exit() which calls do_exit() to do the work. Although,
other parts of the kernel sometimes invoke sys_exit() while they should really
call do_exit().
The function do_exit() is found in kernel/exit.c. The points to note about
do_exit():
- Uses global kernel lock (locks but doesn't unlock).
- Calls
schedule()at the end, which never returns. - Sets the task state to
TASK_ZOMBIE. - Notifies any child with
current->pdeath_signal, if not 0. - Notifies the parent with a
current->exit_signal, which is usually equal toSIGCHLD. - Releases resources allocated by fork, closes open files etc.
- On architectures that use lazy FPU switching (ia64, mips, mips64) (TODO: remove 'flags' argument of sparc, sparc64), do whatever the hardware requires to pass the FPU ownership (if owned by current) to "none".
2.3 Linux 排程
The job of a scheduler is to arbitrate access to the current CPU between
multiple processes. The scheduler is implemented in the 'main kernel file'
kernel/sched.c. The corresponding header file include/linux/sched.h is
included (either explicitly or indirectly) by virtually every kernel source
file.
The fields of task structure relevant to scheduler include:
-
p->need_resched: this field is set ifschedule()should be invoked at the 'next opportunity'. -
p->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset to 0 andp->need_reschedis set. This is also sometimes called 'dynamic priority' of a process because it can change by itself. -
p->priority: the process' static priority, only changed through well-known system calls like nice(2), POSIX.1b sched_setparam(2) or 4.4BSD/SVR4 setpriority(2). -
p->rt_priority: realtime priority -
p->policy: the scheduling policy, specifies which scheduling class the task belongs to. Tasks can change their scheduling class using the sched_setscheduler(2) system call. The valid values areSCHED_OTHER(traditional UNIX process),SCHED_FIFO(POSIX.1b FIFO realtime process) andSCHED_RR(POSIX round-robin realtime process). One can also ORSCHED_YIELDto any of these values to signify that the process decided to yield the CPU, for example by calling sched_yield(2) system call. A FIFO realtime process will run until either a) it blocks on I/O, b) it explicitly yields the CPU or c) it is preempted by another realtime process with a higherp->rt_priorityvalue.SCHED_RRis the same asSCHED_FIFO, except that when its timeslice expires it goes back to the end of the runqueue.
The scheduler's algorithm is simple, despite the great apparent complexity
of the schedule() function. The function is complex because it implements
three scheduling algorithms in one and also because of the subtle
SMP-specifics.
The apparently 'useless' gotos in schedule() are there for a purpose - to
generate the best optimised (for i386) code. Also, note that scheduler
(like most of the kernel) was completely rewritten for 2.4, therefore the
discussion below does not apply to 2.2 or earlier kernels.
Let us look at the function in detail:
- If
current->active_mm == NULLthen something is wrong. Current process, even a kernel thread (current->mm == NULL) must have a validp->active_mmat all times. - If there is something to do on the
tq_schedulertask queue, process it now. Task queues provide a kernel mechanism to schedule execution of functions at a later time. We shall look at it in details elsewhere. - Initialise local variables
prevandthis_cputo current task and current CPU respectively. - Check if
schedule()was invoked from interrupt handler (due to a bug) and panic if so. - Release the global kernel lock.
- If there is some work to do via softirq mechanism, do it now.
- Initialise local pointer
struct schedule_data *sched_datato point to per-CPU (cacheline-aligned to prevent cacheline ping-pong) scheduling data area, which contains the TSC value oflast_scheduleand the pointer to last scheduled task structure (TODO:sched_datais used on SMP only but why doesinit_idle()initialises it on UP as well?). -
runqueue_lockspinlock is taken. Note that we usespin_lock_irq()because inschedule()we guarantee that interrupts are enabled. Therefore, when we unlockrunqueue_lock, we can just re-enable them instead of saving/restoring eflags (spin_lock_irqsave/restorevariant). - task state machine: if the task is in
TASK_RUNNINGstate, it is left alone; if it is inTASK_INTERRUPTIBLEstate and a signal is pending, it is moved intoTASK_RUNNINGstate. In all other cases, it is deleted from the runqueue. -
next(best candidate to be scheduled) is set to the idle task of this cpu. However, the goodness of this candidate is set to a very low value (-1000), in hope that there is someone better than that. - If the
prev(current) task is inTASK_RUNNINGstate, then the current goodness is set to its goodness and it is marked as a better candidate to be scheduled than the idle task. - Now the runqueue is examined and a goodness of each process that can
be scheduled on this cpu is compared with current value; the
process with highest goodness wins. Now the concept of "can be
scheduled on this cpu" must be clarified: on UP, every process on
the runqueue is eligible to be scheduled; on SMP, only process not
already running on another cpu is eligible to be scheduled on this
cpu. The goodness is calculated by a function called
goodness(), which treats realtime processes by making their goodness very high (1000 + p->rt_priority), this being greater than 1000 guarantees that noSCHED_OTHERprocess can win; so they only contend with other realtime processes that may have a greaterp->rt_priority. The goodness function returns 0 if the process' time slice (p->counter) is over. For non-realtime processes, the initial value of goodness is set top->counter- this way, the process is less likely to get CPU if it already had it for a while, i.e. interactive processes are favoured more than CPU bound number crunchers. The arch-specific constantPROC_CHANGE_PENALTYattempts to implement "cpu affinity" (i.e. give advantage to a process on the same CPU). It also gives a slight advantage to processes with mm pointing to currentactive_mmor to processes with no (user) address space, i.e. kernel threads. - if the current value of goodness is 0 then the entire list of
processes (not just the ones on the runqueue!) is examined and their dynamic
priorities are recalculated using simple algorithm:
Note that the we drop the
recalculate: { struct task_struct *p; spin_unlock_irq(&runqueue_lock); read_lock(&tasklist_lock); for_each_task(p) p->counter = (p->counter >> 1) + p->priority; read_unlock(&tasklist_lock); spin_lock_irq(&runqueue_lock); }
runqueue_lockbefore we recalculate. The reason is that we go through entire set of processes; this can take a long time, during which theschedule()could be called on another CPU and select a process with goodness good enough for that CPU, whilst we on this CPU were forced to recalculate. Ok, admittedly this is somewhat inconsistent because while we (on this CPU) are selecting a process with the best goodness,schedule()running on another CPU could be recalculating dynamic priorities. - From this point on it is certain that
nextpoints to the task to be scheduled, so we initialisenext->has_cputo 1 andnext->processortothis_cpu. Therunqueue_lockcan now be unlocked. - If we are switching back to the same task (
next == prev) then we can simply reacquire the global kernel lock and return, i.e. skip all the hardware-level (registers, stack etc.) and VM-related (switch page directory, recalculateactive_mmetc.) stuff. - The macro
switch_to()is architecture specific. On i386, it is concerned with a) FPU handling, b) LDT handling, c) reloading segment registers, d) TSS handling and e) reloading debug registers.
2.4 Linux 連結列實作
Before we go on to examine implementation of wait queues, we must
acquaint ourselves with the Linux standard doubly-linked list implementation.
Wait queues (as well as everything else in Linux) make heavy use
of them and they are called in jargon "list.h implementation" because the
most relevant file is include/linux/list.h.
The fundamental data structure here is struct list_head:
struct list_head {
struct list_head *next, *prev;
};
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) \
struct list_head name = LIST_HEAD_INIT(name)
#define INIT_LIST_HEAD(ptr) do { \
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)
#define list_entry(ptr, type, member) \
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
#define list_for_each(pos, head) \
for (pos = (head)->next; pos != (head); pos = pos->next)
The first three macros are for initialising an empty list by pointing both
next and prev pointers to itself. It is obvious from C syntactical
restrictions which ones should be used where - for example, LIST_HEAD_INIT()
can be used for structure's element initialisation in declaration, the second
can be used for static variable initialising declarations and the third can
be used inside a function.
The macro list_entry() gives access to individual list element, for example
(from fs/file_table.c:fs_may_remount_ro()):
struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;
struct file {
...
struct list_head f_list;
...
} *file;
struct list_head *p;
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}
A good example of the use of list_for_each() macro is in the scheduler where
we walk the runqueue looking for the process with highest goodness:
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
Here, p->run_list is declared as struct list_head run_list inside
task_struct structure and serves as anchor to the list. Removing an element
from the list and adding (to head or tail of the list) is done by
list_del()/list_add()/list_add_tail() macros. The examples below are adding
and removing a task from runqueue:
static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}
static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}
static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}
static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}
2.5 等待佇列
When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.
Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag.
With waitqueues, you can either use a well-known queue and then simply
sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout,
or you can define your own waitqueue and use add/remove_wait_queue to add and
remove yourself from it and wake_up/wake_up_interruptible to wake up
when needed.
An example of the first usage of waitqueues is interaction between the page
allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in
mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in
mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up
whenever the page allocator needs to free up some pages.
An example of autonomous waitqueue usage is interaction between
user process requesting data via read(2) system call and kernel running in
the interrupt context to supply the data. An interrupt handler might look
like (simplified drivers/char/rtc_interrupt()):
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}
So, the interrupt handler obtains the data by reading from some
device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and
then wakes up whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;
add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);
if (data != 0)
break;
if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);
out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}
What happens in rtc_read() is this:
- We declare a wait queue element pointing to current process context.
- We add this element to the
rtc_waitwaitqueue. - We mark current context as
TASK_INTERRUPTIBLEwhich means it will not be rescheduled after the next time it sleeps. - We check if there is no data available; if there is we break out,
copy data to user buffer, mark ourselves as
TASK_RUNNING, remove ourselves from the wait queue and return - If there is no data yet, we check whether the user specified non-blocking I/O
and if so we fail with
EAGAIN(which is the same asEWOULDBLOCK) - We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
- Then we "switch out", i.e. fall asleep, until woken up by the
interrupt handler. If we didn't mark ourselves as
TASK_INTERRUPTIBLEthen the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.
It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;
poll_wait(file, &rtc_wait, wait);
spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);
if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}
All the work is done by the device-independent function poll_wait() which does
the necessary waitqueue manipulations; all we need to do is point it to the
waitqueue which is woken up by our device-specific interrupt handler.
2.6 核心計時器
Now let us turn our attention to kernel timers. Kernel timers are used to
dispatch execution of a particular function (called 'timer handler') at a
specified time in the future. The main data structure is struct timer_list
declared in include/linux/timer.h:
struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};
The list field is for linking into the internal list, protected by the
timerlist_lock spinlock. The expires field is the value of jiffies when
the function handler should be invoked with data passed as a parameter.
The running field is used on SMP to test if the timer handler is currently
running on another CPU.
The functions add_timer() and del_timer() add and remove a given timer to the
list. When a timer expires, it is removed automatically. Before a timer is
used, it MUST be initialised by means of init_timer() function. And before it
is added, the fields function and expires must be set.
2.7 Bottom Halves
Sometimes it is reasonable to split the amount of work to be performed inside an interrupt handler into immediate work (e.g. acknowledging the interrupt, updating the stats etc.) and work which can be postponed until later, when interrupts are enabled (e.g. to do some postprocessing on data, wake up processes waiting for this data, etc).
Bottom halves are the oldest mechanism for deferred execution of kernel tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism was added, called 'task queues', which will be the subject of next section.
Bottom halves are serialised by the global_bh_lock spinlock, i.e.
there can only be one bottom half running on any CPU at a time. However,
when attempting to execute the handler, if global_bh_lock is not available,
the bottom half is marked (i.e. scheduled) for execution - so processing can
continue, as opposed to a busy loop on global_bh_lock.
There can only be 32 bottom halves registered in total. The functions required to manipulate bottom halves are as follows (all exported to modules):
-
void init_bh(int nr, void (*routine)(void)): installs a bottom half handler pointed to byroutineargument into slotnr. The slot ought to be enumerated ininclude/linux/interrupt.hin the formXXXX_BH, e.g.TIMER_BHorTQUEUE_BH. Typically, a subsystem's initialisation routine (init_module()for modules) installs the required bottom half using this function. -
void remove_bh(int nr): does the opposite ofinit_bh(), i.e. de-installs bottom half installed at slotnr. There is no error checking performed there, so, for exampleremove_bh(32)will panic/oops the system. Typically, a subsystem's cleanup routine (cleanup_module()for modules) uses this function to free up the slot that can later be reused by some other subsystem. (TODO: wouldn't it be nice to have/proc/bottom_halveslist all registered bottom halves on the system? That meansglobal_bh_lockmust be made read/write, obviously) -
void mark_bh(int nr): marks bottom half in slotnrfor execution. Typically, an interrupt handler will mark its bottom half (hence the name!) for execution at a "safer time".
Bottom halves are globally locked tasklets, so the question "when are bottom
half handlers executed?" is really "when are tasklets executed?". And the
answer is, in two places: a) on each schedule() and b) on each
interrupt/syscall return path in entry.S (TODO: therefore, the schedule()
case is really boring - it like adding yet another very very slow interrupt,
why not get rid of handle_softirq label from schedule() altogether?).
2.8 Task Queues
Task queues can be though of as a dynamic extension to old bottom halves. In fact, in the source code they are sometimes referred to as "new" bottom halves. More specifically, the old bottom halves discussed in previous section have these limitations:
- There are only a fixed number (32) of them.
- Each bottom half can only be associated with one handler function.
- Bottom halves are consumed with a spinlock held so they cannot block.
So, with task queues, arbitrary number of functions can be chained and
processed one after another at a later time. One creates a new task queue
using the DECLARE_TASK_QUEUE() macro and queues a task onto it using
the queue_task() function. The task queue then can be processed using
run_task_queue(). Instead of creating your own task queue (and
having to consume it manually) you can use one of Linux' predefined
task queues which are consumed at well-known points:
- tq_timer: the timer task queue, run on each timer interrupt
and when releasing a tty device (closing or releasing a half-opened
terminal device). Since the timer handler runs in interrupt context,
the
tq_timertasks also run in interrupt context and thus cannot block. - tq_scheduler: the scheduler task queue, consumed by the scheduler (and also
when closing tty devices, like
tq_timer). Since the scheduler executed in the context of the process being re-scheduled, thetq_schedulertasks can do anything they like, i.e. block, use process context data (but why would they want to), etc. - tq_immediate: this is really a bottom half
IMMEDIATE_BH, so drivers canqueue_task(task, &tq_immediate)and thenmark_bh(IMMEDIATE_BH)to be consumed in interrupt context. - tq_disk: used by low level block device access (and RAID) to start the actual requests. This task queue is exported to modules but shouldn't be used except for the special purposes which it was designed for.
Unless a driver uses its own task queues, it does not need to call
run_tasks_queues() to process the queue, except under circumstances explained
below.
The reason tq_timer/tq_scheduler task queues are consumed not only in the
usual places but elsewhere (closing tty device is but one example) becomes
clear if one remembers that the driver can schedule tasks on the queue, and these tasks
only make sense while a particular instance of the device is still valid
- which usually means until the application closes it. So, the driver may
need to call run_task_queue() to flush the tasks it (and anyone else) has
put on the queue, because allowing them to run at a later time may make no
sense - i.e. the relevant data structures may have been freed/reused by a
different instance. This is the reason you see run_task_queue() on tq_timer
and tq_scheduler in places other than timer interrupt and schedule()
respectively.
2.9 Tasklets
Not yet, will be in future revision.
2.10 Softirqs
Not yet, will be in future revision.
2.11 How System Calls Are Implemented on i386 Architecture?
There are two mechanisms under Linux for implementing system calls:
- lcall7/lcall27 call gates;
- int 0x80 software interrupt.
Native Linux programs use int 0x80 whilst binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is historically misleading because it also covers lcall27 (e.g. Solaris/x86), but the handler function is called lcall7_func.
When the system boots, the function arch/i386/kernel/traps.c:trap_init() is
called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
the address of system_call entry from arch/i386/kernel/entry.S.
When a userspace application makes a system call, the arguments are passed via registers
and the application executes 'int 0x80' instruction. This causes a trap into
kernel mode and processor jumps to system_call entry point in entry.S.
What this does is:
- Save registers.
- Set %ds and %es to KERNEL_DS, so that all data (and extra segment) references are made in kernel address space.
- If the value of %eax is greater than
NR_syscalls(currently 256), fail withENOSYSerror. - If the task is being ptraced (
tsk->ptrace & PF_TRACESYS), do special processing. This is to support programs like strace (analogue of SVR4 truss(1)) or debuggers. - Call
sys_call_table+4*(syscall_number from %eax). This table is initialised in the same file (arch/i386/kernel/entry.S) to point to individual system call handlers which under Linux are (usually) prefixed withsys_, e.g.sys_open,sys_exit, etc. These C system call handlers will find their arguments on the stack whereSAVE_ALLstored them. - Enter 'system call return path'. This is a separate label because it
is used not only by int 0x80 but also by lcall7, lcall27. This is
concerned with handling tasklets (including bottom halves), checking
if a
schedule()is needed (tsk->need_resched != 0), checking if there are signals pending and if so handling them.
Linux supports up to 6 arguments for system calls. They are passed in
%ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see _syscall6() in
asm-i386/unistd.h). The system call number is passed via %eax.
2.12 Atomic Operations
There are two types of atomic operations: bitmaps and atomic_t. Bitmaps are
very convenient for maintaining a concept of "allocated" or "free" units
from some large collection where each unit is identified by some number, for
example free inodes or free blocks. They are also widely used for simple
locking, for example to provide exclusive access to open a device. An example
of this can be found in arch/i386/kernel/microcode.c:
/*
* Bits in microcode_status. (31 bits of room for future expansion)
*/
#define MICROCODE_IS_OPEN 0 /* set if device is in use */
static unsigned long microcode_status;
There is no need to initialise microcode_status to 0 as BSS is zero-cleared
under Linux explicitly.
/*
* We enforce only one user at a time here with open/close.
*/
static int microcode_open(struct inode *inode, struct file *file)
{
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
/* one at a time, please */
if (test_and_set_bit(MICROCODE_IS_OPEN, µcode_status))
return -EBUSY;
MOD_INC_USE_COUNT;
return 0;
}
The operations on bitmaps are:
- void set_bit(int nr, volatile void *addr): set bit
nrin the bitmap pointed to byaddr. - void clear_bit(int nr, volatile void *addr): clear bit
nrin the bitmap pointed to byaddr. - void change_bit(int nr, volatile void *addr): toggle bit
nr(if set clear, if clear set) in the bitmap pointed to byaddr. - int test_and_set_bit(int nr, volatile void *addr):
atomically set bit
nrand return the old bit value. - int test_and_clear_bit(int nr, volatile void *addr):
atomically clear bit
nrand return the old bit value. - int test_and_change_bit(int nr, volatile void *addr):
atomically toggle bit
nrand return the old bit value.
These operations use the LOCK_PREFIX macro, which on SMP kernels evaluates to
bus lock instruction prefix and to nothing on UP. This guarantees atomicity
of access in SMP environment.
Sometimes bit manipulations are not convenient, but instead we need to perform
arithmetic operations - add, subtract, increment decrement. The typical cases
are reference counts (e.g. for inodes). This facility is provided by the
atomic_t data type and the following operations:
- atomic_read(&v): read the value of
atomic_tvariablev. - atomic_set(&v, i): set the value of
atomic_tvariablevto integeri. - void atomic_add(int i, volatile atomic_t *v): add integer
ito the value of atomic variable pointed to byv. - void atomic_sub(int i, volatile atomic_t *v): subtract
integer
ifrom the value of atomic variable pointed to byv. - int atomic_sub_and_test(int i, volatile atomic_t *v):
subtract integer
ifrom the value of atomic variable pointed to byv; return 1 if the new value is 0, return 0 otherwise. - void atomic_inc(volatile atomic_t *v): increment the value by 1.
- void atomic_dec(volatile atomic_t *v): decrement the value by 1.
- int atomic_dec_and_test(volatile atomic_t *v): decrement the value; return 1 if the new value is 0, return 0 otherwise.
- int atomic_inc_and_test(volatile atomic_t *v): increment the value; return 1 if the new value is 0, return 0 otherwise.
- int atomic_add_negative(int i, volatile atomic_t *v): add
the value of
itovand return 1 if the result is negative. Return 0 if the result is greater than or equal to 0. This operation is used for implementing semaphores.
2.13 Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks
Since the early days of Linux support (early 90s, this century), developers were faced with the classical problem of accessing shared data between different types of context (user process vs interrupt) and different instances of the same context from multiple cpus.
SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch was made to 1.3.37 in October the same year).
If the critical region of code may be executed by either process context
and interrupt context, then the way to protect it using cli/sti instructions
on UP is:
unsigned long flags;
save_flags(flags);
cli();
/* critical code */
restore_flags(flags);
While this is ok on UP, it obviously is of no use on SMP because the same
code sequence may be executed simultaneously on another cpu, and while cli()
provides protection against races with interrupt context on each CPU individually, it
provides no protection at all against races between contexts running on different
CPUs. This is where spinlocks are useful for.
There are three types of spinlocks: vanilla (basic), read-write and
big-reader spinlocks. Read-write spinlocks should be used when there is a
natural tendency of 'many readers and few writers'. Example of this is
access to the list of registered filesystems (see fs/super.c). The list is
guarded by the file_systems_lock read-write spinlock because one needs exclusive
access only when registering/unregistering a filesystem, but any process can
read the file /proc/filesystems or use the sysfs(2) system call to force a
read-only scan of the file_systems list. This makes it sensible to use
read-write spinlocks. With read-write spinlocks, one can have multiple
readers at a time but only one writer and there can be no readers while
there is
a writer. Btw, it would be nice if new readers would not get a lock while
there
is a writer trying to get a lock, i.e. if Linux could correctly deal with
the issue of potential writer starvation by multiple readers.
This would mean that readers must be blocked while there is a writer
attempting to get the lock. This is not
currently the case and it is not obvious whether this should be fixed - the
argument to the contrary is - readers usually take the lock for a very short
time so should they really be starved while the writer takes the lock for
potentially longer periods?
Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access, with a penalty for writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into any of these two scenarios, one should use basic spinlocks. You cannot block while holding any kind of spinlock.
Spinlocks come in three flavours: plain, _irq() and _bh().
- Plain
spin_lock()/spin_unlock(): if you know the interrupts are always disabled or if you do not race with interrupt context (e.g. from within interrupt handler), then you can use this one. It does not touch interrupt state on the current CPU. -
spin_lock_irq()/spin_unlock_irq(): if you know that interrupts are always enabled then you can use this version, which simply disables (on lock) and re-enables (on unlock) interrupts on the current CPU. For example,rtc_read()usesspin_lock_irq(&rtc_lock)(interrupts are always enabled insideread()) whilstrtc_interrupt()usesspin_lock(&rtc_lock)(interrupts are always disabled inside interrupt handler). Note thatrtc_read()usesspin_lock_irq()and not the more genericspin_lock_irqsave()because on entry to any system call interrupts are always enabled. -
spin_lock_irqsave()/spin_unlock_irqrestore(): the strongest form, to be used when the interrupt state is not known, but only if interrupts matter at all, i.e. there is no point in using it if our interrupt handlers don't execute any critical code.
The reason you cannot use plain spin_lock() if you race against interrupt handlers is because if you take it and then
an interrupt comes in on the same CPU, it will busy wait for the lock forever:
the lock holder, having been interrupted, will not continue until the
interrupt handler returns.
The most common usage of a spinlock is to access a data structure shared between user process context and interrupt handlers:
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
my_ioctl()
{
spin_lock_irq(&my_lock);
/* critical section */
spin_unlock_irq(&my_lock);
}
my_irq_handler()
{
spin_lock(&lock);
/* critical section */
spin_unlock(&lock);
}
There are a couple of things to note about this example:
- The process context, represented here as a typical driver method -
ioctl()(arguments and return values omitted for clarity), must usespin_lock_irq()because it knows that interrupts are always enabled while executing the deviceioctl()method. - Interrupt context, represented here by
my_irq_handler()(again arguments omitted for clarity) can use plainspin_lock()form because interrupts are disabled inside an interrupt handler.
2.14 Semaphores and read/write Semaphores
Sometimes, while accessing a shared data structure, one must perform operations that can block, for example copy data to userspace. The locking primitive available for such scenarios under Linux is called a semaphore. There are two types of semaphores: basic and read-write semaphores. Depending on the initial value of the semaphore, they can be used for either mutual exclusion (initial value of 1) or to provide more sophisticated type of access.
Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic spinlocks: one can have multiple readers at a time but only one writer and there can be no readers while there are writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.
Also, basic semaphores can be interruptible - just use the operations
down/up_interruptible() instead of the plain down()/up() and check the
value returned from down_interruptible(): it will be non zero if the operation was
interrupted.
Using semaphores for mutual exclusion is ideal in situations where a critical code section may call by reference unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function blocks or not.
A simple example of semaphore usage is in kernel/sys.c, implementation of
gethostname(2)/sethostname(2) system calls.
asmlinkage long sys_sethostname(char *name, int len)
{
int errno;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
down_write(&uts_sem);
errno = -EFAULT;
if (!copy_from_user(system_utsname.nodename, name, len)) {
system_utsname.nodename[len] = 0;
errno = 0;
}
up_write(&uts_sem);
return errno;
}
asmlinkage long sys_gethostname(char *name, int len)
{
int i, errno;
if (len < 0)
return -EINVAL;
down_read(&uts_sem);
i = 1 + strlen(system_utsname.nodename);
if (i > len)
i = len;
errno = 0;
if (copy_to_user(name, system_utsname.nodename, i))
errno = -EFAULT;
up_read(&uts_sem);
return errno;
}
The points to note about this example are:
- The functions may block while copying data from/to userspace in
copy_from_user()/copy_to_user(). Therefore they could not use any form of spinlock here. - The semaphore type chosen is read-write as opposed to basic because there may be lots of concurrent gethostname(2) requests which need not be mutually exclusive.
Although Linux implementation of semaphores and read-write semaphores is very sophisticated, there are possible scenarios one can think of which are not yet implemented, for example there is no concept of interruptible read-write semaphores. This is obviously because there are no real-world situations which require these exotic flavours of the primitives.
2.15 Kernel Support for Loading Modules
Linux is a monolithic operating system and despite all the modern hype about some "advantages" offered by operating systems based on micro-kernel design, the truth remains (quoting Linus Torvalds himself):
... message passing as the fundamental operation of the OS is just an
exercise in computer science masturbation. It may feel good, but you
don't actually get anything DONE.
Therefore, Linux is and will always be based on a monolithic design, which means that all subsystems run in the same privileged mode and share the same address space; communication between them is achieved by the usual C function call means.
However, although separating kernel functionality into separate "processes"
as is done in micro-kernels is definitely a bad idea, separating it into
dynamically loadable on demand kernel modules is desirable in some
circumstances (e.g. on machines with low memory or for installation kernels
which could otherwise contain ISA auto-probing device drivers that are
mutually exclusive). The decision whether to include support for loadable
modules is made at compile time and is determined by the CONFIG_MODULES
option. Support for module autoloading via request_module() mechanism is
a separate compilation option (CONFIG_KMOD).
The following functionality can be implemented as loadable modules under Linux:
- Character and block device drivers, including misc device drivers.
- Terminal line disciplines.
- Virtual (regular) files in
/procand in devfs (e.g./dev/cpu/microcodevs/dev/misc/microcode). - Binary file formats (e.g. ELF, aout, etc).
- Execution domains (e.g. Linux, UnixWare7, Solaris, etc).
- Filesystems.
- System V IPC.
There a few things that cannot be implemented as modules under Linux (probably because it makes no sense for them to be modularised):
- Scheduling algorithms.
- VM policies.
- Buffer cache, page cache and other caches.
Linux provides several system calls to assist in loading modules:
caddr_t create_module(const char *name, size_t size): allocatessizebytes usingvmalloc()and maps a module structure at the beginning thereof. This new module is then linked into the list headed by module_list. Only a process withCAP_SYS_MODULEcan invoke this system call, others will getEPERMreturned.long init_module(const char *name, struct module *image): loads the relocated module image and causes the module's initialisation routine to be invoked. Only a process withCAP_SYS_MODULEcan invoke this system call, others will getEPERMreturned.long delete_module(const char *name): attempts to unload the module. Ifname == NULL, attempt is made to unload all unused modules.long query_module(const char *name, int which, void *buf, size_t bufsize, size_t *ret): returns information about a module (or about all modules).
The command interface available to users consists of:
- insmod: insert a single module.
- modprobe: insert a module including all other modules it depends on.
- rmmod: remove a module.
- modinfo: print some information about a module, e.g. author, description, parameters the module accepts, etc.
Apart from being able to load a module manually using either insmod or modprobe,
it is also possible to have the module inserted automatically by the kernel
when a particular functionality is required. The kernel interface for this
is the function called request_module(name) which is exported to modules,
so that modules can load other modules as well. The request_module(name)
internally creates a kernel thread which execs the userspace command
modprobe -s -k module_name, using the standard exec_usermodehelper() kernel
interface (which is also exported to modules). The function returns 0 on
success, however it is usually not worth checking the return code from
request_module(). Instead, the programming idiom is:
if (check_some_feature() == NULL)
request_module(module);
if (check_some_feature() == NULL)
return -ENODEV;
For example, this is done by fs/block_dev.c:get_blkfops() to load a module
block-major-N when attempt is made to open a block device with major N.
Obviously, there is no such module called block-major-N (Linux developers
only chose sensible names for their modules) but it is mapped to a proper
module name using the file /etc/modules.conf. However, for most well-known
major numbers (and other kinds of modules) the modprobe/insmod commands
know which real module to load without needing an explicit alias statement
in /etc/modules.conf.
A good example of loading a module is inside the mount(2) system call. The
mount(2) system call accepts the filesystem type as a string which
fs/super.c:do_mount() then passes on to fs/super.c:get_fs_type():
static struct file_system_type *get_fs_type(const char *name)
{
struct file_system_type *fs;
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
if (!fs && (request_module(name) == 0)) {
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
}
return fs;
}
A few things to note in this function:
- First we attempt to find the filesystem with the given name amongst
those already registered. This is done under protection of
file_systems_locktaken for read (as we are not modifying the list of registered filesystems). - If such a filesystem is found then we attempt to get a new reference
to it by trying to increment its module's hold count. This always
returns 1 for statically linked filesystems or for modules not
presently being deleted. If
try_inc_mod_count()returned 0 then we consider it a failure - i.e. if the module is there but is being deleted, it is as good as if it were not there at all. - We drop the
file_systems_lockbecause what we are about to do next (request_module()) is a blocking operation, and therefore we can't hold a spinlock over it. Actually, in this specific case, we would have to dropfile_systems_lockanyway, even ifrequest_module()were guaranteed to be non-blocking and the module loading were executed in the same context atomically. The reason for this is that the module's initialisation function will try to callregister_filesystem(), which will take the samefile_systems_lockread-write spinlock for write. - If the attempt to load was successful, then we take the
file_systems_lockspinlock and try to locate the newly registered filesystem in the list. Note that this is slightly wrong because it is in principle possible for a bug in modprobe command to cause it to coredump after it successfully loaded the requested module, in which caserequest_module()will fail even though the new filesystem will be registered, and yetget_fs_type()won't find it. - If the filesystem is found and we are able to get a reference to it, we return it. Otherwise we return NULL.
When a module is loaded into the kernel, it can refer to any symbols that
are exported as public by the kernel using EXPORT_SYMBOL() macro or by
other currently loaded modules. If the module uses symbols from another
module, it is marked as depending on that module during dependency
recalculation, achieved by running depmod -a command on boot (e.g. after
installing a new kernel).
Usually, one must match the set of modules with the version of the kernel
interfaces they use, which under Linux simply means the "kernel version" as
there is no special kernel interface versioning mechanism in general.
However, there is a limited functionality called "module versioning" or
CONFIG_MODVERSIONS which allows to avoid recompiling modules when switching
to a new kernel. What happens here is that the kernel symbol table is treated
differently for internal access and for access from modules. The elements of
public (i.e. exported) part of the symbol table are built by 32bit
checksumming the C declaration. So, in order to resolve a symbol used by a
module during loading, the loader must match the full representation of the
symbol that includes the checksum; it will refuse to load the module if these
symbols differ. This
only happens when both the kernel and the module are compiled with module
versioning enabled. If either one of them uses the original symbol names,
the loader simply tries to match the kernel version declared by the module
and the one exported by the kernel and refuses to load if they differ.