OpenBSD/sgi on octane2 - mplock周りで落ちるバグについて#3

その後KERNEL_PROC_LOCK(p)/KERNEL_PROC_UNLOCK(p)を追ってみた所、kern/init_main.c:start_init()の以下の場所でlockされていないのにunlockしているからエラーが出るらしい事がわかった:

		if ((error = sys_execve(p, &args, retval)) == 0) {
			KERNEL_PROC_UNLOCK(p);
			return;
		}

たとえここのUNLOCKをコメントアウトしても、すぐ別の場所でやはりKERNEL_PROC_UNLOCK(p)が行われて同じように失敗する事もわかった。

このマクロは実際の所、__mp_lock(&kernel_lock)でジャイアントロックを行ってしまっているが、将来的にはプロセス構造体毎にロックを持つ前提で使われているはず。
であるから、あるstruct proc *pに対するKERNEL_PROC_UNLOCK(p)には、それ以前にKERNEL_PROC_LOCK(p)が行われていないと不自然だ。
が、カーネルへ入れたdebug printを見てもそのような形跡が見当たらない。

debug printを入れてみたの図:

Setting $netaddr to 192.168.3.2 (from server )
Obtaining bsd from server
5890992+492712 entry: 0xa800000020020000
ARCS64 Firmware Version 64.0
Found SGI-IP30, setting up.
Initial setup done, switching console.
cpu_id:0
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2009 OpenBSD. All rights reserved.  http://www.OpenBSD.org

OpenBSD 4.5-current (OCTANE_RAMDISK_MP) #13: Mon May 11 15:27:07 JST 2009

real mem = 1073741824 (1024MB)
rsvd mem = 1064960 (1MB)
avail mem = 1013780480 (966MB)
mainbus0 at root
cpu0 at mainbus0: MIPS R12000 CPU rev 3.5 400 MHz with R10000 FPU rev 0.0
cpu0: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way
clock0 at mainbus0: ticker on int5 using count register
xbow0 at mainbus0: XBow revision 5
xheart0 at xbow0 widget 8: Heart revision 6
onewire0 at xheart0
owserial0 at onewire0 family 0x0b sn 0000004d74fa
owserial0: "PM20400MHZ" p/n 030-1476-001, serial KWL381
"Odyssey" revision 2 at xbow0 widget 11 not configured
xbridge0 at xbow0 widget 15: Bridge revision 4
pci0 at xbridge0 bus 0
isp0 at pci0 dev 0 function 0 "QLogic ISP1020" rev 0x05: irq 0
isp0: invalid NVRAM header
scsibus0 at isp0: 16 targets, initiator 7
sd0 at scsibus0 targ 1 lun 0: <QUANTUM, ATLAS_V__9_SCA, 0238> SCSI3 0/direct fixed
sd0: 8682MB, 512 bytes/sec, 17781964 sec total
isp1 at pci0 dev 1 function 0 "QLogic ISP1020" rev 0x05: irq 1
isp1: invalid NVRAM header
scsibus1 at isp1: 16 targets, initiator 7
ioc0 at pci0 dev 2 function 0 "SGI IOC3" rev 0x01
onewire1 at ioc0
owmac0 at onewire1 family 0x09 sn 000002d29696
owmac0: Ethernet Address 08:00:69:13:67:82
owserial1 at onewire1 family 0x0b sn 0000004d4ee4
owserial1: "FP1" p/n 030-0891-003, serial LMR447
owserial2 at onewire1 family 0x0b sn 0000004e12fb
owserial2: "PWR.SPPLY.ER" p/n 060-0035-002, serial AAE0420107
ioc0: superio irq 4, ethernet irq 2
com0 at ioc0 base 0x00020178: ns16550a, 16 byte fifo
com0: console
com1 at ioc0 base 0x00020170: ns16550a, 16 byte fifo
"SGI Rad1" rev 0xc0 at pci0 dev 3 function 0 not configured
[../../../../kern/init_main.c:main:358] KERNEL_PROC_LOCK(0xa800000020609ae0)
rd0: fixed, 8192 blocks
boot device: sd0
root on rd0a swap on rd0b dump on rd0b
WARNING: No TOD clock, believing file system.
WARNING: CHECK AND RESET THE DATE!
[../../../../kern/kern_sched.c:sched_idle:112] KERNEL_PROC_UNLOCK(0xffffffffc910b4c0)
__mp_unlock(0x2060a370): releasing not locked lock
Stopped at      0xa80000002016529c:     jr      ra
0xa8000000201652a0:      nop
ddb{0}>

これって、Machine dependentなコードからKERNEL_PROC_LOCK()を呼んでいるって事なんでは、と思い、同じdebug printが入ったカーネルi386向けにコンパイルして走らせてみた。

すると、
[../../../../kern/kern_fork.c:proc_trampoline_mp:524] KERNEL_PROC_LOCK(0xXXXXXXXX)
というsgiでは出てこなかった表示が得られた。

ソースコードを見てみると、確かにKERNEL_PROC_LOCK(p)している:

#if defined(MULTIPROCESSOR)
/*
 * XXX This is a slight hack to get newly-formed processes to
 * XXX acquire the kernel lock as soon as they run.
 */
void
proc_trampoline_mp(void)
{
	struct proc *p;

	p = curproc;

	SCHED_ASSERT_LOCKED();
	__mp_unlock(&sched_lock);
	spl0();
	SCHED_ASSERT_UNLOCKED();
	KASSERT(__mp_lock_held(&kernel_lock) == 0);

	KERNEL_PROC_LOCK(p);
}
#endif

これを呼んでいるのは誰か、とgrepしてみると、arch/i386/i386/locore.sで以下のようなコードを発見。

NENTRY(proc_trampoline)
#ifdef MULTIPROCESSOR
	call	_C_LABEL(proc_trampoline_mp)
#endif
	movl	$IPL_NONE,CPL
	pushl	%ebx
	call	*%esi
	addl	$4,%esp
	INTRFASTEXIT
	/* NOTREACHED */

つまり、forkしたての所で呼ばれるって訳か。
こんなんNetBSDには無かったぞ・・・。同じBSD兄弟でもSMP実装は色々なんですな。

ああそうかい分かったよって事で、arch/mips64/mips64/exception.Sを以下のように書き換えてみた:

/*-------------------------------------------------------------- proc_trampoline
 *	Setup for and return to user.
 */
LEAF(proc_trampoline, 0)
#ifdef MULTIPROCESSOR
	jal	_C_LABEL(proc_trampoline_mp)
	nop
#endif

おし!これで直るはずだ!

結果:

Option? 5
Command Monitor.  Type "exit" to return to the menu.
>> bootp()192.168.3.1:bsd
Setting $netaddr to 192.168.3.2 (from server )
Obtaining bsd from server
5889920+492744 entry: 0xa800000020020000
ARCS64 Firmware Version 64.0
Found SGI-IP30, setting up.
Initial setup done, switching console.
cpu_id:0
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2009 OpenBSD. All rights reserved.  http://www.OpenBSD.org

OpenBSD 4.5-current (OCTANE_RAMDISK_MP) #17: Mon May 11 22:06:09 JST 2009

real mem = 1073741824 (1024MB)
rsvd mem = 1064960 (1MB)
avail mem = 1013780480 (966MB)
mainbus0 at root
cpu0 at mainbus0: MIPS R12000 CPU rev 3.5 400 MHz with R10000 FPU rev 0.0
cpu0: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way
clock0 at mainbus0: ticker on int5 using count register
xbow0 at mainbus0: XBow revision 5
xheart0 at xbow0 widget 8: Heart revision 6
onewire0 at xheart0
owserial0 at onewire0 family 0x0b sn 0000004d74fa
owserial0: "PM20400MHZ" p/n 030-1476-001, serial KWL381
"Odyssey" revision 2 at xbow0 widget 11 not configured
xbridge0 at xbow0 widget 15: Bridge revision 4
pci0 at xbridge0 bus 0
isp0 at pci0 dev 0 function 0 "QLogic ISP1020" rev 0x05: irq 0
isp0: invalid NVRAM header
scsibus0 at isp0: 16 targets, initiator 0
sd0 at scsibus0 targ 1 lun 0: <QUANTUM, ATLAS_V__9_SCA, 0238> SCSI3 0/direct fixed
sd0: 8682MB, 512 bytes/sec, 17781964 sec total
isp1 at pci0 dev 1 function 0 "QLogic ISP1020" rev 0x05: irq 1
isp1: invalid NVRAM header
scsibus1 at isp1: 16 targets, initiator 0
ioc0 at pci0 dev 2 function 0 "SGI IOC3" rev 0x01
onewire1 at ioc0
owmac0 at onewire1 family 0x09 sn 000002d29696
owmac0: Ethernet Address 08:00:69:13:67:82
owserial1 at onewire1 family 0x0b sn 0000004d4ee4
owserial1: "FP1" p/n 030-0891-003, serial LMR447
owserial2 at onewire1 family 0x0b sn 0000004e12fb
owserial2: "PWR.SPPLY.ER" p/n 060-0035-002, serial AAE0420107
ioc0: superio irq 4, ethernet irq 2
com0 at ioc0 base 0x00020178: ns16550a, 16 byte fifo
com0: console
com1 at ioc0 base 0x00020170: ns16550a, 16 byte fifo
iockbc at ioc0 not configured
ef at ioc0 base 0x000000f0 not configured
dsrtc at ioc0 not configured
"SGI Rad1" rev 0xc0 at pci0 dev 3 function 0 not configured
rd0: fixed, 8192 blocks
boot device: sd0
root on rd0a swap on rd0b dump on rd0b
WARNING: No TOD clock, believing file system.
WARNING: CHECK AND RESET THE DATE!
panic: kernel diagnostic assertion "__mp_lock_held(&kernel_lock) == 0" failed: file "../../../../kern/kern_sched.c", line 184
Stopped at      0xa80000002016501c:     jr      ra
0xa800000020165020:      nop
RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
IF RUNNING SMP, USE 'mach ddbcpu <#>' AND 'trace' ON OTHER PROCESSORS, TOO.
DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
ddb{0}>

(*´∀`)そうきたか。
(※今度はロックされていない筈なのにロックされてるよって言ってpanicしている)