OpenBSD/sgi on octane2 - mplock周りで落ちるバグについて#3
その後KERNEL_PROC_LOCK(p)/KERNEL_PROC_UNLOCK(p)を追ってみた所、kern/init_main.c:start_init()の以下の場所でlockされていないのにunlockしているからエラーが出るらしい事がわかった:
if ((error = sys_execve(p, &args, retval)) == 0) { KERNEL_PROC_UNLOCK(p); return; }
たとえここのUNLOCKをコメントアウトしても、すぐ別の場所でやはりKERNEL_PROC_UNLOCK(p)が行われて同じように失敗する事もわかった。
このマクロは実際の所、__mp_lock(&kernel_lock)でジャイアントロックを行ってしまっているが、将来的にはプロセス構造体毎にロックを持つ前提で使われているはず。
であるから、あるstruct proc *pに対するKERNEL_PROC_UNLOCK(p)には、それ以前にKERNEL_PROC_LOCK(p)が行われていないと不自然だ。
が、カーネルへ入れたdebug printを見てもそのような形跡が見当たらない。
debug printを入れてみたの図:
Setting $netaddr to 192.168.3.2 (from server ) Obtaining bsd from server 5890992+492712 entry: 0xa800000020020000 ARCS64 Firmware Version 64.0 Found SGI-IP30, setting up. Initial setup done, switching console. cpu_id:0 Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. Copyright (c) 1995-2009 OpenBSD. All rights reserved. http://www.OpenBSD.org OpenBSD 4.5-current (OCTANE_RAMDISK_MP) #13: Mon May 11 15:27:07 JST 2009 real mem = 1073741824 (1024MB) rsvd mem = 1064960 (1MB) avail mem = 1013780480 (966MB) mainbus0 at root cpu0 at mainbus0: MIPS R12000 CPU rev 3.5 400 MHz with R10000 FPU rev 0.0 cpu0: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way clock0 at mainbus0: ticker on int5 using count register xbow0 at mainbus0: XBow revision 5 xheart0 at xbow0 widget 8: Heart revision 6 onewire0 at xheart0 owserial0 at onewire0 family 0x0b sn 0000004d74fa owserial0: "PM20400MHZ" p/n 030-1476-001, serial KWL381 "Odyssey" revision 2 at xbow0 widget 11 not configured xbridge0 at xbow0 widget 15: Bridge revision 4 pci0 at xbridge0 bus 0 isp0 at pci0 dev 0 function 0 "QLogic ISP1020" rev 0x05: irq 0 isp0: invalid NVRAM header scsibus0 at isp0: 16 targets, initiator 7 sd0 at scsibus0 targ 1 lun 0: <QUANTUM, ATLAS_V__9_SCA, 0238> SCSI3 0/direct fixed sd0: 8682MB, 512 bytes/sec, 17781964 sec total isp1 at pci0 dev 1 function 0 "QLogic ISP1020" rev 0x05: irq 1 isp1: invalid NVRAM header scsibus1 at isp1: 16 targets, initiator 7 ioc0 at pci0 dev 2 function 0 "SGI IOC3" rev 0x01 onewire1 at ioc0 owmac0 at onewire1 family 0x09 sn 000002d29696 owmac0: Ethernet Address 08:00:69:13:67:82 owserial1 at onewire1 family 0x0b sn 0000004d4ee4 owserial1: "FP1" p/n 030-0891-003, serial LMR447 owserial2 at onewire1 family 0x0b sn 0000004e12fb owserial2: "PWR.SPPLY.ER" p/n 060-0035-002, serial AAE0420107 ioc0: superio irq 4, ethernet irq 2 com0 at ioc0 base 0x00020178: ns16550a, 16 byte fifo com0: console com1 at ioc0 base 0x00020170: ns16550a, 16 byte fifo "SGI Rad1" rev 0xc0 at pci0 dev 3 function 0 not configured [../../../../kern/init_main.c:main:358] KERNEL_PROC_LOCK(0xa800000020609ae0) rd0: fixed, 8192 blocks boot device: sd0 root on rd0a swap on rd0b dump on rd0b WARNING: No TOD clock, believing file system. WARNING: CHECK AND RESET THE DATE! [../../../../kern/kern_sched.c:sched_idle:112] KERNEL_PROC_UNLOCK(0xffffffffc910b4c0) __mp_unlock(0x2060a370): releasing not locked lock Stopped at 0xa80000002016529c: jr ra 0xa8000000201652a0: nop ddb{0}>
これって、Machine dependentなコードからKERNEL_PROC_LOCK()を呼んでいるって事なんでは、と思い、同じdebug printが入ったカーネルをi386向けにコンパイルして走らせてみた。
すると、
[../../../../kern/kern_fork.c:proc_trampoline_mp:524] KERNEL_PROC_LOCK(0xXXXXXXXX)
というsgiでは出てこなかった表示が得られた。
ソースコードを見てみると、確かにKERNEL_PROC_LOCK(p)している:
#if defined(MULTIPROCESSOR) /* * XXX This is a slight hack to get newly-formed processes to * XXX acquire the kernel lock as soon as they run. */ void proc_trampoline_mp(void) { struct proc *p; p = curproc; SCHED_ASSERT_LOCKED(); __mp_unlock(&sched_lock); spl0(); SCHED_ASSERT_UNLOCKED(); KASSERT(__mp_lock_held(&kernel_lock) == 0); KERNEL_PROC_LOCK(p); } #endif
これを呼んでいるのは誰か、とgrepしてみると、arch/i386/i386/locore.sで以下のようなコードを発見。
NENTRY(proc_trampoline) #ifdef MULTIPROCESSOR call _C_LABEL(proc_trampoline_mp) #endif movl $IPL_NONE,CPL pushl %ebx call *%esi addl $4,%esp INTRFASTEXIT /* NOTREACHED */
つまり、forkしたての所で呼ばれるって訳か。
こんなんNetBSDには無かったぞ・・・。同じBSD兄弟でもSMP実装は色々なんですな。
ああそうかい分かったよって事で、arch/mips64/mips64/exception.Sを以下のように書き換えてみた:
/*-------------------------------------------------------------- proc_trampoline * Setup for and return to user. */ LEAF(proc_trampoline, 0) #ifdef MULTIPROCESSOR jal _C_LABEL(proc_trampoline_mp) nop #endif
おし!これで直るはずだ!
結果:
Option? 5 Command Monitor. Type "exit" to return to the menu. >> bootp()192.168.3.1:bsd Setting $netaddr to 192.168.3.2 (from server ) Obtaining bsd from server 5889920+492744 entry: 0xa800000020020000 ARCS64 Firmware Version 64.0 Found SGI-IP30, setting up. Initial setup done, switching console. cpu_id:0 Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. Copyright (c) 1995-2009 OpenBSD. All rights reserved. http://www.OpenBSD.org OpenBSD 4.5-current (OCTANE_RAMDISK_MP) #17: Mon May 11 22:06:09 JST 2009 real mem = 1073741824 (1024MB) rsvd mem = 1064960 (1MB) avail mem = 1013780480 (966MB) mainbus0 at root cpu0 at mainbus0: MIPS R12000 CPU rev 3.5 400 MHz with R10000 FPU rev 0.0 cpu0: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way clock0 at mainbus0: ticker on int5 using count register xbow0 at mainbus0: XBow revision 5 xheart0 at xbow0 widget 8: Heart revision 6 onewire0 at xheart0 owserial0 at onewire0 family 0x0b sn 0000004d74fa owserial0: "PM20400MHZ" p/n 030-1476-001, serial KWL381 "Odyssey" revision 2 at xbow0 widget 11 not configured xbridge0 at xbow0 widget 15: Bridge revision 4 pci0 at xbridge0 bus 0 isp0 at pci0 dev 0 function 0 "QLogic ISP1020" rev 0x05: irq 0 isp0: invalid NVRAM header scsibus0 at isp0: 16 targets, initiator 0 sd0 at scsibus0 targ 1 lun 0: <QUANTUM, ATLAS_V__9_SCA, 0238> SCSI3 0/direct fixed sd0: 8682MB, 512 bytes/sec, 17781964 sec total isp1 at pci0 dev 1 function 0 "QLogic ISP1020" rev 0x05: irq 1 isp1: invalid NVRAM header scsibus1 at isp1: 16 targets, initiator 0 ioc0 at pci0 dev 2 function 0 "SGI IOC3" rev 0x01 onewire1 at ioc0 owmac0 at onewire1 family 0x09 sn 000002d29696 owmac0: Ethernet Address 08:00:69:13:67:82 owserial1 at onewire1 family 0x0b sn 0000004d4ee4 owserial1: "FP1" p/n 030-0891-003, serial LMR447 owserial2 at onewire1 family 0x0b sn 0000004e12fb owserial2: "PWR.SPPLY.ER" p/n 060-0035-002, serial AAE0420107 ioc0: superio irq 4, ethernet irq 2 com0 at ioc0 base 0x00020178: ns16550a, 16 byte fifo com0: console com1 at ioc0 base 0x00020170: ns16550a, 16 byte fifo iockbc at ioc0 not configured ef at ioc0 base 0x000000f0 not configured dsrtc at ioc0 not configured "SGI Rad1" rev 0xc0 at pci0 dev 3 function 0 not configured rd0: fixed, 8192 blocks boot device: sd0 root on rd0a swap on rd0b dump on rd0b WARNING: No TOD clock, believing file system. WARNING: CHECK AND RESET THE DATE! panic: kernel diagnostic assertion "__mp_lock_held(&kernel_lock) == 0" failed: file "../../../../kern/kern_sched.c", line 184 Stopped at 0xa80000002016501c: jr ra 0xa800000020165020: nop RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! IF RUNNING SMP, USE 'mach ddbcpu <#>' AND 'trace' ON OTHER PROCESSORS, TOO. DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb{0}>
(*´∀`)そうきたか。
(※今度はロックされていない筈なのにロックされてるよって言ってpanicしている)