- 更换控制卡重启并开机后,需要输入机器的SN号
- 需要重新配置底层时间并重启
- 升级FW过程中会自动重启一次
控制卡介绍
扩展系统控制设备单元(eXtended System Control Facility Unit, XSCFU)是一种服务处理器,可操作和管理这两种中端服务器。XSCFU 可诊断和启动整个服务器、配置域、提供动态重新配置以及检测和通知各种故障。XSCFU 通过网络启用标准控制和监视功能。使用此功能可以从远程位置进行服务器的启动、设置和操作管理。
故障现像
日志告警
XSCF> showhardconf (输出会自动打星标记)
* XSCFU Status:Degraded,Active;Ver:0101h; Serial:xxxxxxx ;
+ FRU-Part-Number:CF00541-0481 04 /541-0481-04
机器外观(正面黄灯告警,背面控制卡也有黄灯)
关闭故障主机
查看与收集资源状态
由于控制卡,由于已坏,串口连接没有反应,所以通地双机中的另外一台登录关机。
# hostname
LDTX-DB2
#date 核对机器一下时间(最好与自己电脑对比一下,后面要用)
# hvdisp -a (资源都是online))
Local System: ldtx-db2RMS
Configuration:/opt/SMAW/SMAWRrms/build/config.us
Resource Type HostName State StateDetails
-----------------------------------------------------------------------------
ldtx-db1RMS SysNode Online
ldtx-db2RMS SysNode Online
LDTX userApp Online
Machine001_LDTX andOp ldtx-db2RMS Online
Machine000_LDTX andOp ldtx-db1RMS
ManageProgram000_Cmdline0 gRes Online
Ipaddress000_Gls0 gRes Online
AllDiskClassesOk_Gds0 andOp Online
cdata1_Gds0 gRes Online
故障机ldtx-db1所双机状态
# rsh LDTX-DB1 (远程登录)
Last login: Tue Aug 19 15:32:06 on console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
You have new mail.
# hostname
LDTX-DB1
##
#hvdisp –a (所在资源都是offline)
Local System: ldtx-db1RMS
Configuration:/opt/SMAW/SMAWRrms/build/config.us
Resource Type HostName State StateDetails
-----------------------------------------------------------------------------
ldtx-db2RMS SysNode Online
ldtx-db1RMS SysNode Online
LDTX userApp Standby
LDTX userApp ldtx-db2RMS Online
Machine001_LDTX andOp ldtx-db2RMS
Machine000_LDTX andOp ldtx-db1RMS Offline
ManageProgram000_Cmdline0 gRes Offline
Ipaddress000_Gls0 gRes Standby
AllDiskClassesOk_Gds0 andOp Offline
cdata1_Gds0 gRes Offline
关机
bash-3.00# shutdown -i5 -y -g0
Shutdown started. Wed Aug 27 15:31:34 CST 2014
Changing to init state 5 - please wait
Broadcast Message from root (pts/2) onLDTX-DB1 Wed Aug 27 15:31:34...
THE SYSTEM LDTX-DB1 IS BEING SHUT DOWN NOW! ! !
Log off now or risk your files beingdamaged
可以使用双机命令验证另外一台是否已经离线。
由于看现场看不到信息,所以等待5~10分钟,看硬盘没闪动,再强行下电。
如console有输出,则需要输入命令下电:
XSCF> poweroff –a
更换控制卡
更换并上电
拔电线,换板,接电线,接控制器线,
控制板会自动加板启动,控制台可以看到输入,类似:
SCF board boot factor = 4080
memory test ..
Memory compare test
................finish
DDR Real size: 256 MB
DDR: 224 MB
## Booting image at ff800000
输入机器的SN号
(机器前面板上贴着)
***** WARNING *****
XSCF initialization terminate becausesystem data in XSCF/OPNL are mismatch.
Start procedure for system data selection.
Please select system data according to theinstruction
Please input the chassis serial number : XXXXXXXX //手工输入SN
1:PANEL
Please select the number : 1 //选择1 初始化,自检后,会自动重启
Restoring data from PANEL to XSCF#0.
Please wait for several minutes ...
setdefaults : XSCF clear : start
setdefaults : XSCF clear : DBS start
setdefaults : XSCF clear : wait 20s for DBSinitialization
setdefaults : XSCF clear : common databaseclear complete
The restoration of data has completed.
Please turn off the breaker.
unmount /hcp0/linux
unmount /hcp0/scfprog -- complete
unmount /hcp0/gendata -- complete
unmount /hcp0/remcscm -- complete
unmount /hcp1/linux
unmount /hcpcommon/scflog1 -- complete
unmount /hcpcommon/scflog2 -- complete
The system is going down NOW !!
Sending SIGTERM to all processes.
Sending SIGKILL to all processes.
Please stand by while rebooting thesystem.(15)
Restarting system.
登录
重启完了之后,登录。
登录名有二种:
一种用户名与密码ce/abc123,如果不是再默认的
默认的用户名default,输入后,提示要开或关锁lock或switch,按示转动回车。
登录后提示版本不匹配,需要升级。
XSCF Initialize complete.
Jan 1 08:12:46 xscf0-M5000-MachineSN-1 XSCF[106]: XSCF Initialize complete.
login: ce
Password:
XCPversion of Panel EEPROM and XSCF FMEM mismatched,
Panel EEPROM=1080, XSCF FMEM=1115
新卡版本检查
升级前,先查看一下版本,这前通日志知道原来的版本是1080,现在是1115
XSCF> version -c xcp
XSCF#0 (Active )
XCP0 (Current): 1115
XCP1 (Reserve): 1115
XSCF> version -c xcp -v
XSCF#0 (Active )
XCP0 (Current): 1115
OpenBoot PROM : 02.32.0000
XSCF : 01.11.0005
XCP1 (Reserve): 1115
OpenBoot PROM : 02.32.0000
XSCF : 01.11.0005
OpenBoot PROM BACKUP
#0: 02.11.0000
#1: 02.32.0000
查看里面的升级包是否在
XSCF> getflashimage –l
Existing versions:
Version Size Date
FFXCP1115.tar.gz 45791674 Mon Jan 01 08:27:57 CST 2001
版本升级
XSCF> flashupdate -c check -m xcp -s 1115 //先检查版本
XCP update is possible with domains up
XSCF> flashupdate -c update -m xcp -s 1115 //升级
The XSCF will be reset. Continue? [y|n] :y
XCP update is started (XCPversion=1115:last version=1080)
OpenBoot PROM update is started (OpenBootPROM version=02320000)
OpenBoot PROM update has been completed(OpenBoot PROM version=02320000)
XSCF update is started (XSCFU=0,bank=1,XCPversion=1115:last version=1080)
XSCF download is started(XSCFU=0,bank=1,XCP version=1115:last version=1080, Firmware ElementID=00:version=01110004:last version=01110004)
升级大约需要10~20分钟,并会自动重启,重启后更新直到结束。
XSCF flashupdate[830]: XCP update has been completed (XCP version=1115)
使用default登录
login: default
Change the panel mode switch to Locked andpress return... //按提示操作
Leave it in that position for at least 5seconds. Change the panel mode switch toService, and press return... //按提示操作
提示tip:
+++++++++++++++++++++++++++++++++++++++
因为default登录麻烦,如果有必要,可以创建新的用户
XSCF> adduser ce
XSCF> password ce
XSCF> setprivileges ce platadm (管理权限)
XSCF> showuser -l
+++++++++++++++++++++++++++++++++++++++++++++
网络配置(略)
XSCF> shownetwork -a
xscf#0-lan#0
Link encap:Ethernet HWaddr00:21:28:25:D4:D6
inet addr:209.56.7.120 Bcast:209.56.7.255 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Base address:0xe000
xscf#0-lan#1
Link encap:Ethernet HWaddr00:21:28:25:D4:D7
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0(0.0 B)
Base address:0xc000
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
查看当前版本是否已更新。
XSCF> version -c xcp -v
XSCF#0 (Active )
XCP0(Reserve): 1115
OpenBoot PROM : 02.32.0000
XSCF : 01.11.0005
XCP1(Current): 1115
OpenBoot PROM : 02.32.0000
XSCF : 01.11.0005
OpenBoot PROM BACKUP
#0: 02.11.0000
#1: 02.32.0000
时间配置
XSCF> showtimezone -c tz //查看时区
Asia/Shanghai
XSCF> poweron -d 0 // 给机器加电(d表示域)
DomainIDs to power on:00
Continue? [y|n] :y
Poweron canceled due to invalid system dateand time. //提示时间上有问题
需要重新配置一下时间。
XSCF> setdate -s 2014.08.27-17:04:23
Wed Aug 27 17:04:23 CST 2014
The XSCF will be reset. Continue? [y|n] :y //选择Y,后自动重启
Wed Aug 27 09:04:23 UTC 2014
XSCF> execute J00shutdown_start -- complete
execute K000end -- complete
Aug 27 17:04:25 xscf0-M5000-MachineSN-1XSCF[106]: XSCF shutdown sequence start
正常引导
重启正常后,再次使用ce登录,查看硬件状态
XSCF> showhardconf
XSCF> showhardconf -u
检查硬件正常后,加电
XSCF> poweron -d 0
DomainIDs to power on:00
Continue? [y|n] :y //选择Y
00 :Powering on
*Note*
Thiscommand only issues the instruction to power-on.
Theresult of the instruction can be checked by the "showlogs power".
进入主机控制台
XSCF> console -d 0
Console contents may be logged.
Connect to DomainID 0?[y|n] :y
系统会停在OK状态下,可以使用printenv,probe-scsi-disk,setenv,nvalais 等等
OK boot
SPARC Enterprise M5000 Server, using Domainconsole
Copyright (c) 1998, 2012, Oracle and/or itsaffiliates. All rights reserved.
Copyright (c) 2012, Oracle and/or itsaffiliates and Fujitsu Limited. All rights reserved.
OpenBoot 4.33.5.d, 65536 MB memoryinstalled, Serial #xxxxxx.
Ethernet address 0:21:28:25:d4:d2, Host ID:xxxxxxx.
Aborting auto-boot sequence.
{0} ok boot
双机软件开启
登录系统,查看双机状态,由于双机没有自动启动,所以需要手工
bash-3.00# hvdisp -a
hvdisp: RMS is not running
bash-3.00# hvcm -s ldtx-db1 //指定主机或hvcm –a 所以主机资源
Starting Reliant Monitor Services now
bash-3.00# disAug 27 17:05:42 LDTX-DB1 : LOG3.014091303421080023 0 3 0 4.2 RMS (WRP, 34): ERROR: Cluster host ldtx-db2RMS isno longer in time sync with local node. Sane operation of RMS can no longer beguaranteed. Further out-of-sync messages will appear in the syslog.
bash-3.00# hvdisp -a
Local System: ldtx-db1RMS
Configuration:/opt/SMAW/SMAWRrms/build/config.us
Resource Type HostName State StateDetails
-----------------------------------------------------------------------------
ldtx-db2RMS SysNode Online
ldtx-db1RMS SysNode Online
LDTX userApp Standby
LDTX userApp ldtx-db2RMS Online
Machine001_LDTX andOp ldtx-db2RMS
Machine000_LDTX andOp ldtx-db1RMS Offline
ManageProgram000_Cmdline0 gRes Offline
Ipaddress000_Gls0 gRes Standby
AllDiskClassesOk_Gds0 andOp Offline
cdata1_Gds0 gRes Offline
bash-3.00# man hvcm
Reformatting page. Please Wait... done
Maintenance Commands hvcm(1M)
NAME
hvcm - start the Reliant Monitor configuration monitor
SYNOPSIS
hvcm {-a | -s SysNode } Format 1
hvcm -c config_file {-a | -s SysNode } [-h time] [-l level]
Format 2
hvcm -V Format 3
DESCRIPTION
The configuration monitor is the decision-making module of
Reliant Monitor. It controls the configuration and access
to all Reliant Monitor resources. If a resource fails, the
configuration monitor analyzes the failure and initiates the
appropriate action according to the specifications for the
resource in the nodes configuration file.
The hvcm command starts the configuration monitor and the
detectors for all monitoredresources. In most cases, it is
not necessary to specify options to the hvcm command; the
default values are sufficient for most configurations.
附:富士通双机软件介绍
http://www.fujitsu.com/cn/services/hardware/servers/software/index.html
双机命令:
显示本地主机状态资源状态
hvdisp -a
启动所有资源
hvcm –a
启动某个节点资源
hvcm –s hostnode
切换某APP到节点
hvswitch app hostnode
资源状态调整,一般为清除fault
hvutil –c userapp
切换成online状态
hvutil –a userapp
停止资源,启用需要hvswitch
hvutil –f userapp
切换成非活动状态,
hvutil –d userapp
将资源切换成维护或非维护状态
Hvutil –m off|on userapp
心跳网卡
cfconfig –g
显示节点信息
cftool –n
显示心跳状态
ciptool –n
显示群集共享资源结构
/etc/opt/FJSVcluster/bin/clgettree
显示网卡多路径状态信息
/opt/FJSVhanet/usr/sbin/dsphanet
# hvsetenv HV_RCSTART
查看RMS是否自动启动,1为启动,0为不启动
# hvsetenv HV_AUTOSTARTUP
查看userapplication是否自动启动,1为启动,0为不启动
# hvsetenv HV_RCSTART 0
RMS不自动启动。
# hvsetenv HV_AUTOSTARTUP 0
Userapplication不自动启动
取资源管理器RMS配置信息。
# hvdump -f /opt/pcl.`date`.Z
说明:将资源管理器RMS的配置信息导出到/opt/pclRMS.`date`.Z文件中。
收集该节点系统信息及日志。
# /opt/FJSVsnap/bin/fjsnap -a/tmp/fjsnap_`uname -n`.tar.gz
说明:收集该节点系统配置和日志信息,并生成/tmp/fjsnap_`uname -n`.tar.gz文件。