关注点:
  1. 更换控制卡重启并开机后,需要输入机器的SN号
  2. 需要重新配置底层时间并重启
  3. 升级FW过程中会自动重启一次

控制卡介绍

扩展系统控制设备单元(eXtended System Control Facility Unit, XSCFU)是一种服务处理器,可操作和管理这两种中端服务器。XSCFU 可诊断和启动整个服务器、配置域、提供动态重新配置以及检测和通知各种故障。XSCFU 通过网络启用标准控制和监视功能。使用此功能可以从远程位置进行服务器的启动、设置和操作管理。

 

 

故障现像

日志告警

XSCF> showhardconf    (输出会自动打星标记)

*  XSCFU Status:Degraded,Active;Ver:0101h; Serial:xxxxxxx ;

       + FRU-Part-Number:CF00541-0481 04   /541-0481-04   

 

机器外观(正面黄灯告警,背面控制卡也有黄灯)

 

关闭故障主机

查看与收集资源状态

由于控制卡,由于已坏,串口连接没有反应,所以通地双机中的另外一台登录关机。

 

# hostname

LDTX-DB2

#date  核对机器一下时间(最好与自己电脑对比一下,后面要用)

# hvdisp -a   (资源都是online))

Local System:  ldtx-db2RMS

Configuration:/opt/SMAW/SMAWRrms/build/config.us

 

Resource            Type    HostName            State        StateDetails

-----------------------------------------------------------------------------

ldtx-db1RMS         SysNode                     Online      

ldtx-db2RMS         SysNode                     Online      

LDTX                userApp                     Online      

Machine001_LDTX     andOp  ldtx-db2RMS         Online      

Machine000_LDTX     andOp  ldtx-db1RMS

                    

ManageProgram000_Cmdline0 gRes                        Online      

Ipaddress000_Gls0   gRes                        Online      

AllDiskClassesOk_Gds0 andOp                       Online      

cdata1_Gds0         gRes                        Online      

 

故障机ldtx-db1所双机状态

 

# rsh LDTX-DB1    (远程登录)

Last login: Tue Aug 19 15:32:06 on console

Sun Microsystems Inc.   SunOS 5.10      Generic January 2005

You have new mail.

# hostname

LDTX-DB1

##

#hvdisp –a (所在资源都是offline

 

Local System:  ldtx-db1RMS

Configuration:/opt/SMAW/SMAWRrms/build/config.us

 

Resource            Type    HostName            State        StateDetails

-----------------------------------------------------------------------------

ldtx-db2RMS         SysNode                     Online      

ldtx-db1RMS         SysNode                     Online      

LDTX                userApp                     Standby     

LDTX                userApp ldtx-db2RMS         Online

Machine001_LDTX     andOp  ldtx-db2RMS                     

Machine000_LDTX     andOp  ldtx-db1RMS         Offline     

ManageProgram000_Cmdline0 gRes                        Offline     

Ipaddress000_Gls0   gRes                        Standby     

AllDiskClassesOk_Gds0 andOp                       Offline     

cdata1_Gds0         gRes                        Offline     

 

关机

bash-3.00# shutdown -i5 -y -g0

Shutdown started.    Wed Aug 27 15:31:34 CST 2014

Changing to init state 5 - please wait

Broadcast Message from root (pts/2) onLDTX-DB1 Wed Aug 27 15:31:34...

THE SYSTEM LDTX-DB1 IS BEING SHUT DOWN NOW! ! !

Log off now or risk your files beingdamaged

 

可以使用双机命令验证另外一台是否已经离线。

 

由于看现场看不到信息,所以等待5~10分钟,看硬盘没闪动,再强行下电。

console有输出,则需要输入命令下电:

 

XSCF> poweroff –a

更换控制卡

更换并上电

拔电线,换板,接电线,接控制器线,

控制板会自动加板启动,控制台可以看到输入,类似:

 

SCF board boot factor = 4080

memory test ..

Memory compare test

................finish

   DDR Real size: 256 MB

   DDR: 224 MB

 

## Booting image at ff800000

输入机器的SN

(机器前面板上贴着)

 

***** WARNING *****

XSCF initialization terminate becausesystem data in XSCF/OPNL are mismatch.

Start procedure for system data selection.

Please select system data according to theinstruction

 

Please input the chassis serial number : XXXXXXXX    //手工输入SN

1:PANEL

Please select the number : 1      //选择1 初始化,自检后,会自动重启

Restoring data from PANEL to XSCF#0.

Please wait for several minutes ...

setdefaults : XSCF clear : start

setdefaults : XSCF clear : DBS start

setdefaults : XSCF clear : wait 20s for DBSinitialization

setdefaults : XSCF clear : common databaseclear complete

 

The restoration of data has completed.

Please turn off the breaker.

unmount /hcp0/linux

unmount /hcp0/scfprog  -- complete

unmount /hcp0/gendata  -- complete

unmount /hcp0/remcscm  -- complete

unmount /hcp1/linux

unmount /hcpcommon/scflog1  -- complete

unmount /hcpcommon/scflog2  -- complete

The system is going down NOW !!

Sending SIGTERM to all processes.

Sending SIGKILL to all processes.

Please stand by while rebooting thesystem.(15)

Restarting system.

登录

重启完了之后,登录。

登录名有二种:

一种用户名与密码ce/abc123,如果不是再默认的

默认的用户名default,输入后,提示要开或关锁lockswitch,按示转动回车。

登录后提示版本不匹配,需要升级。

XSCF Initialize complete.

Jan 1 08:12:46 xscf0-M5000-MachineSN-1 XSCF[106]: XSCF Initialize complete.

 

login: ce

Password:

XCPversion of Panel EEPROM and XSCF FMEM mismatched,

       Panel EEPROM=1080, XSCF FMEM=1115

新卡版本检查

升级前,先查看一下版本,这前通日志知道原来的版本是1080,现在是1115

XSCF> version -c xcp

XSCF#0 (Active )

XCP0 (Current): 1115

XCP1 (Reserve): 1115

XSCF> version -c xcp -v

XSCF#0 (Active )

XCP0 (Current): 1115

OpenBoot PROM : 02.32.0000

XSCF          : 01.11.0005

XCP1 (Reserve): 1115

OpenBoot PROM : 02.32.0000

XSCF          : 01.11.0005

OpenBoot PROM BACKUP

#0: 02.11.0000

#1: 02.32.0000

 

查看里面的升级包是否在

 

XSCF> getflashimage –l   

Existing versions:

       Version                Size  Date

       FFXCP1115.tar.gz   45791674  Mon Jan 01 08:27:57 CST 2001

版本升级

XSCF> flashupdate -c check -m xcp -s 1115  //先检查版本

XCP update is possible with domains up

XSCF> flashupdate -c update -m xcp -s 1115    //升级

The XSCF will be reset. Continue? [y|n] :y

XCP update is started (XCPversion=1115:last version=1080)

OpenBoot PROM update is started (OpenBootPROM version=02320000)

OpenBoot PROM update has been completed(OpenBoot PROM version=02320000)

XSCF update is started (XSCFU=0,bank=1,XCPversion=1115:last version=1080)

XSCF download is started(XSCFU=0,bank=1,XCP version=1115:last version=1080, Firmware ElementID=00:version=01110004:last version=01110004)

 

升级大约需要10~20分钟,并会自动重启,重启后更新直到结束。

 

XSCF flashupdate[830]: XCP update has been completed (XCP version=1115)

 

使用default登录

login: default

Change the panel mode switch to Locked andpress return...  //按提示操作

Leave it in that position for at least 5seconds.  Change the panel mode switch toService, and press return... //按提示操作

 

提示tip

+++++++++++++++++++++++++++++++++++++++

因为default登录麻烦,如果有必要,可以创建新的用户

XSCF> adduser ce

XSCF> password  ce

XSCF> setprivileges  ce platadm (管理权限)

XSCF> showuser -l

+++++++++++++++++++++++++++++++++++++++++++++

 

网络配置(略)

 

XSCF> shownetwork -a

xscf#0-lan#0

         Link encap:Ethernet  HWaddr00:21:28:25:D4:D6 

         inet addr:209.56.7.120  Bcast:209.56.7.255  Mask:255.255.255.0

         UP BROADCAST MULTICAST MTU:1500  Metric:1

         RX packets:0 errors:0 dropped:0 overruns:0 frame:0

         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0

         collisions:0 txqueuelen:1000

          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

         Base address:0xe000

 

xscf#0-lan#1

         Link encap:Ethernet  HWaddr00:21:28:25:D4:D7 

         inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0

         UP BROADCAST MULTICAST  MTU:1500 Metric:1

         RX packets:0 errors:0 dropped:0 overruns:0 frame:0

         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0

         collisions:0 txqueuelen:1000

         RX bytes:0 (0.0 B)  TX bytes:0(0.0 B)

         Base address:0xc000

 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 

查看当前版本是否已更新。

 

XSCF> version -c xcp -v

XSCF#0 (Active )

XCP0(Reserve): 1115

OpenBoot PROM : 02.32.0000

XSCF          : 01.11.0005

XCP1(Current): 1115

OpenBoot PROM : 02.32.0000

XSCF         : 01.11.0005

OpenBoot PROM BACKUP

#0: 02.11.0000

#1: 02.32.0000

 

时间配置

XSCF> showtimezone -c tz    //查看时区

Asia/Shanghai

XSCF> poweron -d 0        // 给机器加电(d表示域)

DomainIDs to power on:00

Continue? [y|n] :y

Poweron canceled due to invalid system dateand time.  //提示时间上有问题

 

需要重新配置一下时间。

 

XSCF> setdate -s 2014.08.27-17:04:23

Wed Aug 27 17:04:23 CST 2014

The XSCF will be reset. Continue? [y|n] :y    //选择Y,后自动重启

Wed Aug 27 09:04:23 UTC 2014

XSCF> execute J00shutdown_start  -- complete

execute K000end  -- complete

Aug 27 17:04:25 xscf0-M5000-MachineSN-1XSCF[106]: XSCF shutdown sequence start

正常引导

重启正常后,再次使用ce登录,查看硬件状态

XSCF> showhardconf

XSCF> showhardconf -u

 

检查硬件正常后,加电

XSCF> poweron -d 0

DomainIDs to power on:00

Continue? [y|n] :y  //选择Y

00 :Powering on

 

*Note*

 Thiscommand only issues the instruction to power-on.

 Theresult of the instruction can be checked by the "showlogs power".

 

进入主机控制台

XSCF> console -d 0 

 

Console contents may be logged.

Connect to DomainID 0?[y|n] :y

 

系统会停在OK状态下,可以使用printenvprobe-scsi-disk,setenv,nvalais 等等

 

OK boot

 

 

SPARC Enterprise M5000 Server, using Domainconsole

Copyright (c) 1998, 2012, Oracle and/or itsaffiliates. All rights reserved.

Copyright (c) 2012, Oracle and/or itsaffiliates and Fujitsu Limited. All rights reserved.

OpenBoot 4.33.5.d, 65536 MB memoryinstalled, Serial #xxxxxx.

Ethernet address 0:21:28:25:d4:d2, Host ID:xxxxxxx.

 

Aborting auto-boot sequence.

{0} ok boot

 

双机软件开启

登录系统,查看双机状态,由于双机没有自动启动,所以需要手工

bash-3.00# hvdisp -a

hvdisp: RMS is not running

bash-3.00# hvcm -s ldtx-db1    //指定主机或hvcm –a  所以主机资源

Starting Reliant Monitor Services now

bash-3.00# disAug 27 17:05:42 LDTX-DB1  : LOG3.014091303421080023   0   3    0    4.2        RMS              (WRP, 34): ERROR: Cluster host ldtx-db2RMS isno longer in time sync with local node. Sane operation of RMS can no longer beguaranteed. Further out-of-sync messages will appear in the syslog.

 

bash-3.00# hvdisp -a

 

Local System:  ldtx-db1RMS

Configuration:/opt/SMAW/SMAWRrms/build/config.us

 

Resource            Type    HostName            State        StateDetails

-----------------------------------------------------------------------------

ldtx-db2RMS         SysNode                     Online      

ldtx-db1RMS         SysNode                     Online      

LDTX                userApp                     Standby     

LDTX                userApp ldtx-db2RMS         Online

Machine001_LDTX     andOp  ldtx-db2RMS                     

Machine000_LDTX     andOp  ldtx-db1RMS         Offline     

ManageProgram000_Cmdline0 gRes                        Offline     

Ipaddress000_Gls0   gRes                        Standby     

AllDiskClassesOk_Gds0 andOp                       Offline     

cdata1_Gds0         gRes                        Offline     

 

 

bash-3.00# man hvcm

Reformatting page.  Please Wait... done

 

Maintenance Commands                                     hvcm(1M)

 

NAME

    hvcm - start the Reliant Monitor configuration monitor

 

SYNOPSIS

     hvcm {-a | -s SysNode }                             Format 1

 

    hvcm -c config_file {-a | -s SysNode } [-h time] [-l level]

                                                        Format 2

 

    hvcm -V                                            Format 3

 

DESCRIPTION

    The configuration monitor is the decision-making  module of

    Reliant  Monitor.   It controls the configuration and access

    to all Reliant Monitor resources. If a resource fails,  the

    configuration monitor analyzes the failure and initiates the

    appropriate action according to the specifications  for the

    resource in the nodes configuration file.

 

    The hvcm command starts the configuration  monitor and  the

    detectors  for all monitoredresources. In most cases, it is

    not necessary to specify options to the hvcm  command;  the

    default values are sufficient for most configurations.

 

附:富士通双机软件介绍

http://www.fujitsu.com/cn/services/hardware/servers/software/index.html

 

双机命令:

 

显示本地主机状态资源状态

hvdisp -a

 

启动所有资源

hvcm –a

 

启动某个节点资源

hvcm –s hostnode

 

切换某APP到节点

hvswitch app hostnode

 

资源状态调整,一般为清除fault

hvutil –c userapp 

 

切换成online状态

hvutil –a userapp 

 

停止资源,启用需要hvswitch

hvutil –f userapp 

 

切换成非活动状态,

hvutil –d userapp 

 

将资源切换成维护或非维护状态

Hvutil –m off|on  userapp

 

心跳网卡

cfconfig –g

 

显示节点信息

cftool –n

 

显示心跳状态

ciptool –n

 

显示群集共享资源结构

/etc/opt/FJSVcluster/bin/clgettree

 

显示网卡多路径状态信息

/opt/FJSVhanet/usr/sbin/dsphanet

 

# hvsetenv HV_RCSTART

查看RMS是否自动启动,1为启动,0为不启动

 

# hvsetenv HV_AUTOSTARTUP

查看userapplication是否自动启动,1为启动,0为不启动

 

# hvsetenv HV_RCSTART 0

RMS不自动启动。

 

# hvsetenv HV_AUTOSTARTUP 0

Userapplication不自动启动

 

取资源管理器RMS配置信息。

# hvdump -f /opt/pcl.`date`.Z

说明:将资源管理器RMS的配置信息导出到/opt/pclRMS.`date`.Z文件中。

 

收集该节点系统信息及日志。

# /opt/FJSVsnap/bin/fjsnap -a/tmp/fjsnap_`uname -n`.tar.gz

说明:收集该节点系统配置和日志信息,并生成/tmp/fjsnap_`uname -n`.tar.gz文件。

 

 

09-23 07:35