MSE 8.0 のサービスが突然停止し、PI からは Unreachable となる問題

JapanTAC_CSC · ‎2016-10-28

2016年10月28日（初版）

TAC SR Collection

主な問題

原因

ソフトウェア不具合:

CSCuw31838

MSE 8.0 goes down with archive logs filling up - sid env variable null

および

CSCuy20418

MSE 8.0.130.0 service goes down when Archive logs gets filled up

この問題は、MSE リリース 8.0 のサービスが稼働開始後、ある時突然サービスが停止するというものです。PI にて MSE の状態を確認すると "Unreachable" と表示されます。また、次のようにエラーメッセージが PI 上で確認できます。

[PI 上の Alarm]
Database Log Cleanup Failure. Database archive logs clean-up failed 8 times. Please restart the MSE as soon as possible to remedy this situation. If this condition is not fixed soon, the MSE may stop functioning.

MSE が停止した後、MSE のサービスを再起動させようとしたり、あるいはスタータスを getserverinfo コマンドにより確認しようとすると、次のようなログが MSE で出力されます。


[root@CISCOMSE ~]# service msed start
Starting MSE Platform
Flushing any pending data from Admin Process read and write pipe.
Starting Apache HTTPD Server
Apache Server is already running. Skipping restart.
Starting Health Monitor, Waiting to check the status.
Health Monitor successfully started
Starting Admin process...
Started Admin process.
Starting database ..................
Database started successfully. Starting framework and services ...............................................MSE startup not completed yet. Please check back later with getserverinfo command.

[root@CISCOMSE ~]# getserverinfo
Health Monitor is running
Retrieving MSE Services status.
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Checking status of service at port 7555
Failed to get the status for process at port 7555

解決策

既にこの問題が発生している状況の場合、下記の復旧方法を実施した後にCSCuy20418 の修正バージョンにバージョンアップすることにより、この問題は発生しなくなります。次の手順を MSE に実施することで復旧し、MSE サービスを再スタートできるようになりますが、データベースに直接変更を加えますので、正確に入力する必要があります。

復旧方法はバージョンにより 2通りあります。

復旧方法

[バージョン 8.0.130.0 以降の場合]

1. MSE のサービスを止めます。
[root@CISCOMSE ~]#service msed stop

2. Oracle スレッドが動いていないことを確認します。

[root@CISCOMSE ~]#ps -ef | grep oracle
もしも、この grep oracle 動作以外のプロセスが動いているのが確認できたら、それらを kill コマンドにより停止します。文法は "kill -9 PID" です。PID とは各出力の一番左側の数字です。全ての Oracle プロセスが停止していると、次のような表示になりますので確認してください。

[root@CISCOMSE ~]# ps -ef | grep ora

root 3412 3388 0 Oct24 ? 00:05:44 hald-addon-storage: polling /dev/hdc

root 3790 32272 0 09:02 pts/1 00:00:00 grep ora

3. 次のように入力してデータベースを再起動します。

[root@CISCOMSE ~]# cd /opt/mse/framework/
[root@CISCOMSE framework]# oracleDBStartStop.sh start mseorcl mount
[root@CISCOMSE framework]# source /opt/mse/install/oracleenv

4. Recovery Manager (RMAN) に繋ぎ、専用コマンドにより archivelog の削除を実施します。

[root@CISCOMSE ~]# oraclesudo -n --command="rman target /" oracle

RMAN> crosscheck archivelog all;

RMAN> delete noprompt archivelog all;

RMAN> quit

5. このあと mse を再起動します。

service msed start

==================================================

[バージョン 8.0.130.0 よりも前の 8.0 リリースの場合]

1. MSE のサービスを止めます。
[root@CISCOMSE ~]#service msed stop

2. Oracle スレッドが動いていないことを確認します。

[root@CISCOMSE ~]#ps -ef | grep oracle
もしも、この grep oracle 動作以外のプロセスが動いているのが確認できたら、それらを kill コマンドにより停止します。文法は "kill -9 PID" です。PID とは各出力の一番左側の数字です。全ての Oracle プロセスが停止していると、次のような表示になりますので確認してください。

[root@CISCOMSE ~]# ps -ef | grep ora

root 3412 3388 0 Oct24 ? 00:05:44 hald-addon-storage: polling /dev/hdc

root 3790 32272 0 09:02 pts/1 00:00:00 grep ora

3. 次のように入力してデータベースを再起動します。

[root@CISCOMSE ~]# cd /opt/mse/framework/
[root@CISCOMSE framework]# oracleDBStartStop.sh start mseorcl mount
[root@CISCOMSE framework]# source /opt/mse/install/oracleenv

4. 次のコマンドを入れて SQL 操作を実施します。

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set numwidth 15;
SQL> select * from v$recovery_file_dest;

NAME
--------------------------------------------------------------------------------
SPACE_LIMIT SPACE_USED SPACE_RECLAIMABLE NUMBER_OF_FILES
--------------- --------------- ----------------- ---------------
/opt/data/flash_recovery_area
75161927680 [SR-Preview for 5161927680] 119537664 109527040 12

SQL> quit

5. Recovery Manager (RMAN) に繋ぎ、専用コマンドにより archivelog の削除を実施します。

[root@CISCOMSE ~]# su --command "rman target /" oracle

RMAN> crosscheck archivelog all;

RMAN> delete noprompt archivelog all;

RMAN> quit

6. このあと mse を再起動します。

service msed start

備考本不具合は、Bug Search Tool でも確認できます。