大家好,这里是公众号 DBA学习之路,分享一些学习国产数据库路上的知识和经验。

前言

今天检查一套 Oracle RAC 12.2.0.1 数据库,检查集群状态时,发现集群命令一直夯着没反应:

image.png

经过一顿分析,终于解决问题,比较简单,这里分享一下处理过程。

问题分析

首先怀疑是集群资源挂了,查看集群资源,发现 ora.crsd 挂了:

## 节点一
[grid@lucifer1 ~]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       mesdb0                   Started,STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.crf
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.ctssd
      1        ONLINE  ONLINE       mesdb0                   OBSERVER,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.gipcd
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.gpnpd
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.mdnsd
      1        ONLINE  ONLINE       mesdb0                   STABLE
ora.storage
      1        ONLINE  ONLINE       mesdb0                   STABLE
--------------------------------------------------------------------------------

## 节点二
[grid@lucifer2 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

检查 crs alert.log 日志:

2025-02-24 06:11:42.105 [ORAROOTAGENT(29459)]CRS-8500: Oracle Clusterware ORAROOTAGENT 进程以操作系统进程 ID 29459 开头
2025-02-24 06:12:42.142 [ORAROOTAGENT(29459)]CRS-5818: 已中止命令 'check' (对于资源 'ora.crsd')。详细资料见 (:CRSAGF00113:) {0:15:2} (位于 /oracle/app/grid/diag/crs/mesdb0/crs/trace/ohasd_orarootagent_root.trc)2025-02-24 06:13:20.260 [CRSD(30357)]CRS-8500: Oracle Clusterware CRSD 进程以操作系统进程 ID 30357 开头
2025-02-24 06:13:22.541 [CRSD(30357)]CRS-1019: 主机 mesdb0 上的 OCR 服务已退出。详细资料见 /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc
2025-02-24T06:13:22.563713+08:00
Errors in file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc  (incident=41):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc

2025-02-24 06:13:22.584 [CRSD(30357)]CRS-8505: Oracle Clusterware CRSD 进程 (具有操作系统进程 ID 30357) 遇到内部错误 CRS-01019

检查 crsd 日志:

2025-02-24 06:13:22.514 :  OCRMSG:3187623680: prom_listen: Port str [a0f4-81a3-c06c-03aa]
2025-02-24 06:13:22.514 :  OCRSRV:3187623680: proath_listen: listening to remote requests at portstr [a0f4-81a3-c06c-03aa]
2025-02-24 06:13:22.518 :  OCRMSG:3168728832: prom_listen: Port str [ab1d-0688-2d30-7387]
2025-02-24 06:13:22.518 :  OCRSRV:3168728832: th_invalidate_cache: listening to cache_invalidation requests at portstr [ab1d-0688-2d30-7387]
2025-02-24 06:13:22.522 :  OCRMSG:3166627584: prom_listen: Port str [c71c-c1a3-dc88-994f]
2025-02-24 06:13:22.522 :  OCRSRV:3166627584: proath_listen: listening to remote rim requests at portstr [c71c-c1a3-dc88-994f]
2025-02-24 06:13:22.533 :  OCRMAS:3164526336: th_calc_av: Configured Active Patch Level [0]
2025-02-24 06:13:22.533 :  OCRMAS:3164526336: th_calc_av:5'': Return persisted APL [0]
  OCRMAS:3164526336: th_calc_av:5': Return persisted AV [203424000] [12.2.0.1.0]
2025-02-24 06:13:22.535 :  OCRMAS:3164526336: th_master_prereg: Persistent upgrade state retrieved from OCR is [0].
2025-02-24 06:13:22.537 :  OCRMAS:3164526336: th_master_prereg: Persistent upgrade toversion buffer retrieved from OCR is [12.2.0.1.0]. Setting toversion to [203424000].
2025-02-24 06:13:22.541 : CSSCLNT:3164526336: clssgsGroupJoin: member in use group(1/ocrlocal)
2025-02-24 06:13:22.541 : default:3164526336: procr_reg_localgrp: Error [14] from clssgsreglocalgrp(). Return [23].
2025-02-24 06:13:22.541 : default:3164526336: SLOS : [clsuSlosFormatDiag called with non-error slos.]

2025-02-24 06:13:22.541 :  OCRMAS:3164526336: th_master_register: Failed to register in OCRLOCAL group. Retval:[23]
2025-02-24 06:13:22.541 :  OCRAPI:3164526336: procr_ctx_set_invalid: ctx is in state [6].
2025-02-24 06:13:22.541 :  OCRAPI:3164526336: procr_ctx_set_invalid: ctx set to invalid
Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc
Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved.
DDE: Flood control is not active
2025-02-24T06:13:22.564565+08:00
Incident 41 created, dump file: /oracle/app/grid/diag/crs/mesdb0/crs/incident/incdir_41/crsd_i41.trc
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
2025-02-24 06:13:22.706 :  OCRAPI:3164526336: procr_ctx_set_invalid: Aborting...
Trace file /oracle/app/grid/diag/crs/mesdb0/crs/trace/crsd.trc
Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved.
 default:2552033344: 1: clskec:has:CLSU:910 4 args[CLSD00302][mod=clsdadr.c][loc=(:CLSD00302:)][msg=clsdAdrInit: Trace file size and number of segments fetched from environemnt variable: ORA_DAEMON_TRACE_FILE_OPTIONS filesize=26214400,numsegments=10]

    CLSB:2552033344: Argument count (argc) for this daemon is 2
    CLSB:2552033344: Argument 0 is: /oracle/app/12.2.0/grid/bin/crsd.bin
    CLSB:2552033344: Argument 1 is: reboot
2025-02-24 06:13:22.829 : CSSCLNT:2552033344: clsssinit: initialized context: (0x4edf930) flags 0x207
2025-02-24 06:13:22.829 : CRSMAIN:2552033344:  First attempt: init CSS context succeeded.
2025-02-24 06:13:22.829 : CRSMAIN:2552033344:  Start mode: normal
2025-02-24 06:13:22.831 :  CLSDMT:2343307008: PID for the Process [30402], connkey CRSD
2025-02-24 06:13:23.745 : CRSMAIN:2552033344:  CRS Daemon Starting
2025-02-24 06:13:23.745 : CRSMAIN:2343307008:  Process environment is not initialized yet!
2025-02-24 06:13:23.746 :    CRSD:2552033344:  Logging level for Module: clsdadr  0
2025-02-24 06:13:23.746 :    CRSD:2552033344:  Logging level for Module: clsdnreg  0
2025-02-24 06:13:23.746 :    CRSD:2552033344:  Logging level for Module: clsdynam  0

查看 trace dump 日志:

----- Invocation Context Dump -----
Address: 0x7f1a9c024340
Phase: 3
flags: 0x10E0000
Incident ID: 41
Error Descriptor: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Error class: 0
Problem Key # of args: 0
Number of actions: 10
----- Incident Context Dump -----
Address: 0x7f1abc9d99d0
Incident ID: 41
Problem Key: CRS 1019
Error: CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
[00]: dbgePostErrorDirectVaList_int [diag_dde]
[01]: dbgePostErrorDirect [diag_dde]
[02]: clsdAdrPostError []
[03]: clsdadrpr_CreateIncidentCheck []
[04]: clsdadrprAlert []
[05]: clsd_alertprintft []
[06]: proath_master_exit_helper []<-- Signaling
[07]: proath_master_register []
[08]: proath_master []
[09]: start_thread []
MD [00]: 'Client ProcId'='crsd.bin@mesdb0.30357_139752810403584' (0x0)
Impact 0:
Impact 1:
Impact 2:
Impact 3:
Derived Impact:
----- END Incident Context Dump -----

看着很像是 BUG,在 MOS 搜索后发现有一个文章很匹配:

  1. crsd.bin Fail With Error CRS-1019 When ohasd Restarted (Doc ID 2291799.1)
  2. Bug 24396050 - crsd.bin failed several times with error CRS-1019 (Doc ID 24396050.8)

MOS 截图如下:

image.png

image.png

image.png

image.png

MOS 内容与问题日志完全一致,确认是 BUG,需要进行补丁修复:

image.png

解决问题

下载 BUG 所需补丁 Patch 24396050: LNX64-12.2-CRS: CRSD.BIN FAILED SEVERAL TIMES WITH ERROR CRS-1019

image.png

更新 OPatch

查看补丁 README:You must use the OPatch utility version 12.2.0.1.5 or later to apply this patch.

检查当前 OPatch 补丁是否符合要求:

[grid@mesdb0 ~]$ cd $ORACLE_HOME/OPatch/
[grid@mesdb0 OPatch]$ ./opatch version
OPatch Version: 12.2.0.1.6

OPatch succeeded.

符合要求,不需要更新 OPatch。

解压补丁

## root 执行
unzip -q /soft/p24396050_122010_Linux-x86-64.zip -d /soft/
chown -R oracle:oinstall /soft/24396050

安装补丁

## root 执行
export GI_HOME=/oracle/app/12.2.0/grid

## 安装前检查
$GI_HOME/OPatch/opatchauto apply /soft/24396050 -analyze

## 安装补丁
$GI_HOME/OPatch/opatchauto apply /soft/24396050 -oh $GI_HOME

安装补丁后重启系统验证集群已经恢复正常。

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐