ASM更换磁盘名称,守得云开见月明
分类:科学技术

原标题:守得云开见月明:一次ASM存储高可用故障解决过程分析

今天他们要测试IBM的那个SVC存储同步的东西,然后需要把服务器上的盘都踢出去后再加进来,不过这样的话磁盘名称就会变了。因此需要把ASM中的磁盘名称都换了,其实过程也很简单:

图片 1

1、修改asm实例的asm磁盘默认查找路径参数asm_diskstring,使用如下命令:

作者 | 姜劲松,云和恩墨专家支持部Oracle技术专家,Oracle OCP,MySQL OCP,RHCE等认证专家。长期服务移动运营商行业客户,精通 oracle 性能优化,故障诊断,特殊恢复领域。23年IT从业经验、资深数据库及系统软硬件集成专家。

alter system set asm_diskstring='/dev/rhdisk*';

百万级用户规模营销账务系统研发及实施运维经验,主持过11省千万级电力营销业务系统运维主管工作;设计实施过10多个阿里云平台新能源SAAS系统。历任开发工程师、项目经理、技术经理、项目总监、运维主管、云平台架构师等职位。

2、关闭整个Cluster,等待他们踢盘加盘后再修改如下属性,我的是RAC环境所以一下操作要在所有节点上执行

前言

修改磁盘用户及属组:

Oracle ASM 全称为Automated Storage Management,即自动存储管理,它是自 Oracle10g 这个版本 Oracle 推出的新功能。这是 Oracle 提供的一个卷管理器,用于替代操作操作系统所提供的 LVM,它不仅支持单实例配置,也支持RAC这样的多实例配置。

[rac11g2@root]# chown grid:asmadmin /dev/rhdisk[2-4]

给 Oracle 数据库管理员带来极大的方便,ASM 可以自动管理磁盘组,并提供数据冗余和优化。 ASM提供了丰富的管理和容灾手段,通过适当的配置,可以实现高效的数据库层面的存储容灾功能。

修改磁盘属性为660:

本案例通过某客户项目现场1次ASM存储容灾无法实现预期目标的问题分析解决过程,和大家共同探讨对于非预期问题的解决之道。

[rac11g2@root]# chmod 660 /dev/rhdisk[2-4]

01问题简述

修改磁盘共享属性:

背景说明:

[rac11g2@root]# lsattr -El hdisk2|grep reserve_policy
reserve_policy  no_reserve                                          Reserve Policy                          True

1、Oracle12.2RAC+ASM Normal Redendancy 模式,数据库存储采用双存储冗余架构,规避单存储故障导致服务中断及数据丢失;

2、 ASM DiskGroup 设计2个 Failgroup(FG),1个FG磁盘全部存储在1#存储;1个FG全部磁盘存储在2#存储中;

style="font-size: 16px;">3、期望任意存储故障或断电,数据库实例不受影响,数据不丢失,故障存储上线后数据自动同步。

[rac11g2@root]# chdev -l hdisk2 -a reserve_policy=no_reserve
[rac11g2@root]# chdev -l hdisk3 -a reserve_policy=no_reserve
[rac11g2@root]# chdev -l hdisk4 -a reserve_policy=no_reserve

在实际高可用测试中,拔掉1个存储,发现如下现象:

3、现在就可以启动Cluster了

style="font-size: 16px;">1.CRS集群不受影响,ocr/votedisk自动Failover;

2.DB Controlfile/Redolog发生I/O错误,导致LWGR/CKPT等核心进程长时间阻塞后,Oracle主动重启DB实例(1个或2个实例)后,数据库恢复正常;

style="font-size: 16px;">3.数据库数据正常,故障存储Online后自动同步正常。

[rac11g1@root]# crsctl start cluster -all

02测试过程

注:我曾经因为忘了修改磁盘属性为660,结果导致Database起不来,在Alert日志中出现了ORA-00600的错误,吓我一跳,不过从日志中比较容易看出来是权限的问题,调整磁盘属性后再重启就可以了:

1) 第一类测试

Sweep [inc][393409]: completed
Sweep [inc2][393409]: completed
NOTE: Loaded library: System 
ORA-15025: could not open disk "/dev/rhdisk4"
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 13: Permission denied
Additional information: 11
SUCCESS: diskgroup DATA was mounted
Errors in file /soft/Oracle/diag/rdbms/nint/nint1/trace/nint1_ckpt_19136654.trc  (incident=409793):
ORA-00600: internal error code, arguments: [kfioTranslateIO03], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /soft/oracle/diag/rdbms/nint/nint1/incident/incdir_409793/nint1_ckpt_19136654_i409793.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: dependency between database nint and diskgroup resource ora.DATA.dg is established
ERROR: unrecoverable error ORA-600 raised in ASM I/O path; terminating process 19136654 
Dumping diagnostic data in directory=[cdmp_20120302172201], requested by (instance=1, osid=19136654 (CKPT)), summary=[incident=409793].
Fri Mar 02 17:22:01 2012
PMON (ospid: 14156014): terminating the instance due to error 469
System state dump requested by (instance=1, osid=14156014 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /soft/oracle/diag/rdbms/nint/nint1/trace/nint1_diag_21168306.trc
Fri Mar 02 17:22:02 2012
ORA-1092 : opitsk aborting process
Fri Mar 02 17:22:02 2012
License high water mark = 1
Instance terminated by PMON, pid = 14156014
USER (ospid: 15335672): terminating the instance
Instance terminated by USER, pid = 15335672

1、存储完成拔线:16:56:05

2、实例16:57:37-16:57:39 挂掉

【摘录】

ASM日志:

First, you can try to check the OS drive ownership , permission and reserve_policy attribute on all nodes. Then restart the ASM instance.
  1)Make sure that the hdisk# is owned by the OS user who installed the ASM Oracle Home ... and that the disk is mounted correctly (with the correct owner) 
  2)Make sure that the permissions are set correctly at the disk level ... 660 is normal ... but if there are problems use 777 as a test 
  ls -l /dev/rhdisk3 output:
  For 10gR2/11gR1 like:  crw-rw----  oracle:oinstall /dev/rhdisk3 
  For 11gR2 like:        crw-rw----  grid:asmadmin /dev/rhdisk3

2018-08-01T16:57:41.712885+08:00

NOTE: ASM client style="font-size: 16px;">node11:node1:node1-rac disconnected unexpectedly

  How to change the drive ownership and permission ?
  For 10gR2/11gR1:
    # chown -R oracle:oinstall /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]
  For 11gR2:
    # chown -R grid:asmadmin /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]

DB:

  3)Make sure that the reserve_policy attribute of the needed hdisk# is no_reserve or no on all nodes.
    chdev -l hdisk# -a reserve_policy=no_reserve

2018-08-01T16:57:45.214182+08:00

Instance terminated by USER, pid = 10158

2018-08-01T16:57:36.704927+08:00

Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

ORA-15064: communication failure with ASM instance

2018-08-01T16:57:36.705340+08:00

Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00221: error on write to control file

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'

ORA-15081: failed to submit an I/O operation to a disk

ORA-15081: failed to submit an I/O operation to a disk

ORA-15064: communication failure with ASM instance

If it also fail by the first step, you may try to set the Oracle ASM parameter ASM_DISKSTRING to /dev/* or /dev/rhdisk*. The Step is:
1)Backup the ASM instance pfile(Parameter File) or spfile (Server Parameter File).
  Most in the $ORACLE_HOME/dbs. pfile name like is init+ASM1.ora, you can use cp command to backup it .and vi the content. 
  You to create spfile to pfile for backup,if use spfile. 
2)set ASM_DISKSTRING parameter
  use pfile ENV:
    Add or Edit "ASM_DISKSTRING" line to *.ASM_DISKSTRING='/dev/rhdisk*' in pfile. Startup the ASM instance using the pfile.
  
  use spfile ENV:
    $ ORACLE_SID=+ASM1;export ORACLE_SID
    
    $ sqlplus "/ as sysdba"
    or
    $ sqlplus "/ as sysasm"
    
    SQL> startup
    SQL> alter system set asm_diskstring='/dev/rhdisk*';
    SQL> select group_number,disk_number,path from v$asm_disk; 
        --You can get some disk info and the most disk's group_number  is not 0.

Oracle CKPT 进程因为控制文件 IO 错误阻塞,导致主动重启 instance,每次测试都在超时70s之后开始Terminate instance。

If ASM_DISKSTRING is NOT SET ... then the following default is used

怀疑是ASM实例offline disk时间过慢,希望调高CKPT阻塞时间阀值解决问题,但是没有找到对应的参数。

    Default ASM_DISKSTRING per OS

既然是controlfile存在此问题,是不是因为DATA磁盘比较多,导致offline检测时间长呢?

    Operating System Default            Search String
    =======================================
    Solaris (32/64 bit)                        /dev/rdsk/*
    Windows NT/XP                          \.orcldisk* 
    Linux (32/64 bit)                          /dev/raw/* 

尝试将controlfile转移到磁盘较少的REDO DG,仍然在controfile这里报错:

    LINUX (ASMLIB)                         ORCL:*
    LINUX (ASMLIB)                         /dev/oracleasm/disks/* ( as a workaround )

systemstatedump文件:

----- Beginning of Customized Incident Dump(s) -----

Process CKPT (ospid: 4693) is waiting for event 'control file sequential read'.

Process O009 (ospid: 5080) is the blocker of the wait chain.

===[ Wait Chain ]===

CKPT (ospid: 4693) waits for event 'control file sequential read'.

LGWR (ospid: 4691) waits for event 'KSV master wait'.

O009 (ospid: 5080) waits for event 'ASM file metadata operation'.

node1_lgwr_4691.trc

----- END DDE Actions Dump (total 0 csec) -----

ORA-15080: synchronous I/O operation failed to write block 1031 of disk 4 in disk group DG_REDO_MOD

ORA-27063: number of bytes read/written is incorrect

HPUX-ia64 Error: 11: Resource temporarily unavailable

Additional information: 4294967295

Additional information: 1024

NOTE: process _lgwr_node1 (4691) initiating offline of disk 4.4042263303 (DG_REDO_MOD_0004) with mask 0x7e in group 3 (DG_REDO_MO

D) with client assisting

    HPUX                                       /dev/rdsk/* 
    HP-UX(Tru 64)                            /dev/rdisk/*
    AIX                                            /dev/*

2) 第二类测试

图片 2

尝试对 controlfile 进行 multiplex:

1、每个存储分配1个10GB LUN给服务器;

2、基于每个LUN创建1个DG,controlfile multiplex到这2个DG中。

重新开始模拟1个存储故障测试,发现仍然会发生控制文件无法读写,重启实例!

在Oracle文档发现只能采用ASM FG来实现高可用,因为任何控制文件都需要在线,否则将直接导致实例中止!

style="font-size: 16px;">

Multiplex Control Files on Different Disks

Every Oracle Database should have at least two control files, each stored on a different physical disk. If a control file is damaged due to a disk failure, the associated instance must be shut down. Once the disk drive is repaired, the damaged control file can be restored using the intact copy of the control file from the other disk and the instance can be restarted. In this case, no media recovery is required.

The behavior of multiplexed control files is this:

The database writes to all filenames listed for the initialization parameter CONTROL_FILES in the database initialization parameter file.

The database reads only the first file listed in the CONTROL_FILES parameter during database operation.

If any of the control files become unavailable during database operation, the instance becomes inoperable and should be aborted.

Note:

Oracle strongly recommends that your database has a minimum of two control files and that they are located on separate physical disks.

所以这种 multiplex 方法对 controlfile 的高可用无效!

3) 第三类测试

将controlfile存储在一个RPT存储中,避免因为controlfile同步导致的阻塞。

发现有时测试能够成功,但是有时会在**REDO LOG**读写时报错导致DB重启!

4) 第四类测试

创建2个独立的DG,指向2个不同存储,REDO GROUP的2个member multiplex到2个DG中。

测试failover成功,ASM实例会将故障DG dismount,数据库完全不受影响!

根据以上的测试过程,发现如下现象:

1、 ASM Failgroup对数据库文件处理完全没有问题,可以实现Failover

2、 ControlFile/RedoLogfile在Normal DG做offline时,异常长时间阻塞并主动重启DB实例,重启后运行正常,数据完整性不受影响!

反复多次测试,问题均随机出现,因此高度怀疑为Oracle BUG,在MOS上发现1个类似『 链接:Bug 23179662 - ASM B-slave Process Blocking Fatal background Process like LGWR producing ORA-29771 (文档 ID 23179662.8)』,但是MOS说明 20180417PSU 已经 fixed 此 BUG, Wordaround 行为就是重启实例。

在连续1周无法解决问题的情况,采用了如下临时的解决方案:

本文由威尼斯网址开户网站发布于科学技术,转载请注明出处:ASM更换磁盘名称,守得云开见月明

上一篇:没有了 下一篇:没有了
猜你喜欢
热门排行
精彩图文