Tuesday, October 28, 2014

Spare Disk Management



Always trigger AutoSupport
Options autosupport.doit   ‘diskfailure’

 Verification of failed disk:


a. Confirm the failed disk id on filer:

Text Box: Syntax: filer> aggr status -f

EX:
Broken disks
RAID Disk       Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------       ------  ------------- ---- ---- ---- ----- --------------    --------------
failed          0c.76   0c    4   12  FC:A   -  FCAL 15000 272000/557056000  274845/562884296

b. Confirm the status of Raid reconstruction:

Text Box: Syntax: filer> aggr status -r
  
Please ensure that raid reconstruction is able to start successfully otherwise another disk failure in same aggregate can lead to double degraded mode resulting in shutdown of filer or data corruption.

c. Create ticket for disk replacement as per the DC support guidelines.

Verification of replacement disk:


Once NetApp support confirms that failed disk is physical replaced verify that failed disk is replace successfully and visible on filer:

1. Run the below commands to verify that the replaced disk is available as spare disk:

Text Box: Syntax: filer> aggr status -s

2. If not visible in above command, run below command to verify if the disk is replaced but still unassigned to controller.

Text Box: Syntax: filer> disk show -n

3. Assign the disk, this will make the disk visible as spare disk:

Text Box: Syntax: filer> disk assign disk_id

Ex: disk assign 1b.33

4. Confirm with below command that disk is available as spare disk:

Text Box: Syntax: filer> aggr status -s


 Spare disk Recommendation:

A controller that is the primary owner of a set of drives can allocate up to the following number of spares (per drive type):

Number of drives per controller
Number of Spares
14–27 drives

1 spare
28–100 drives

2 spares

Additional 84 drives
1 additional spare


Example: A controller with 184 drives (one drive type) can have up to three spares.

Monday, October 13, 2014

How does surviving partner get all LUN/Volumes information of DOWN filer , where this information is saved and how it is communicated to surviving partner ?

The information from the down filer is saved in the mailbox section of the NVLOG, which is part of the NVRAM. The mailboxes in the cluster is kept in the root volume of both running filers. If one filer fails, and surviving filer to “take over” the loop B side of the failed controller, and inadvertently assume control of the root volume of the failed controller. Since the surviving controller runs 2 halves of the NVRAM (one is its own, and one the failed controller), the entire architecture continues to run.

How is the HOST connectivity (ethernet) maintained incase of cluster failover ?
Guys, You have to ensure several components are installed to ensure network connectivity
 

a) Dual NIC ports, each going to different Ethernet switches.
b) NIC team is a must
c) Multi pathing for iSCSI is a must
d) Switches should be configured with some trunking mechanism – EtherChannel or Link Aggregation
e) At the filer end, dual network connectivity to network switches with trunking set up
f)  A much better architecture is to have quad network connectivity – ie. 2 ports from the host end and 4 ports from the filer end, OR 4 host ports and 4 filer network ports

WAFL Inconsistencies (wafliron Vs WAFL_check)



Wafliron is a Data ONTAP(R) tool that will check a WAFL(R) file system for inconsistencies and correct any inconsistencies found. It should be run under the direction of NetApp Technical Support. Wafliron can be run on a traditional volume or an aggregate. When run against an aggregate, it will check the aggregate and all associated FlexVol(R) volumes. It cannot be run on an individual FlexVol volume within an aggregate.
Wafliron can be run with the storage system online, provided that the root volume does not need to be checked. When run, wafliron performs the following actions on the traditional volume or in the aggregate and associated FlexVol volumes:

-Checks file and directory metadata
-Scans inode
-Fixes file system inconsistencies

The time the storage system will spend in the first phase is difficult to estimate due to the number of contributing factors on the aggregate and FlexVol volumes.  These factors include:
  • The number of SnapshotTM copies
  • The number of files
  • The size of the aggregate/volumes
  • The RAID group size
  • The physical data layout
  • RAID reconstructions occurring
  • The number of LUNs in the root of the volume

What is the difference between wafliron and WAFL_check? 
 WAFL_checkand wafliron are both diagnostic tools used to check WAFL file systems. Wafliron will make changes as it runs and record these changes in the storage system's messages file.  The administrator has no choice over which changes wafliron will commit. 
WAFL_check is run, the administrator can choose whether or not to commit changes.