Wednesday, June 24, 2015

NetApp Deswizzling

Fast-Path Vs Slow-Path
Data within a FlexVol has both a logical location within the volume and a physical location within the containing aggregate. When reading data from a FlexVol, the physical location of that data must be known to perform the read. Under normal circumstances, reads use a fast-path to translate a logical address into a physical address. When SnapMirror transfers a FlexVol to a different aggregate, however, the relationship between logical and physical addresses is changed. The destination aggregate lays out the volume into different physical blocks. Immediately after the SnapMirror transfer, reads to the destination volume will use a different process to translate logical addresses into physical addresses. This slow-path is less efficient and may require an additional disk read to access a block of data. As soon as a FlexVol SnapMirror update finishes, the destination storage controller will launch a scanner to recreate the fast-path metadata. Once the scan finishes, reads to the SnapMirror destination will use the normal fast-path. This process is known as deswizzling.

Q. If you are using Snapsmirror Asynch mode (transfer every 5 min) and you have good band width, will it leave any impact ?
Ans. Yes, desswizzling.

To check if Scanner is running to create Fast-path :
filer> priv set advanced
filer*> wafl scan status

Example

DR-FILERA*> wafl scan status sm_myvolume

Volume sm_myvolume:

Scan id                   Type of scan     progress
5081724    container block reclamation     block 106 of 4909
5081725             volume deswizzling     snap 131, inode 97 of 32781. level 1 of normal files. Totals: Normal files: L1:0/14245 L2:0/38304 L3:0/38521 L4:0/38521   Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0

DR-FILERA*> snap status sm_myvolume
Volume sm_myvolume (cleaning summary map)
snapid status     date           ownblks release fsRev name
------ ------     ------------   ------- ------- ----- --------
116 creating   May 02 15:54         0 8.0 21057 DR-FILERA(1574215916)_sm_myvolume.69540 (no map)
115 creating   May 02 15:51      1096 8.0 21057 DR-FILERA(157

Question: Does deswizzling take place on offline volumes?

Answer: No, deswizzling will not take place on offline volumes. Only online volumes will be deswizzled.

Note: Volume deswizzling or a deswizzle scan can also occur after a vol copy operation, even if Volume SnapMirror is not used or licensed. Since a 'vol copy' is a block-level transfer, the destination FlexVol requires a swizzle scanner to update the location of the blocks in WAFL.

Reference : https://kb.netapp.com/support/index?page=content&id=3011866

Friday, April 3, 2015

What is RTO and RPO ?

They are both crucial elements of business continuity, and they sound quite similar and confusing.

What is RTO ?
As per BS 25999-2, a leading business continuity standard, defines RTO as “…target time set for resumption of product, service or activity delivery after an incident”.

RTO, or Recovery Time Objective, is the target time you set for the recovery of your IT and business activities after a disaster has struck. The goal here is to calculate how quickly you need to recover,. For example, if RTO is 3 hours, then you need to invest quite a lot of money in a disaster recovery – because you want to be able to achieve full recovery in only 3 hours. However, if your RTO is 1 week, then the required investment will be much lower because you will have enough time to acquire resources after an incident has occurred. And incase you need RTO =0, means dude, you need redundant infrastructure, lot of investment is required. So RTO is not talking about data loss.

What is RPO?
RPO, or Recovery Point Objective, is focused on data and your company’s loss tolerance in relation to your data. RPO is determined by looking at the time between data backups and the amount of data that could be lost in between backups. For example, If your RPO is 3 hours, then you need to perform backup at least every 3 hours.

Dude, What’s the difference?
The difference is in the purpose – RTO has a broader purpose because it sets the boundaries for your whole business continuity management, while RPO is focused solely on the issue of backup frequency. They are not directly related – you could have RTO of 24 hours and RPO of 1 hour, or RTO of 2 hours and RPO of 12 hours.

What is common between RTO and RPO ?
They are both vital for business impact analysis (BIA) and for business continuity management (BCM).

Sunday, March 15, 2015

Pre and Post Checks for NetApp ONTAP upgrade

Pre Checks ONTAP Upgrade

1.    Generate config advisor report and take the corrective measures if any
2.    Check if Disk firmware upgrade is required, keep firmware at required location
3.    Check if Shelf firmware upgrade is required, keep firmware at required location
4.    Check if SP/RLM firmware upgrade is required, keep firmware at required location
5.    Check if correct DATA ONTAP version is kept at required location (Filer1*> software list)
6.    Check if you are able to login to SP and System console on required filer one day before
7.    Check if monitoring blackout MSCT is opened else Open blackout for Monitoring
8.    Check if change initiator has involved all required teams (for example, Systems, Network ..etc contact person’s name should be shared one day before)
9.    Make sure logs are collected at all the times during your activity
10.    Check if cluster failover is enabled (cf status) if no, take the corrective measures (Please ignore this step in case of standalone filers)
11.    Check if storage shows any critical errors (storage show fault)
12.    Check if you are able to observe any hardware failures (environment chassis all, environment chassis list-sensors), if yes, halt the change
13.    Check if any CLONED LUN is mapped. Especially in case of Secondary filers, there should not be any mapped LUN (eg LUNs used in restorations). (LUNs will remain mapped in case of HA Pairs)
14.    Check if CIFS sessions are connected, if yes, gracefully halt the sessions
15.    Check if NFS sessions are connected, if yes, gracefully halt the sessions
16.    Check if any backup activity (tape backup , restore..etc) is running (NDMP status, backup status)
17.    Check if SnapVaults are happening, if yes, gracefully halt the transfers
18.    Check if SnapMirrors are happening, if yes, gracefully halt the transfers
19.    Back up the etc\hosts and etc\rc files
20.    Check all Volumes should be online > vol status, any vol offline , take the corrective measures
21.    Check all aggregates should be online > aggr status, if any offline , take the corrective measures
22.    Check if there is any failed disk, take the corrective measures
23.    Check if any disk is bypassed, if yes, take the corrective measures
24.    Performance Baseline > perfstat  CPU utilization should not be more than 50%, take the corrective measures
25.    Check /etc/log/messages for any errors
26.    Trigger the auto support before starting activity (eg, options autosupport.doit "starting_NDUP8.1.2P4")
27. As a standard there must be a change request in your tool, must review this change request carefully

Post checks (after Major Activities)

1.    Check if Disk firmware is upgraded as per required version (if required), also update MSCT with this information
2.    Check if Shelf firmware is upgraded as per required version (if required) ), also update MSCT with this information
3.    Check if SP/RLM firmware is upgraded as per required version (if required) ), also update MSCT with this information
4.    Check if cluster failover is enabled, cf status else take the corrective measures (Please ignore this step in case of standalone filers)
5.    Check if ONTAP is upgraded (version –b) as per required version), also update MSCT with this information
6.    check for any critical error messages /etc/log/messages, take corrective measures
7.    Trigger autosupport after successful completion (eg, options autosupport.doit "completed_NDUP8.1.2P4")
8.    Perform filer health checkup (environment chassis all, environment chassis list-sensors)
9.    Check if Snapvaults are automatically resumes else resume it (if required)
10.    Check if SnapMirrors are automatically resumes else resume it (if required)
11.    Close the change in your tool (as mentioned in point no. 27 above)

NetApp volume move operation using NMC (NetApp Management Console)

Step 1. Login to NetApp Management Console:

Step 2. Select aggregate to migrate the volume:

Go to Host > Aggregates >

Step 3. Select Aggregate from which you want to migrate volume , for example in below screenshot aggregate is 92% full, select aggregate and right Click, select “Manage Space”

Step 4. Click Next

Step 5. Select the required volume and click on “migrate”

Step 6. Select the required option

Step 7. Click Next

Step 8. Selected volume should appear here

Step 9. Select Radio button , Review and Comment

Step 10. Click Finish

Step11. Go to “Jobs” to see the status of your job

Monday, January 26, 2015

What to do if you receive alert (aggregate is nearly full)

For Storage Administrators, what to do if you receive alert (aggregate is nearly full)

Check List

1. Check for Old snapshots and delete snapshots which are out of retention
2. Check for Backup (Volrest) volume ( and you can destroy if there is no backup happening )
3. Check. for gurantee none, all volumes should be thin provisioned, Gurantee shouln’t be volume
4. Check for snap reserve , you can make it 0 and check for fractional reserve
5. Check if aggragate level Snapshots are scheduled
Filer1*> snap sched -A aggr0
6. To delete snapshots are aggregate level (if not required)
Filer1*> snap list -A aggr0 , Filer2*> snap delete -A aggr0_sm hourly.3
7. If all above fine, check for spare disk for that controller, if there is no spare disk check for its Partner Spare disk and change the ownership and assign it to aggregate
8. Move few volumes (nondisruptively) from one aggregate to other (less occupied) aggregate:

vol move start source_volume aggr_destination –r 10 –w 120

9. If all above fine, go to Operations Manager and look for unusal activity at that volume (note graph)
10. Check in your team, it is observed that somebody in your team is working on it
11. At last, if aggr remains same, don't wait raise alarm to your management or vendor

Sunday, January 25, 2015

NetApp FlexCache Volumes

FlexCache Volumes Improves performance in your NFS environment, You can use FlexCache volumes to speed up access to data, or to offload traffic from heavily accessed volumes. FlexCache volumes help improve performance, especially when clients need to access the same data repeatedly, because the data can be served directly without having to access the source. Therefore, you can use FlexCache volumes to handle system workloads that are read-intensive.

Reads

Caches are populated as a host reads data from the source. On the first read of any data in below figure, the cache has to fetch the data from the original source (2). The data is returned to the cache (3), stored in the cache, and then passed back to the host (4). As reads are passed through a cache, the cache fills up by storing the requested data.

Any subsequent accessing of data that is already stored in the cache can be served

immediately back to the host (2) without spending time and resources accessing the original source of the data. This is the primary advantage of a cache—serving frequently accessed data directly to a host without having to fetch the data from the original source. However, you may be thinking, “What if the data changed on the origin system? Does the FlexCache system still serve the data stored in its cache?” It is possible that the FlexCache system could be storing data that has changed at the origin system—this is called stale data. However, although it is true that the FlexCache system may store stale data, policies exist that allow you to control and manage how the FlexCache system handles stale data. These policies are discussed in “Cache Consistency.”

Writes

In a FlexCache system, all writes from a host (1 in Figure 3) are passed directly through the cache volume to the origin volume (2). The origin volume responds to the FlexCache volume when it assumes responsibility for the new or changed data (3); only then does the FlexCache volume acknowledge the result of the write to the host (4). This is called a write-through cache.

A write-through cache is a cache that does not respond to the host until it receives a response from the next subsystem in the line. In other words, the FlexCache cache volume does not respond to the host until the origin volume acknowledges receipt of the data, thus helping to keep the data safe and sound.

This is in contrast to a write-back cache, which responds to the host immediately before verifying that the data can be successfully passed to the next subsystem. Once a write-back cache accepts responsibility for data and responds to the host before acknowledging receipt of the next subsystem, the write-back cache must protect the data until it is written to physical media (i.e., disk). Data in this state is called dirty data, and it must be protected from system failures such as power loss (in such a way that when power is restored, the dirty data is still accessible and ready to be stored on disk).

Cache Consistency

Three primary policies govern the freshness of data: (explaining first 2 important ones)

1. attribute cache time-outs,

2. delegations

3. write operation proxy.

ATTRIBUTE CACHE TIME-OUTS

As data is retrieved from an origin volume and stored in the cache volume, the file containing that data is considered fresh for a specified amount of time, called the attribute cache time-out. In other words, if a host requests data from a file in the cache volume and the attribute cache time-out has not expired, the cache volume serves the data directly to the host without having to communicate with the origin volume. If the attribute cache time-out has expired, the cache volume checks with the origin volume to compare the file’s attributes. If the attributes are the same on the cache volume and the origin volume, the file is fresh, and the data is served from the cache volume (and the attribute cache time-out restarts). If the attributes are not the

same, the file is stale, is marked invalid, and is reread from the origin volume before serving the host (also resetting the attribute cache time-out).

DELEGATIONS

A delegation is another mechanism used to manage cache consistency. A delegation is a contract between

the cache volume and the origin volume, which says that if the cache volume is granted a delegation for a

specific file, the origin volume does not change that file without first notifying the cache volume.

***This means that the cache volume does not have to validate the file with the origin volume, even if the attribute cache time-out has expired. In fact, if a file has a delegation, the attribute cache time-out is ignored.

FlexCache FAQs

DOES FLEXCACHE SUPPORT FILE LOCKING? (how it handles multiple WRITE requests)

Yes, for applications that use the NFS feature for file locking, FlexCache uses a network lock manager (NLM) to pass locks through.

FlexCache Volumes limitation?

FlexVol volumes features are not supported on FlexCache volumes, such as Snapshot copy creation, deduplication, compression, FlexClone volume creation, volume move, and volume copy.

How many FlexCache volumes can be made on a storage system?

You can have a maximum of 100 FlexCache volumes on a storage system.

WHAT CLIENT PROTOCOLS DOES FLEXCACHE SUPPORT?

FlexCache supports serving hosts and clients that use NFS v2 and v3. FlexCache does not support CIFS,

NFS v4, or any block based storage protocol (FC, iSCSI)

DOES FLEXCACHE SUPPORT ANY NETWORK PROTOCOLS ,SUCH AS HTTP, FTP, OR NNTP?

WHAT IS THE PROTOCOL USED BETWEEN A FLEXCACHE SYSTEM AND THE ORIGIN SYSTEM?

The protocol used between a FlexCache system and the origin system is a NetApp proprietary protocol

referred to as NRV. NRV offers many advantages in terms of performance, capability, and time to market for other features.NRV is high performing, very NFSv4-like, and includes the use of delegations.

WHAT PORTS ARE REQUIRED?

FlexCache uses only port 2050 for all NRV communication between the cache storage system and the origin

storage system. NRV works only on TCP, not UDP.

CAN I THROTTLE BANDWIDTH ON THE NRV CONNECTION?

No, FlexCache does not have a mechanism to throttle the bandwidth usage of the NRV protocol.

HOW DOES FLEXCACHE INTERACT WITH MULTISTORE?

FlexCache volumes can be created only from vfiler0.FlexCache volumes can cache origins on any vFiler appliance.Origin volumes cannot be destroyed or moved in vFiler appliances with the command vfiler move , add, remove, or destroy. FlexCache volume mapping.

DOES FLEXCACHE WORK WITH METROCLUSTER?

Yes.

HOW DOES FLEXCACHE INTERACT WITH V-SERIESSYSTEMS?

A V-Series system can act as an origin volume for FlexCache volumes on FAS systems.

A volume for FlexCache volumes on another V system can act as a FlexCache volume for origin volumes on FAS systems.

A V-Series system can act as an origin volume with FlexCache volumes on the same system (hierarchical storage management).

WHAT IS THE DIFFERENCE BETWEEN FLEXCACHE AND NETCACHE?

NetCache® caches network protocols such as HTTP, FTP, and NNTP.

FlexCache caches storage protocol NFS v2 and v3.

THE AUTOGROW FEATURE WHAT IS THE AUTOGROW FEATURE?

- The autogrow feature automatically manages the size of FlexCache volumes.

The size of a FlexCache volume increases automatically, providing more disk space for storing cached data (assuming that there is available space in the aggregate).

- FlexCache volumes that are too small can negatively affect cache performance.

- Automatically increasing the volume size minimizes the need to eject data from the cache.

- If you have several FlexCache volumes sharing the same aggregate, the volumes that are getting the most data accesses also receive the most space.

How to create FlexCache Volume

Create the volume: vol create cache_vol aggr [size{k|m|g|t}] -S origin:source_vol (***For best performance, do not specify a size when you create a FlexCache volume. Specifying a size disables the FlexCache Autogrow capability.)

Methods to view FlexCache statistics

The flexcache stats command (client and server statistics)
The nfsstat command (client statistics only)
The perfstat utility
The stats command

SAN and NAS Concepts