Tag Archives: BCP

VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 2

More note taking from, Using Decision Tree to create a Disaster Recovery Plan section of the BCDR Design MyLearn course

Complex task to create.  Decision trees to create DRPs.  Have Business Impact Analysis and risk assessment, have RPO and RTO ready.

Application Protection

Should the application be protected:
If yes – BIA should decide if application is protected
Is the data separate from application, on separate server or files?
If yes protect both servers/data
Should app be combined with other apps?
If no, protect independently

Sample decision tree
sample-application-decision-tree

Restoration Priority

Is the app infrastructure related (DHCP, AD) which other servers rely on.
If yes, restore first.
If no, BIA determines how critical it is to the business.
If critical, restore prior to other apps, but not before infrastructure.

Recovery site selection and location

Single site is a dedicated recovery sites, generally simplifies DRP.
Is it extremely remote?
If yes, how will staff get/stay there?
Is it Hot/warm/cold? Hot = more expensive.  Warm = less expensive but longer to bring online.
Multi site are regional, multi-purpose recovery site.

Recovery site connection

Are you replicating data over WAN links?
If yes, are they redundant?
Yes = higher cost, no = less reliable if link is down data cannot replicate.
If no, longer recovery time while tape/data manually delivered

Backup Design

Full or mixed (full, incremental, differential) – may differ from app to app
Diff + Full lowers complexity, but slower backups that incremental
Full only = easiest restore, longer backups, more costly due to space

Storage Architecture

Will OS, apps + data mixed onto same storage/LUNS
If yes – makes DR harder, backups larger, replicating LUNs take more time due to moving all files
Suggest moving swap files to diff datastore/LUNs, have VMDKs with data only on separate file, possibly LUN but increases complexity of VM management

Sample decision tree
sample-storage-design-tree

Replication Design
Will you replicate this LUN?
Non-replicated LUN will only be available if loaded from backup media.
Replicated LUNS, will it be frequently replicated?
If yes, more options for RPO, but more bandwidth needed.
Keep multiple snapshots?  If yes more storage space.

Power and facility design
Will you manage power?
If no, you have to provide power and cooling at all time.
If cold site, only need AC during operation.  Need to ensure power.
If yes, cost savings from power and cooling while systems are not in use.

DRP Testing
Can you test part of it without testing entire plan?
If no, will you test complete plan more than once a year?
If no, test once a year at minimum, generally need 3 test cycles before working properly
If yes, helps perfect plan but more expensive/time consuming.
If yes, does it interfere with production?
If yes, harder to test as interferes with business.

**Design test plan modularly, harder but testing easier**

Disaster Declaration

Should a disaster be declared?
If no, business continues normally, solve outages locally using normal procedures.
If yes, is it a single person to declare disaster?
Committees may need to approve.  If partial DRP can be implemented, easier to implement for only some systems
Single person can act quicker, but committee may have more reasoned judgement.

VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 1

Strictly a brain dump/note taking page.

Disaster – halts business; harms physical location but also digital/data loss.  If you lose your data, you are likely to go out of business – up to 90% within 1 year.

Services lost – web, email, finances, erp apps unavailable causing busines shutdown.  Redundant app servers do not protect against disaster in the same data center.

What is a disaster – Natural (tornado, hurricane, etc), Man made (infrastructure failure, terrorism).  Disaster depends on how your company designs its network, sets SLAs, builds redundancy.

Disaster is declared by an officer of the company or group of officers, implement DRP/BCP

3 Categories

Catastrophone:  Typically natural event, entire data center is destroyed.  Affects geographic area rendering local support services unavailable.

Disaster:  Localized to datacenter, unavailable day or longer.  65% that lose their DC for more than a week go out of business within a year.

Non-disaster:  Service disruption caused by a specific failure

Not a disaster:  Failure of hardware component, temporary service interruption such as power outage.  Small, isolated failures not built into BCP/DRP.

Disaster Recovery Plan (DRP) Objectives:
Minimize downtime:  Streamline processes.
Run book:  Guide to implement DRP, manually creation, maintained by hand.
Reduce risk:  DRP can’t fail, needs to be tested.  Requires additional hardware.

Reduce cost:  Control cost, fast and simple = expensive.  Potentially double cost for duplicate data center.

Physical DR process.  Data is replicated from production to recovery site via replication over WAN or moving via backups.

Complications:  Lots of data to identify and replicate, complex recovery process, inability to test.

RPO – Recovery Point Objective – how old can data be that is recovered
RTO – Recovery Time Objective – how long until back online

Hot site – ready in minutes/hours
Warm site – ready in days
Cold site – several days to bring online

SRM supports several storage vendors.

Failover – switching operations from primary site to recovery site.
Failback – switch back from recovery site to primary site – days to months to failback.

DRP – process, policies and procedures for recovery.  DR focused on technology.  DRP are plans.  Exact step by step procedure to bring systems online, keep personnel safe, protect assets, switch systems to remote site.  Includes network config, how to verify failover.

Non virtual – install OS, apps, drivers, restore data
Virtual – install hypervisor, mount snapshots/replicated luns, power on VMs

If DRP is not tested, you dont have a DRP – testing reveals oversights.

Test app data is protected, not enough time for documentation for configs etc and up to date.

Design DR in a modular fashion, test small parts before full test.  Tests should minimize disruption to production.

Once passes test, needs to be tested regularly.

Post disaster:  How to run at recovery site, less capacity, may be inconvenient for staff, plan for temporary housing.

Management of company questions for managers – shut down certain apps, when to failback = BCP.

DRP – plan/procedures during chaos of disaster on safeguarding assets and personnel and is procedure oriented to get systems online.

BCP – process of keeping company running, day to day ops at recovery site.  BCP for IT is different than Finance.  Each department should have a BCP.

BCP Address 3 things:
– Run ops at recovery site
– Issues
– Failback
DR/BC is not a product, no single product or group of product can give instant DR/BC

SRM helps, but needs other product and technology but needs planning.

DRP relies on storage technology, backups need to be offsite, replication has to be timely (consider which LUNs replicated), not all files need to be replicated (dont replicate OS, Swap, temp files).  Make library of ISOs, apps.

Good backup for small scale outages.

DRP – 11 Step Process:

– Enable management buy in – management must agree (CEO, CFO, CTO) its required and funded.  Management driven.  Ongoing significant expense.  Upper level management must allow staff time for testing.
– Business impact analysis –

– Identify key assets (blueprints, IP, equipment, data centers, apps, data) and business functions, mapp functions to assets, identify interdependencies.

– Determination of loss criteria – what if you lose asset X, or is degraded.

– Max tolerable downtime for assets – assign value to assets based on loss criteria.  If down longer than MTD significant loss of business.

– Critical, minutes, Urgent 24 hrs, Important 72 hrs, Normal 7-14 days, Non-essential 30 days

– Define RPO – Industry standard measurement, point in time to be recovered, amount of data loss.
– Define RTO – Industry standard measurement, how long can be down, includes fault detection and bringing app back online.
– Risk assessment – What are your risks – location based problems such as earthquakes, tornados.  Manmade problems – leak, auto accident.  Can’t protect against all, determine likelihood.
– Examine regulatory compliance – Some laws may require specific technology, RPO, RTO.  Check legal requirements.
– Develop DRPs – Create outlines using above info, what has to come online first including infrascuture, what order for all systems, not critical / not in DRP, does it need to be in BCP?
– Design DR systems – Select remote recovery sites, dedicated or non-dedicated, hot-warm-cold?, storage and replication tech, WAN links, communication for operations (phones, mobile).
– Create run books – specific set of procedures, detailed, step by step, app specific.  Rebuild system from OS, restore accounts, create infrastructure settings (AD, LDAP, VLAN), reload software, config software, restore data.  Site specific.  Run book for each asset, order runbook based on when systems need to be brought online.  Hard to create/maintain, capture config changes.
– Develop BCPs – What to do when DRP is finished, now operated business.  What problems – access, less resources, remote access, physical problems (desks, phones, direct dial), plan for operations at remote site (backups, accounting).  Plan for failback – storage replication back.  Need failback procedures.
– Test DRPs and BCPs – You must test, problems will be found.

Free self paced VMware Business Continuity and Disaster Recovery Design [v5.X]

http://mylearn.vmware.com/mgrreg/courses.cfm?ui=www_edu&a=one&id_subject=31255

Summary:

Self-Paced (4.5 Hours)
Overview:
This self-paced course covers the concept of disaster, recovery sites, disaster recovery or DR and business continuity or BC issues, and the planning process.
Objectives: After this course, you will be able to:

• Describe what a “disaster” is and what it is not.
• Describe the difference between DRP and BCP.
• Describe the importance of remote recovery sites in DRP.
• Describe the importance of storage architecture in DRP and BCP.
• Describe the issues involved in disaster recovery and business continuity.
• Use decision trees to design recovery plans.
• Use decision trees to develop BCPs.
• Discuss the features and functions of VMware products that map to tasks within the creation of DRPs and BCPs.

Intended Audience: SEs (VMware/Partner) seeking certification to become a VMware Technical Pre-Sales Professional.

Outline: This course consists of the following modules:

• Module 1: Introduction to Disaster Recovery and Business Continuity

• Module 2: Using Decision Trees to Design Recovery Plans

• Module 3: Using Decision Trees to Design Business Continuity Plans

• Module 4: Mapping VMware Product Features to DRPs and BCPs