Tag Archives: vcap-dcd

VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 2

More note taking from, Using Decision Tree to create a Disaster Recovery Plan section of the BCDR Design MyLearn course

Complex task to create.  Decision trees to create DRPs.  Have Business Impact Analysis and risk assessment, have RPO and RTO ready.

Application Protection

Should the application be protected:
If yes – BIA should decide if application is protected
Is the data separate from application, on separate server or files?
If yes protect both servers/data
Should app be combined with other apps?
If no, protect independently

Sample decision tree

Restoration Priority

Is the app infrastructure related (DHCP, AD) which other servers rely on.
If yes, restore first.
If no, BIA determines how critical it is to the business.
If critical, restore prior to other apps, but not before infrastructure.

Recovery site selection and location

Single site is a dedicated recovery sites, generally simplifies DRP.
Is it extremely remote?
If yes, how will staff get/stay there?
Is it Hot/warm/cold? Hot = more expensive.  Warm = less expensive but longer to bring online.
Multi site are regional, multi-purpose recovery site.

Recovery site connection

Are you replicating data over WAN links?
If yes, are they redundant?
Yes = higher cost, no = less reliable if link is down data cannot replicate.
If no, longer recovery time while tape/data manually delivered

Backup Design

Full or mixed (full, incremental, differential) – may differ from app to app
Diff + Full lowers complexity, but slower backups that incremental
Full only = easiest restore, longer backups, more costly due to space

Storage Architecture

Will OS, apps + data mixed onto same storage/LUNS
If yes – makes DR harder, backups larger, replicating LUNs take more time due to moving all files
Suggest moving swap files to diff datastore/LUNs, have VMDKs with data only on separate file, possibly LUN but increases complexity of VM management

Sample decision tree

Replication Design
Will you replicate this LUN?
Non-replicated LUN will only be available if loaded from backup media.
Replicated LUNS, will it be frequently replicated?
If yes, more options for RPO, but more bandwidth needed.
Keep multiple snapshots?  If yes more storage space.

Power and facility design
Will you manage power?
If no, you have to provide power and cooling at all time.
If cold site, only need AC during operation.  Need to ensure power.
If yes, cost savings from power and cooling while systems are not in use.

DRP Testing
Can you test part of it without testing entire plan?
If no, will you test complete plan more than once a year?
If no, test once a year at minimum, generally need 3 test cycles before working properly
If yes, helps perfect plan but more expensive/time consuming.
If yes, does it interfere with production?
If yes, harder to test as interferes with business.

**Design test plan modularly, harder but testing easier**

Disaster Declaration

Should a disaster be declared?
If no, business continues normally, solve outages locally using normal procedures.
If yes, is it a single person to declare disaster?
Committees may need to approve.  If partial DRP can be implemented, easier to implement for only some systems
Single person can act quicker, but committee may have more reasoned judgement.


VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 1

Strictly a brain dump/note taking page.

Disaster – halts business; harms physical location but also digital/data loss.  If you lose your data, you are likely to go out of business – up to 90% within 1 year.

Services lost – web, email, finances, erp apps unavailable causing busines shutdown.  Redundant app servers do not protect against disaster in the same data center.

What is a disaster – Natural (tornado, hurricane, etc), Man made (infrastructure failure, terrorism).  Disaster depends on how your company designs its network, sets SLAs, builds redundancy.

Disaster is declared by an officer of the company or group of officers, implement DRP/BCP

3 Categories

Catastrophone:  Typically natural event, entire data center is destroyed.  Affects geographic area rendering local support services unavailable.

Disaster:  Localized to datacenter, unavailable day or longer.  65% that lose their DC for more than a week go out of business within a year.

Non-disaster:  Service disruption caused by a specific failure

Not a disaster:  Failure of hardware component, temporary service interruption such as power outage.  Small, isolated failures not built into BCP/DRP.

Disaster Recovery Plan (DRP) Objectives:
Minimize downtime:  Streamline processes.
Run book:  Guide to implement DRP, manually creation, maintained by hand.
Reduce risk:  DRP can’t fail, needs to be tested.  Requires additional hardware.

Reduce cost:  Control cost, fast and simple = expensive.  Potentially double cost for duplicate data center.

Physical DR process.  Data is replicated from production to recovery site via replication over WAN or moving via backups.

Complications:  Lots of data to identify and replicate, complex recovery process, inability to test.

RPO – Recovery Point Objective – how old can data be that is recovered
RTO – Recovery Time Objective – how long until back online

Hot site – ready in minutes/hours
Warm site – ready in days
Cold site – several days to bring online

SRM supports several storage vendors.

Failover – switching operations from primary site to recovery site.
Failback – switch back from recovery site to primary site – days to months to failback.

DRP – process, policies and procedures for recovery.  DR focused on technology.  DRP are plans.  Exact step by step procedure to bring systems online, keep personnel safe, protect assets, switch systems to remote site.  Includes network config, how to verify failover.

Non virtual – install OS, apps, drivers, restore data
Virtual – install hypervisor, mount snapshots/replicated luns, power on VMs

If DRP is not tested, you dont have a DRP – testing reveals oversights.

Test app data is protected, not enough time for documentation for configs etc and up to date.

Design DR in a modular fashion, test small parts before full test.  Tests should minimize disruption to production.

Once passes test, needs to be tested regularly.

Post disaster:  How to run at recovery site, less capacity, may be inconvenient for staff, plan for temporary housing.

Management of company questions for managers – shut down certain apps, when to failback = BCP.

DRP – plan/procedures during chaos of disaster on safeguarding assets and personnel and is procedure oriented to get systems online.

BCP – process of keeping company running, day to day ops at recovery site.  BCP for IT is different than Finance.  Each department should have a BCP.

BCP Address 3 things:
– Run ops at recovery site
– Issues
– Failback
DR/BC is not a product, no single product or group of product can give instant DR/BC

SRM helps, but needs other product and technology but needs planning.

DRP relies on storage technology, backups need to be offsite, replication has to be timely (consider which LUNs replicated), not all files need to be replicated (dont replicate OS, Swap, temp files).  Make library of ISOs, apps.

Good backup for small scale outages.

DRP – 11 Step Process:

– Enable management buy in – management must agree (CEO, CFO, CTO) its required and funded.  Management driven.  Ongoing significant expense.  Upper level management must allow staff time for testing.
– Business impact analysis –

– Identify key assets (blueprints, IP, equipment, data centers, apps, data) and business functions, mapp functions to assets, identify interdependencies.

– Determination of loss criteria – what if you lose asset X, or is degraded.

– Max tolerable downtime for assets – assign value to assets based on loss criteria.  If down longer than MTD significant loss of business.

– Critical, minutes, Urgent 24 hrs, Important 72 hrs, Normal 7-14 days, Non-essential 30 days

– Define RPO – Industry standard measurement, point in time to be recovered, amount of data loss.
– Define RTO – Industry standard measurement, how long can be down, includes fault detection and bringing app back online.
– Risk assessment – What are your risks – location based problems such as earthquakes, tornados.  Manmade problems – leak, auto accident.  Can’t protect against all, determine likelihood.
– Examine regulatory compliance – Some laws may require specific technology, RPO, RTO.  Check legal requirements.
– Develop DRPs – Create outlines using above info, what has to come online first including infrascuture, what order for all systems, not critical / not in DRP, does it need to be in BCP?
– Design DR systems – Select remote recovery sites, dedicated or non-dedicated, hot-warm-cold?, storage and replication tech, WAN links, communication for operations (phones, mobile).
– Create run books – specific set of procedures, detailed, step by step, app specific.  Rebuild system from OS, restore accounts, create infrastructure settings (AD, LDAP, VLAN), reload software, config software, restore data.  Site specific.  Run book for each asset, order runbook based on when systems need to be brought online.  Hard to create/maintain, capture config changes.
– Develop BCPs – What to do when DRP is finished, now operated business.  What problems – access, less resources, remote access, physical problems (desks, phones, direct dial), plan for operations at remote site (backups, accounting).  Plan for failback – storage replication back.  Need failback procedures.
– Test DRPs and BCPs – You must test, problems will be found.

VCAP-DCD practice design by @GreggRobertson5

Great sample company scenario by Gregg Robertson for those preparing for their VCAP-DCD.


As some people may know I am currently preparing to re-take my VCAP5-DCD and I have reached the point in my preparations now where I am doing mock designs and also going through the labs from the VMware Design Workshop and so I thought I would follow the same idea and start creating a mock customer design scenario and also put down the same vein of questions I am being asked from the design workshop labs and hopefully if people are interested they can use it, write down what design choices,the justifications for these  choices and the impacts these choices create on the rest of the design and hopefully everyone will learn from this. Below is a company profile that I made up and I also used some ideas from a scenario Matt Mould one of my Xtravirt colleagues sent me as few months back:

Company Profile
•    Safe &…

View original post 1,459 more words

VCAP-DCD scheduled, study notes updated.

Well I did it, I scheduled my VCAP-DCD for Tuesday February 26th.  I have a lot I want to cover between now and then and here is what I will be doing.

  • APAC #vBrownBag VCAP-DCD recordings (done but will likely listen to them a few more times)
  • Clustering Deep Dive (about half way done)
  • VMware Press books – storage and building a vDC
  • Mastering vSphere 5 (re-read)
  • DR/BCP course from VMware available on MyLearn

That doesn’t seem like nearly enough as I write it out.

Example architectural decision-network IO control shareslimits for ESXi host using ip storage via Josh Odgers (@josh_odgers)

Another in a great series by Josh Odgers


Problem Statement

With 10GB connections becoming the norm, ESXi hosts will generally have less physical connections than in the past where 1Gb was generally used, but more bandwidth per connection (and in total) than a host with 1GB NICs.

In this case, the hosts have only to 2 x 10GB NICs and the design needs to cater for all traffic (including IP storage) for the ESXi hosts.

The design needs to ensure all types of traffic have sufficient burst and sustained bandwidth for all traffic types without significantly negatively impacting other types of traffic.

How can this be achieved?


1. No additional Network cards (1gb or 10gb) can be supported
2. vSphere 5.1
3. Multi-NIC vMotion is desired


1. Two (2) x 10GB NICs


1. Ensure IP Storage (NFS) performance is optimal
2.Ensure vMotion activities (including a host entering maintenance mode) can be performed in a timely…

View original post 615 more words

Back to VCAP-DCD Prep

I let myself lose focus last week, here is the plan:

1.  Watch APAC #vBrownBag recordings
2.  Finish Clustering Deep Dive
3.  Watch DR training on MyLearn
4.  Watch SRM training on VMware.com
5.  Read Storage Implementation from VMware Press
6.  Take notes (from slides) from the VCAP-DCD and VCDX #vBrownbags
7.  Re-read
8.  Review VCAP-DCD blueprint and study guides by Gregg Robertson (@GreggRobertson5) and Shane Williford (@coolsport00)

I also have a small startup looking for a basic office setup so I am going to “design” the hell out of it.

Example Architectural Decision – Virtual Machine swap file location via Josh Odgers (@josh_odgers)


Problem Statement

When using shared storage where deduplication is utilized along with an array level snapshot based backup solution, what can be done to minimize the wasted capacity of snapping transient files in backups and the CPU overhead on the storage controller having to attempt to deduplicate data which cannot be deduped?


1. Virtual machine memory reservations cannot be used to reduce the vswap file size


1. Reduce the snapshot size for backups without impacting the ability to backup and restore
2. Minimize the overhead on the storage controller for deduplication processing
3. Optimize the vSphere / Storage solution for maximum performance

Architectural Decision

1. Configure the HA swap file policy to store the swap file in a datastore specified by the host.
2. Create a new datastore per cluster which is hosted on Tier 1 storage and ensure deduplication is disabled on that volume
3. Configure all…

View original post 276 more words