Thanks for sharing Nick, I have a lot more EMC in my future (starting Tuesday!).
More note taking from, Using Decision Tree to create a Disaster Recovery Plan section of the BCDR Design MyLearn course
Complex task to create. Decision trees to create DRPs. Have Business Impact Analysis and risk assessment, have RPO and RTO ready.
Should the application be protected:
If yes – BIA should decide if application is protected
Is the data separate from application, on separate server or files?
If yes protect both servers/data
Should app be combined with other apps?
If no, protect independently
Is the app infrastructure related (DHCP, AD) which other servers rely on.
If yes, restore first.
If no, BIA determines how critical it is to the business.
If critical, restore prior to other apps, but not before infrastructure.
Recovery site selection and location
Single site is a dedicated recovery sites, generally simplifies DRP.
Is it extremely remote?
If yes, how will staff get/stay there?
Is it Hot/warm/cold? Hot = more expensive. Warm = less expensive but longer to bring online.
Multi site are regional, multi-purpose recovery site.
Recovery site connection
Are you replicating data over WAN links?
If yes, are they redundant?
Yes = higher cost, no = less reliable if link is down data cannot replicate.
If no, longer recovery time while tape/data manually delivered
Full or mixed (full, incremental, differential) – may differ from app to app
Diff + Full lowers complexity, but slower backups that incremental
Full only = easiest restore, longer backups, more costly due to space
Will OS, apps + data mixed onto same storage/LUNS
If yes – makes DR harder, backups larger, replicating LUNs take more time due to moving all files
Suggest moving swap files to diff datastore/LUNs, have VMDKs with data only on separate file, possibly LUN but increases complexity of VM management
Will you replicate this LUN?
Non-replicated LUN will only be available if loaded from backup media.
Replicated LUNS, will it be frequently replicated?
If yes, more options for RPO, but more bandwidth needed.
Keep multiple snapshots? If yes more storage space.
Power and facility design
Will you manage power?
If no, you have to provide power and cooling at all time.
If cold site, only need AC during operation. Need to ensure power.
If yes, cost savings from power and cooling while systems are not in use.
Can you test part of it without testing entire plan?
If no, will you test complete plan more than once a year?
If no, test once a year at minimum, generally need 3 test cycles before working properly
If yes, helps perfect plan but more expensive/time consuming.
If yes, does it interfere with production?
If yes, harder to test as interferes with business.
**Design test plan modularly, harder but testing easier**
Should a disaster be declared?
If no, business continues normally, solve outages locally using normal procedures.
If yes, is it a single person to declare disaster?
Committees may need to approve. If partial DRP can be implemented, easier to implement for only some systems
Single person can act quicker, but committee may have more reasoned judgement.
Strictly a brain dump/note taking page.
Disaster – halts business; harms physical location but also digital/data loss. If you lose your data, you are likely to go out of business – up to 90% within 1 year.
Services lost – web, email, finances, erp apps unavailable causing busines shutdown. Redundant app servers do not protect against disaster in the same data center.
What is a disaster – Natural (tornado, hurricane, etc), Man made (infrastructure failure, terrorism). Disaster depends on how your company designs its network, sets SLAs, builds redundancy.
Disaster is declared by an officer of the company or group of officers, implement DRP/BCP
Catastrophone: Typically natural event, entire data center is destroyed. Affects geographic area rendering local support services unavailable.
Disaster: Localized to datacenter, unavailable day or longer. 65% that lose their DC for more than a week go out of business within a year.
Non-disaster: Service disruption caused by a specific failure
Not a disaster: Failure of hardware component, temporary service interruption such as power outage. Small, isolated failures not built into BCP/DRP.
Disaster Recovery Plan (DRP) Objectives:
Minimize downtime: Streamline processes.
Run book: Guide to implement DRP, manually creation, maintained by hand.
Reduce risk: DRP can’t fail, needs to be tested. Requires additional hardware.
Reduce cost: Control cost, fast and simple = expensive. Potentially double cost for duplicate data center.
Physical DR process. Data is replicated from production to recovery site via replication over WAN or moving via backups.
Complications: Lots of data to identify and replicate, complex recovery process, inability to test.
RPO – Recovery Point Objective – how old can data be that is recovered
RTO – Recovery Time Objective – how long until back online
Hot site – ready in minutes/hours
Warm site – ready in days
Cold site – several days to bring online
SRM supports several storage vendors.
Failover – switching operations from primary site to recovery site.
Failback – switch back from recovery site to primary site – days to months to failback.
DRP – process, policies and procedures for recovery. DR focused on technology. DRP are plans. Exact step by step procedure to bring systems online, keep personnel safe, protect assets, switch systems to remote site. Includes network config, how to verify failover.
Non virtual – install OS, apps, drivers, restore data
Virtual – install hypervisor, mount snapshots/replicated luns, power on VMs
If DRP is not tested, you dont have a DRP – testing reveals oversights.
Test app data is protected, not enough time for documentation for configs etc and up to date.
Design DR in a modular fashion, test small parts before full test. Tests should minimize disruption to production.
Once passes test, needs to be tested regularly.
Post disaster: How to run at recovery site, less capacity, may be inconvenient for staff, plan for temporary housing.
Management of company questions for managers – shut down certain apps, when to failback = BCP.
DRP – plan/procedures during chaos of disaster on safeguarding assets and personnel and is procedure oriented to get systems online.
BCP – process of keeping company running, day to day ops at recovery site. BCP for IT is different than Finance. Each department should have a BCP.
BCP Address 3 things:
– Run ops at recovery site
DR/BC is not a product, no single product or group of product can give instant DR/BC
SRM helps, but needs other product and technology but needs planning.
DRP relies on storage technology, backups need to be offsite, replication has to be timely (consider which LUNs replicated), not all files need to be replicated (dont replicate OS, Swap, temp files). Make library of ISOs, apps.
Good backup for small scale outages.
DRP – 11 Step Process:
– Enable management buy in – management must agree (CEO, CFO, CTO) its required and funded. Management driven. Ongoing significant expense. Upper level management must allow staff time for testing.
– Business impact analysis –
– Identify key assets (blueprints, IP, equipment, data centers, apps, data) and business functions, mapp functions to assets, identify interdependencies.
– Determination of loss criteria – what if you lose asset X, or is degraded.
– Max tolerable downtime for assets – assign value to assets based on loss criteria. If down longer than MTD significant loss of business.
– Critical, minutes, Urgent 24 hrs, Important 72 hrs, Normal 7-14 days, Non-essential 30 days
– Define RPO – Industry standard measurement, point in time to be recovered, amount of data loss.
– Define RTO – Industry standard measurement, how long can be down, includes fault detection and bringing app back online.
– Risk assessment – What are your risks – location based problems such as earthquakes, tornados. Manmade problems – leak, auto accident. Can’t protect against all, determine likelihood.
– Examine regulatory compliance – Some laws may require specific technology, RPO, RTO. Check legal requirements.
– Develop DRPs – Create outlines using above info, what has to come online first including infrascuture, what order for all systems, not critical / not in DRP, does it need to be in BCP?
– Design DR systems – Select remote recovery sites, dedicated or non-dedicated, hot-warm-cold?, storage and replication tech, WAN links, communication for operations (phones, mobile).
– Create run books – specific set of procedures, detailed, step by step, app specific. Rebuild system from OS, restore accounts, create infrastructure settings (AD, LDAP, VLAN), reload software, config software, restore data. Site specific. Run book for each asset, order runbook based on when systems need to be brought online. Hard to create/maintain, capture config changes.
– Develop BCPs – What to do when DRP is finished, now operated business. What problems – access, less resources, remote access, physical problems (desks, phones, direct dial), plan for operations at remote site (backups, accounting). Plan for failback – storage replication back. Need failback procedures.
– Test DRPs and BCPs – You must test, problems will be found.
Great sample company scenario by Gregg Robertson for those preparing for their VCAP-DCD.
As some people may know I am currently preparing to re-take my VCAP5-DCD and I have reached the point in my preparations now where I am doing mock designs and also going through the labs from the VMware Design Workshop and so I thought I would follow the same idea and start creating a mock customer design scenario and also put down the same vein of questions I am being asked from the design workshop labs and hopefully if people are interested they can use it, write down what design choices,the justifications for these choices and the impacts these choices create on the rest of the design and hopefully everyone will learn from this. Below is a company profile that I made up and I also used some ideas from a scenario Matt Mould one of my Xtravirt colleagues sent me as few months back:
• Safe &…
View original post 1,459 more words
Example Architectural Decision – HA Admission Control Policy with Software licensing constraints via @josh_odgers
High Availability Admission Control Setting & Policy with a Software Licensing Constraint
The customer has a requirement to virtualize “Application X” which is currently running on physical servers. The customer is licensed for a maximum of 32 cores and the software vendor has strict licensing restrictions which do not recognize the use of DRS rules to restrict virtual machines to a sub-set of hosts within a cluster.
The application is Tier 1, and requires maximum availability. A capacity planner assessment has been conducted and found 32 cores and 256Gb RAM is sufficient to run all servers.
The servers requirements vary greatly from 1vCPU/2GB RAM to 8vCPU/64GB Ram with the bulk of the VMs 2vCPU or less with varying RAM sizes.
What is the most suitable hardware configuration and HA admission control policy / setting that complies with the licensing restrictions while ensuring N+1 redundancy and minimizing the change…
View original post 512 more words
A few months ago I wrote a post on how I use various social media sites, but realized I didn’t share how I arrived at those use cases/decisions so here it is.
First and foremost identify what you hope to get from your social media experience. If one of your goals isn’t to share knowledge then re-think why you are trying to use social media. I have heard, well read, that Twitter is one of the largest groups of people who love to share their knowledge and experiences. When I decided it was time to be more involved I had several reasons – becoming more involved in a community I was passionate about, learning from others, sharing my experiences and meeting others who shared similar interest were my key drivers for getting started with social media.
Next you will want to identify the key areas you want to focus on. For me this was an easy task as I had decided to veer off my (at the time) current career trajectory and re-focus on more hands on/engineering type roles that would hopefully revolve around virtualization, VMware and other related technologies.
Once you know what information you want, do some research and identify what networks are most active in that area. I find Twitter to be much more active with others in the VMware/engineering space versus Facebook so I started with Twitter and later expanded into other networks. While I use some platforms in a similar manner, I also have specific uses for others. For example Twitter is my primary social network where I will engage openly with everyone, and use Google+ in a similar fashion while leveraging circles to share more personal information such as pictures of my family with just a limited subset of circles. You will likely also find yourself involved in groups with no real tie to a particular social media site/tool such as local meet up groups, blogs or webinars.
Now, develop a plan. Once you have identified the reasons why you want to be involved, and which communities are most active your plan could be as simple as taking time to focus on one specific aspect and expanding into other areas as you become comfortable in others. For example, after several attempts and “figuring out” Twitter, I focused simply on using it as a tool to stay up to date on information from various vendors who I enjoyed working with. This made an easy transition into following others who were talking about the same topics, learning from and meeting those people.
Finally – push yourself. It is easy to stay in an area you are comfortable – with people you know, information or skills you may already excel at but you won’t grow if you do not challenge yourself. One area that I just stumbled into was being more out going, as typically I am quiet and reserved and don’t go out of my way to talk to people. One night I just happened to see someone I follow on Twitter who lives out of state ask where he could get dinner just a couple towns away from me as in was in the area for work, several other people who already knew him suggested some places and others even were going to meet him there so I just threw myself into the mix and had some great conversations over dinner with people I had never met – that was far outside my comfort zone. Of course in a scenario like that, ask first don’t just show up!
If you are struggling with how to get started, leverage social media as a means to get information and learn and you will likely find that over time you will start engaging and sharing more information naturally as others are already doing.
Well I did it, I scheduled my VCAP-DCD for Tuesday February 26th. I have a lot I want to cover between now and then and here is what I will be doing.
- APAC #vBrownBag VCAP-DCD recordings (done but will likely listen to them a few more times)
- Clustering Deep Dive (about half way done)
- VMware Press books – storage and building a vDC
- Mastering vSphere 5 (re-read)
- DR/BCP course from VMware available on MyLearn
That doesn’t seem like nearly enough as I write it out.
Thanks for sharing Brian, great notes!
Since the folks over at the VMware Go team liked my first post so much, I figured I’d oblige and write up an article about how to install ESXi through VMware Go since they tweeted about it before I actually wrote it 🙂
VMware Go, for those that missed the last post, is a cloud based service for small businesses and new VMware admins to help manage and setup their VMware environment. There are “two” ways to install ESXi from VMware go – by converting an existing Windows server/machine or downloading the ISO and installing manually. The later isn’t really installing “through” VMware Go but certainly a viable path, and then you can simply add the host once your install is finished.
Once logged into VMware Go, click on the Virtual tab and select Install an ESXi Hypervisor from the drop down menu.
Click the Get Started button on the next screen and provide the IP address of the Windows server you wish to convert and click the Next button.
VMware Go will connect to the IP address of the computer to determine if it is compatible, when prompted enter the username and password for that server. The machine I tried to install on failed the compatibility check as you can see here:
Thankfully I am doing this in a VM so one second while I go reconfigure that machine…and we are back and the machine passed the test this time as you can see:
I popped in my hostname and opted for DHCP config. Make sure you pay attention to the warning – Windows will be gone! Make sure you backed up your data, settings etc… if you need anything from this server and click next. You will see a summary of the actions to be taken. If you are ready to take the plunge, click the Start ESXi Hypervisor Installation!
After confirming you will blow away your Windows install, you will be prompted for the ESXi password you wish to set, and then need to enter the Windows credentials again. You can see the task/download/install progress at the top of the screen:
Queue on hold music in your head….daaa na na naa naa na. Less than 20 minutes and a few reboots later I had an ESXi screen where my Windows login screen once appeared and can see the task completed in VMware Go.
VMware Go was even nice enough to add it to its inventory for me! So thats it, without ever touching an ISO I nuked my Windows server and turned it into a functional ESXi server!.
Another in a great series by Josh Odgers
With 10GB connections becoming the norm, ESXi hosts will generally have less physical connections than in the past where 1Gb was generally used, but more bandwidth per connection (and in total) than a host with 1GB NICs.
In this case, the hosts have only to 2 x 10GB NICs and the design needs to cater for all traffic (including IP storage) for the ESXi hosts.
The design needs to ensure all types of traffic have sufficient burst and sustained bandwidth for all traffic types without significantly negatively impacting other types of traffic.
How can this be achieved?
1. No additional Network cards (1gb or 10gb) can be supported
2. vSphere 5.1
3. Multi-NIC vMotion is desired
1. Two (2) x 10GB NICs
1. Ensure IP Storage (NFS) performance is optimal
2.Ensure vMotion activities (including a host entering maintenance mode) can be performed in a timely…
View original post 615 more words