Category Archives: Technology

VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 2

More note taking from, Using Decision Tree to create a Disaster Recovery Plan section of the BCDR Design MyLearn course

Complex task to create.  Decision trees to create DRPs.  Have Business Impact Analysis and risk assessment, have RPO and RTO ready.

Application Protection

Should the application be protected:
If yes – BIA should decide if application is protected
Is the data separate from application, on separate server or files?
If yes protect both servers/data
Should app be combined with other apps?
If no, protect independently

Sample decision tree
sample-application-decision-tree

Restoration Priority

Is the app infrastructure related (DHCP, AD) which other servers rely on.
If yes, restore first.
If no, BIA determines how critical it is to the business.
If critical, restore prior to other apps, but not before infrastructure.

Recovery site selection and location

Single site is a dedicated recovery sites, generally simplifies DRP.
Is it extremely remote?
If yes, how will staff get/stay there?
Is it Hot/warm/cold? Hot = more expensive.  Warm = less expensive but longer to bring online.
Multi site are regional, multi-purpose recovery site.

Recovery site connection

Are you replicating data over WAN links?
If yes, are they redundant?
Yes = higher cost, no = less reliable if link is down data cannot replicate.
If no, longer recovery time while tape/data manually delivered

Backup Design

Full or mixed (full, incremental, differential) – may differ from app to app
Diff + Full lowers complexity, but slower backups that incremental
Full only = easiest restore, longer backups, more costly due to space

Storage Architecture

Will OS, apps + data mixed onto same storage/LUNS
If yes – makes DR harder, backups larger, replicating LUNs take more time due to moving all files
Suggest moving swap files to diff datastore/LUNs, have VMDKs with data only on separate file, possibly LUN but increases complexity of VM management

Sample decision tree
sample-storage-design-tree

Replication Design
Will you replicate this LUN?
Non-replicated LUN will only be available if loaded from backup media.
Replicated LUNS, will it be frequently replicated?
If yes, more options for RPO, but more bandwidth needed.
Keep multiple snapshots?  If yes more storage space.

Power and facility design
Will you manage power?
If no, you have to provide power and cooling at all time.
If cold site, only need AC during operation.  Need to ensure power.
If yes, cost savings from power and cooling while systems are not in use.

DRP Testing
Can you test part of it without testing entire plan?
If no, will you test complete plan more than once a year?
If no, test once a year at minimum, generally need 3 test cycles before working properly
If yes, helps perfect plan but more expensive/time consuming.
If yes, does it interfere with production?
If yes, harder to test as interferes with business.

**Design test plan modularly, harder but testing easier**

Disaster Declaration

Should a disaster be declared?
If no, business continues normally, solve outages locally using normal procedures.
If yes, is it a single person to declare disaster?
Committees may need to approve.  If partial DRP can be implemented, easier to implement for only some systems
Single person can act quicker, but committee may have more reasoned judgement.

Advertisements

VMware Business Continuity & Disaster Recovery (BCDR Design) Notes from Module 1

Strictly a brain dump/note taking page.

Disaster – halts business; harms physical location but also digital/data loss.  If you lose your data, you are likely to go out of business – up to 90% within 1 year.

Services lost – web, email, finances, erp apps unavailable causing busines shutdown.  Redundant app servers do not protect against disaster in the same data center.

What is a disaster – Natural (tornado, hurricane, etc), Man made (infrastructure failure, terrorism).  Disaster depends on how your company designs its network, sets SLAs, builds redundancy.

Disaster is declared by an officer of the company or group of officers, implement DRP/BCP

3 Categories

Catastrophone:  Typically natural event, entire data center is destroyed.  Affects geographic area rendering local support services unavailable.

Disaster:  Localized to datacenter, unavailable day or longer.  65% that lose their DC for more than a week go out of business within a year.

Non-disaster:  Service disruption caused by a specific failure

Not a disaster:  Failure of hardware component, temporary service interruption such as power outage.  Small, isolated failures not built into BCP/DRP.

Disaster Recovery Plan (DRP) Objectives:
Minimize downtime:  Streamline processes.
Run book:  Guide to implement DRP, manually creation, maintained by hand.
Reduce risk:  DRP can’t fail, needs to be tested.  Requires additional hardware.

Reduce cost:  Control cost, fast and simple = expensive.  Potentially double cost for duplicate data center.

Physical DR process.  Data is replicated from production to recovery site via replication over WAN or moving via backups.

Complications:  Lots of data to identify and replicate, complex recovery process, inability to test.

RPO – Recovery Point Objective – how old can data be that is recovered
RTO – Recovery Time Objective – how long until back online

Hot site – ready in minutes/hours
Warm site – ready in days
Cold site – several days to bring online

SRM supports several storage vendors.

Failover – switching operations from primary site to recovery site.
Failback – switch back from recovery site to primary site – days to months to failback.

DRP – process, policies and procedures for recovery.  DR focused on technology.  DRP are plans.  Exact step by step procedure to bring systems online, keep personnel safe, protect assets, switch systems to remote site.  Includes network config, how to verify failover.

Non virtual – install OS, apps, drivers, restore data
Virtual – install hypervisor, mount snapshots/replicated luns, power on VMs

If DRP is not tested, you dont have a DRP – testing reveals oversights.

Test app data is protected, not enough time for documentation for configs etc and up to date.

Design DR in a modular fashion, test small parts before full test.  Tests should minimize disruption to production.

Once passes test, needs to be tested regularly.

Post disaster:  How to run at recovery site, less capacity, may be inconvenient for staff, plan for temporary housing.

Management of company questions for managers – shut down certain apps, when to failback = BCP.

DRP – plan/procedures during chaos of disaster on safeguarding assets and personnel and is procedure oriented to get systems online.

BCP – process of keeping company running, day to day ops at recovery site.  BCP for IT is different than Finance.  Each department should have a BCP.

BCP Address 3 things:
– Run ops at recovery site
– Issues
– Failback
DR/BC is not a product, no single product or group of product can give instant DR/BC

SRM helps, but needs other product and technology but needs planning.

DRP relies on storage technology, backups need to be offsite, replication has to be timely (consider which LUNs replicated), not all files need to be replicated (dont replicate OS, Swap, temp files).  Make library of ISOs, apps.

Good backup for small scale outages.

DRP – 11 Step Process:

– Enable management buy in – management must agree (CEO, CFO, CTO) its required and funded.  Management driven.  Ongoing significant expense.  Upper level management must allow staff time for testing.
– Business impact analysis –

– Identify key assets (blueprints, IP, equipment, data centers, apps, data) and business functions, mapp functions to assets, identify interdependencies.

– Determination of loss criteria – what if you lose asset X, or is degraded.

– Max tolerable downtime for assets – assign value to assets based on loss criteria.  If down longer than MTD significant loss of business.

– Critical, minutes, Urgent 24 hrs, Important 72 hrs, Normal 7-14 days, Non-essential 30 days

– Define RPO – Industry standard measurement, point in time to be recovered, amount of data loss.
– Define RTO – Industry standard measurement, how long can be down, includes fault detection and bringing app back online.
– Risk assessment – What are your risks – location based problems such as earthquakes, tornados.  Manmade problems – leak, auto accident.  Can’t protect against all, determine likelihood.
– Examine regulatory compliance – Some laws may require specific technology, RPO, RTO.  Check legal requirements.
– Develop DRPs – Create outlines using above info, what has to come online first including infrascuture, what order for all systems, not critical / not in DRP, does it need to be in BCP?
– Design DR systems – Select remote recovery sites, dedicated or non-dedicated, hot-warm-cold?, storage and replication tech, WAN links, communication for operations (phones, mobile).
– Create run books – specific set of procedures, detailed, step by step, app specific.  Rebuild system from OS, restore accounts, create infrastructure settings (AD, LDAP, VLAN), reload software, config software, restore data.  Site specific.  Run book for each asset, order runbook based on when systems need to be brought online.  Hard to create/maintain, capture config changes.
– Develop BCPs – What to do when DRP is finished, now operated business.  What problems – access, less resources, remote access, physical problems (desks, phones, direct dial), plan for operations at remote site (backups, accounting).  Plan for failback – storage replication back.  Need failback procedures.
– Test DRPs and BCPs – You must test, problems will be found.

RSA SecurID Authentication Manager Unexpected Error searching Active Directory Identity Source

For some reason I can’t get Mr. Mackey out of my head on this one – “Quotes are bad…mmmmkay.”  I recently inherited a project to get SecurID working and, it seemed pretty straight forward.  I had setup SecurID at previous companies so I   was sure it was something obvious.

After reviewing the config, and reviewing the documentation from RSA – which is good, it doesn’t read as a “Step-by-step to setting up AD” but it works.  I opened a support ticket with RSA (non-urgent) and they got back to me within just a couple hours.  The documentation provided by RSA for both the Authentication Manager installation and configuration and the firewall configuration were both spot on.

The problem was, when the identity source was originally setup in the RSA Operations Console, “quotes” were used around the user and user group base DN fields.  What was odd, if I entered an OU that didn’t exist I would get an error, so it was seemingly reading the fields with the quotes but when I went to search for users in the Security Console I would get an ‘unexpected’ error.  Removing the quotes around the user and user group base DN fields fixed this problem.

Cisco WLAN Controller not passing traffic – resolved – but not sure why

I ran into a strange problem recently, a Cisco WLAN controller 5508 with 1142N APs (not sure the model and controller matter entirely as I found the fix on a support forum thread for a 4000 series) would allow clients to connect, get an IP address but NOT pass any traffic other than ICMP.  I thought maybe the problem was Windows firewall related but disabled it still appeared.  I thought maybe a driver problem but tried several revs of the driver, and it also happened with different model cards.  A temporary work around was to disable, then re-enable the wireless card.

DHCP is handled by a Windows 2008 server, not the access points or WLAN manager, and again – the client was actually DHCPing an address (as I type that I wonder if there is a problem with the DHCP server now, but it didn’t happen to wired clients or on a temporary access point we brought in…).  There was a thought it was a DHCP problem since ping worked, but I could not access network resources via IP which ruled DNS out.  Yet another test was to isolate the WLAN controller and APs on to a separate switch.  This also eliminated what appears to be a known problem with 2960 switches where APs cannot register with the controller (which wasn’t our problem but worth isolating anyways).  I also removed all but 1 of the APs, but the problem persisted.

Now had I listened to my own Troubleshooting 101 post, I would have opened a support ticket, but this particular company let the support lapse and did not want to renew it.  This also meant I did not have access to download the latest software for the controller or APs.  So for those wondering, thats why there was no all into Cisco TAC on this issue.

What lead me to the fix that ultimately “fixed” the problem was an error I found in the logs “APF−1−REGISTER_IPADD_ON_MSCB_FAILED: Could not RegisterIP Add on MSCB. MSCB still in init state.”  Now I am happy this is fixed, but I am not happy with what the “fix” was yet because I haven’t found good documentation that explains why this fixed our problem.  I had to enable DHCP Addr. Assignment in the advanced section of the WLAN config, according to the documentation from Cisco:

DHCP Addr. Assignment Required setting, which disallows client static IP addresses. If DHCP Addr. Assignment Required is selected, clients must obtain an IP address via DHCP. Any client with a static IP address is not be allowed on the network. The controller monitors DHCP traffic because it acts as a DHCP proxy for the clients.

Thats good and all, but my clients WERE DHCPing  addresses just fine and APs were broadcasting SSIDs just fine.  Oh and by the way this was all working swell through October, for several months actually, and just started to have problems in November.  If anyone has a better description/document that more deeply defines the DHCP Addr. Assignement Required option I would love to read it.

ReadyNAS Reboot Loop

I hate SOHO technology, generally lacking in support/features.  Recently I ran into a ReadyNAS stuck in a seemingly famous reboot loop.  As a Netgear partner I may have better support than most but this is a story of the process to try and get it back online.

First they had me try an OS reinstall, according to support this is no destructive but does put back all the settings to the factory default.  To do this, have the IT gods favorite tool handy, a paper clip.

  1. Power the unit off
  2. Push the paperclip into the small reset button on the back of the unit under the USB ports (at least on a ReadyNAS200) and turn the unit on.
  3. After a few seconds a Boot Menu will appear.  Use the backup button to go to OS Reinstall and push the paperclip in the reset button again to select.

This step did not fix my problem, but now we didn’t have internet access.  The next step probably should come first in Netgears support steps and that is to boot into TechSupport mode.  Since we had already reset the unit, it booted with the default IP address which probably doesn’t work on anyone’s actual network so if you find yourself in this situation, maybe suggest tech support mode first.

Since I didn’t have a spare switch handy and, since it was a Friday, wasn’t much in the mood to start making changes to the customers core switch, I need to figure out how to connect.  Plugging directly in didn’t work as expected, however there are 2 ports on the back of the ReadyNAS so I plugged my laptop in directly and I could ping!  Great now I will hop on the web gui and reconfigure for the network…. no so much – no HTTP.  Okay no problem I will SSH …. nope but wait – TELENT!  No to figure out the tech support username and password.  Google to the rescue, the root password for tech support mode is ‘infr8ntdebug’ – once logged in, it appears quite linuxy (great…**sarcasm**).  At this point, it appears you need to go through Netgear tech support in order to find/access anything to fix the problem.  All the forums suggest calling tech support and it is magically fixed.  I will update this post if/when I hear back from them.

Update:  2 hour wait so far for L3 support, just posted a message on their support forums after my original phone call.

Update 2:  Now almost 3 hours, called back and being told it could be as much as 2-3 days or longer before I get a response.  

Update 3:  22 hours later, on a Saturday morning, on hold waiting for our “partner” 24×7 technical support because no one has called me back or emailed me.

Update 4:  My partner support wouldn’t help me because of how I needed them to access the device (Kind of can’t blame them) but thankfully someone from the community support forum had the directions on how to fix.  Very easy and I recorded the session so I will post notes shortly.  Essentially its mount the boot volume, delete the file, sync the writes, and unmount all the volumes.

Troubleshooting 101

Disclaimer – These are not all my ideas, simply a collection of ideas from people I have met over the year including some very smart folks from Softscape (former company), ICI and others.

When faced with a problem, error message, basically anything not working here are some tips to get you back quickly.

  1. Make a list.  Scott Lowe just presented on the ProfessionalVMware.com #vBrownBag (http://professionalvmware.com/?p=2931) and focused on the importance of tracking tasks so you are not bouncing back and forth and losing focus.  If nothing else, the below points might make a good list for you to walk through to ensure you are getting the right things done in a timely manner.
  2. Collect as much information you can about the problem.  Albert Einstein once said “If I had one hour to save the world, I would spend 55 minutes defining the problem and only five minutes finding the solution” and this should be a lesson for all of us in technology.  We need to understand the problem before we can solve it.  What is the operating system, patch level, software application and patch level, when did the problem start, what has changed since it started are all good data points to start with.
  3. Open a support ticket with the vendor.  This was a great bit of advice I received from Brad Maltz (@bmaltz), chances are you are likely going to sit on hold, or have to wait for a support engineer to respond to your request so once you have collected as much information as you can, get the support case rolling.  While you are spending time troubleshooting you are also moving up the support queue to someone who might be able to help.  Worst case scenario, by the time they get back to you, you just close the case because you figured it out.
  4. Give the customer (remember internal IT departments you are an IT service provider, even if its just for one company) an specific time you will get back to them with an update.  Be clear about what the problem is, what system or systems are affected and that the next response may be nothing more than an update that you have no update.  Set a reminder in your calendar to send the update as well, you don’t want to promise an update at 10A and then not follow through.  I have worked through issues where I sent 3 or 4 emails stating that the system was still down and working towards a resolution and that we would respond again within the next hour (or some other time interval that is appropriate for the problem).  Give yourself enough time to hopefully find an answer, but not so short that you keep stopping your work and not so long that the customer is just left wondering.
  5. What do the logs say?  If something isn’t working, in theory there should be a log file with errors in it someplace (though I have certainly run into a few system crashes that happened so quickly the only error was that Windows didn’t shutdown properly).  Make sure you have a good system to collect and analyze log files.
  6. What does the knowledge base say?  Just about every technology company publishes a knowledge base or KB where you can look up error messages, codes or type in symptoms.  Chances are you are not the first person in the world to run into this error/problem.  If the companies KB doesn’t have anything, search the GKB (aka google.com).  In addition to the KB, customer support forums are also a great place to look.
  7. Check with your team.  The 5 steps above should have taken you no more than 10-15 minutes (collect info, open a ticket, review logs, search the KB), if you haven’t yet heard back from technical support ask your team.  Remember you are not on an island alone, even if you are a single IT person shop there are others you can lean on.  Maybe there are people the affected department who have been at the company longer than you, if its a system thats been in place for a while they may have had this happen to them before and can tell you who helped them fix it the first time.
  8. Check in the social-media-o-sphere.  If you are in technology and not taking advantage of social media and the many smart, open and helpful people out there, then I hope this blog post will have you start looking into it.  LinkedIn and Twitter are my go to places if I am just beyond stumped.
  9. Review the original project documentation.  There are certainly (many) times this may not exist, and if you find yourself thinking that YOU don’t have any documentation for your projects maybe its a good time to start.  Typically a consultant/vendor would provide a scope of work, project outline or similar documentation that should state what you are doing with this platform, why you are doing it and what is likely to go wrong in the environment given the design.
  10. Panic.  Just kidding…. well kind of… no – I am kidding (Still crack up over that comment Rob).  In all seriousness though, if by now you have not found a solution its probably time to start reaching out to local consultants/experts for the system or systems you are having problems with.  It may also be at this point that you realize a system restore is the only viable option and its time to break out the DR/backup and recovery plan so you can get the systems back online.
  11. Document and share it.  Thanks to Edward Henry (@NetworkN3rd) for the tip and reminder.  Once you have solved the problem, make sure to document the information you collected, the symptoms and how you resolved it.  If it happened once, it will probably happen again.  Share with your team, share via your online resources (customer forums, Twitter, blogs etc…) as others may see a problem, or similar problem in the future.

What do you think of the steps above?  Do you use similar steps?  What tips do you have?

The definitive guide to network integrations from acquisitions

Here are some rules to help your next acquisition and network integration.

First, there is no such guide. Every acquisition, every company is different. While you may learn some useful processes during any given project, don’t assume they will work for the next acquisition. If you find yourself saying “This is just like…” stop yourself, you will be trying to put a square peg in a round hole.

Second, check your ego at the door. I don’t care how many times you have done this before, as i said previously this one is different and you will need help. Make sure your M&A team is bringing someone from the enterprise technology group early so you can identify members of the new IT team that you can dedicate to the integration project.

Third, have a plan for your new colleagues. Your VP or CxO should speak with all the new team members as soon as they can legally know about the acquisition. Learn about what they do today, what they are capable of and what they want to do tomorrow. Even if you can’t give them all their dream job, give them a job, a title in your group and an understanding of how they will contribute after all the migration work is done. If part of the acquisition is letting some/all of the team go, be upfront and honest with clear time lines.

Fourth, identify some easy wins early and rip the band-aid off. One such migration that I am a fan of doing (almost) immediately is moving the level 1 helpdesk. Eventually you will be supporting the new companies associates, what better way for your helpdesk to learn than by doing and gaining insights and best practices from the existing helpdesk team. If this is not a practical step (again every company is different) maybe something like intranets or public websites work better but pick something and move on it quickly. The best integration I worked on, an engineer showed up the day the deal was signed with a server under his arm, we brought up the VPN and had the domain trust in place within a day – that was an easy, necessary win that got the team working together immediately.

Fifth, have a plan, move quickly but don’t forget about testing. So you worked for months doing your due diligence – you know what they have for systems, how many users, how they use those systems so its time to get your migration on right? Wrong! Talking about how someone does something will never replace real testing.

Finally, understand the integration is about 1 thing and 1 thing only – the user experience. I don’t care if you move a petabyte of data from Sydney to Boston in 15 minutes (well I do so please blog about it), all the employees care about is how you affected their ability to do their job. All the technical acts of heroism don’t mean a thing if your users are not happy. We are all IT service providers so make sure your customers are happy.

Concerns over Microsoft in the enterprise

Microsoft has been, and for the most part still the standard for enterprise platforms and applications. For the most of the last 2 decades, they maintained a consistent user experience, added new functionality to their platforms and improved the performance of those platforms. With Windows 8 and Server 2012 I feel as though they have become to focused beating what Apple did in the consumer space by redesigning the user experience to something we don’t realize we want yet. In the consumer space, that is great/fine – NOT in the enterprise.

In addition to the UI, I also feel like Microsoft has not focused enough on continuous, predictable improvement (i.e. NT >> 2000 >> 2003 >> 2008) and instead gone for bulk change, sometimes (like is the case with Exchange) changing architecture to drastically. I liked Exchange 2003, 2007 was not bad and for the most part really liked 2010 (the management GUI was meh but Powershell awesome). I also liked the segmentation between roles which, now with Exchange 2013, they have done away with.

It’s flux like Exchange, drastic overhauls to HyperV (which they might have got right in 2012 but who knows if they will keep it around) and lack of commitment in consumer products (Zune, Kin, soon to be Surface) that makes me reluctant to trust my enterprise platforms to Microsoft. Even license changes that seemed to be somewhat understandable the last two or three years are being changed again.

Is this just part of a knee jerk reaction to the Surface being not good or do these changes have you wondering about Microsofts direction as well?

Simulate 150 Cloud User Activities Using Open Source Tools

Great post, looking forward to see what Eucalyptus does and also some handy load test tools (I have only used WebLoad in the past)

Kyo Lee

For the 3.2 release in this December, Eucalyptus is coming out with an intuitive, easy-to-use cloud user console, which aims to support the on-premise dev/test cloud adoption among IT organizations and enterprises.


This easy-to-use Eucalyptus User Console is consisted of two main components: a browser-side javascript application, written in JQuery, and a proxy server that utilizes Python Boto to relay requests to Eucalyptus Cloud, which is written in Python Tornado, an open source version of the scalable, non-blocking web server developed by Facebook.

The target scale for the initial version of the user console is set to handle 150 simultaneous user activities under a single user console proxy.

Now, the challenge is how to simulate these 150 users to ensure that the user consoles and the proxy are able to withstand the workload of 150 active cloud users; more importantly how to ensure that such workload is not jeopardizing the…

View original post 862 more words