With all the latest trends in the web such as “Social Sharing”, “Distributed marketing campaigns” or just simple website trackings, people tend to forget about one tiny simple fact:
If your website is your sales channel or even your only product you must have the ambition to keep it running and fully functional 24/7 and with 99,99% uptime per year. Everything below that is a significant loss of revenues. My employers shop (www.otto.de) has 2 orders per second on average. If it had 99,9% (instead of 99,99%) uptime per year, well, calculate yourself. A year has 31.536.000 seconds. Talking about 0,09% downtime a year thus means 7,8hrs or 28.382,4 seconds of unavailability. Sounds fair. But 28.382,4 seconds of downtime compared to to 2 orders per second, this would mean a loss of 56.764 orders!
Now we get closer to the point I want to make. Just 0,09% of our website being “not fully available” can mean a loss of thousands of orders and thus a lots of revenues. And that is the only loss you are aware of ! Because there are red-lights bumping up in your data centre and your ops are running around like mad chicken trying to get stuff back online again.
That is the real interesting part.
Downtimes based on your own infrastructure are easy to spot and to measure. Your VP SiteOps may already generate a precise report for the C-Level guys saying that “we were up an running this month like 99.97%”.
Fine.
But after everyone embraced himself for being so available, the marketing departments requires a multi-channel-tracking-javascript from company XYZ, the CorpCom guys want some fancy new G+ share buttons and your business intelligence department requires a new tracking lib being served from the Foo CDN. So you just stabilized your own infrastructure and are proud of the 99.97% but introduced several new Point of Failures: Third Party Content.
Let me give you 3 rules which are very important to be totally aware of:
1) Third Party content is not within your control
2) In General servers will fail. Every server fails. There is no 100% uptime. And uptime with 98% CPU load is also uptime
3) Thus third Party servers will fail and if you haven’t done your homeworks, you will fail too. No matter how fancy your infrastructure fail overs are. And then think of 1)
So lets do our homework and understand what will fail!!
Lets assume we have two different types of Third Party Code. JavaScript and CSS. We leave out backend stuff here, because they are usually provided with good test coverage and failovers. If e.g. you want to use some marketing tracking stuff on your website, usually the “marketing tracking provider” asks you to put his <script> block just right below the opening <html> tag in order to work properly.
Now we mix some ingredients together:
A very easy and abstracted code example would be
<html>
<head>
<script src="http://www.thirdparty.com/tracking.js"> </script>
</head>
<body> Your websites content <script> affiliate.trackAndGenerateMoney();</script> </body> </html>
Now as mentioned above, lets imagine thirdparty.com or any of the magic between the client and the server of thirdparty.com is broken and the HTTP request there does not succeed. This leads to the first script-block loading for like 30-60seconds (default browser timeout). Until the browser aborts the pageload, the user gets to see a blank page with a loading indicator.
Repeat: Just because you embedded thirdparty javascript and THEY are down, YOUR users sit in front of a blank page.
Usually users wait for an average of 10seconds until they leave.
The users that get to see such a fail abort:
The situation is not so much different when including third party CSS.
<html>
<head>
<link rel="stylesheet" href="http://www.thirdparty.com/some_widget_magic.css">
</head>
<body>
Your websites content
<script> affiliate.trackAndGenerateMoney();</script>
</body>
</html>
In the common browsers, loading these from an unavailable server, the browser wont even start rendering the page at all (see e.g. http://www.phpied.com/rendering-styles/) until some browser timeout triggers (commonly 30 seconds). This gives the user the bad experience of a white screen. Again, you wont event notice as your tracking relies on dom:ready and thus wont fire. And interesting question would be: What happens if a third party webfont is being referenced from your own stylesheet ? But that would be too much here.
Here is a tiny video I made from the very popular website www.smashingmagazine.com. This will give you a good visualization about the effect of a broken third party server.
wpvideo 8v1h94L9]
Please find on the left handside the page with everything working(your website and the third party webservers) and the right handside the situation were two third party servers (Affiliate partner and twitter) are down and don’t respond.
Got it?!
Not so nice, right?!
So how can be safe with regards to SPOF (Single Point of Failures) & Third Party fails ?
1) Choose your Third Party Providers wisely! Ask them whether their script snippet, css include or webfonts loads *async*. If the reply is like “uhm, what?” or “well, this is not possible”, choose another partner. Seriously.
2) Think about embedding such code to your platform. Do you really need that? Can you provide the feature yourself ? Could you at least host stuff in your own infrastructure?
3) Install a browser plugin such as “SPOF-O-MATIC https://chrome.google.com/webstore/detail/spof-o-matic/plikhggfbplemddobondkeogomgoodeg You can easily see if your page has the potential to fail. And it is fun to browse around the web and see how blind website owners are. Even companies, where the website is the only revenue channel.
4) browse your websites code (locally in your dev environment) for external references such as the above. Replace any occurence of a third party reference with http://blackhole.webpagetest.org This site generates a 30s lasting request that magicly simulates a third party downtime.
5) Pro and advanced tip: Change your /etc/hosts file and redirect request to facebook, googleplus, twitter and urlofyour3rdpartyprovider.com to http://blackhole.webpagetest.org Honestly, while at work, you shouldnt browse FB and G+, so why not working all day while simulating they are down ?! You will be astonished how many websites appear to be broken or even down while we only simulate FB and G+ are down.
My personal favs are #1 and #5.
1) I want to work with awesome people. And if someone gives me code that could crash my site, he is not trustworhty.
5) It is ever so great to see the impact of a well sorted SPOF and if you browse your product/website frequently over the day, you will immediately see SPOFs before they take down your site.
In the end, dont trust anyone but your own devs, ops, devops and talk with your third party vendors about SPOFs.
We have received your feedback.