
A few months back I was asked to give an account of my personal experience with cloud. I realised that I could trace this back to 2001 with a limited stretch of the imagination, so could spend 5 minutes on a potted history before getting into the stuff that cloud is enabling us to do now with scalable continuous integration, testing and fast cycle deployments. Having thrown together a few slides I realised that it probably warranted one more, maybe final, blog entry on the subject of cloud. I have been involved in the deployment of applications on Microsoft Azure, Google Cloud and AWS in production environments so have sufficient material (and experiential pain) to fill a couple of pages. Here I will discuss the lead up to, and motivators for initial cloud adoptions I’ve been involved with, how we use cloud today, the various routes to cloud adoption, and some general heuristics for guidance and the avoidance of pitfalls. If you don’t care for history then skip forward to “The Marketplace Today”.
Ancient History – how we ended up at cloud
Having checked dates with my co-conspirator from those days (http://linkedin.com/in/robert-kenny-griffiths) I present here a very brief summary of the pre-cloud part:
Day 0. The pre-grid days – *so* last century
Working with exotics, back when they were a tradeable asset rather than a one-way trip to pariah status in the banking fraternity, we generated risk reports on the trading desk. Lots of risk reports. It took a great effort to partition the tasks up into chunks to run on separate multi-processor servers. Each had to be tailored for the number of cores (which back then equated to processors) available on the destination server. This proved to be….tiresome.
Day 1. MatGrid – 2001
MatGrid – was named after its author who possessed sufficient smarts to write our first grid scheduler application that took the pain of manual task allocation away. However, Mat didn’t see his career as The Grid Scheduler Guy so was not overly keen on developing and maintaining MatGrid for the remainder of his career in financial services. Fortuitously the market was starting to see a groundswell of support for scalable grid scheduling as a genuine long term commercial product.
Day 2. Sun Grid Engine – 2003/2004
Sun had been beavering away producing their own grid scheduler (SGE). Since our grid was a burgeoning collection of Sun Solaris machines this seemed like a good fit and also enabled Mat to divest himself of the tedious obligation to monitor a batch.
Day 3. N1 grid – 2005
SGE was such a success that in true Sun style they decided to rename it “N1” and give it away for free. (At this point it was N1GE6 for the grid-nerds amongst you.) However, there were 2 problems. Firstly Sun were in the process of being swallowed up by Oracle, and secondly Sun pizza boxes were becoming relatively expensive where bang-per-buck was concerned. Increasingly localised failures could be tolerated because grid software enabled us to route around and isolate failed nodes. Hardware was becoming wholly commoditised so why did I need to pay so much to buy it?
Day 4. Datasynapse GridServer – 2006/2007
Switching to Windows was the simple solution and at this point there were two big players in the market; Datasynapse with their Gridserver product, and Platform’s Symphony. Naturally we POC’d both products but on the basis that Datasynapse had 80% of the rapidly growing financial services market and, most crucially, had their main office around the corner from ours so I could immediately pay them a personal visit if anything bad happened; we went with Gridserver. This proved to be a successful arrangement which led to us scaling up and also across our entire tradefloor where we scavenged all the desktops for a period of time with no complaints from users. Note that the session was immediately terminated if a user touched their mouse, keyboard or the user’s CPU usage hit 30% so other than a new small blinking icon in the system tray (spotted by one single trader) there was no change to the environment.
It’s worth noting at this point to consider quite how much cash was being wasted in financial services at this time. I recall a conversation from a conference that went something like this:
Grid manager of major global bank brags: “We have more compute than Microsoft!”
Me: “Running at 15%-25% utilisation, and in divisional silos…”
(That may have been during a conference panel session, and I may have upset him. :-/ )
This led me to deduce the Big Corporation IT Truism rendered in quasi-maths (so this almost looks scientific):
As [(PS*SC)^WM]->inf then U->0.
Where WM and SC are often strongly correlated.
In English: “As (Poor Scheduling * Siloed Compute)^Weak Management tends to infinity then Utilisation levels tend to zero”.
However, at this point we were running at between 65% and 85% utilisation over a 6-day week depending on how much float kit we had. Sunday was for maintenance and sleep rather than religious observation. Consequently we didn’t really have much unused capacity to fall back on or optimise into.
Day 5. Microsoft HPC – 2010/2011
Microsoft became aware of the growth of grid and launched their HPC Pack extension to Windows. I noted that if we could employ this I could get two clear benefits. Firstly I could stop paying Datasynapse fees since HPCServer came bundled with the Windows licence. Secondly the opportunity to cloudburst to Microsoft’s cloud platform, then known as Windows Azure was too good an opportunity to turn down. As background we were attempting to plan annual budgets while almost blindfolded with respect to the growth potential over the next 12 months. To some extent we could predict trade population growth and factor in improvements to risk models that would ameliorate the pain, but there were always the anomalies, also known as Risk Manager’s Whims, often driven by regulatory change or some new risk metric (think latterly CVA, KVA, FVA, etc but there have always been ‘new’ risks to measure). For (a totally apocryphal – honest!) example, a new risk report may be demanded and a risk manager may decide that 500K simulations are needed. So we size the hardware requirement, factor in additional switches, cabinets, power, etc., order it, we take delivery, build it, deploy the software, test it, move it to production and within the space of three short(!) months our grid is ready and the production risk is run. After a week or so the Risk Manager decides that actually 100K simulations is fine because the model converges better across the portfolio over time than expected. So we now have 5 times the kit we needed with its capex amortising away over the next three or four years. Like I said, *apocryphal*.Naturally kit always gets used for something but it’s an uncomfortable purchase to justify in retrospect. How much more preferable it would have been to have been able to offer a couple of weeks opex to sanity check the requirement before spending all that cash on new kit. Here’s a thought. It might even become preferable to charge all of our compute to opex and let someone else run the datacenter – fundamentally we were a bank, not a datacenter provider!
It all seemed perfect, but wasn’t. It transpired that HPCServer was somewhat less functionally rich to that which we’d become used to. Rather than give up and stay with Gridserver we persisted. This was largely because Microsoft were so keen to support our efforts that they gave us direct access to their cloud CTO and their Shanghai and Seattle teams. They surmised that we would play nicely and help to develop their proposed cloud offering known (then) as “Windows Azure” since its launch in February 2010. Our big idea was that if we ran HPCServer on premise and in-cloud we could cloudburst to Azure when our workload became spiky. This was ambitious. We worked with Microsoft through the early period looking at everything from job scheduling and prioritisation to the economic modelling of the resources. Examples being: you don’t want a grid utilisation profile that looks like camel humps due to daft scheduling, and economically we wanted to start paying for resources when we received them in a fully-useable state, not at the point we requested them (which in those days could have been 25 minutes prior). We also enlisted the help of The Great Dane (https://uk.linkedin.com/in/daniel-schiermer-16a73a7) who did much of the techy heavy lifting.
Ultimately Datasynapse were swallowed up by Tibco, Platform were acquired by IBM while we prepared to switch over to Microsoft.
Day 6. Cloud – 31st December 2012
The 31st of December may seem an odd day to go live with a new environment but if you have already frozen the market data for the year, computed your end-of-year risk run, and the majority of your users are working towards a self-induced coma for the next couple of days it’s about the lowest risk you’ll see. Our standard batch ran on-premise as usual but the exotic VaR run ran entirely on Azure. Correctly. From that point on we scaled up to over 10K cores on Azure on a daily basis, replacing on-premise kit with cloud resourcing as the on-premise kit rolled off its capex cycle or died. While we started with Reserved Instances eventually Azure capacity grew to the extent that we didn’t need them, there was always sufficient compute available on tap to hit the General Availability supply.
Day 7. In stark contrast to Chapter 2 of the Book of Genesis there was no time to rest. Compute elasticity was only the start of the benefits we sought to leverage from cloud.
Over the next few years I also used Google Cloud and AWS. Google was notably used for the Big Table facility as part of a MiFID2 solution. This project was always going to be ‘difficult’ since MiFID2 proved to be an experience not dissimilar to putting your hand onto a preheated hob then finding that everyone else had their hand on top of yours. Google’s insistence on using a reseller to contract with a global megabank did not endear them or give the impression that they were business-focussed. In my current CTO role we use AWS extensively for almost all new development.
The Marketplace at the end of 2019
While there are lots of cloud wannabes there are actually only 4 major players at the time of writing: AWS, Microsoft, Google and Alibaba.
In reverse order, Alibaba are huge.
In China.
Everywhere else they have suffered from what I’ll call the “Huawei Effect” and leave it at that. As a result their non-China growth has been somewhat stunted of late.
Google make all the right noises but always look uncomfortable in a (metaphorical) formal business suit. Note my comments earlier with respect to only contracting via resellers. They’re great to play around with but “corporate penetration” is always going to sound rather filthy to a dyed-in-the-wool, old-school Googler. Notwithstanding this standpoint I understand they are making moves to address the challenge, and the next year or so should see whether they actually mean Business.
Microsoft were a little late to the party but have been desperately trying to make up ground ever since. Under the stewardship of Steve “Monkey Dance” Ballmer they never seemed fully committed to technological innovation, but since Satya Nadella took over they’ve looked convincing. Adding cloud services was always a fairly straightforward extension to an existing Enterprise Agreement (EA) making the contractual onboarding painless, since pretty much everyone already uses Microsoft products to some degree so has a Master Services Agreement of some description. If you run a Microsoft stack then this is a straightforward lift to use Azure. However, it’s worth noting that they are seeing significant growth in their Linux deployments so have clearly accepted that Linux is way too big to ignore, and that’s notwithstanding the persistent chatter regarding Microsoft’s Next OS having a Linux core.
AWS were first to market and made all the early running. They’re a little more expensive but are keeping in front by launching a new product every few minutes, or so it seems.
AWS and Microsoft have always had a business-centric view. With Microsoft large firms added cloud services to their EA. AWS is really easy to get started and was business-focussed from the start. Furthermore they’ve invested in creating a cohort of developers with AWS cloud skills, their online training programmes are particularly clear, extensive and largely free. I understand that Google have recently been hiring more “business-folk” of late but they have a way to go.
Each of AWS, Microsoft and Google host meet-ups and conferences so there are plenty of opportunities to pick up the requisite skill sets to leverage their platforms.
Private cloud
Private cloud was invented by people unable to understand security requirements and who fail to grasp the fact that their private cloud is probably less secure than it would be if run properly on a public platform. One argument is that it constitutes the brainchild of a server-hugging Head of IT afraid to let go of the tin. Or it’s a result of historical siloed server purchasing by divisions of a corporate who have been forced to play nicely and share. The resultant massive over-capacity is resold as “private cloud”. If there hadn’t been such a profligate spend in the first place it might not have ever been necessary.
I will grudgingly grant that there are occasions where private cloud is appropriate. If your application is tailored for mass FPGA deployment or runs on a particularly niche architecture not supported by the public cloud vendors then there is an argument in favour of a large-scale, shared, inhouse compute facility. The justification should not however revolve around security. Unless you need something physically air-gapped then it’s a weak argument. Short of inter-continental ballistic missile command consoles it’s hard to come up with something that is better off segregated (and in the ICBM case they are only air-gapped because they run on kit so old they still use 5.25” floppy disks).
Developing for public cloud in 2019
Our firm’s business mandate is to offer services to globally distributed clients that we previously offered only to co-located staff.
We run an AWS-hosted service that interacts with a React.js browser-based client. The client GUI acts a portal for financial services firms (such as asset managers, hedge funds and family offices) to access our services. Through the GUI we offer up a variety of functions including trade creation, pricing, and portfolio risk management. We could have supported this by building out our own infrastructure but there are a number of inhibitors to growth, scale and elasticity. For example, it would all be capex. Then we would need to factor in a 3-month delivery lag, maintenance, etc. Additionally it is monotonically scalable, we can’t send physical servers back if we no longer need them, they’re on our books right through the depreciation cycle. Furthermore, there would be significant steps in the cost each time we needed to provision an additional set of switches, cage or datacenter. We would also need to manage our own failovers and have appropriate over-provisioning to handle spikes in demand and replication.
As it stands, like almost every other firm, we operate an on/off-premise hybridised technology stack. There are relatively few cases where it is preferable to maintain technology on-premise and a proliferation of datacenters is a costly asset (we have reduced from 7 to 2 over the past couple of years). However, some of our infrastructure remains on-premise, i.e. in rented sheds either side of London. Our experience is not unusual as the plummeting cost of shed-space around London is testament. Our on-premise kit is primarily for VDI hosting but also some third-party applications. Typically these apps can be the stickiest to move to cloud because you cannot do the lift autonomously. It is only when the application vendor is prepared to port their application and figure out the charging structure that this becomes tractable. Even then the commercial arrangements may be far less attractive because the vendor may be looking to claw back some of the migration costs rather than seeing the cloud port as essential for the long-term viability of their product. Worse still they may perceive the cloud version as an opportunity to dress up the product as a new version at a premium price notwithstanding the fact that the functionality uplift is minimal. The alternatives can be unpalatable; inhouse development of an adequate replacement is likely to be outside the commercial bounds and timelines appropriate for the business model while implementing a competing product that is already available in-cloud is simpler but requires a POC, additional legal agreements, commercials and potentially complex workflow re-engineering and integration work. Nonetheless there are few global jurisdictions where Regulators are restrictive with respect to the data or compute hosting of transactions pertaining to their financial system provided reporting is both timely and complete.
Our solution uses containerised applications running on EC2 instances with consideration for micro services (Lambdas) where appropriate. Storage is largely handled by S3 and RDS (Postgres) on separate subnets. Our portal is served up via (Elastic) Beanstalk. We use Terraform for configuration management enabling us to operate a config-as-code model. Finally, we asked AWS to come in and conduct an extensive tyre-kicking exercise on our architecture to ensure it was a well-architected framework. (It was.)
This has enabled us to collapse our release cycle from every 2 weeks as per the standard Scrum cycle to every couple of days. Sometimes we’ll release twice in a day if it feels like the right thing to do. All commits to trunk initiate a full regression. This can be on a dedicated test platform or it can be on a test environment spun up purely for the unit/integration/regression test suite to exploit, then torn down straight afterwards. Using cloud in this way increases release velocity. Ten years ago it was necessary to wait for your turn to line up all the test environments to conduct an end-to-end environmental test. Now they can be spun up in minutes.
Cloud migration options
Stephen Orban (Head of Strategy, AWS) has mapped out 6 strategies for migrating applications to the cloud that are as good a taxonomy as any to determine what kind of process you need to follow:
- Reposting – “Lift and Shift”: If an application is containerised or has very few external dependencies then it may simply run as-is in cloud.
- Replatforming – “Lift-Tinker-Shift”: For example you might shift to a database-as-a-service RDBMS but keep the guts of the application the same.
- Repurchasing – “Buy a different (cloud-based) product”: In the example we discussed earlier whereby application providers are slow or reticent to move their product to cloud then it can be simpler to abandon the incumbent system. A case in point might be ditching an on-premise hosted CRM for Salesforce.
- Refactoring/re-architecting – “The nuclear option”: If it won’t go as-is then rewrite the application. For some legacy applications this may be the only option but comes with a price tag and potentially lengthy development timeline attached.
- Retire – “Actually you can live without the product”: During Windows migrations (NT to 7, 7 to 10, etc) we typically came across a plethora of applications that needed testing and in some cases effort to migrate. In a significant number of cases the users were surprisingly relaxed at the prospect of losing the product, particularly if we were able to associate a monetary cost to making the applications available on the new OS, or we showed the impact on other deliveries. It is worth periodically assessing the stack of applications you support to avoid cruft creep.
- Retain – “Kicking the can down the road”: Sometimes it can be worth waiting and focus instead on managing data volumes and egress costs until an application vendor is ready to migrate or a technology is sufficiently mature that the migration is markedly more straightforward.
Guidance and pitfall avoidance: how to spot ladders and avoid snakes in cloud adoption
Consider Gartner’s Hype Cycle. This shows an array of technologies moving through a process with notable phases:
- Innovation trigger
- Peak of inflated expectations
- Trough of disillusionment
- Slope of Enlightenment
- Plateau of Productivity
Essentially someone has an idea, it gains traction with a couple of other people, many many other people decide they’re experts and should sell their services to you for a grossly inflated value, you realise they’re picking it up as they go along on your dime so lay them off. Finally someone finds a use case for which the idea is a good fit, and the idea becomes useful. Cloud never went into the trough of despair. There are a number of reasons for this, primarily that viable use cases existed before cloud. Secondly, many of those drivers had clear economic benefits which ensured that early adopters were pushing against an open door when pitching for funding. Finally, in order to accelerate adoption the main vendors have kept a lid on charging. In fact it tends to generally get cheaper. This model isn’t new. Consider Uber for example, that are pricing their service below that of existing services. However, unlike Uber, cloud firms are highly profitable. An example of cost reduction would be AWS’ Fargate product that offers container orchestration without the need to provide your own EC2 instances. Essentially it spins them up on-the-fly as needed. In discussion with AWS I noted that we hadn’t adopted it due to cost – it was simply too expensive. Their response was that they had reduced the price by 50% a month or so earlier. While not a zero-sum game there are losers as a result of widespread cloud adoption. “Shed sales” – aka the provisioning of datacenter space around London, has seen a significant cut in premium, maybe 50% compared with a few years ago. While global economics may be playing a part there has to be at least some margin of inverse correlation with cloud adoption.
Here’s some advice worth consideration:
- Spot instances: as long as you don’t go for the latest and greatest kit then generally you never get bumped. You will see significant cost savings.
- Linux instances are substantially cheaper than Windows ones.
- Keeping microservices (aka Lambdas in AWS) hot so they’re always available to accept work immediately means you’re paying for compute you’re not always using. That’s not how they should be used.
- Minimise data egress. Data ingress is free, egress incurs a charge. So, do all your calculation and analysis in-cloud. Only pull back your result set, if you have to pull anything back.
- Shuffle data between AZs using their infrastructure. Their pipes are almost certainly fatter than yours. Note that there are quirks in the bandwidth between different cloud providers, some are faster than others between different countries, but generally they’re all quick.
- Avoid anyone who offers to “Facilitate your journey into the cloud”. The marketplace is less awash with charlatans than it was two or three years ago but they still exist and are simply trying to sell you consultancy they’re not fit to deliver. Always find case studies and any other clients they’ve worked for who are prepared to vouch for them, even if it’s only over a coffee. To be fair the cloud consultancy business is relatively mature now and most of the aforementioned charlatans quickly ran off to rebrand themselves as Blockchain Consultants and are currently plying their meagre wares as Digital Transformation Consultants.
- You won’t always save money by deploying to cloud. But you will get the flexibility to spin up and tear down environments at will, so testing becomes much simpler. We’ve moved from bi-weekly to weekly to a bi-daily deployment schedule using a variant of Kanban. We have full testing each time a developer commits code/merges to trunk.
- Run a dedicated line to the cloud endpoint. If you are leveraging the in-cloud bus it would be foolish to rely on the public internet at the edge.
- Containerisation for cloud agnosticism is a double-edged sword. It gives portability between cloud vendors but you lose access to the added benefits, all the cloud-specific tooling that’s available such as machine learning or cloud-specific implementations. Aurora is reputedly AWS’ fastest growing product at the moment (they used to use Oracle but found it too weak so built their own RDBMS – because they can) so the ability to leverage proprietary services may outweigh the benefits of cloud agnosticism.
- Exploit config-as-code to ensure your environments can be replicated quickly. This give your teams the opportunity to build, test and deploy rapidly.
- Hire people that understand why “dev”, “sec” and “ops” should be juxtaposed.
- Remove people that push back without a verifiable and justifiable cause. Some folks won’t like losing their empire/hardware estate, or will forever want to touch the tin. They will keep looking for ways to subvert your plans to preserve their position.
- Encourage all your development and support teams to undertake online training on cloud platforms. There’s plenty of it, and it’s mostly free.
- Disabuse your senior management team should they posit the notion that because all the infrastructure will be in cloud you no longer need an infrastructure technology team. You will still have databases to configure and network topologies to secure – but the skillsets will be subtly evolved from the historical, hardware-oriented kind. Encourage your teams to retrain, it’s evolution, same as it ever was.
- Your keys are immensely important. Keep them very secure.
- Stop challenging staff when they ask to buy cloud resourcing and instead start making them fill in forms and write justifications for hardware purchases instead (if you don’t already). Essentially, do not replicate the internal bureaucracy that is associated with capex purchases where cloud resources are concerned – they are opex and charges will accrue gradually. Cloud is not always the right answer, but be clear that you require solid justification for any material expenditure. For example, prototyping and starting up a system on cloud before switching to on-premise compute may be appropriate if the target product stack needs to be sized appropriately before committing to a substantial capex outlay.
- Impose similar rigorous challenges for projects that cannot easily be containerised or are tied to specific environments. There is a risk of increasing technical debt that you will eventually have to remediate.
- Encourage your existing Infrastructure team to re-skill and continue to be the compute resource gatekeepers. Coupled with this is good auditing and logging. Ensure that someone is responsible for cleaning up resources and handing them back when they are no longer required. Leaving resources enabled when they are no longer needed means you are throwing away one of the key benefits of cloud and wasting cash. Having someone policing this and attributing resources to individuals or teams will help.
Reasons why your firm can’t do cloud
There are still firms out there that consider their data too sensitive for cloud and/or the systemic risk of being wholly dependent on a cloud provider is too big a risk to take. Notwithstanding the fact that the US DoD, Fed, CIA and most Investment Banks (IBs) are on the record as saying that they are using cloud and have plans to adopt it further, there remain naysayers insistent that while cloud is a great idea, it just won’t work for them. We are interviewed once or twice a fortnight by different fund managers, or by auditors on their behalf, conducting audits and completing their periodic Due Diligence Questionnaires (DDQs). The subject of cloud adoption does come up in the context of security. Fortunately the cloud providers make this easy for us. Our links to AWS are via dedicated lines with on-wire encryption. Within the cloud infrastructure AWS’ own accreditations apply. They are extensive and available to view online. Pretty much every accreditation imaginable has been addressed and the proof is transparently available. I challenge any IB datacenter manager to prove that their own infrastructures are as comprehensively secured. The point is an IB is concerned with financial services, everything else is a service and ancillary to the main thrust of their business. Increasingly IBs are facing increased scrutiny of costs. Datacenters and their security are substantial sums on the wrong side of the balance sheet.
It is worth drawing attention to the FCA’s guidance on the matter (FG 16/5, last updated September 2019). The opening section states that “we want to support innovation and ensure that regulation unlocks these benefits, rather than blocks them” and goes on to state that “we see no fundamental reason why cloud services (including public cloud services) cannot be implemented, with appropriate consideration, in a manner that complies with our rules”.
Let us be clear, a cloud vendor is still a technology supplier and should be subject to all the same rigorous challenges you apply to any other vendor in terms of the third party due diligence process. However all financial services businesses are currently subject to systemic risks. Clearing houses are centralised single points of failure. Your trade lifecycle management system is key to your firm’s capability to manage its business, do not think that a simple escrow copy will save you if you can’t immediately hire a team of knowledgeable support staff and developers to maintain it. Without your settlement system you will also fail to conduct your business. The list goes on. All aspects of systems risk should be assessed and mitigated, but at some level your firm will risk accept something as being too unlikely to warrant further mitigation. In reality cloud adoption is likely to ameliorate your risk rather than increase it since data security is the cloud suppliers’ number 1 focus.
Fundamentally in larger firms the main challenge to cloud adoption is likely to be monolithic incumbent systems that are too big to shift to cloud in their current form. Several IBs are grappling with this challenge following years of patching and making do. But that’s not a cloud problem, that’s an architectural monolith legacy issue, which is a whole different blog topic.
So cloud is old news. But don’t take my word for it, instead take the FCA’s comment from FG16/5:
“Using third-party providers, including cloud providers, may bring benefits to firms such as cost efficiencies, increased security, and more flexible infrastructure capacity. These benefits can support more effective competition.”
I reads this to completion and failed to spot any amusing images or blatant spaghetti grows on tree reference. Where is the other Robert Johnson, framed in the attic?
?