Enterprise Cloud Security Best Practice Architecture

Enterprise Cloud Security Best Practice Architecture

Enterprise organisations with existing permitter high bandwith firewalls with high speed internet uplinks and solid ingress and egress security polices with full Application level and deep pact analysis, secure web gateways, following ITIL policies. The last thing you should do is open up your business to Public clouds that can create internet outbount/inbound links with a lick of a button, exposing your internal company.

The best way to mitigate teh risk of a public cloud data exposures;

  1. Effectively, Route all traffic through a enterprise grade fire wall for Administration of any and all SaaS and Cloud environments.
  2. Block all Public Cloud Internet and any new network connections
  3. Monitor and changes to these configuration.
  4. Minimise usage of high level access accounts only via strict change control and key management.
  5. Setup a DirectConnect into your internal corporate firewall and direct all internet traffic using you existing strict firewall policies and minoring.
  6. Only allow access to Cloud via your Corporate Private IP subnet VPN
  7. Enable IAM and MFA Access based on Corporate AD Connect to Cloud access.
  8. Build a policy to eliminate shadow IT

Digital Transformation (a study)

Digital Transformation (a study)


Digital Transformation, or DX for cool kids, is a way to describe, workplace modernisation, the move towards the use of Software-as-a-Service platforms, in pay-as-you-go models monthly billing in a more secure modern platforms and inline with the 4th industraial revulotions, intergrating colobration and use of cloud based platforms to enable colobration and hive mind to support your multi channel communcation with customers.

Workplace Transformation, Customer service and Digital Marketing integration

DX Business Value

  • Cost Savings
  • Staff productivity
  • Operational resilience
  • Business agility

Industry x0


Assessment and Strategy (ROI, Business Prioritisation. )

SD-WAN Transformation

Workplace Transformation

Infrastructure Transformation

Application Transforamtion


Public Cloud (Wars) Hyperscalers Comparison

Cloud vs On-prem Security Controls


The Cost of Cloud, a Trillion Dollar Paradox

The Cost of Cloud, a Trillion Dollar Paradox


Public Cloud (Wars) Hyperscalers Comparison

There are only three main global public cloud vendors AWS, Azure and Google Cloud. These three all have very interesting competitive advantages for the global Enterprise Market; Not just pokemon Go 🙂

  • AWS
    • Advantage
      • First to global market, absolute dominant leader in Public cloud, with the most advanced feature rich platform, at least 10 years ahead of Azure and Google Cloud. But, of course GCP and Azure are catching up quickly. The only options if you are building a global scale app.
    • Disadvantage
      • Incredibly complex and expensive to run non-aws optimised workloads and design.
      • Lack of Enterprise experience, Agile, DevOps is just a nice buzz word used in corporate world the reality is very different.
      • Most Enterprise workloads will require complete refactoring for migration, but VMware integration and NetApp CloudVolumes will make it allot easier for Enterprise Workload migration.
      • Lock in Architecture, once you build a AWS native app, it will be very difficult to migrate out.
      • Not all services meet ‘devils-in-details’ enterprise features. AWS WAF, it is a version of ModSecurity Opensource version, but very difficult to customise and can not compete with a F5 WAF features.
      • AWS product suit have complex limitations and wont figure until you a migrate.
      • AWS people are expensive. (like me)
      • AWS Availability Zone could be within multiply Datacenter, the customer is responsible to architect resilience using multiple regions, availability zones and backup your service. The key factor is that the AWS SLA promise is based at Region level. So it is vital to consider the AWS SLA into your design and cost estimates based on the SLA metrics.
  • Azure
    • Advantage
      • Every single Enterprise Customer already uses most Microsoft products ; Microsoft Office 365, Microsoft Active Directory, Microsoft Windows Operating Systems, Microsoft Storage Server, Microsoft Azure Stack, Microsoft Azure AD SSO. (These technologies provide the stickiness for Azure.)
      • Microsoft Windows Operating System, Microsoft Active Directory, and Microsoft Office 365 are used by almost every corporate customer in the world. As customer transition from on-premise to cloud and SaaS, they will move workloads to Office 365 and Azure AD, and then setup a tenant on Azure making it a very easy transition.
      • Microsoft also restricts some applications and Operating Systems, via licensing restrictions for other shared compute platforms other than its own Azure platform. Eg. Microsoft RDS and Windows 10 are only allowed on Azure. There are many other complex licensing issues that you will only figure out while reading all the licensing legal items.  (I have a number of articles discussing this on this blog.)
      • Microsoft is also enabling, on-premis Azure stack that will make it easy to deploy and transition from on-premis to Azure, including its own Microsoft Storage Server.
    • Disadvantage
      • Microsoft console is not as feature rich and the available features are rolled can cause  headaches, if you are not experienced enough to understand.
      • Microsoft technology takes a great deal of Expertise to maintain and to get working.
      • Microsoft Azure Stack and Hyper-v are not as high performance as VMware ESXi or AWS, at a very low level.
  • Google Cloud
    • Advantage
      • Google Ambient computing PWA, Google Chrome, Google DaRT, Google Firebase. Software will all be SaaS and consumption based. There is no reason for any customer to buy software, in the future we will all consume software from a Marketplace SaaS provider and multi-cloud. Google is heading towards market dominance without the ego of building a monopoly they are working with other competing partners to build outcomes for customers.
      • Google is looking far beyond the current market, they working towards infinite reach. It’s actually insane, if you think about this companies achievements and future..
      • Google Services are running on massive infrastructure globally and just like Amazon, their primary customer is themselfs.
      • They are taking a different approach to gaining market share, As google provide the most widely used browser, they are pushing PWA for development. The whole Google Cloud platform is very much accessible via a developers IDE. Its very easy to start to create a multi-platform application using a Google framework such as Flutter, AngularData and run up services using Google Firebase.. Connecting the developers IDE directly to the Google Cloud platform makes it a very easy options for DEVOPS and develop MVPs. That is the future, when a Developer can build a global scale app, with AI, BigData, Blockchain and whatever else, straight from a IDE and have everything created, like a WAF, CDN, etc. Utopia is what google is building for Software. (I want in, please.)
      • Google is actually cheaper than AWS
    • Disadvantage
      • Late to the game, they need to move fast and differentiate with AWS or Azure in terms of release of features.
      • Google is Search, Google Advertising company moving to Cloud/DC infrastructure applications, etc in the enterprise is a big giant leap. They will need to hire Enterprise Presales.

Update 10/11/19 based on recent research Google Cloud is far superior now to AWS.

  • Google Cloud allows you to depart from the predefined configurations as seen above and customize your instance’s CPU and RAM resources to fit your workload. These are known as custom machines. Other types include Google Cloud Preemptible VMs
  • GCP has higher performance for Storage
  • GCP is priced lower/competitively to AWS
    • Google Cloud Platform also launched their per second billing and Google seems to be slightly lower in pricing.
    • Comparison of Google Cloud Committed Use Discounts vs AWS Reserved Instances
    • Another really huge cost-saving discount that Google Cloud offers is what they call Sustained Use Discounts. These are automatic discounts that Google Cloud Platform provide the longer you use the instance, unlike with AWS where you have to reserve the instance for a long period of time.
  • GCP free tier with no time limits attached.
    • Google Cloud offers a $300 credit which lasts for 12 months. And as of March 2017, they also have a free tier with no time limits attached. Here is an example of an instance you could run forever for free with GCP.
      • f1-micro instance with 0.2 virtual CPU, 0.60 GB of memory, backed by a shared physical core. (US regions only)
      • 30 GB disk with 5 GB cloud storage
  • GCP Network Tiers – With Network Service Tiers, GCP is the first major public cloud to offer a tiered cloud network – https://cloud.google.com/network-tiers/
    • Premium Tier delivers GCP traffic over Google’s well-provisioned, low latency, highly reliable global network. This network consists of an extensive global private fiber network with over 100 points of presence (POPs) across the globe. By this measure, Google’s network is the largest of any public cloud provider. See the Google Cloud network map. GCP customers benefit from the global features within Global Load Balancing, another Premium Tier feature. You not only get the management simplicity of a single anycast IPv4 or IPv6 Virtual IP (VIP), but can also expand seamlessly across regions, and overflow or fail over to other regions.
    • Google Cloud Platform launched their separate premium tier and standard tier networks. This makes them the first major public cloud to offer a tiered cloud network. The premium tier delivers traffic over Google’s well-provisioned, low latency, highly reliable global network. Redundancy is key, and that’s why there are at least three independent paths (N+2 redundancy) between any two locations on the Google network, helping ensure that traffic continues to flow between the locations even in the event of a disruption.
  • GCP has lower latency than AWS, due to Google having its own backhaul fibre optics network, over Google’s backbone, not over the Internet.
    •  FASTER Cable System which gives Google access to up to 10Tbps (Terabits per second) of the cable’s total 60Tbps bandwidth between the US and Japan. They are using this for Google Cloud and Google App customers. The 9,000km trans-Pacific cable is the highest-capacity undersea cable ever built and lands in Oregon in the United States and two landing points in Japan. Google is also one of six members which have sole access to a pair of 100Gb/s x 100 wavelengths optical transmission strands between Oregon and Japan.
  • Google Cloud also has a unique feature with their ability to live migrate virtual machines. Benefits of live migrations allow for the engineers at Google to better address issues such as patching, repairing, and updating the software and hardware, without the need for you to worry about machine reboots
    • AWS provides Availability Zones and has concepts that your design needs to adhere to such as Availability and Durability.
    • Availability: refers to the ability of a system or component to be operational and accessible if required (system uptime). The availability of a system or component can be increased by adding redundancy to it. In case of a failure, the redundant parts prevent the failure of the entire system (e.g. database cluster with several nodes).
    • Durability: refers to the ability of a system to assure data is stored and data remains consistently on the system as long as it is not changed by legitimate access. Means that data should not get corrupted or disappears because of a system malfunction.
  •  Reference
  • GCP Security have been built over 15 years to protects its own service such as gmail and GCP has implement security as the core via GCP identify services and other features.
  • Google Firebase integration – Google provides Application Development Languages such as Angular, Go, DART and Fluter that enables developers to create high performance multi platform and native applications very quickly and integration with Google Firebase means that a develop can access the full capability of GCP via the IDE such as Visual Studio Code. The nirvana is the ability to develop a front end and back end app via the IDE, then connect and manage full capability of the GCP cloud via Firebase and the IDE/your application architecture.
  • GCP and Infrastructure as a Code has really good intergration with Hasiicorp tools and Anisible – https://cloud.google.com/blog/products/gcp/hashicorp-and-google-expand-collaboration-easing-secret-and-infrastructure-management
  • Google Kubernetes has advantages over AWS Container services for security and orchestration  and management
  • Google Cloud Platform has been Carbon neutral since 2017


There is still plenty of years left in traditional data centre technologies and new emerging scale-out and management platforms. You can easily design a server infrastructure with the latest tech that can be 1/10 of the cost of AWS and you can then sweat that asset for 10+ years. I worked on IBM non-stop servers and they are still going after 20+ years. That is pretty good ROI for static apps that don’t need to scale-out.

Enterprise Architecture for Digital Transformation is required, a CIO saying everything needs to go AWS is not the right move,  You need a proper assessment of your business, future strategy and current workloads. Orgainsations need to build a 10+ year strategy and work on a slow migration to cloud adoption and latest Data centre technologies. It’s not a blind pick a Cloud vendor and go.  Depending on your workloads, you maybe be better off staying inside a secure datacentre. I started selling consulting services for assessing workloads to transition to cloud. No to a single customer wanted this, selling a proper migration strategy into cloud is not something most organisation take seriously. Cost/Risk/Agilty is not a easier exercise, but assessing your current workloads is actually very simple with the advancement in cloud migration tools.IMO, if you have a multi-dc, branch and multi-cloud environment, you stuffed it up. You must lost all and any ROI!

Service comparisons

Service Category Service AWS Google Cloud
Compute IaaS Amazon Elastic Compute Cloud Compute Engine
  PaaS AWS Elastic Beanstalk App Engine
  FaaS AWS Lambda Cloud Functions
Containers CaaS Amazon Elastic Kubernetes Service, Amazon Elastic Container Service Google Kubernetes Engine
  Containers without infrastructure AWS Fargate Cloud Run
  Container registry Amazon Elastic Container Registry Container Registry
Networking Virtual networks Amazon Virtual Private Cloud Virtual Private Cloud
  Load balancer Elastic Load Balancer Cloud Load Balancing
  Dedicated interconnect AWS Direct Connect Cloud Interconnect
  Domains and DNS Amazon Route 53 Google Domains, Cloud DNS
  CDN Amazon CloudFront Cloud CDN
  DDoS firewall AWS Shield, AWS WAF Google Cloud Armor
Storage Object storage Amazon Simple Storage Service Cloud Storage
  Block storage Amazon Elastic Block Store Persistent Disk
  Reduced-availability storage Amazon S3 Standard-Infrequent Access, Amazon S3 One Zone-Infrequent Access Cloud Storage Nearline and Cloud Storage Coldline
  Archival storage Amazon Glacier Cloud Storage Archive
  File storage Amazon Elastic File System Filestore
  In-memory data store Amazon ElastiCache for Redis Memorystore
Database RDBMS Amazon Relational Database Service, Amazon Aurora Cloud SQLCloud Spanner
  NoSQL: Key-value Amazon DynamoDB FirestoreCloud Bigtable
  NoSQL: Indexed Amazon SimpleDB Firestore
  In-memory data store Amazon ElastiCache for Redis Memorystore
Data analytics Data warehouse Amazon Redshift BigQuery
  Query service Amazon Athena BigQuery
  Messaging Amazon Simple Notification Service, Amazon Simple Queueing Service Pub/Sub
  Batch data processing Amazon Elastic MapReduce, AWS Batch DataprocDataflow
  Stream data processing Amazon Kinesis Dataflow
  Stream data ingest Amazon Kinesis Pub/Sub
  Workflow orchestration Amazon Data Pipeline, AWS Glue Cloud Composer
Management tools Deployment AWS CloudFormation Cloud Deployment Manager
  Cost management AWS Budgets Cost Management
Operations Monitoring Amazon CloudWatch Cloud Monitoring
  Logging Amazon CloudWatch Logs Cloud Logging
  Audit logging AWS CloudTrails Cloud Audit Logs
  Debugging AWS X-Ray Cloud Debugger
  Performance tracing AWS X-Ray Cloud Trace
Security & identity IAM Amazon Identity and Access Management Cloud Identity and Access Management
  Secret management AWS Secrets Manager Secret Manager
  Encrypted keys AWS Key Management Service Cloud Key Management Service
  Resource monitoring AWS Config Cloud Asset Inventory
  Vulnerability scanning Amazon Inspector Web Security Scanner
  Threat detection Amazon GuardDuty Event Threat Detection (beta)
  Microsoft Active Directory AWS Directory Service Managed Service for Microsoft Active Directory
Machine learning Speech Amazon Transcribe Speech-to-Text
  Vision Amazon Rekognition Cloud Vision
  Natural Language Processing Amazon Comprehend Cloud Natural Language API
  Translation Amazon Translate Cloud Translation
  Conversational interface Amazon Lex Dialogflow Enterprise Edition
  Video intelligence Amazon Rekognition Video Video Intelligence API
  Auto-generated models Amazon SageMaker Autopilot AutoML
  Fully managed ML Amazon SageMaker AI Platform
Internet of Things IoT services Amazon IoT Cloud IoT

Australian Government – Digital Transformation Strategy

Australian Government – Digital Transformation Strategy


  • https://www.dta.gov.au/what-we-do/policies-and-programs/secure-cloud/?lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BWmMXjzgNTV2ysnBsB%2BS3DQ%3D%3D
  • https://www.dta.gov.au/files/cloud-strategy/secure-cloud-strategy.pdf
  • Principle 1: Make risk-based decisions when applying cloud security
  • Principle 2: Design services for the cloud
  • Principle 3: Use public cloud services as the default
  • Principle 4: Use as much of the cloud as possible
  • Principle 5: Avoid customisation and use services ‘as they come’
  • Principle 6: Take full advantage of cloud automation practices
  • Principle 7: Monitor the health and usage of cloud services in real time
  • Initiative 1: Agencies must develop their own cloud strategy
  • Initiative 2: Implement a layered certification model
  • Initiative 3: Redevelop the Cloud Services Panel to align with the procurement recommendations for a new procurement pathway that better supports cloud commodity purchases
  • Initiative 4: Create a dashboard to show service status for adoption, compliance status and services panel status and pricing
  • Initiative 5: Create and publish cloud service qualities baseline and assessment capability
  • Initiative 6: Build a cloud responsibility model supported by a cloud contracts capability
  • Initiative 7: Establish a whole-of-government cloud knowledge exchange
  • Initiative 8: Expand the Building Digital Capability program to include cloud skills
  • Myth 1: The Cloud is not as secure as on premise services
  • Myth 2: Privacy reasons mean government data cannot reside offshore.
  • “Generally, no. The Privacy Act does not prevent an Australian Privacy Principle (APP) entity from engaging a cloud service provider to store or process personal information overseas. The APP entity must comply with the APPs in sending personal information to the overseas cloud service provider, just as they need to for any other overseas outsourcing arrangement. In addition, the Office of the Australian Information Commissioner’s Guide to securing personal information: ‘Reasonable steps’ to protect personal information discusses security considerations that may be relevant under APP 11 when using cloud computing.” https://www.oaic.gov.au/agencies-and-organisations/agency-resources/privacy-agency-resource-4-sending-personalinformation-overseas Additionally, APP 8 provides the criteria for cross-border disclosure of personal information, which ensures the right practices for data residing off-shore are in place. Our Australian privacy frameworks establish the accountabilities to ensure the appropriate privacy and security controls are in place to maintain confidence in our personal information in the cloud.

    Myth 3: Information in the cloud is not managed properly and does not comply with record keeping obligations


Cloud Migration Example

Cloud Migration Example

This is by far one of the best Cloud Migration Examples, Keeping a copy here for future reference. Using ZeroTier and Consul. You could use VMware NSX and velocloud.com for the same function.



About 6 months ago (in a galaxy pretty close to our office) …

Our old hosting provider was having network issues… again. There had been a network split around 3:20 AM, which had caused a few of our worker servers to become disconnected from the rest of our network. The background jobs on those workers kept trying to reach our other services until their timeout was reached, and they gave up.

This had already been the second incident in that month. Earlier, a few of our servers had been rebooted without warning. We were lucky that these servers were part of a cluster that could handle suddenly failing workers gracefully. We had taken care that rebooted servers would start up all their services in the right order and would rejoin the cluster without manual intervention.

However, if we would have been unlucky, and e.g. our main database server would have been restarted without warning, then we would have had some downtime and, potentially, would have had to manually fail over to our secondary database server.

We kept joking about how the flakiness of our current hosting provider was a great “Chaos Monkey”-like service which forced us to make sure that we had proper retry-policies and service start-up sequences in place everywhere.

But there were also other issues: booting up new machines was a slow and manual process, with few possibilities for automation. The small maximum machine size also started to become an inconvenience, and, lastly, they only had datacenters in the Netherlands, while we kept growing internationally.

It was clear that we needed to do something about the situation.

Which cloud to go to?

Our requirements for a new hosting provider made it pretty clear that we would have to move to one of the three big cloud providers if we wanted to fulfill all of them. One of the important things for us was an improved DevOps experience that would allow us to move faster. We needed to be able to spin up new boxes with the right image in seconds. We needed a fully private network that we could configure dynamically. We needed to be flexible in both storage and compute options and be able to scale both of them up and down as necessary. Additional hosted services (e.g. log aggregation and alerting) would also be nice to have. But, most importantly, we needed to be able to control and automate all of this with a nice API.

We had already been using Google Cloud Storage (GCS) in the past and were very content with it. The main reason for us to go with GCS had been the possibility to configure it to be strongly consistent, which made things easier for us. Therefore, we had a slight bias towards Google Cloud Platform (GCP) from the start but still decided to evaluate AWS and Azure for our use case.

Azure fell out pretty quickly. It just seemed too rough around the edges and some of us had used it for other projects and could report that they had cut their fingers on one thing or another. With AWS, the case was different, since it has everything and the kitchen sink. A technical problem was the lack of true strong consistency for S3. While it does provide read-after-write consistency for new files, it only provides eventual consistency for overwrite PUTs and for DELETEs.

Another issue was the price-performance ratio: for our workload, it looked like AWS was going to be at least two times more expensive as GCP for the same performance. While there are a lot of tricks one can use to get a lower AWS bill, they are all rather complex and either require you to get into the business of speculating on spot instances or to commit for a long time to specific instances, which are both things we would rather avoid doing. With GCP, the pricing is very straightforward: you pay a certain base price per instance per month, and you get a discount on that price of up to 80% for sustained use. In practice: If you run an instance 24/7, you end up paying less than half of the “regular” price.

Given that Google also offers great networking options, has a well-designed API with an accompanying command-line client, and has datacenters all over the world, the choice was simple: we would be moving to GCP.

How do we get there?

After the decision had been taken, the next task was to figure out how we would move all of our data and all of our services to GCP. This would be a major undertaking and require careful preparation, testing, and execution. It was clear that the only viable path would be a gradual migration of one service after another. The “big bang” migration is something we had stopped doing a long time ago after realizing that, even with only a handful of services and a lot of preparation and testing, this is very hard to get right. Additionally, there is often no easy path to rollback after you pulled the trigger, leading to frantic fire-fighting and stressed engineers.

The requirements for the migration were thus as follows:

  • as little downtime as possible
  • possibility to gradually move one service after the other
  • testing of individual services as well as integration tests of several services
  • clear rollback path for each service
  • continuous data migration

This daunting list had a few implications:

  • We would need to be able to securely communicate between the old and the new datacenter (let’s call them dc1 and dc2)
  • The latency and the throughput between the two would need to be good enough that we could serve frontend requests across datacenters
  • Internal DNS names needed to be resolved between datacenters (and there could be no DNS name clashes)
  • And, we would have to come up with a way to continuously sync data between the two until we were ready to pull the switch

A plan emerges

After mulling this over for a bit, we started to have a good idea how to go about it. One of the key ingredients would be a VPN that would span both datacenters. The other would be proper separation of services on the DNS level.

On the VPN side, we wanted to have one big logical network where every service could talk to every other service as if they were in the same datacenter. Additionally, it would be nice if we wouldn’t have to route all traffic through the VPN. If two servers were in the same datacenter, it would be better if they could talk to each other directly through the local network.

Given that we don’t usually spend all day configuring networks, we had to do some research first to find the best solution. We talked to another startup that was using a similar setup, and they were relying on heavy-duty networking hardware that had built-in VPN capabilities. While this was working really well for them, it was not really an option for us. We had always been renting all of our hardware and had no intention of changing that. We would have to go with a software solution.

The first thing we looked at was OpenVPN. It’s the most popular open-source VPN solution, and it has been around for a long time. We had even been using it for our office network for a while and had some experience with it. However, our experience had not been particularly great. It had been a pain to configure and getting new machines online was more of a hassle than it should have been. There were also some connectivity issues sometimes where we would have to restart the service to fix the problem.

We started looking for alternatives and quickly stumbled upon zerotier.com, a small startup that had set out to make using VPNs user-friendly and simple. We took their software for a test ride and came away impressed: it literally took 10 minutes to connect two machines, and it did not require us to juggle certificates ourselves. In fact, the software is open-source and they do provide signed DEB and RPM packages on their site.

The best part of ZeroTier, however, is its peer-to-peer architecture: nodes in the network talk directly to each other instead of through some central server and we measured very low latencies and high throughput due to it. This was another concern that we had had with OpenVPN, since the gateway server could have become a bottleneck between the two datacenters. The only caveat about ZT is that it requires a central server for the initial connection to a new server, all traffic after that initial handshake is peer-to-peer.

With the VPN in-place, we needed to take care of the DNS and service discovery piece next. Fortunately, this one was easy: we had been using Hashicorp’s Consul pretty much from the beginning and knew that it had multi-datacenter capabilities. We only needed to find out how to combine the two.

The dream team: Consul and ZeroTier

Getting ZeroTier up and running was really easy:

  • First install the zerotier-one service via apt on each server (automate this with your tool of choice).
  • Then, issue sudo zerotier-cli join the_network_id once to join the VPN.
  • Finally, you have to authorize each server in the ZT web interface by checking a box (this step can also be automated via their API, but this was not worth the effort for us).

This will create a new virtual network interface on each server:

[email protected] ~ % ip addr
3: zt0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether ff:11:ff:11:ff:11 brd ff:ff:ff:ff:ff:ff
    inet brd scope global zt0

The IP address will be assigned automatically a few seconds after authorizing the server. Each server then has two network interfaces, the default one (e.g. ens4) and the ZT one, called zt0. They will be in different subnets, e.g. 10.132.x.x and 10.144.x.x, where the first one is the private network inside of the Google datacenter and the second is the virtual private network created by ZT, which spans across both dc1 and dc2. At this point, each server in dc1 is able to ping each server in dc2 on their ZT interface.

It would be possible to run all traffic over the ZT network, but, for two servers that are anyway in the same datacenter, this would be a bit wasteful due to the (small) overhead introduced by ZT. We, therefore, looked for a way to advertise a different IP address depending on who was asking. For cross-datacenter DNS requests, we wanted to resolve to the ZT IP address, and, for in-datacenter DNS requests, we wanted to resolve to the local network interface.

The good news here is that Consul supports this out-of-the-box! Consul works with JSON configuration files for each node and service. An example of the config for a node is the following:

[email protected]:/etc/consul$ cat 00-config.json
  "dns_config": {
    "allow_stale": true,
    "max_stale": "10m",
    "service_ttl": {
      "*": "5s"
  "server": false,
  "bind_addr": "",
  "datacenter": "dc2",
  "advertise_addr": "",
  "advertise_addr_wan": "",
  "translate_wan_addrs": true

Consul relies on the datacenter to be set correctly if it is used for both LAN and WAN requests. The other important flags here are:

  • advertise_addr the address to advertise over LAN (the local one in our case)
  • advertise_addr_wan the address to advertise over WAN (ZT in our case)
  • translate_wan_addrs enable to return the WAN address for nodes in a remote datacenter
  • bind_addr make sure this is (which is the default) so that consul listens on all interfaces

After applying this setup to all nodes in each datacenter, you should now be able to reach each node and service across datacenters. You can test this by e.g. doing dig node_name.node.dc1.consul once from a machine in dc1 and once from a machine in dc2, and they should then respond with the local and with the ZT addresses respectively.

Given this setup, it is then possible to switch from a service in one datacenter to the same service in another datacenter simply by changing its DNS configuration.

Issues we ran into

As with all big projects like this, we ran into a few issues of course:

  • We encountered a Linux kernel bug that prevented ZT from working. It was easily fixed by upgrading to the latest kernel.
  • We are using Hashicorp’s Vault for secret management. See our other blogpost for a more in-depth explanation of how we use it. In order to make vault work nicely with ZT we needed to set its redirect_addr to the consul hostname of the server it is running on, e.g. redirect_addr = "http://the_hostname.node.dc1.consul:8501". Vault advertises its redirect address in its Consul service definition by default. And this defaults to the private IP in the datacenter it was running in. Setting the redirect_addr to the Consul hostname ensures that consul resolves to the right address. Debugging this issue was quite the journey and required diving into the source of both Consul and Vault.
  • Another issue we ran into was that Dnsmasq is not installed by default on GCE Ubuntu images. We rely on Dnsmasq to relay *.consul domain names to Consul. It can easily be installed via apt of course.

Moving the data

While a lot of our services are stateless and could therefore easily be moved, we naturally also need to store our data somewhere and, therefore, had to come up with a plan to migrate it to its new home.

Our main datastores are Postgres, HDFS, and Redis. Each one of these needed a different approach in order to minimize any potential downtime. The migration path for Postgres was straightforward: Using pg_basebackup, we could simply add another hot-standby server in the new datacenter, which would continously sync the data from the master until we were ready to pull the switch. Before the critical moment we turned on synchronous_commit to make sure that there was no replication lag and then failed over using the trigger file mechanism that Postgres provides. This technique is also convenient if you need to upgrade your DB server, or if you need to do some maintenance, e.g. apply security updates and reboot.

For HDFS the approach was different: Due to the nature of our application, we refresh all data on it at least every 24 hours. This made it possible to simply upload all of the data to two clusters in parallel and to keep them synced as well. Having the data on the new and the old cluster allowed us to run a number of integration tests that ensured that the old and the new system would return the same results. For a while, we would submit the same jobs to both clusters and compare the results. The result from the new cluster would be discarded, but, if there was a difference, we would send an alert that would allow us to investigate the difference and fix the problem. This kind of “A/B-testing” was an invaluable help that helped ironing out any unforeseen issues before switching over in production.

We use Redis mainly for background jobs, and we have support for pausing jobs temporarily in Jobmachine, our job scheduling system. This made the Redis move easy: We could pause jobs, sync the Redis data to disk, scp the data over to the new server, run a few integrity tests, update DNS, and then resume processing jobs.

The key in migrating our data was again to do each service individually, validate the data, test the services relying on it, and then switching over once we were sure everything was working correctly.


The issues and limitations of our old hosting provider made it necessary to look for an alternative. It was important for us that we could move all of our services and data gradually and could test and validate each step of the migration. We therefore chose to create a VPN that would span both of our datacenters using ZeroTier. In combination with Consul, this allowed us to have two instances of each service, which we could easily switch between using only a DNS update. For the data migration we made sure to duplicate all data continuously until we were sure everything was working as intended. If you are looking for an easy way to migrate from one datacenter to another, then we can highly recommend looking into both Consul and ZeroTier.

Gartner – Five Ways to Migrate Applications to the Cloud (just words thou)

Rehost, i.e. redeploy applications to a different hardware environment and change the application’s infrastructure configuration. Rehosting an application without making changes to its architecture can provide a fast cloud migration solution. However, the primary advantage of IaaS, that – teams can migrate systems quickly, without modifying their architecture – can be its primary disadvantage as benefits from the cloud characteristics of the infrastructure, such as scalability, will be missed.

Refactor, i.e. run applications on a cloud provider’s infrastructure. The primary advantage is blending familiarity with innovation as “backward-compatible” PaaS means developers can reuse languages, frameworks, and containers they have invested in, thus leveraging code the organization considers strategic. Disadvantages include missing capabilities, transitive risk, and framework lock-in. At this early stage in the PaaS market, some of the capabilities developers depend on with existing platforms can be missing from PaaS offerings.

Revise, i.e. modify or extend the existing code base to support legacy modernization requirements, then use rehost or refactor options to deploy to cloud. This option allows organizations to optimize the application to leverage the cloud characteristics of providers’ infrastructure. The downside is that kicking off a (possibly major) development project will require upfront expenses to mobilize a development team. Depending on the scale of the revision, revise is the option likely to take most time to deliver its capabilities.

Rebuild, i.e. Rebuild the solution on PaaS, discard code for an existing application and re-architect the application. Although rebuilding requires losing the familiarity of existing code and frameworks, the advantage of rebuilding an application is access to innovative features in the provider’s platform. They improve developer productivity, such as tools that allow application templates and data models to be customized, metadata-driven engines, and communities that supply pre-built components. However, lock-in is the primary disadvantage so if the provider makes a pricing or technical change that the consumer cannot accept, breaches service level agreements (SLAs), or fails, the consumer is forced to switch, potentially abandoning some or all of its application assets.

Replace, i.e. discard an existing application (or set of applications) and use commercial software delivered as a service. This option avoids investment in mobilizing a development team when requirements for a business function change quickly. Disadvantages can include inconsistent data semantics, data access issues, and vendor lock-in.