Data Access Implications For Storing Data In The Cloud

A Hidden Problem Just Became A Major One

SELECT CloudCustomerTable.Name
       CloudCustomerTable.TelNo
  FROM CloudCustomerTable
  JOIN LocalMailShotTable
    ON CloudCustomerTable.CustomerId = 
       LocalMailShotTable.CustomerId
 WHERE LocalMailShotTable.MailshotID = 
                          "Yesterdays"

The above is an imaginary query to get the telephone number for everyone in yesterday's mailshot, something that is so mainstream that it can't possibly be an issue.

Yet it can be hugely problematic in a mixed environment where some data is held on the cloud and some within a business.

Obviously this one example by itself is trivial and there are workarounds, but notice that I have already mentioned workarounds in what was supposed to be an ideal environment.

And one thing that we know about workarounds is that they take time, effort and resources and don't always work.

The Database Engine's Perspective On Linked And Local Tables

Joe in Marketing has a local SQL Server Express (the free edition) database that he uses to manage mailshots and he wants to get the contact details for yesterday's mailshot from the core business data which is in The Cloud.

To answer the query shown on the left the database server, the program that reads and writes to the database as in nearly all modern systems that is a separate program to your own application, has to

Create a list of customer ids from LocalMailShotTable that had the relevant mailshot.
Read the table CloudCustomerTable to get the telephone numbers for those customer ids.

When both tables are either in the same database or on the same server the list of customer ids can be held in RAM.

When the two are separated by one being in The Cloud and one on a computer within the business a step needs to be taken that is transparent to the user but is highly resource intensive.

Assuming that the request is being processed locally with the CloudCustomerTable in the cloud and the LocalMailShotTable is local to the business then the database server has no choice but to.

Download the list of ALL customers from The Cloud via the company's internet connection.
Bring this list into the database and index it, this is a quite resource intensive process as the data has to be written to the disc.

Modern servers deliver very high levels of performance by storing in RAM the data that is most often used, anything that reads and writes to the disc dramatically reduces that performance.
Create the list of customers who had the mailshot and find the matching customer details from the temporarily downloaded list of all customer.

The request could be processed by the database server in The Cloud where the options would be the other way around, uploading all the mailshot information instead.

What the database server is not capable of doing is recognising that it could reduce the problem by getting the customers locally, transferring this list to The Cloud, running a new query on The Cloud database server that uses the list uploaded and then pass back just the results to the local database server.

Which is good because the security consequences of allowing such behaviour would be unacceptable.

As this happens in the background "Joe in marketing" may not even be aware that he is downloading huge chunks of data, slowing the company's internet connection and incurring data transfer, CPU and disk usage costs.

If you give this query to IT and get a programmer to do it, he would do what I have just said that the server can't do as it is the most efficient solution to the problem.

This step of uploading a file is always going to be a bottleneck which gets worse as the queries become more complex, turning a trivial task into one that needs thinking about.

So the ad hoc nature of using a PC is heavily compromised by either needing to set up advanced queries or run very inefficient queries that used to take a couple of seconds and now take a minute or so and are charged for. So now added to the possible cloud server costs the employee is getting frustrated, "I could do this before upgrading to the cloud".

As this is a trivial example it is easy to find ways around it, for example at 7am you could make a local copy of the customer data. Then when things go wrong and you have to work with yesterday's data which means that the report xyz gives different total to report abc because they are using different customer databases what do you do?

In highly formalised companies this may be less of an issue, the financial controller and his department may not have the interest or the necessary security permissions to query the database and all queries go via IT anyway.

The real point is did you think about this when deciding to put your data into the cloud?

Simulate The Cloud With Linked Servers

SQL Server linking to itself pretty much simulates The Cloud.

Almost Identical Queries Different Execution Plans

Execution Plans, what are Execution Plans I don't bother checking to see what my queries are actually doing?

Although this may look techie it is aimed at a non-technical audience as a detailed explanation of why splitting databases between all in-house and in-house and the cloud can cause big problems.

Shown below are the estimated execution plans for the "same" query when both of the tables are on the same database server and when one is on a local server and one in The Cloud. In both examples the tables are in different databases.

An execution plan is the database explaining to the user what steps it is going to take to execute the database query.

Sometimes you need a decent understanding of database concepts to understand what a plan is telling you but in this case it is pretty clear that something is different.

To allow you to repeat the exercise we represent the cloud server as a link created from a local server to itself, SQL Server doesn't assume that it is safe to remove this "link" so it processes requests to the linked version of itself as if it was a remote server.

Unfortunately you can't rename the link it has to be the server name, so in the second example what is the remote server has to be called localhost!

Note how the only change between the queries is RemoteServer.dbo.Customers becomes LocalHost.RemoteServer.dbo.Customers.

The parts separated by the dot represent [Database server name].[Database name].[owner].[table name] Just to repeat the above, LocalHost is a connection to a remote SQL Server.

Execution Plan When All Tables Are On The Same Server

Because all of the data is on the same server the database server "owns the data" it can take responsibility for the data and produce the most efficient query possible.

In this example both the MailShotDetails table and the Customers table are on the same server but in different databases.

We can see that the query plan is showing two tasks taking up all the execution time, Clustered Index Scans/Seeks on the two tables.

A Clustered Index Scan/Seek is like going to the index in a book to find the pages that you want, a Seek is more specific than a scan though.

This execution plan is conceptually the best way to solve the problem and no amount of manual tuning could produce a better result.

SELECT Customers.Name,Customers.TelNo
FROM LocalServer.dbo.MailShotDetails
JOIN RemoteServer.dbo.Customers
ON Customers.Ref = MailShotDetails.CustomersRef
WHERE MailShotDetails.MailShotHeaderRef = 1

Execution Plan When Some Tables Are On A Cloud Server

As some of the data is on another server the database server "doesn't own that data" so it can't take responsibility for managing it so it has to produce a safe but slow query.

In this example the MailShotDetails table is on a database server within the building and the Customers table in is a remote database in The Cloud.

It is immediately obvious that the execution plan is more complex than the one above and it has to caused by the fact that the customer table has moved into The Cloud.

There is the new step Remote Query which now takes 61% of the execution time, which is described as Send A SQL Query to another than the current SQL Server and the query is select from customers with any restrictions. The list of customers wanted is not sent.

The 61% percent number needs to be taken with a large pinch of salt because it don't take into account the time taken to transmit the information across the internet or just how busy the remote server will be.

Nor can it take into account the size of the customer table in relation to the number of wanted rows in the mailshot table.

SELECT Customers.Name,Customers.TelNo
FROM LocalServer.dbo.MailShotDetails
JOIN LocalHost.RemoteServer.dbo.Customers
ON Customers.Ref = MailShotDetails.CustomersRef
WHERE MailShotDetails.MailShotHeaderRef = 1

SQL Server T-SQL

The only bit that isn't pretty standard SQL is

Ref INT IDENTITY (1,1) NOT NULL PRIMARY KEY,

IDENTITY (1,1) is a column that starts at one and goes up by one for each row added, usually.

SQL Server caches the value and after almost every server restart this caching leaves unused values.

SQL Script For The Above

Try it on your servers.

The full SQL script to allow you to create and play with the above is shown below, it is SQL Server T-SQL

CREATE DATABASE LocalServer;
CREATE DATABASE RemoteServer;

USE LocalServer;
CREATE TABLE Customers
(
Ref INT IDENTITY (1,1) NOT NULL PRIMARY KEY,
Name VARCHAR(64),
TelNo VARCHAR(32)
)
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 1','01906 123456')
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 2','01906 123457')
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 3','01906 123458')

CREATE TABLE MailShotHeader
(
Ref INT IDENTITY (1,1) NOT NULL PRIMARY KEY,
Description VARCHAR(64) NOT NULL,
SentDate DATETIME NOT NULL
);
CREATE INDEX idxMailShotHeader_SentDate ON MailShotHeader (SentDate);
INSERT INTO MailShotHeader (Description,SentDate) VALUES ('Example Mailshot',GETDATE());

CREATE TABLE MailShotDetails
(
Ref INT IDENTITY (1,1) NOT NULL PRIMARY KEY,
MailShotHeaderRef INT NOT NULL,
CustomersRef INT NOT NULL
);
CREATE INDEX idx_MailShotDetails_MailShotHeaderRef ON MailShotDetails (MailShotHeaderRef);
CREATE INDEX idx_MailShotDetails_CustomersRef ON MailShotDetails (CustomersRef);
INSERT INTO MailShotDetails (MailShotHeaderRef,CustomersRef) VALUES (1,1);
INSERT INTO MailShotDetails (MailShotHeaderRef,CustomersRef) VALUES (1,2);
USE RemoteServer
CREATE TABLE Customers
(
Ref INT IDENTITY (1,1) NOT NULL PRIMARY KEY,
Name VARCHAR(64),
TelNo VARCHAR(32)
)
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 1','01906 123456')
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 2','01906 123457')
INSERT INTO Customers (Name, TelNo) VALUES ('Cust 3','01906 123458')

USE LocalServer
SELECT Customers.Name,Customers.TelNo
FROM LocalServer.dbo.MailShotDetails
JOIN RemoteServer.dbo.Customers
ON Customers.Ref = MailShotDetails.CustomersRef
WHERE MailShotDetails.MailShotHeaderRef = 1

SELECT Customers.Name,Customers.TelNo
FROM LocalServer.dbo.MailShotDetails
JOIN LocalHost.RemoteServer.dbo.Customers
ON Customers.Ref = MailShotDetails.CustomersRef
WHERE MailShotDetails.MailShotHeaderRef = 1

Cloud Servers, Linked Servers And Local Data, The Problem.