Unique identifer or INT as clustered index

Today I was asked about what is best to use for a clustered index and a primary key in a table created on e.g. SQL Server 2008 R2. Is it a Unique Identifier or an Int.

As usual my answer was
 It Depends, but if we only look at index maintenance and other performanceissues, then I would always prefer an Int instead of an Unique Identifier.

So what else can we to than try to test different scenarios.

First we create a database.

 

use master
go

-- First let's see if the database
-- already exists and if we have to delete it.
if DATABASEPROPERTYEX('keyTest','Version') is not null
begin
    drop database KeyTest
end

-- Lets create the database
Create database keyTest
go

use keyTest
go

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

First step is to test if the database KeyTest already exists. If it does then we drop it and create it again.  Then we switch context to the keyTest database.

Now, lets create two tables that have the same data, but a clustered index that are build on different datatypes. We implement the clustered index by using the primary key keyword (there are a big difference between a clustered index and a primary key, but if we create a table with the primary key keyword, then a clustered index is also created.

-- Lets create the two tables
Create table IdentifierTBL
        (    ID        uniqueidentifier        primary key
             default newID()
            , Postalcode        char(4)
            , Name                char(50)
        )

Create table IntTBL
        (    ID        int    identity        primary key
            , Postalcode        char(4)
            , Name                char(50)
        )

The two tables has the same columns, but in the first table the ID is a uniqueidentifier, that are 16 bytes long and in the first table it is an int, that are 4 bytes long.

Now let try to insert 10.000 rows in each of the tables.

 

-- Let's insert some data
insert into IdentifierTBL values (newid(), '1234', 
Replicate('A', 50))
go 10000

insert into IntTBL values ('1234', Replicate('A', 50))
go 10000

Thats it. We have two tables with 1000 rows. The only test we want to carry out now is to see the fragmentation and how many pages they take up.

select    t1.object_id
        , t2.name
        , t1.avg_fragmentation_in_percent
        , index_level, page_count
        , avg_record_size_in_bytes
from sys.dm_db_index_physical_stats(DB_ID(), 
       null, null, null, 'Detailed') t1
    inner join sys.objects t2 on t1.object_id=t2.object_id
where t2.name in ('IdentifierTBL', 'IntTBL')

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

The result looks like this:

image

We can see that both tables consists of two levels, a leaf level and a root level. We can see that if we use an Unique identifier, then fragmentation is approx. 99%, and when we use an integer, then it’s approx. 2%.

We can also see that the table with an Int as the clustered key, takes up 92 pages and the average record size is 65 bytes. The same figures for the table with the unique identifier is 154 pages and 77 bytes in average recordsize.

But it does not stop here. What if we create a nonclustered index on the postalcode. What will happen to our physical structures?

USE [keyTest]
GO
CREATE NONCLUSTERED INDEX
        [Nonclustered] ON [dbo].[IntTBL]
(
    [Postalcode] ASC
)

CREATE NONCLUSTERED INDEX
        [Nonclustered] ON [dbo].[IdentifierTBL]
(
    [Postalcode] ASC
)

 

Lets see the characteristica

 

select    t1.object_id
        , t2.name
        , t3.name
        , t1.avg_fragmentation_in_percent
        , index_level, page_count
        , avg_record_size_in_bytes
from sys.dm_db_index_physical_stats(DB_ID(), null, 
null, null, 'Detailed') t1
    inner join sys.objects t2 on t1.object_id=t2.object_id
    inner join sys.indexes t3 on t1.object_id=t3.object_id 
and t1.index_id=t3.index_id
where t2.name in ('IdentifierTBL', 'IntTBL')
and t3.name='NonClustered'

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

The result is

image

What I am interested in showing you is the page count and the record size. As you can see, then the index is much bigger with a unique identifier (it uses 36 pages compared to the 20 pages). That is because the recordsize is  24 bytes compared to the 12 bytes that are used in the IntTBL. The difference is 12 bytes, and it is the difference between the 16 bytes of a unique identifier and the 4 bytes of an int.

What we can conclude is that the clustered index key is added to the nonclustered index and if it is 16 bytes, then 16 bytes is added; if it is 4 bytes, then it is 4 bytes that are added.

Why should we ever then use Unique identifier, because I have now showed you how much better an integer performs as a clustered key.

Well

. If I e.g. should have a database with disconnected users, e.g. a CRM system, then it could be a good idea to use a unique identifier. But it is not within the scope of this blog to write about that.

SĂžren

Two ways of improving your ETL with SQL 2012 #3 Change data capture

In an earlier post we saw how the new project deployment functionality will make it much easier to deploy and keep track of versions of a SSIS package.

The other cool feature I will talk about is change data capture. How do you transfer data from your datasources (e.g. an ERP-system) to a staging database? Well, I see multiple ways, but the purpose is to transfer as few records as possible, and some of the methods are:

1) If a table has a last changed, then keep track of the “Last changed date” in the staging database, and then transfer all data since the last “Last Changed”.

2) If table as a rowID that are incremental, then transfer all new rows since the last transfer.

3) Transfer e.g. the last two days data and then work with them.

All methods has their ups and downs, but now we can utilize the Change Data Capture functionality, and I will show you how it works.

Basically, we must

1) add change data capture to a table

2) Then we must create the SSIS package

3) Run it initially the first time.

4) Setup the package to run continiously.

 

It is very easy, and I will try to give you a short demo on that now.

[youtube=http://www.youtube.com/watch?v=vlOZvI9xmuY&w=448&h=252&hd=1]

 

<– Project Deployment

two ways of improving your ETL with SQL 2012 #2 – Project deployment

How did you deploy SSIS packages earlier? Well, I went into the project properties, and create a deploymentutility, and then I ran it. I had to do a lot of work.

Now I

  1. setup a SSIS catalog in SQL Server Management Studio.
  2. Right click in Data Tools (Visual studio) and deploy.

Let’s see how that is done in SQL Server 2012. Please don’t mind the quality of the video. I have a son that are sick today, so I had to do a lot of breaks.

[youtube=http://www.youtube.com/watch?v=SsjEftMLswU&w=448&h=252&hd=1]

 

<—Introduction   Change data capture –>

Two ways of improving your ETL with SQL 2012 #1

SQL 2012 has improved a lot when it comes to the possibility to create a better and more efficient ETL, and in this short intro, I would like to show you some of the things I find useful.

In this series of blogs I will show you:

  1. Projectdeployment instead of package deployment.
  2. Using Change Data Capture to improve performance.

Projectdeployment

One of the things that has always irritated me was the need to deploy a package instead of a complete project. To be practical, then I will show you how to deploy a project to a SQL Server 2012 instead of. It is very easy and gives you a lot of benefits. And it makes it easier to configure values that must change depending on the installation. On a scale from 1 to 10, where 1 is no new value and 10 is great new value, then I give this functionality 9.

Change data capture

This is also not new, and we have been able to setup change data capture since version 2008, but now even a newbie to ETL can setup changedatacapture and use it in Integration services. We have some new components that helps us. I would give this 10 out of 10.

This is simply brilliant, since it addresses one of the biggest problems in building a datawarehouse and getting a good performance: How do we only update and insert changed and new rows.

So let’s get started with projectdeployment.

Project deployment –>

Database mirroring–survival kit #4

This is the final part of database mirroring, and here I want to give you some nice tips and also some links to get you a little bit further.

First of all. Showing you the videos are very easy, because I am used to do this. When you start working it can be a challenge for you, but I suggest that you setup a virtual machine and do a snapshot once in a while.

When setting database mirrorring up in a production environment, then be aware of the following:

  1. Security. I did just click next –> next –> next, but you must think about security, and eventhough it is irritating, then make sure that you think about the principle least privileged.
  2. Endpoints must be configured correctly. Do you need encryption? (yes of course Smiley), what ports are you going to use and what credentials are you using?
  3. Make sure that ports are open in the firewalls for the ports you are using.
  4. Do we need high safety or high performance.
  5. Do we need automatical failover or not?

If you want to learn more about database mirroring, then I can recommend the following link:

http://msdn.microsoft.com/en-us/library/ms189852%28v=sql.105%29.aspx

if you get into troubles, then I would suggest:

http://www.sqlskills.com

http://www.sqlservercentral.com

http://www.sqlauthority.com

 

That was the final step in this survival kit for Database Mirroring.

<—Go to step 3

Database mirroring–survival kit #3

Now it gets a little bit tricky, because how will you handle databasemirroring and replication together?

Let’s imagine a scenario, where we want to replicate a table from a mirrored database to another database. If we failover, then replication should continue to work. Is that possible?

Yes, and it is fairly easy to setup, what you need to do is:

  1. Setup the mirroring.
  2. Setup a distributor and publication on the current principal server(think about having the distributor on its own server).
  3. On the mirrorserver setup distribution and use the same distributor as on the current principal server.
  4. Add both the distributor and mirrorserver to the replication monitor.
  5. Set –PublisherFailoverPartner property for the agents (in my demo it will be the snapshotagent and the log reader agent.
[youtube=http://www.youtube.com/watch?v=JNb5dNqWj4c&w=448&h=252&hd=1]

<—Go to step 2      Go to step 4—>

Database mirroring–survival kit #2

So, in the previous article about database mirroring, we saw how to setup database mirroring, in this short demo, I will show you how to create a very small .net application to utilize from database mirroring.

The problem is that two instances of a database either has two different IP address and two different names, and earlier a developer should use a lot of time to configure that.

Now we can add some settings to a connectionstring to define the failover partner.

Normally a connectionstring to a SQL Server looks something like:

Data Source=myServerAddress;Initial Catalog=myDataBase;Integrated Security=True;

and now we can simply change it to:

Data Source=myServerAddress;Initial Catalog=myDataBase;Integrated Security=True;Failover Partner=myMirrorServerAddress;

And now database mirroring is working in the application.

Let me show you how it works.

[youtube=http://www.youtube.com/watch?v=nqyUHjV0c1E&w=448&h=252&hd=1]

<—Go to step 1      Go to step 3—>

Database mirroring–survival kit #1

This blog is intended to give you a short introduction to Database mirroring. Database mirroring is a high availability functionality, that are easy to setup, reliable and has two features that I like:

  1. It makes it easy for developers to create software against a mirrored database (I will explain that later).
  2. It has automatic failover.

Basically the question is: What is database mirroring?

Imagine that you have a database that are critical for you. It could be a paymentDB, and it is used by a website to track customers payments. If the database is down, then your customers can’t pay and therefore you loose money!!!!

Today a lot of important software and websites have a setup like this:

image

 

You have a website and a single point database that are running on a server. Problem is that if your paymentDB is offline for some reason, then your website does not work (and yes, I have seen that a lot of times, and when I talk to customers about that, then they always says that “our database is never offline”
 Well!! I can guarantee you that, their database will be down sometimes).

 

So, what we want is a setup, that could look a little bit like this, where the webapplication is communicating with the RunningDB, but if that fails, the the database that are “Ready to take over”, becomes the runningDB.

 

image

That is the basics of Database mirroring, so lets see what to do.

First of all, you must at least have two instances of Microsoft SQL Server installed. They can be on the same server or on two different servers. There is a lot of benefits on having it on two servers, and basically, you are more protected and have a better chance for have a good uptime.

In this demo, I just have some instances on the same machine.

So to get started, then you must:

  1. Have at least two instances. If you want to have automatic failover from the one database to the other (from one instance to the other one, then you must also have an instance that becomes the witness).
  2. Have a database that meets the requirements for database mirroring.

So, lets try to see how that works in.

[youtube=http://www.youtube.com/watch?v=Q8Su4RvObDM&w=448&h=252&hd=1]

 

It is quite easy, but unfortunately there are some important things that you need to know, and the two most important are:

  1. What did I configure when I clicked Next –> Next –> Next
 We talk about the endpoints.
  2. Operational modes. What does the three modes mean.

I have attached the video below, so that you also can get an understanding about that.

[youtube=http://www.youtube.com/watch?v=85zfWgDEpO0&w=448&h=252&hd=1]

In survival kit #4, we talk a little bit more about security, because when you have all your instances running on the same server, then it is very easy. It becomes complicated when you want a more complex setup.

Go to step 2 –>