I need gurus input on the pros and cons of Seeded(int) Identity colums vs. UniqueIdentifier(GUIDs) columns for my db design. As I understand it, merging the data of 2 independent db's both using IDENTITY columns would be very hairy because of the possibility
of overlap. GUIDs are much more likely to be unique across different servers. What I'd like to hear from others are the other pros and cons of the situation. I of course understand the performance hit suffered at the hands of the GUID. Thanks for your help!
A Favorite subject of mine :) The only performance hit when using GUIDs is the extra storage requirements as far as I know. Again, this is 16 bytes as compared to 4 bytes for an int... I have generally seen more fuss made over whether an inconsequential "Comments"
field should be 50 as opposed to 255 or 2000 (or whatever) bytes. However, this is somewhat offset (especially under high DML loads) by some of the lesser-known beneficial aspects of using uniqueidentifier for PK. For example, there is an issue known as a
"hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always
go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page
locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which
causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts. Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page
split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually
make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the
performance hits of page splits until off-peak times. If you even *think* you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require
a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there. Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique
- not just among all values generated by *this* server, but all values generated by *all* computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now
- in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch
a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created. About the only thing I can think of that I consider a "pro" for the IDENTITY mechanism is the fact that as a human, its MUCH easier
to read and remember an integer as opposed to a Guid sequence. However, its also my opinion that USERS should NEVER see the real key values in any table... if they need a numerical reference, then use a surrogate key as an IDENTITY, but keep the real PK as
a Guid.
Krome, This thread is quite informative. 1. I didn't think deeply in this topic, but I got impression that using UniqueIdentifiers(GUIDs) is relatively more expensive with more benefits, while using Identity is less expensive with less benefits. Is that a true
statement? 2. If the statement above is true, can I say in general that using identity is enough, but in some scenarios, using UniqueIdentifiers(GUIDs) is preferred. Could you give me what are the scenarios that using UniqueIdentifiers(GUIDs) is a must? 3.
You talked about 'hotspot', I got one relative topic from MS TecNet at
Reducing Lock Contention in SQL Server. It seems this hotspot issue only happens for SQL 6.x and earlier. Is that true? I check SQL 6.5 book online, it said, this 'hotspot' area locking has been imptoved by Insert Row-level Locking (IRL). Could you give
me more explanation about this, thanks. 4. Finally, I have a puzzle here, that was discussed at nother thread
252169, but without a solution;
I am curious that although SQL is SET based, when we use SELECT without ORDER BY clause, it seems there is a certain order, or maybe say a default order, data records come out not in a random order. I have checked tables in NorthWind and pubs, I got 100%
output following the rule ORDER BY IDENTITY if there was an IDENTITY, otherwise, ORDER BY KEY with only one exception, the employee table in pubs. For the table without key, I didn't see a clear ORDER. However, this rule is not true for my working database.
Even I insert a test table into pubs 'test1' with an id and two other columns, it doesn't follow the rule, and I didn't find the rule, for example it doesn't ORDER BY, say, the UPDATE command.
The question is probably a little bit too much. But, I saw Krome's posts in several threads, those must be your Favorite subject. Hope I could get help from you. Thanks in advance.
MCAD.NET (C#)
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
1. It really depends on what is most critical for the table in question. Only two cases where I would say that IDENTITY is a better choice - first would be when data storage size is absolutely critical, and every byte saved makes a difference (quite a rare
situation really), and second would be when you really have a good business reason to have primary key that is easy to read and remember (domain tables, i.e. "lookup" tables often qualify for this). For just about anything else, I prefer to use a GUID. Generally
speaking, I use uniqueidentifier as my default pkey field type, unless I find specific reason to go with int (more likely smallint/tinyint) in a particular situation. 2. Given the many nice features of GUID, and the fact that only an incredibly small fraction
of applications will have a real business reason to not use them, I would reword that to say "in general using uniqueidentifier is enough, but in some scenarios, using identity (int) is preferred". Some examples of where GUID is a must: a) The table is being
replicated using transactional replication (possibly other repl models too, but definately transactional). When you publish a table for replication, it will have a uniqueidentifier field added anyways.... so might as well use it from the start. b) The table
is assigned key values in another application tier, such as your DAL or fat-client. This is most often found in tables that participate in batched master-detail updates. Unique GUIDS can easily be generated at any application tier. c) The table is involved
in a "merge" process with other sources of the same data. This can be in the form of custom merge replication, or bulk file imports from another system. This can also be a result of system consolidation, where two companies using the same application merge
their business and need to now merge their databases... not very common, but I have seen it once before and having the GUID keys was a godsend. d) Here is another reason that I personally try to use GUID keys: When PK values are unique not only within a given
table, bot accross ALL tables in a database, this enables you to do something else thats quite nice... ***NOTE that this particular thing is NOT a good idea for anyone squeamish about large joins in queries***... you can create a generic "associations" table
that is capable of handling dynamic relational associations between ANY two entities in your system, using ANY defined role. The table generally looks like this:
CREATE TABLE DynamicAssociation(
AssociationID uniqueidentifier default(NewID()),
RoleName varchar(255) not null,
ParentTable varchar(255) not null,
DependantTable varchar(255) not null,
ParentID uniqueidentifier not null,
DependantID uniqueidentifier not null,
primary key (AssociationID)
)
You can also normalize out the RoleName, ParentTable, and DependantTable columns into another
AssociationRole table (where you can also add multiplicity rules etc). What this allows you to do is add new relationships to your logical data model, without causing DDL changes in the underlying database, which can really come in handy sometimes. For example,
if you suddenly decide that you now need a new relationship where Customers (ParentTable) have a "Secondary Contact" (RoleName) in Contacts (DependantTable). To establish the relationship, you dont have to change a single table structure, you just have to
add a new row in the DynamicAssociation table (and possibly AssociationRole) to descrube the relationship. Of course, the downside to this technique is that querying that information back out becomes more complicated, since you now have to join the Customers
and Contacts table "through" the DynamicAssociation table. The performance cost of this complex join is also not as bad as you might suspect, assuming good indexes are created on the association table. 3. IRL somewhat alleviates the hotspot, but does not solve
the core problem that all inserts are happening in the exact same page of the table storage. IRL only protects a page thats not being split... if it needs to be split, then page locks are aquired for the split process (splits are fairly rough btw - in high-volume
systems they need to be reduced as much as possible). Also, if you are inserting more than one row, the row lock can get escalated to a page lock anyways pretty quickly. And lastly, row locks arent exactly the "cat's meow"... just like any other lock in the
server, they require resources, force serialization of requests, and basically just slow things down a little. So basically, my opinion is that for most cases, I use the uniqueidentifier. The cost is really pretty minor in exchange for the many other benefits
possible. 4. I have not seen the post this came from, however.... from BOL: "An explicit ORDER BY clause for a SELECT statement is required to ensure any useful ordering of data. In addition, the exact results depend upon the collation being used. Expect different
results as compared to earlier versions of SQL Server. Add an explicit ORDER BY clause to all SELECT statements needing to produce ordered rows." In SQL Server 6.5 and prior, you would ALWAYS get rows (if not ORDER BY was specified) based on the physical ordering
in the table pages. In other words, the default order was by the Clustered index of the table (a clustered index defines the physical ordering on disk). I think this behavior may have even been stated in the 6.5 BOL. In SQL 7.0 and higher, this was changed.
It still *usually* returns records by default in the order specified in any Clustered index, however this is not at all guaranteed anymore. SQL 7 added the ability to have parallel threads reading a table to solve a query, and these threads read from different
segments, so without a defined output sorting order, they are returned in whatever order they happened to have been read. I think there also may be other issues in the new engine that can contribute to this too. Bottom line on this one: If you want your data
in any particular order, you MUST specify an ORDER BY clause (GROUP BY does not count either, I have seen that question come up before too).
I saw an article,
Pros and Cons of GUIDs as Primary Keys, which gives a good summary about this issue. Some topics are overlapped with krome's, but I think the most advantage in that article is the comparison is generally based on a programmer's view, instead of a DBA. In
the other words, it is much more understandable for programmer. I copied the
major points below (the numbers were added by myself): PROs for using GUID 1. We get a more object ID–like datatype because each primary key typically will be unique for the complete database. 2. @@IDENTITY has been a source of problems when INSERT
triggers have INSERTed rows into other tables with IDENTITYs. (You can use SCOPE_IDENTITY() in SQL Server 2000 to avoid this problem.) 3. You can move the creation processing of the values from the data tier to the business or consumer tiers. In the case of
.NET and ADO.NET, it's a big advantage to know the ID value at the time rows are added to the DataSet. Otherwise, for example, autogenerated INSERTs can't be used for master-detail relationships within the DataSet. 4.The span of values for GUIDs is enough
for most scenarios. INTEGERs are not big enough for some scenarios. 5. You can prepare for merge replication. When merge replication is to be used, each table to be replicated must have a GUID column (called ROWGUID in SQL Server). 6. IDENTITY can't be used
for tables in distributed partitioned views. 7. A GUID will not mean anything to the users. An INTEGER can be read, understood, and remembered. It's common to see INTEGER keys in the UI, and that might cause problems later. 8. You know what a nightmare it
can be when doing a manual merge between two tables with INTEGERs as primary keys.
CONs for using GUID 1. There is a huge overhead in size. As I said, a GUID is four times larger than an INT. This is very important when it comes to indexes. 2. Portability can be a problem because not every database has GUID as a native datatype. 3.
You can't use GUIDs with aggregates such as MIN. This is unfortunate when it comes to using WHILE plus an aggregate function, such as MIN or MAX for looping instead of a CURSOR. (Well, a workaround is to use a CONVERT or a CAST to CHAR(36) or BINARY(16) of
the GUID before it's used in the aggregate, but watch out for performance problems with large tables.) 4. It's detrimental to developer-friendliness because it's really hard to edit GUIDs manually, to remember them when you browse the database, and so on.
Where the Pros 5, 6, 8 and Cons 2 are quite DBA related, others are all programmer related issues.
MCAD.NET (C#)
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
After further understanding for PROs and CONs of UniqueIdentifiers(GUIDs), as a programmer, I would stick back to the statement that
In general, using identity or natural key (something like Social Security Number, it is unique by its nature) is enough; but in some scenarios, using UniqueIdentifiers(GUIDs) is preferred. Supporting 1: Analysis of the PROs and CONs From the PROs and
CONs above, the disadvantages are obvious, the overhead in both programming and performance, see this article for some analysis regarding to performance
The Cost of GUIDs as Primary Keys. Let's see the PROs: As I said, PRO 5, 6 and 8 are DBA concerned. So, if DBA requested, then there would be a 'must' reason to use UniqueIdentifier. PRO 1 is an invisible benefit; PRO 2 could be well handled with care;
PRO 4 is not valid in most most cases where the row number is more than 9,223,372,036,854,775,807. PRO 7 is another invisible benefit. Then, only PRO 3 is a good reason to use UniqueIdentifier. But, it is not a must case, like the one DBA requested. You probably
need to balance the PROs and CONs, then make the decision. So, it seems, without request from DBA, one almost will never use the UniqueIdentifier with certain exceptions. Supporting 2: Statistics I have checked all my prvious used database, SQL 7/2000 default
database, Visual Studio 5, 6, .NET sample database, including the Starter Kits and this forums, more than 40 database (I use a procedure to check the key word 'UniqueIdentifier'), and I found there is no one database using 'UniqueIdentifier', except the system
database 'master' and 'msdb'. By the way, in SQL 6.5, there was no such 'UniqueIdentifier' at all.
Conclusion, UniqueIdentifier is somewhat system or DBA used or preferred stuff, instead of programmer.
MCAD.NET (C#)
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
Interesting conclusions, and I did read the article you posted. It seemed like a fairly well-done performance test, however in my own experience I have not ever seen a 10x performance penalty for inserts using guids. The differences that I have seen have only
been maybe 20% when dealing with massive insert streams. I have never seen any real performance issues when querying guid keyed tables - but then again I always pay very close attention to index strategies and query plan optimizations. One thing that concerned
me about the article is the statement on the first page that played down the impact of cascaded updates. You can easily end up with a million rows being affected by an update to a single row this way... and it also plays havoc with "audit stamp" triggers that
may be in place on the tables. These issues were totally ignored by the article, although in my experience they are extremely important in database design (much more of a problem with natural keys than identity keys though). Also, in the real world, the indexes
on tables will be much more complex. Its not uncommon to see "covering" indexes that include several hundred bytes of fields. In these situations, the extra 12 bytes of a guid as opposed to an int in the index is insignificant. And from a programmer perspective
(I am actually more programmer than DBA too) - I am not too sure you are giving adequate weighting to PRO 3. When writing business-tier components, this feature is HUGE. Any time you deal with partially disconnected data of even modest complexity, the ability
to generate keys in high level tiers is a huge programming and design timesaver. In almost all of my designs, I start with Uniqueidentifier for PKs, and only switch to int or "natural" keys when there is a specific need demanding it.
About the paper, I agree it might over emphasize the overhead of UniqueIdentifier, especially when the author tried to introduce his own method. I'd rather believe his test, but one million inserts are rarely used in real world. In normal case, I believe what
you said, logically, otherwise, 10 times performance penalty would kill the way from very beginning. I am curious about your opinion about my
Supporting 2: Statistics. Besides your personal experience and preference, could you give some examples that use UniqueIdentifier? could you give some estimated statistics from your experience? That would be beneficial for us to see the issue with a
different or possibly a whole view.
MCAD.NET (C#)
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
vengeance
Member
45 Points
9 Posts
Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 01, 2003 05:08 AM|LINK
krome
Member
190 Points
38 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 01, 2003 03:56 PM|LINK
MCSD, MCDBA, MCAD, CIC
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 02, 2003 02:23 PM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
krome
Member
190 Points
38 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 04, 2003 06:21 PM|LINK
MCSD, MCDBA, MCAD, CIC
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 07, 2003 02:43 PM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 14, 2003 08:28 PM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 15, 2003 01:07 AM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 15, 2003 03:34 AM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0
krome
Member
190 Points
38 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 16, 2003 08:49 PM|LINK
MCSD, MCDBA, MCAD, CIC
ghan
Participant
1440 Points
288 Posts
Re: Seeded(int) Identity columns vs. UniqueIdentifiers(GUIDs)
Jul 17, 2003 02:17 AM|LINK
MCP on: SQL Server 2k Design&Iimplementation, Web Development (VI 6.0) and VB 5.0