database primary key problems

Last post 07-01-2005 8:48 AM by Caddre. 8 replies.

Sort Posts:

  • database primary key problems

    06-01-2005, 9:56 AM
    • Member
      507 point Member
    • kakusei
    • Member since 09-10-2003, 3:41 AM
    • Posts 193
    Hi all,
         i'm having problem in deciding on whether to use a natural key or a surrogate key as the primary key if i want to create the following 3 tables.  
    The tables are LoginInfo, ContactDetails, and ResumeInfo.
    The LoginInfo table will contain the following fields :- login id, First Name, Last Name, Email, and password.
    The Contact Details table will contain :- contact detail id, login id, First name, last name, address, suburb, phone number, state, country, and email
    The ResumeInfo table will contain: - resume id, login id,  resumeTitle, workexperience, skills, education, etc.....

    Now what i want to know is that should i use the natural key (email) as the primary key for the logininfo table or use a surrogate key(login id)?
    if i use  a natural key as the primary key then what natural key can i use for the contact details and resumeInfo tables?

    if i choose to use a surrogate key as the primary key for all three tables. then should i use an auto increment column or insert a random number of say 8 digits into the column?

    can somebody give me some guide on this thank you !!!

  • Re: database primary key problems

    06-01-2005, 1:22 PM
    • All-Star
      55,155 point All-Star
    • DarrellNorton
    • Member since 04-04-2003, 3:49 PM
    • VA, USA
    • Posts 6,628
    • Moderator
      TrustedFriends-MVPs
    My preference would be an auto increment column.  I wouldn't make a key based on email address since email addresses change, and even with SQL Server's cascade update feature that's a lot of database changes for something that will probably change often.  I would advise against the random number simply because I don't think it buys you anything here.
    Darrell Norton, MVP
    Darrell Norton's Blog


    Please mark this post as answered if it helped you!
  • Re: database primary key problems

    06-01-2005, 1:38 PM
    • All-Star
      26,551 point All-Star
    • Caddre
    • Member since 06-23-2003, 9:53 AM
    • Indy
    • Posts 5,308

    Surrogate keys are not for OLTP(Online Transaction Proccessing ) databases they are for OLAP (Oline Analiytical Processing) and OLAP databases are Denormalized and read only.  Try these links for the two options you have, I am assuming you know about Identity field and Foreign Key mapping.  Hope this helps.

    http://codebetter.com/blogs/sahil.malik/archive/2004/12/16/37360.aspx

    http://martinfowler.com/eaaCatalog/

    Kind regards,
    Gift Peddie
  • Re: database primary key problems

    07-01-2005, 1:28 AM
    • Member
      507 point Member
    • kakusei
    • Member since 09-10-2003, 3:41 AM
    • Posts 193

    Hi All,

       i want to know what would be the best datatype or format to use as a primary key for a very large user's table say expecting to be around 1,000,000 records or more? would an identity field be a good choice or a CHAR field of 8-10 random numbers be better?  if indentity field is the choice then what is the best seed to start with?  if a CHAR field is better then how should i generate the random numbers? and would 8-10 random numbers be enough for that much records and more??  OR is there other ways to generate the primary keys for that much records? And also i want to know what format of a primary key would give the best performance eg. a series sequential numbers like 1000001,1000002,100003, etc... or some random numbers?

    best regards 

  • Re: database primary key problems

    07-01-2005, 5:06 AM
    • Member
      10 point Member
    • thona
    • Member since 08-16-2002, 2:26 AM
    • Posts 2,923
    First a clarification:

    a very large user's table say expecting to be around 1,000,000 records

    10 years ago there was this joke that Access was no proper database because you could not deal with 1 million rows in a table. At this time, Foxpro did handle this amount without sweating.

    Today, one million records in a table is nt worth mentioning. One thing it clearly is not, though, and this is very large. It is not even large. It is non-trivial anymore (i.e. better know what a database is and read up what an index is), because at this size table scans will have nasty side effects. But you do notneed to do any of the professional stuff that you need to make to handle very large tables, like partitioning and replication. I would personally say that a very large table propably starts at around 1000 times this data (i.e. one billion rows), and large at possibly 10 times.

    Now, for a SMALL table that contains about one million rows, there is no technical need NOT to use an identity field, especially as the data is not heavy written. Users may get created, but not by hundreds of thousands an hour, so even if you get replicating, you can handle all inserts on one location (identity fields pretty much say goodbye if you get into replication).

    i want to know what would be the best datatype or format to use as a primary key

    I normally use a GUID for this type of data. It is easily generated, it is unique, it is realtively small (16 bytes ONLY) and has enough room for the users to seriously grow. Not taking a GUID, in the given case, I would assign it an int identity field - with the int being 32 bits signed you ahve enough room to grow a little more than just one million entries.

    And also i want to know what format of a primary key would give the best performance eg. a series sequential numbers like 1000001,1000002,100003, etc... or some random numbers?

    Empirical evidence (done to get a feasibility on the speed of the GUID as primary data type) suggests that random inserts (not updates) are lower by a 10:1 factor compared to linear inserts of identity fields (and only about 10% slower for read operations), due to the necessity to frequenty rebalance large parts of the corresponding index. Common sense, in most parts, will tell you that this is totally irrelevant for a user table. You will not insert hundreds of thousands of users per hour (after the initial load, maybe, if you transfer an old system). I would suggest that if your user table is like all other user tables I have seen so far, you will proapbly read it 1000 times more than you insert into it. As such, you have to stand the question whether you really care about insertion speed, or whether you basically fall into the typical developer trap of premature optimization, and try to optimize an item that is just irrelevant in the light of the complete application. Especially in the area of user management my stomach says that this will be totally irrelevant for the application performance.

    One thing yo have to be aware of when working with random inserts is the possibility of doubles. A random generator may generate the same number twice, which you have to handle. You can only ignore it if your number space is large enough that you basically are more lkely to win in the lottery than to get two doubles (like with the GUID, which IIRC has more numbers available than there are atoms in the universe or something like this). Too many double values and your app starts looking pretty bad, especially if you ignore the item and let the error propagate to the end user (which is likey unless you make mass tests with real amounts of data).

    What I would NOT take, if I had any choice, is a primary key that is a CHAR (or VarChar) field. Numbers are not strings. This is really the slowest of all possible solutions, takes up the most space and in general will make people have a good laugh in this case.
    Thomas Tomiczek
    PowerNodes ApS
    (Microsoft MVP C#/.NET)
    ---
    Building a Website? Try the PowerNodes CMS - http://www.powernodes.com/
  • Re: database primary key problems

    07-01-2005, 5:42 AM
    • Contributor
      3,290 point Contributor
    • Qnnn
    • Member since 12-02-2003, 6:05 AM
    • Estonia, Tallinn
    • Posts 658

    Check MS SQL Server data types and their lengths here:

    http://www.databasejournal.com/features/mssql/article.phpr/2212141

    Seems like int is more than enough for you as the maximum number it can hold is 2147483647 and you are talking about couple of million of rows. Even for a pretty long time period (considering users leaving and coming) the int should be enough for you.

    Gunnar Peipman

    -- General Protection Fault
    -- Illegal Page Fault
    -- Not My Fault
  • Re: database primary key problems

    07-01-2005, 6:22 AM
    • Member
      10 point Member
    • thona
    • Member since 08-16-2002, 2:26 AM
    • Posts 2,923
    Now what i want to know is that should i use the natural key (email) as the primary key

    And you loose this round.

    The email address of a user is not a natural key. Natural keys should have a certain stability, and this is definitly not given.

    Actually it is so bad that using the email address as login screen name may be a VERY bad idea. If a user looses access to an email address for wahtever reason (like provider going down), he can not even change his password as you would send the changed password to his old email address. You would have to deal with primary keys changing, which would or could result in cascading updates - some of which you may not be able to do, like into old log files (who was this user - I do not find him in the database?).

    This is the same reason, like, for cars the template is a bad primary keys (changes), the Social Security Number in the US is a bad key (they DO get reused, you know), phone numbers are bad natural keys.

    if i use  a natural key as the primary key then what natural key can i use for the contact details and resumeInfo tables?

    Besides me not liking natural keys and saying you should actually not use one here, I would let the user define a login name when he signs up. Preferrably one without a "@", so he can also log in with his registered email address. Then you could use this in the related tables.

    if i choose to use a surrogate key as the primary key for all three tables. then should i use an auto increment column or insert a random number of say 8 digits into the column?

    Let me formulate it like this: the later would show everyone you ahve not read the database documentation.

    I would go either with an autoincrement field, or - my favourite - a GUID field. GUIDÄs have disadvantages, but their main advantage is that they are UNIQUE over everything.
    Thomas Tomiczek
    PowerNodes ApS
    (Microsoft MVP C#/.NET)
    ---
    Building a Website? Try the PowerNodes CMS - http://www.powernodes.com/
  • Re: database primary key problems

    07-01-2005, 6:46 AM
    • Member
      507 point Member
    • kakusei
    • Member since 09-10-2003, 3:41 AM
    • Posts 193
    Hi Qnnn,
         Thanks for replying.  for my question if i want to have a fix length of primary key i would then need to set the seed to start at 1,000,000 right? since it would be impossible to grow up to 10,000,000 user records. So if i start my seed at 1,000,000 then all of the will start will 1000..... until i reach to 2,000,000.  So if with all these records starting with just 100....something , wouldn't that be inefficient to search throught the records??  do u or does anyone have any links to any sites or articles that's about the performance issues of the datatype formats of database columns?

    regards 
  • Smile [:)] Re: database primary key problems

    07-01-2005, 8:48 AM
    • All-Star
      26,551 point All-Star
    • Caddre
    • Member since 06-23-2003, 9:53 AM
    • Indy
    • Posts 5,308

    Try these links for some information about GUIDs and alternatives and how to create a fake Sequence which is implemented in Oracle but not SQL Server.  If you have any specific questions I can answer them, we were asleep in Texas when you posted.  Hope this helps.

    http://sqldev.net/xp/xpguid.htm#Known_limitations

    http://www.informit.com/articles/article.asp?p=25862&rl=1

    Kind regards,
    Gift Peddie
Page 1 of 1 (9 items)