Login

06-06-2009, 03:56 PM

[eluser]dcheslow[/eluser]
This is sort of a MySQL topic, but I am posting it here because many CI libraries and classes ASSUME that primary keys are always integers. Mine are not and it makes some libraries unusable for me.

Why I use UUIDs for primary/foreign keys

I don't mean to start a religious war, but I've noticed that many developers still use AutoIncrement int fields as primary/foreign keys. I find this curious because I stopped using integers as keys a couple years ago and have never regretted it. Perhaps you didn't even know that there is an alternative.

Rather than integer primary key fields I use UUIDs (universally unique IDs). A UUID is a 36 character hexadecimal string with 4 embedded dashes...which means that it represents 16^32 possible values (a very big number, by any measure). UUID generators use a special algorithm to ensure that it is extremely unlikely that a collision ever occurs. You can treat UUIDs as disposable... you'll never run out (certainly not in the lifetime of your application), and no two will ever be the same.

The easiest way to generate a UUID is to let MySQL do it for you:

SELECT UUID();

The main advantage to using UUIDs is that they are not just unique within a table (like AutoIncrement) but within a database... and even across databases. This is great whenever you combine data from different tables or databases. For example, when using UNION, merging two tables into one or restoring a partial backup.

Another reason you might want to use UUIDs is that, because they are strings, they are usable as file/folder names and javascript ids. I use this feature often when caching results, uploading files, generating temporary files or building forms. One little 'gotcha' that happened to me once... when passing UUIDs as parameters of javascript functions, be sure to quote them. If the UUID up to the first dash just happens to be all digits, then javascript will try to perform subtraction unless it's quoted. That's a pretty hard bug to find. You should quote all function string parameters anyway, so this is not really a burden, just a reminder not to get sloppy.

The main disadvantages to using UUIDs are performance and disk/memory space. If you're working with terabytes of data, then integers are definitely a better way to go. If you're building a batch processing system, then by all means use integers. But almost all of what I've been programming lately is transaction based, where an extra dozen milliseconds or 20 more bytes of disk space here or there isn't ever noticed.

There is really no reason that a UUID must be stored as a 36 character VARCHAR... it could be stored in a 12 byte unsigned integer - which is what it really is. To keep them useful as file/folder/id values, there needs to be an (automatic?) conversion to/from hexadecimal. Several people have posted PHP solutions like this, but it really needs to be in the database code to be efficient. Truly effective support for this would require the introduction of a new datatype in SQL which could happen but I'm not holding my breath.

One final note, you may see UUIDs referred to as GUIDs... for all intents and purposes, they are equivalent.

So... there you have it. I hope this makes you think about switching from integer to UUID primary key fields. I really like them. If you're not convinced to use them yourself, then I hope you will at least consider that not everyone uses integers as primary keys when you are programming your libraries.

=dave=

06-06-2009, 04:10 PM

[eluser]Dam1an[/eluser]
First up, I'm going to say I use auto increment ID's, not cause I don't know any better, but cause I prefer not to have either strings or UUIDs as a primary key
Lets say you've got any sort of system where you access 'stuff' by passing the ID in the URL, such as users/123, much nicer then users/3857ab365-3857ab365-3857ab365-3857ab365
(another alterntive in the above example would be using the email as the PK, definatly don't want that as a link)

Also, you say it's very unlikely to get a collision, and they're unique accross databases as well, surely there will be a clash eventually (and without exhausting all 16^32 possibilities)

And lastly, going back to your main point, people who write libraries are unlikely to cater for a small number of users that use UUIDs when they have a much, much larger portion of users using standard auto increment IDs

06-06-2009, 04:28 PM

[eluser]TheFuzzy0ne[/eluser]
I was under the impression that digits were quicker to index.

I agree with Dam1an. Having a simple integer ID is so much less error prone, especially when using auto increment. In my mind, integers are more natural, and I think that databases have been designed with having an integer ID. PHP offers mysql_insert_id() to grab the last insert ID, but there is no mysql_uuid() or equivalent. I can't see me switching any time soon. Smile

Thanks for posting.

06-06-2009, 05:01 PM

[eluser]n0xie[/eluser]
Random 5 reasons we use integers:
1. Performance (joins/ indexing)
2. Size (just how big are your tables?!?!)
3. Int is a native data type, therefor faster with any form of computation. It's much easier to validate/sanitize if it is passed from an URL making SQL injection much easier to prevent. Santizing strings is like a box of chocolates: you'll never know what you're getting.
4. Readability. Where id = 123 vs id = '{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'. Also if the id is the primairy key and auto incremented it gives you as a developer an indication when it was inserted. Sort by id DESC to see the last few entries. Try that with GUID.
5. Usability: http://domain.tld/blog/123 vs http://domain.tld/blog/BAE7DF4-DDF-3RG-5TY3E3RF456AS10. 'Clever' users will know that the next blog item will be 124 and the previous blog item will be 122. As for GUID...

06-06-2009, 07:02 PM

[eluser]gtech[/eluser]
GUIDs as row ids... what an interesting idea, to be fair the decheslow has said that the big disadvantage is performance.

When I update a database row I usually keep a separate table of who/when and what, so I know when a rows been inserted and who insterted it.

On the usuability issue, not all applications pass ids through the url.. and to be fair mabee I dont want a user to be able to guess ids, urls do not really have to look 'friendly'

as to the argument about there being a clash eventually... I not sure with 16^32 possible combinations it is quite unlikely on a database with even a million ids.

I think the biggest benifit is if you were considering merging databases, or even distributing data.

Thanks for the heads up, I will probably use normal ids for now, but I might look into other projects that use GUIDS.

[url="http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html"] interesting article [/url]

06-06-2009, 07:44 PM

[eluser]dcheslow[/eluser]
Great feedback, thanks!

OK.. so it seems the only disadvantage that I did NOT mention was the readability thing. I concede the point, but...

I'm, personally, against making URLs too editable/readable. If I, as the developer, want the user to be able to go forward or backward one record in the database, then I'll provide a way to do that. The only person who should be hacking URLs is me. A naughty person might be much more tempted to hack /user/123 than /user/BAE7DF4-DDF-3RG-5TY3E3RF456AS10.

Collisions are very (actually... very very) unlikely with a number this large... 16^32 is more than 10 with 37 zeroes after it (number of molecules on earth ~= 10^49). The UUID algorithm includes things like machine ID and current timestamp and some other stuff to ensure uniqueness. If you use a long integer as a key, then you have less than 3 with 6 zeroes after it possible values, if each table of each database of each server starts with 1, then the odds of a collision are pretty good...comparatively speaking.

We shouldn't hard code primary key values anywhere, so making primary keys readable doesn't impact coding at all.

As for library developers only targeting the majority audience... well, if we all coded to the most common browser, then we'd all be using IE <perish the thought>.

gtech raises another significant benefit. He (and I) maintain metadata separately about rows inserted/edited/deleted. I used to have insertedat, insertedby, etc. in every table... but UUIDs make it even simpler. I simply keep one table called 'metadata' which contains recordid,insertedat,insertedby,etc.... Since UUIDs are guaranteed to be unique, I can join to this table with confidence. In fact, UUIDs make all 1:1 relationships totally foolproof.... and 1:M relationships much less likely to screw up.

06-07-2009, 07:06 AM

[eluser]slowgary[/eluser]
I use integers as well, but have looked at UUIDs. Mainly the reason that I hadn't switched was that I feel comfortable with using integers already, but I hadn't really looked at pros and cons so this thread helps.

I can add to the argument of uniqueness though. UUIDs are unlikely duplicated, but that doesn't mean it's not possible. If you handle the possibility of duplication then you're okay. BUT, you can't really argue that they're more unique or less likely to collide as integer IDs, because integer IDs are auto-incremented, so there will NEVER be a collision, and you don't need to write the code to handle one.

I do see where UUIDs would make combining tables easier, but I've never had need to combine tables and if I did I'd probably just have to write a script to handle the IDs.

I think for simplicity I will stick with integers but if I had a need for a bit of extra security/obscurity of data in an app I will consider UUIDs.

06-07-2009, 07:11 AM

[eluser]TheFuzzy0ne[/eluser]
If I had to obfuscate a key, I'd probably just add an extra field for it, so you could query on the obfuscated UUID, and join on the primary key. I'm fairly sure that would be faster than joining on the UUID.

06-07-2009, 07:32 AM

[eluser]slowgary[/eluser]
That makes more sense to me. Then you can have a pretty ID and an ugly ID, and the pretty ID can choose to only be friends with other pretty IDs. Although sometimes, the pretty IDs will befriend an ugly ID just so that they look prettier by comparison. :lol:

06-07-2009, 09:11 AM

[eluser]jedd[/eluser]
Hi Dave,

An interesting thread - thank you.

A very competent programmer friend of mine also advocates the use of GUIDs everywhere. I'm not yet convinced, though, as they seem to require a bit more effort, make some things more complicated, and I don't see compelling advantages to offset that extra pain.

Quote:Another reason you might want to use UUIDs is that, because they are strings, they are usable as file/folder names and javascript ids.

Of course, tablename/id - is also unique for folder/filename use.

While I concur that for most of us, we should let the performance and storage penalty of working with GUIDs rather than integers, I'm also wary of fully adopting the 'let the hardware deal with my inability to optimise' approach. I don't think it's necessarily inelegant, but it does seem to be a lazy way of thinking.

Quote:I'm, personally, against making URLs too editable/readable.

I guess with a lot of this (web) stuff (programming) there's the expectation that the debugging phase of the project will last the entire lifetime of the project, so it's very handy to be able to edit the URL's forever, and expect some sanity and predictability while doing so.

I think if you're relying on a GUID's obfuscation for your security, then you're missing the point of security.

I also think any obfuscation should not replace a genuine security model - viz:

Quote:A naughty person might be much more tempted to hack /user/123 than /user/BAE7DF4-DDF-3RG-5TY3E3RF456AS10.

Further, I think that any obfuscation of user ID's should be done closer to the view, than the model, and not just for debugging reasons, but it just seems to be a neater solution to that problem.

Quote:Collisions are very (actually... very very) unlikely with a number this large...

I grant you that collisions with these are unlikely. However, SERIAL types in MySQL, for example, are one better - they are impossible. While your UNIQUE attribute of your PK will protect you, it means you need additional code wrapped around the INSERT to ensure this, even if 'just in case', and that's code that an auto-increment approach doesn't need.

Quote:We shouldn't hard code primary key values anywhere, so making primary keys readable doesn't impact coding at all.

I accept this claim, also, but suggest that it's common to want to type something in during debugging (again, not just during the development stage of the project) and it's much easier to play with users 5, 6, 7 and 8, than BAE7DF4-DDF-3RG-5TY3E3RF456AS10, BAE7DF4-DDF-ERG-5TY3E3RF456AS10, BAE7DF4-DDF-3RG-51Y3E3RF456AS10 and BAE7DF4-DFD-3RG-5TY3E3RF456AS10.

Quote:As for library developers only targeting the majority audience... well, if we all coded to the most common browser, then we'd all be using IE <perish the thought>.

OTOH, we'd all be coding PHP for Apache .. so it's a tricky call to make there.

Quote:gtech raises another significant benefit. He (and I) maintain metadata separately about rows inserted/edited/deleted. I used to have insertedat, insertedby, etc. in every table... but UUIDs make it even simpler. I simply keep one table called 'metadata' ...

I've been thinking about how to log changes to my DB, given the huge number of tables that this thing will be managing, and ended up (probably very logically) at this approach. However, I just assumed it'd be something like: ACTIVITY_LOG ( id SERIAL , table_name CHAR (60) , activity CHAR(300)) - or similar. In other words, I'm not convinced that tracking back a UUID - which would mean having to find which table that refers to - is easier than my approach here. Again, as cited elsewhere in this thread, there are significant advantages to being able to look at the ID and gain some insight into the relative age of that record.