SQL database question - Primary keying with an 'id' column

defenseman · Jun 12, 2011

I like to give my tables an 'id' field which serves as each row's unique identifier/primary key (no other foreign keys or indexes).

My question is, whats the best way for obtaining that row's 'id' field after adding the item with auto-increment (meaning I don't specify the new row's id when adding the row)?

I could try and search for that row via other columns right after I add it, but that won't always give me the row that was just added because two or more rows could have that same data.

I could also try and figure out the new row's id field before adding the item, but that's not perfect either if I have a large number of users adding data to the database as the 'same' time.

calebb · Jun 12, 2011

defenseman said:
I like to give my tables an 'id' field which serves as each row's unique identifier/primary key (no other foreign keys or indexes).

My question is, whats the best way for obtaining that row's 'id' field after adding the item with auto-increment (meaning I don't specify the new row's id when adding the row)?

I could try and search for that row via other columns right after I add it, but that won't always give me the row that was just added because two or more rows could have that same data.

I could also try and figure out the new row's id field before adding the item, but that's not perfect either if I have a large number of users adding data to the database as the 'same' time.

SELECT @@identity

Now... your modus operandi is not ideal in very many circumstances.

I'm sure you're familiar with the clustered index; This represents the physical order of the data on your disk. {insert 3 pages of discussion on the ramifications of sequentially ordered data, RAID controllers and disk cache}.

Let's talk about your other indexes. Let's say you have an Employee table with a nonclustered index on LastName. Your nonclustered index actually contains your clustering key as well. Would you want your LastName indexes to be filled with an autoincrement id (that probably isn't relevant to anything other than that table itself) or maybe something like EmployeeID?

To put this to the test, let's imagine you are looking for all Employees with a last name like 'B%' and you want all their corresponding addresses.

You might write a query like

SELECT Employee.LastName, Address.Address1
FROM Employee E
INNER JOIN Address A ON E.EmployeeID = A.EmployeeID
WHERE E.LastName like 'B%'

First let's consider what happens if you used an autoincrement id:
1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and your clustered index, id.
1c. Seek on the clustered index to find the location of the rest of data in this record
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the Address "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)

Now, contrast that with an EmployeeID clustered index on Employee table and an AddressID index on Address (which is probably an autoincrement as you normally use since an Employee can have multiple addresses).

1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and the clustered index, EmployeeID.
1c. No need to perform a seek on the clustered index - we already have the EmployeeID so we're already ready to jump to the address table!
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the AddressID "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)
3. Perhaps you want to further improve this? Consider a composite clustered index on EmployeeID, AddressID. Depending on your workload, this could cut out an extra seek when joining on EmployeeID.

As you can see, in many cases it makes sense to use a meaningful column for your PK/clustered index - you can save your db engine a lot of work!

It's VERY easy in SQL Server to see this in action - pretty graphs and all. In SSMS, press ctrl-m prior to executing your query. (Include Actual Execution Plan). In your results window, you'll have boxes representing each operation - seek, scan, join, etc. Hover over the operation and look for data like "number of rows" and "number of time performed."

krogen · Jun 12, 2011

After you insert you could determine the primary key of that said tuple with max, something like:

select max(myPrimaryKeyName) from MyTable where this=that (and so forth)

This might not be ideal either, since I guess theoretically it could give you a wrong result if a lot of entries are added at "the same time".

Is there no other field that you could add to uniquely identify each record? What are the columns, if I may ask?

Whatsisname · Jun 13, 2011

this is your real problem. Solve that first.

defenseman said:
I could try and search for that row via other columns right after I add it, but that won't always give me the row that was just added because two or more rows could have that same data.

calebb · Jun 13, 2011

Whatsisname said:
defenseman said:

I could try and search for that row via other columns right after I add it, but that won't always give me the row that was just added because two or more rows could have that same data.

Click to expand...

this is your real problem. Solve that first.

Good observation; A "uniquifying" field like an autoincrement id column can sometimes be a sign of a bad data model. This isn't always the case though... need more details from the OP to be sure.

msny · Jun 13, 2011

You will need to add other unique columns to the table.
Possibly adding a datetime and a username for instance.

Pwyl_The_Destroyer · Jun 13, 2011

You don't say what database you're using, but in SQL-Server, I would use scope identity:
http://msdn.microsoft.com/en-us/library/ms190315.aspx

FrEaKy · Jun 13, 2011

calebb said:
SELECT @@identity

Now... your modus operandi is not ideal in very many circumstances.

I'm sure you're familiar with the clustered index; This represents the physical order of the data on your disk. {insert 3 pages of discussion on the ramifications of sequentially ordered data, RAID controllers and disk cache}.

Let's talk about your other indexes. Let's say you have an Employee table with a nonclustered index on LastName. Your nonclustered index actually contains your clustering key as well. Would you want your LastName indexes to be filled with an autoincrement id (that probably isn't relevant to anything other than that table itself) or maybe something like EmployeeID?

To put this to the test, let's imagine you are looking for all Employees with a last name like 'B%' and you want all their corresponding addresses.

You might write a query like

SELECT Employee.LastName, Address.Address1
FROM Employee E
INNER JOIN Address A ON E.EmployeeID = A.EmployeeID
WHERE E.LastName like 'B%'

First let's consider what happens if you used an autoincrement id:
1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and your clustered index, id.
1c. Seek on the clustered index to find the location of the rest of data in this record
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the Address "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)

Now, contrast that with an EmployeeID clustered index on Employee table and an AddressID index on Address (which is probably an autoincrement as you normally use since an Employee can have multiple addresses).

1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and the clustered index, EmployeeID.
1c. No need to perform a seek on the clustered index - we already have the EmployeeID so we're already ready to jump to the address table!
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the AddressID "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)
3. Perhaps you want to further improve this? Consider a composite clustered index on EmployeeID, AddressID. Depending on your workload, this could cut out an extra seek when joining on EmployeeID.

As you can see, in many cases it makes sense to use a meaningful column for your PK/clustered index - you can save your db engine a lot of work!

It's VERY easy in SQL Server to see this in action - pretty graphs and all. In SSMS, press ctrl-m prior to executing your query. (Include Actual Execution Plan). In your results window, you'll have boxes representing each operation - seek, scan, join, etc. Hover over the operation and look for data like "number of rows" and "number of time performed."

Thats a pretty nice in-depth guide to do that. Good write up!

ameoba · Jun 14, 2011

If you're using MySQL (which I'm assuming from the 'autoincrement' terminology) there's a function to get the 'last insert id'. What, exactly, it's called varies depending on your DB library and language.

defenseman · Jun 14, 2011

calebb said:
SELECT @@identity

Now... your modus operandi is not ideal in very many circumstances.

I'm sure you're familiar with the clustered index; This represents the physical order of the data on your disk. {insert 3 pages of discussion on the ramifications of sequentially ordered data, RAID controllers and disk cache}.

Let's talk about your other indexes. Let's say you have an Employee table with a nonclustered index on LastName. Your nonclustered index actually contains your clustering key as well. Would you want your LastName indexes to be filled with an autoincrement id (that probably isn't relevant to anything other than that table itself) or maybe something like EmployeeID?

To put this to the test, let's imagine you are looking for all Employees with a last name like 'B%' and you want all their corresponding addresses.

You might write a query like

SELECT Employee.LastName, Address.Address1
FROM Employee E
INNER JOIN Address A ON E.EmployeeID = A.EmployeeID
WHERE E.LastName like 'B%'

First let's consider what happens if you used an autoincrement id:
1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and your clustered index, id.
1c. Seek on the clustered index to find the location of the rest of data in this record
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the Address "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)

Now, contrast that with an EmployeeID clustered index on Employee table and an AddressID index on Address (which is probably an autoincrement as you normally use since an Employee can have multiple addresses).

1. Seek on the LastName index to the "B" section.
1b. Start reading data; The index contains LastName and the clustered index, EmployeeID.
1c. No need to perform a seek on the clustered index - we already have the EmployeeID so we're already ready to jump to the address table!
2. (depending on join algorithm) probe the Address table for the EmployeeID for the first result
2a. If you have an EmployeeID index, this might or might not be used (depending on the cardinality of the Address table)
2b. The EmployeeID index (if it is just on EmployeeID) will have the AddressID "id" reference in it's leaf node.
2c. Seek on the clustered index to find the rest of the data in this record (LastName)
3. Perhaps you want to further improve this? Consider a composite clustered index on EmployeeID, AddressID. Depending on your workload, this could cut out an extra seek when joining on EmployeeID.

As you can see, in many cases it makes sense to use a meaningful column for your PK/clustered index - you can save your db engine a lot of work!

It's VERY easy in SQL Server to see this in action - pretty graphs and all. In SSMS, press ctrl-m prior to executing your query. (Include Actual Execution Plan). In your results window, you'll have boxes representing each operation - seek, scan, join, etc. Hover over the operation and look for data like "number of rows" and "number of time performed."

Despite speed issues (which I think you were eluding too), I fail to see how having an 'id' field is a bad or is part of a bad data model... other than the current issue that I brought up.

calebb · Jun 14, 2011

defenseman said:
Despite speed issues (which I think you were eluding too), I fail to see how having an 'id' field is a bad or is part of a bad data model... other than the current issue that I brought up.

I didn't say it's bad; However, after answering your original question, I did provide a concrete example where a meaningful ID is a much better choice than an autoincrement id.

Imagine returning 1,000,000 records from the hypothetical Employee table with an autoincrement id; With a simple nonclustered index on EmployeeID, you'll need to first get 1,000,000 ids, and then perform 1,000,000 more reads to get the rest of the columns you need. That's 1,000,000 extra reads.

(of course, this is a theoretical example; if you were stuck with an autoincrement id PK for an Employees table, you would probably include some inequality columns in your index to mitigate the extra reads; which would be a workaround for the inappropriate data model and would cost more work for every INSERT and UPDATE)

SQL database question - Primary keying with an 'id' column

defenseman

[H]ard|Gawd

calebb

[H]F Junkie

krogen

[H]ard|Gawd

Whatsisname

[H]F Junkie

calebb

[H]F Junkie

msny

2[H]4U

Pwyl_The_Destroyer

Limp Gawd

FrEaKy

[H] Movie and TV Show Review Guy

ameoba

Supreme [H]ardness

defenseman

[H]ard|Gawd

calebb

[H]F Junkie