What's a good place to start with SQL?

unhappy_mage · Jul 5, 2007

I'm a fairly experienced programmer in a variety of languages, but I haven't had much need to interact with (or at least not much exposure to) SQL databases as yet. I have a test application I'd like to use SQL for (basically a price tracker over time), and I'd like some input on how to go about implementing the database part of it.

1) What do I need to do about SQL injection?
I'm planning on using Perl and DBD:

g, with prepared statements. Does this prevent SQL injections, or is there something else I should be doing?

2) What should my table layout be like?
I want to store (for each item, at each time) [price, name, link-to-item, store-name]. This indicates to me that I want an "items" table with [item-id, name, link-to-item] rows, and a "data" table with [item-id, time, price] rows. Does this sound reasonable? What might a query to display a set of [price, name] tuplets for a given time look like? I think it'd be something like:
"SELECT data.price, items.name WHERE data.itemid = items.itemid AND '7/4/07' < data.time AND data.time < '7/5/07'"
What would a good way to add items be? I've done my magic, now I have a [price, name, link-to-item, store-name, time] tuplet. What do I do with it? link-to-item is unique, being a URL; should I determine the item-id from that? What kind of logic is good to create rows in the "items" table?

Sorry for the broad questions - I'm a little bit clueful about SQL syntax, but I don't want to dive in and later discover I've been going about things wrong. Thanks in advance for any assistance!

mikeblas · Jul 6, 2007

I guess #1 is a question about SQL, but it's really more to do about your database client language (and its libraries). #2 isn't about SQL, mostly; it's about relational modelling. If you've got no experience with other RDBMSes, then you should think about database modelling at least a little before diving into SQL.

1) Prepared statements should help with SQL injection, since they normally mean that you're forced to bind the values you want into the statement instead of adding them as literals. Adding mal-formed literals is what injection is all about. If you're binding you're values, then you're all set.

2) It's kind of unclear what your goals are. You're recording prices for a known set of items at particular times at particular stores, then? Why don't you have a known set of stores? What does "link-to-item" mean? Neither of your tables has store-name as a column, so how are you storing that?

Your SQL statement has a couple of problems -- as far as I can tell (that is, with the above questions withstanding). You can use the BETWEEN keyword instead of that conjoined conditional, but it's inconsequential for the database -- it's just a matter of preference. You'll also want to get used to doing proper joins with the JOIN keyword:

You need a FROM clause, though, showing which tables you're takin' rows from. And a JOIN clause, with a predicate showing how the reows in the two tables you're takin' data from are related.

How to identify an item ID has mostly to do with your users. Are they looking for a part number? An item number? Some description? Some fuzzy matching over those? Then, you'll want to store that data, and probably make some synthetic primary key over the attributes the user actually sees.

unhappy_mage · Jul 6, 2007

mikeblas said:
I guess #1 is a question about SQL, but it's really more to do about your database client language (and its libraries). #2 isn't about SQL, mostly; it's about relational modelling. If you've got no experience with other RDBMSes, then you should think about database modelling at least a little before diving into SQL.

Good idea. Do you have any suggestions for a good reference on this topic (book or website or other)? Table design is one of the many unknowns in this for me.

mikeblas said:
1) Prepared statements should help with SQL injection, since they normally mean that you're forced to bind the values you want into the statement instead of adding them as literals. Adding mal-formed literals is what injection is all about. If you're binding you're values, then you're all set.

Okay, that's good to hear. I guess this is probably implementation-dependent, but if I prepare a statement like "INSERT INTO table (foo, bar, baz) VALUES (?, ?, ?)" and then execute it with a call like "statement->execute(1, 2, 3)" that's using binding, correct? If the statement ended with "VALUES (1, 2, 3)" it would be inserting literals and Bad.

mikeblas said:
2) It's kind of unclear what your goals are. You're recording prices for a known set of items at particular times at particular stores, then? Why don't you have a known set of stores? What does "link-to-item" mean? Neither of your tables has store-name as a column, so how are you storing that?

A known (but possibly changing - if a new item comes out, it'd get added) set of items, yes. The set of stores is known, too - should it get its own table as well? "link-to-item" is the URL of the place one can buy the item from that store. And yes, store-name should be a column in the "items" table (as I envision it, anyways).

mikeblas said:
Your SQL statement has a couple of problems -- as far as I can tell (that is, with the above questions withstanding). You can use the BETWEEN keyword instead of that conjoined conditional, but it's inconsequential for the database -- it's just a matter of preference. You'll also want to get used to doing proper joins with the JOIN keyword:

You need a FROM clause, though, showing which tables you're takin' rows from. And a JOIN clause, with a predicate showing how the reows in the two tables you're takin' data from are related.

So if I've got the following tables:

Code:

items:
item-id  | description | link-to-item           | store
---------+-------------+------------------------+------
1        | foo         | http://www.google.com  | google
14       | bar         | http://www.yahoo.com   | yahoo
134      | baz         | http://perl.org        | baz

data
row-id | item-id | date    | price
-------+---------+---------+-------
1      | 1       | 7/7/07  | 12.34
2      | 1       | 7/8/07  | 12.11
4      | 1       | 7/9/07  | 12.15

and I want to get a row like this:

Code:

store   | name | date    | price | link-to-item
--------+------+---------+-------+-------------
 google | foo  | 7/7/07  | 12.34 | http://www.google.com

I could run a query like "SELECT items.store, items.name, data.date, data.price, items.link-to-item FROM data INNER JOIN items ON data.item-id = items.item-id WHERE data.date BETWEEN '7/6/07' AND '7/8/07'"?

mikeblas said:
How to identify an item ID has mostly to do with your users. Are they looking for a part number? An item number? Some description? Some fuzzy matching over those? Then, you'll want to store that data, and probably make some synthetic primary key over the attributes the user actually sees.

"item-id" was a poor name for that field. "description" probably would have been better, and that's what I changed it to above. I do realize there will need to be a synthetic primary key for both of these tables ("item-id" and "row-id" above); is the "CREATE SEQUENCE" command the right way to do that or am I missing something?

In conclusion, any suggestions on a reference on table design or SQL in general would be appreciated. I think I'll take a gander at the school library after work today.

mikeblas · Jul 6, 2007

unhappy_mage said:
Okay, that's good to hear. I guess this is probably implementation-dependent, but if I prepare a statement like "INSERT INTO table (foo, bar, baz) VALUES (?, ?, ?)" and then execute it with a call like "statement->execute(1, 2, 3)" that's using binding, correct? If the statement ended with "VALUES (1, 2, 3)" it would be inserting literals and Bad.

Right. Say you tried to build the INSERT statement from user input as a string, yourself. You might code something like:

Code:

  statement = "INSERT INTO Foo VALUES (" + userinput +")"

If userinput is "38", then that's great, you insert 38. If userinput is "0);DELETE FROM Foo;" then you're pwned.

Binding guarantees that you're strictly binding a parameter. With this same example, the bind fails because "0);DELETE FROM Foo;" isn't a number. Even if you were expecting to bind a string, you'd be fine because you'd strictly bind the content of the string and not subject the string to the SQL parser, where it can get translated in a malicious way.

unhappy_mage said:
The set of stores is known, too - should it get its own table as well?

I'd think so. Making a reference table of know lists has a couple of advantages. First, it helps you switch around the table to contain more data when your needs change. You've got the store name redundantly in your table again and again. What if you decide you want to also have the address, or a contact name, or something? You'd have to redundantly store that data, too.

Next, it assures data integrity. As you've defined it, I could shop at "google", or accidentally shop at "googl" or "googol" or "goog", or "xyzzy". With the reference talbe, you've got the opportunity to check that the values are completely correct.

And so on.

unhappy_mage said:
I could run a query like "

Looks about right.

unhappy_mage said:
In conclusion, any suggestions on a reference on table design or SQL in general would be appreciated.

I'm hard-pressed to provide one. I've been doing this for a long time, so I don't really have current books on my shelf. It's called "relational modelling" or "database modelling", and maybe that helps you search.

Tytalus · Jul 6, 2007

For reference most of the web references will be sufficient. There are two primary things to understand prior to looking at a single line of SQL:

1) On paper, databases are nothing more than a relational model expressed as a mathematical expression. Of course, over the years this has evolved and morphed, but you can find some history on the Wikipedia page for relational model. Skip the history stuff and go to the substance, making sure to glance over the differences between the relational model and modern SQL. As a quick footnote to this item I will also mention that relational algebra is the math behind queries, as you can see on the Wikipedia page for relational algebra. It's going to be a dense read if you didn't do an extensive amount of math work in school, but anyone who took an Advanced Algebra course and at least one Calculus course could pretty easily grasp it given a few examples and some time to work them out.

2) Also on paper, during the database design process you should frequently attempt to use a process known as database normalization. Again I refer you to the Wikipedia page for database normalization as a decent reference. Only look at 1NF through 3NF. For most modern electronic databases normalizing tables beyond 3NF is wasteful and counter-productive, but this statement is open to wide interpretation. For your needs 3NF should be sufficient.

Understanding the two items above independently of the existing implementations in MSSQL, Oracle, MySQL, etc. is highly recommended (by me) as you will find some differences between what is described above and what is implemented by each database management suite. Similar to writing good code, a proper database design leaves itself open for easy expansion and flexibility, while still retaining strong relationships between associated data fields.

Hopefully I didn't confuse you too much!

If you have any other questions just yell and I'll see what I can do to help you out. I have a few book references but there really is no substitute for busting out a sheet of paper and doodling it out until you get what you want. Also, based on the ideas I described above you could probably get a decent search engine to kick you out some syllabi for undergraduate-level coursework which will further refine your knowledge of the subject.

mikeblas · Jul 6, 2007

The relational algebra and relational calculi describe the operators that relational databases use over tables and how those operators interact. They don't do much to help a person model a real-world problem into a database design.

Learning about these can help glue together some concepts, but I'm not sure its necessary to be successful even with complicated database development projects. If you do want to go this deeply into studying relational databases, I'd recommend "Fundamentals of Database Systems" by Elmasri and Navathe (ISBN 0321369572). This book is very approachable, and well-written.

"An Introduction to Database Systems" (ISBN 0321197844), and "Database in Depth: Relational Theory for Practitioners" (ISBN 0596100124) are both by CJ Date, one of the leading researchers in the field.

An appreciable amount of the content of these books is only interseting to people who are implementing database systems -- that is, actually writing a DBMS -- rather than people using a DBMS to solve a problem.

Tytalus · Jul 6, 2007

mikeblas said:
The relational algebra and relational calculi describe the operators that relational databases use over tables and how those operators interact. They don't do much to help a person model a real-world problem into a database design.

I fully agree with this statement. For the sake of completeness (in the limited capacity that I described the relational model) I thought it best to at least mention it, for the interested reader. I suppose I could have been more descriptive regarding my intentions, in hindsight.

mikeblas said:
Learning about these can help glue together some concepts, but I'm not sure its necessary to be successful even with complicated database development projects.

Taking your statement as it is, I can agree with it. Success is a VERY relative term with databasing. For smaller database structures (few tables, few relationships, etc.) it's a LOT easier to get away with poor design (as it is with many computer science concepts.)

My background is focused primarily on the academic side of databasing and has since branched out in to business uses, so my "anchor" (starting point) for a database design always follows (at least in spirit) the principles I described in my previous post. I look at it as someone saying "You can learn C++ by opening Notepad, pasting in some code from the web or a book and running a compiler on the code to see what happens" vs. someone saying "You can learn C++ by opening a book and reading about the structure of a simple program and learning the basic data types available, and how to utilize them."

I do realize I probably went a bit off-course in my description here.... sorry.

mikeblas said:
If you do want to go this deeply into studying relational databases, I'd recommend "Fundamentals of Database Systems" by Elmasri and Navathe (ISBN 0321369572). This book is very approachable, and well-written.

"An Introduction to Database Systems" (ISBN 0321197844), and "Database in Depth: Relational Theory for Practitioners" (ISBN 0596100124) are both by CJ Date, one of the leading researchers in the field.

An appreciable amount of the content of these books is only interseting to people who are implementing database systems -- that is, actually writing a DBMS -- rather than people using a DBMS to solve a problem.

I currently have "Fundamentals of Database Systems" sitting on my bookshelf at home and I highly recommend it if anyone requires a paper text as a reference.

Date's book "Database in Depth: Relational Theory for Practitioners" looks decent for an academic text. I took a quick peek at it on Google Books and it follows the structure of most academic texts in relational database theory. It isn't a read for the light-hearted or mathematically-challenged. As Date says in the introduction to Chapter 1:

C. J. Date said:
...So this book is aimed at database professionals, especially commercial database practitioners, who have had some exposure to the relational model but don't know as much about it as they ought to. It's definitely not meant for beginners; however, it isn't a refresher course, either. To be more specific: I'm sure you know something about SQL, but--and I apologize if my tone is somewhat offensive here--if your knowledge of the relational model derives only from your knowledge of SQL, then I'm afraid you won't know the relational model as well as you should, and you'll probably know "some things that ain't so."

I think that quote sums it up nicely.

unhappy_mage · Jul 6, 2007

Thanks for the input, guys. I do plan to take a course on this in the spring, so theoretical material is welcome.

In the meantime, though, I have a practical question. I've created the tables:

Code:

CREATE TABLE stores (
	storeid integer PRIMARY KEY DEFAULT nextval('storeids'),
	name text
);
CREATE TABLE items (
	itemid integer PRIMARY KEY DEFAULT nextval('itemids'),
	description text,
	item_url text,
	storeid integer REFERENCES stores
);
CREATE TABLE data (
	dataid integer PRIMARY KEY DEFAULT nextval('dataids'),
	itemid integer REFERENCES items,
	date date,
	price integer
);

(those "DEFAULT nextval()" things also have SEQUENCEs created for them) and added an example store:

Code:

SELECT * FROM stores;
 storeid |  name  
---------+--------
       1 | Newegg

Next, I want to insert a row into the "items" table. How can I take my (description, item_url, store_name) tuple, and translate store_name into a storeid? Is it possible to do that in one step? I tried:
"INSERT INTO items VALUES (DEFAULT, ?, ?, SELECT storeid FROM stores WHERE name=?);"
and then I execute("foo", "bar", "StoreName"), but that doesn't work. Should I do this as two queries, or is there a way to do it as one query?

mikeblas · Jul 6, 2007

Something like this:

Code:

INSERT INTO Items (Description, Item_URL, StoreID)
SELECT ?, ?, StoreID
  FROM Stores
 WHERE Name = ?

should work.

unhappy_mage · Jul 7, 2007

That works just fine. Thanks.

I'm still looking for a clever way to create stores that don't exist, and items that don't exist, to automatically make foreign keys when needed. I could do a separate SELECT before I try to INSERT, but that strikes me as inefficient. The answer would get cached, I bet, but it's still an extra layer of abstraction. Writing a stored procedure that takes a completely joined row (with all the non-artificial data in it) and splits it into the appropriate tables might be a better idea, I guess. Any suggestions one way or the other?

mikeblas · Jul 7, 2007

I can't figure out what you mean by "non-artificial".

Selecting before the insert isn't so bad; it's more I/O, the database does some caching, and so on. Database management systems (well, most of them) are remarkably good at storing and retreiving information.

So what if you didn't do something? Where would the data come from?

What's a good place to start with SQL?

unhappy_mage

[H]ard|DCer of the Month - October 2005

mikeblas

[H]ard|DCer of the Month - May 2006

unhappy_mage

[H]ard|DCer of the Month - October 2005

mikeblas

[H]ard|DCer of the Month - May 2006

Tytalus

Supreme [H]ardness

mikeblas

[H]ard|DCer of the Month - May 2006

Tytalus

Supreme [H]ardness

unhappy_mage

[H]ard|DCer of the Month - October 2005

mikeblas

[H]ard|DCer of the Month - May 2006

unhappy_mage

[H]ard|DCer of the Month - October 2005

mikeblas

[H]ard|DCer of the Month - May 2006