Help Me Out With This Database Schema

GeForceX · Apr 28, 2009

Hey guys,

I have limited experience with databases but in order to learn more, I've been attempting at designing databases myself. I'm trying to create a news aggregation database for users to retrieve news from.

The basic idea of this database is that a user will login his/her university account and choose their preferences. The preferences are based on colleges only (i.e. Science College, Engineering College, Liberal Arts College, etc.). The preferences can also be drilled down further with keywords (i.e. technology, educational, fun, etc.). Then the user can receive news as they please.

The database looks like this:

The basic business rules are as follows:

-A user can choose many colleges and a college can be chosen by many users.
-Thus we have a composite table for "User_College"

-A college can publish many articles but an article can only belong to one college.
-One could argue that an article could be jointly written by authors from two different colleges but for simplicity's sake, let's say it's not going to happen.

-An article can have multiple keywords tagged to it and a keyword can be tagged onto many articles.
-Thus we have a composite table for "Article_Keyword"

-An article can be written by many authors and an author can write many articles.
-Thus we have a composite table for "Article_Author"

My concerns are the following:

1) Well, if articles can be written by many authors, then it begs the question of, "Will it come from different colleges too?" Well, I'm not sure what to do - should I make the relationship between College and Articles many-to-many?
2) I have to store the user's preferences somewhere and I don't know where to store them.
3) I need to aggregate news from somewhere, right? How exactly do I implement such a function in a ERD (diagram) like this?
4) What if users prefer to be e-mailed news rather than to login and read news? How do I implement the functionality of letting users be e-mailed news?
5) I also feel that this diagram is too "simple". Or maybe I'm wrong - it is supposed to be that simple.

Any feedback would be great!

Appreciated,
Jeff~

PTNL · Apr 29, 2009

Thoughts so far... Sounds like you've thought through some of the potential many-to-many relationships. One assumption I am making is that "college" refers to a specific academia group of a university (ex: College of Business).

One thing worth pointing out is that an author may work at different colleges in their career. While you seem to defer against this in a similar comparison, where a book may have multiple authors yet each author works at a different college, it brings up a higher level question of whether some data should be considered accurate if it is current or "moment in time".

Seeing that there are multiple situations that suggest "moment in time" (with the simplest example being "when the book was written"), I would store such moment in time information at the book's detail levels. You can still use parent-foreign key linking for this, just add some extra foreign key columns to the book details and/or link-tables for "author", "author's college", "date published", etc. This way your queries for getting the author's current information is separate from the author's information from when a specific book was written.

Getting into the questions now...

1 - My previous explanation provides my thoughts on this... (I wrote the above response before first reading the specific questions

).

2 - Perhaps a similar link table would help, such as the one you did for "Article_Keyword". In this case, it would be "User_ArticleType". This would give the article types that a specific user is interested in. Thinking further down the line, one expandion of this design would be to track which articles a user has received already, approved/rejected, and/or some sort of content "voting" (best comparison I could give is like a "stars" rating you would assign music within iTunes).

3 - This gets a little subjective, mostly because nothing's been said about how any of these articles are actually stored. Is it a single repository, multiple repositories, a mix of on- and off-site datastores, third party and/or proprietary solutions? What amount of data is even needed to lookup a single article: URL's, BLOB recomposition, etc.? Point is, I think you'll figure this one out on your own when you start answering the questions posed in here.

4 - This kinda depends on what you learn from #3. Start with that one, then you'll start seeing a pool of options (if any).

5 - I think you're on a good start with the diagram, and the questions being raised. Evaluate your needs against what I commented on about the "moment-in-time" tracking, adjust as needed. Figure out more about the articles themselves. Others may have some good details as well, so take a look at the high level questions that get posted.

MindChild · Apr 30, 2009

1) You assume everyone has ONE and ONLY ONE email address. While this might seem logical and enforcable, I promise from loads of experience, it won't be. Use a linking table and perhaps a flag marking an email address as "primary"

2) First and Last name fields are WAY to short. What will happen when someones name is "Yoda Buymeafriggindrinkoriamgoingtodie"?

PTNL · May 1, 2009

MindChild said:
1) You assume everyone has ONE and ONLY ONE email address. While this might seem logical and enforcable, I promise from loads of experience, it won't be. Use a linking table and perhaps a flag marking an email address as "primary"

2) First and Last name fields are WAY to short. What will happen when someones name is "Yoda Buymeafriggindrinkoriamgoingtodie"?

Number of fields and field sizes are good points to make. Fortunately, these are trivial to implement or adjust.

If multiple e-mail addresses becomes necessary, why not just add a second e-mail address column (or a third)? Name one (or at least consider one) of the e-mail fields as "primary". I don't see a reason to have a link table for e-mail addresses; your own e-mail address is not shared with someone else, so why bother implementing such a solution? It just adds unnecessary JOINs in the SQL queries and extra database overhead.

MindChild · May 1, 2009

PTNL said:
Number of fields and field sizes are good points to make. Fortunately, these are trivial to implement or adjust.

If multiple e-mail addresses becomes necessary, why not just add a second e-mail address column (or a third)? Name one (or at least consider one) of the e-mail fields as "primary". I don't see a reason to have a link table for e-mail addresses; your own e-mail address is not shared with someone else, so why bother implementing such a solution? It just adds unnecessary JOINs in the SQL queries and extra database overhead.

Actually, it DOESN'T cause anything unnecessary. It's very necessary to the point where it has been a subject of study for as long as relational databases existed... its called normalization. It keeps data rational and unnecessary data out of where it doesn't belong.

Having fields like PrimaryEmailAddress, EmailAddress1, EmailAddress2, etc would be a bloody nightmare. First of all, your queries will end up having convoluted logic. Lets say you are looking for the person with email address of [email protected]. It would look something like this:

Code:

Select *
From Users
Where PrimaryEmailAddress = '[email protected]' or EmailAddress1 = '[email protected]' or EmailAddress2 = '[email protected]' ...

You'd have to change the code AND the database anytime anyone went over the threshold of email address fields that you have. Also, your unused email fields would likely contain NULLs which will make string comparisons even more retarded, since you can't compare a string with NULL since NULL means "Has No Value". This is just a shortcut anyway to avoid linking (as in, an outer join will return NULL for the appropriate query fields where there is no link) and destroys the "relational" bit of "relational database".

Think something along the lines of:

Code:

Select Users.*
From (Users Inner Join Users_EmailAddresses on Users.UserID = Users_EmailAddresses.UserID) Inner Join EmailAddresses on Users_EmailAddresses.EmailAddressID = EmailAddresses.EmailAddressID
Where EmailAddresses.EmailAddress = '[email protected]'

Don't be afraid of joins, even if it is across 100 tables. This is what RDBMS's were DESIGNED to do.

PTNL · May 1, 2009

Link tables help (among other things) to resolve many-to-many relationships. E-mail addresses will share a one-to-many relationship with a user: a user can have several e-mail addresses, but a single e-mail address can (or better *should*) only belong to one person.

Normalization is designed to provide cleaner abstraction of data, minimize data duplication, etc. But your limited concern on just of the number of e-mail addresses allowed is not applicable for an application of these specs, scope, impact, tracking. And consider the few online sites that you have given contact information to; how many e-mail addresses, phone numbers, street addresses do they store? Some may allow groupings of a block of contact information, but each group is a fixed number of fields.

Now taking your argument further .... Why limit your design approach to only e-mail address? Why not abstract and expand further down normalization and apply it to all types of contact information (phone number, e-mail, address, etc.)? If normalization is really your focus, then why stop at just e-mail address?

The better approach you should take is to check with the OP about the number of e-mail addresses that have been stated as required/necessary. If this has not been set, then the OP should feel compelled to flesh out more details with the recipient of this application to make sure that either a few or an infinite number of e-mail addresses is realistic. The OP should be prepared to answer the recipient with the difference in cost and development time so that the recipient can make an informed decision.

MindChild · May 1, 2009

Link tables help (among other things) to resolve many-to-many relationships. E-mail addresses will share a one-to-many relationship with a user: a user can have several e-mail addresses, but a single e-mail address can (or better *should*) only belong to one person.

I completely agree. I have been on that end of the stick... WAY more times than I want to even remember. It seems most every organization has to tackle this problem on their own for some reason. There will always be someone that tries to use the same email address as someone else. There will always be someone who uses 40 email addresses. There will always be someone who doesn't have an email address at all. Regardless what the intentions and the scope of the project are, these are depressing facts that just can't be avoided.

Normalization is designed to provide cleaner abstraction of data, minimize data duplication, etc. But your limited concern on just of the number of e-mail addresses allowed is not applicable for an application of these specs, scope, impact, tracking. And consider the few online sites that you have given contact information to; how many e-mail addresses, phone numbers, street addresses do they store? Some may allow groupings of a block of contact information, but each group is a fixed number of fields.

Now taking your argument further .... Why limit your design approach to only e-mail address? Why not abstract and expand further down normalization and apply it to all types of contact information (phone number, e-mail, address, etc.)? If normalization is really your focus, then why stop at just e-mail address?

EXACTLY. For the same reasons above, the same SHOULD be implemented for addresses, phone numbers, vehicles, etc.

The better approach you should take is to check with the OP about the number of e-mail addresses that have been stated as required/necessary. If this has not been set, then the OP should feel compelled to flesh out more details with the recipient of this application to make sure that either a few or an infinite number of e-mail addresses is realistic. The OP should be prepared to answer the recipient with the difference in cost and development time so that the recipient can make an informed decision.

While this is a necessary step, counting on the client to be able to comprehend the whole scope of the problem would be naive at best. When was the last time you had a client that got you everything you needed before implementation? While everyone has the best intentions, a common rule I go by is, if someone ever says "We are never going to need ...", count on them eventually needing it

PTNL · May 1, 2009

MindChild said:
While this is a necessary step, counting on the client to be able to comprehend the whole scope of the problem would be naive at best. When was the last time you had a client that got you everything you needed before implementation? While everyone has the best intentions, a common rule I go by is, if someone ever says "We are never going to need ...", count on them eventually needing it

That's called a "change request", and gets documented as such. The goal is to hold the client accountable for the initial specs, and yet still be willing to make change requests that they authorize *after* you inform them of the impact on calendar time and development costs. Put the onus on the client for the final decision, that's what they are paying for. It's called project management, and needs to be treated as such.

For the whole "unlimited number of e-mail addresses" feature, I see no reason to force such a drastic design change that is typically congruent with an enterprise-level CRM system for an application of this scope.

GeForceX -- It's your call on this design approach. You have the opinions on both sides of the argument. Hopefully this discussion and the extra details that you have not posted about this project are enough to help make your decision.

mrgstiffler · May 1, 2009

You don't have any "Published/Deleted/Inactive" type fields in there. One extra field can save you a lot of hassle when you go to manage your content/users.

MindChild · May 1, 2009

If you want to mark a record as "deleted", move it to a different table. Else you risk fragmenting the table and its indexes needlessly.

Help Me Out With This Database Schema

GeForceX

Supreme [H]ardness

PTNL

Supreme [H]ardness

MindChild

Weaksauce

PTNL

Supreme [H]ardness

MindChild

Weaksauce

PTNL

Supreme [H]ardness

MindChild

Weaksauce

PTNL

Supreme [H]ardness

mrgstiffler

[H]F Junkie

MindChild

Weaksauce