Storing delimited data vs. relational databases (PHP/MySQL)

Sp4nk · Feb 22, 2007

I'm curious as to what the differences/consequences/benefits are between storing delimited data as a single field item in a row or having all that data as individual rows in a separate table.

For example, say I had a user named "Steve" and he voted on items 1,2, and 5. Would having a single field in Steve's user data row with 1,2,5 be any more or less efficient than having three rows with those values in another table for the SOLE purpose of determining that Steve voted on those items? This delimited data would not be joined to another table in any way, nor attached to what the vote was, but merely determining if the item was voted on or not.

In this situation, lets say all voting items are being listed. The delimited data would be a single variable and matched in a loop consisting of each voting item. No queries within queries, no subqueries, no joins. Just a single while loop of one query and a string match between the voting item ID and the delimited data of Steve's voting record (so to speak).

This would be by no means a common method for matching data throughout an entire site, but for the purposes of recording more or less the activity of users. It seems easier on the server to hit the database less by having this data in a field, but it may tax the server more to run through a massive string. I'm curious to other thoughts and opinions on the subject. Thanks!

Deque · Feb 22, 2007

It's bad database design because it breaks 1NF. Whether or not it's justified or not, I can't really comment because I haven't designed many databases outside of academia.

ameoba · Feb 22, 2007

What happens when you make a new poll or decide that users should vote on more subjects? Unless you're absolutely certain that voting will only be done on a small, fixed number of issues, you're asking for problems down the road by going this route.

While there's cases where it may be desirable to denormalize your data for performance reasons, it's probably best to wait until you -know- you have a performance problem. Even then, you're probably better off reworking your queries and adding indices.

drizzt81 · Feb 22, 2007

There are many implications:
If your row format is:

Code:

Name | Votes
-------------------
Steve | 1,3,5
Bill   | 2,3,5

It may be time consuming to find all people that voted on issue 6, since you cannot index on the Votes column anymore. Also, depending on how you define the Votes field, you may be limiting the number of Votes that a person can cast. Also, if Steve casts another vote for a different project, depending on the data type of Votes, the update may require ``moving'' the record to a different location, if there is not enough space at the current storage location.

ZeroH · Feb 22, 2007

If you go by your example of 4 columns (Steve, 1, 2, 5) with a small set of rows then by no means is it any better or worse to use delimited file over a relational db.

However, you may be better off having Steve and the voting items as unique rows..
Steve, 1
Steve, 2
Steve, 5

Digital Cro-Magnon · Feb 23, 2007

From what I gather from the original post, it look like there is a table of users and a table of voting. I think it would be best to create a related voting_record table, which would contain the user table id and the voting table id. If indexes are used, it will greatly speed up the queries and reduce performance penalties.

mikeblas · Feb 23, 2007

Regardless of the size of the example, the normalized solution is easier to write queries for. Some queries are even impossible if using a less capable database engine (such as MySQL). For example, given the denormalized form, we can't write a query that lists all the votes by number and the count of people who have voted for them. I don't think it's possible in MySQL; in SQL Server, you'd have to create a table-valued function, then join against it. It's pretty tedious and doesn't execute optimally.

With a normalized approach:

Code:

NameID,Name
1,Steve
2,Bill

Vote,NameID
1,1
3,1
5,1
2,2
3,2
2,5

Vote
1
2

the query is trivially written.

The normalized approach may take more disk space. The denormalized approach will almost always do more I/O because it can't be indexed, as drizzt has pointed out. The normalized approach is easier to extend and modify, both in the domain of the data and outside of the domain of the data. The denormalized approach offers little hope of efficiently providing referential intergity.

ZeroH · Feb 23, 2007

I could be wrong (not like I haven't been before) but it sounded like the OP wanted a method of processing one small set of data.. users and their voting options.. If that's the case, and he has the logic already built (or an idea formulated) then i don't see the need for building a relational database when a simple program could do what he needs on that file.

Granted, if there's going to be a need for complex querying and/or joining with other data sets (ex. user address and phone number), a relational DB would be better suited for that. If a DB sounds preferable.. just using a single table would still be simpler given the small variation of data..

Name, Vote
------------------
Steve, 1
Steve, 2
Steve, 3
Bill, 1
Bill, 3
Bill, 9
Bill, 11
etc...

And you can index on both columns if necessary for query performance if the table grows unwieldy.

Normalization is often overused in cases where a denormalized table will suffice given the size of data involved. You can normalize to a point where performance degrades rather than improve as in mikeblas's last example. Not to say that it's not useful if there were way more than just two columns involved and ten's of thousands of rows.

mikeblas · Feb 23, 2007

ZeroH said:
but it sounded like the OP wanted a method of processing one small set of data..

What I read Sp4nk asking for was what the "differences/consequences/benefits are between storing delimited data as a single field item in a row or having all that data as individual rows in a separate table". And that's what I've provided.

While a single table might be simpler, I'm not sure it's more efficient.

Which of my normalized examples degrades performance?

ZeroH · Feb 23, 2007

No offense mikeblas.. your explaination is perfectly aligned with normalization concepts.

But if I had a 1000 users that voted on average of 5 items out of a 10 possible.. that's 5000 rows in a single table.. where as going by your " With a normalized approach:" example.. I would have:

NameID,Name
1000 rows

Vote,NameID
5000 rows

Vote
10

To identify what Steve voted on, I'd have to join 3 tables to get 5 rows... where as I could just query one table and get 5 rows.. If there were no indexes and for whatever reason the database decided to do a full table scan, it's still only 5000 rows rather than 1000+5000+10.

It's very kin to a lot of "academic" examples using State and Zipcodes... Just because one State may have many zipcodes it doesn't mean you should normalize that into 2 tables. It's overkill.

CSIXTY4 · Feb 23, 2007

I'm curious as to what the differences/consequences/benefits are between storing delimited data as a single field item in a row or having all that data as individual rows in a separate table.

Assuming a separate vote table is keyed on a user_id and an item_id, then finding out if Bob voted for item #2 is a constant-time operation. The primary key is indexed, so the query just looks up where to find that record in the index.

Selecting against a delimited field is a much more complicated process involving string comparisons of varying lengths.

If the string comparisons are done by the application, then there's the additional overhead of transferring all the records that don't match from the database to the application and performing the comparison there.

If you're only talking about a handful of records, then this question is just academic. But I don't know how many times I've seen "oh, this will rarely be used" turn into "this is great...the whole department depends on it".

Sp4nk · Feb 23, 2007

Thanks for the response everyone.

While I understand that having individual items in a separate table is beneficial for looking up who voted on what, my original question related only to matching a single user against which items they voted on, not what they voted for.

I'd definitely keep a record of all items voted on, including their votes, by each user for other purposes, such as tracking demographics, patterns of people vs. topics, etc... However, my thoughts regarding storing what was voted on by a user in a single field of a user's row (ie: account data) as a means to show them what they have and haven't voted on seemed less intensive on the server by string matching a single field as opposed to going through buttloads of records. This would be done on the users end, not a matter of public record. I was curious if string matching ends up using just as many resources going through the delimited data as it would going through records.

mikeblas · Feb 23, 2007

ZeroH said:
To identify what Steve voted on, I'd have to join 3 tables to get 5 rows...

You don't need to join three tables for that query. You only need two; you only need the Vote reference table to find things not voted on. Or, for referential integrity -- something the denormalzed approach completely fails to provide.

Similarly, Im not sure why we'd consider a case without indexes. Indexes over the normalized tables can assure that users only vote once, vote for one choice once, don't have duplicate names, and so on.

These guarantees are harder to establish in the non-normalized approach.

CSIXTY4 said:
If you're only talking about a handful of records, then this question is just academic. But I don't know how many times I've seen "oh, this will rarely be used" turn into "this is great...the whole department depends on it".

Precisely. This is why I said table size was largely irrelevant. Anyone with experience has worked on previous few systems that had static requirements. Another demand, more users, another idea, and so on -- you'll be asked for more eventually.

Fitting into the relational model buys a lot of flexibility, and I think we're showing the perceived efficiency cost is between neglible and non-existant for this kind of application. In any non-acadeic application, the flexibility is usually more important than the efficiency.

sp4nk said:
This would be done on the users end, not a matter of public record. I was curious if string matching ends up using just as many resources going through the delimited data as it would going through records.

It'll depend on your query pattern.

ZeroH · Feb 23, 2007

You don't need to join three tables for that query. You only need two; you only need the Vote reference table to find things not voted on. Or, for referential integrity -- something the denormalzed approach completely fails to provide.

Similarly, Im not sure why we'd consider a case without indexes. Indexes over the normalized tables can assure that users only vote once, vote for one choice once, don't have duplicate names, and so on.

Yes, I was mistaken.. 2 tables not 3.. still I stand that it's more overhead than with just 1.

And yes, indexes are an option too and with just one table you can have 1 index to accommodate what your 2 tables would do with 2 indexes.. again, more overhead.

In any non-acadeic application, the flexibility is usually more important than the efficiency.

Flexibility only becomes a concern when scaling is involved, which doesn't seem to be the case here. And while flexibility is important in any production environment, efficiency is always the driving factor. To the business it's always a question of "How fast can (and correctly) we get this done for the customer?" not "How much larger can it get?"

mikeblas · Feb 23, 2007

ZeroH said:
Flexibility only becomes a concern when scaling is involved, which doesn't seem to be the case here.

So far. As I and others have pointed out, requirements change more often than not. I'm not sure how you've related flexibility and scaling; to me, they're orthogonal.

ZeroH said:
And while flexibility is important in any production environment, efficiency is always the driving factor. To the business it's always a question of "How fast can (and correctly) we get this done for the customer?" not "How much larger can it get?"

Right: so, since the advanced queries are more efficient at runtime in the relational approach, and the storage penalty is not that great (and, in fact, is likely to be greater for the denormalized approach), and so on ... denormalization often turns out to be a false economy.

There are reasons to consider denormalization, but I don't think we've heard any of those from the OP.

If efficiency is really the driving factor, then denormalization is often the wrong answer. If I've finally got my math sorted out, the problem with your "efficiency" argument is that you're just counting rows, not row width. For the normalized approach, let's assume names have an average length of 14 and that integers are four bytes wide. We'll need an integer and the name in the Names table:

4*Names + 14*Names == 18*Names

Then, for the Votes table, we'll need a vote ID and a name ID for each vote, for each name:

Names * Votes * (4 + 4) == 8*Votes*Names

Our total storage in bytes is then 8*Votes*Names + 18*Names. With your estimate of 1000 names and 5 votes, and my guess at 14 characters per name, we're using

8 * 1000 * 5 + 18 * 1000 == 40,000 + 18,000 == 58,000 bytes

For the denormalized approach, our storage is considerably greater. We'll need each name stored once for each vote along with a vote number:

Names * Votes * (14 + 4) == 18 * Names * votes

Using the same volume estimates, we've got

18 * 5000 * 1000 == 90,000 bytes

If the names, on average, get longer, the denormalized approach gets proportionally worse becasue it's redundantly storing expensive data.

In the normalized approach, a clustered index gets us data integrity over duplicate votes, duplicate users, and misspelled user names for no additional storage cost. The denormalized approach, even with a clustered index, gets only duplicate vote checking. Mispelled user names, bogus vote IDs, and so on, all end up thrown to the wind.

To find a vote, you can scan a single larger table, but that's not much of a savings. Since the rows are wider, the scan takes longer on average

Sp4nk · Feb 24, 2007

Here's a clearer example of both methods I'm curious about. First off, when the PHP is first being processed by a logged in user, a single query grabs their account information. Their user name, ID, password, profile details, and so on. Also included in that information is what they voted on (the delimited method). This information is stored as an array so that all the following scripts or includes that parse throughout the rest of the page have access to it. So, in this example:

Code:

USERS
id user name password profile    votedon
-- --------- -------- ---------- -------------
 1     Steve k39X02i  I'm cool.  [1][4][5][7]
 2      Bill 2019X_2  I'm dope.  [1][3][9][11]

we have Steve and Bill's voting record stored in the "votedon" field. The voting item IDs are simply concatenated each time a user votes on something. The method to scan through their voted items would be a string match using strstr. I've added the brackets in this case so that they're not mismatched (ie: preventing 10 from returning true because it matched "10"0).

Steve logs in. We have his info queried and ready for any script afterwards. Let say he's attempting to vote on an item he's already voted on. As part of the processing script for voting, we check the single field for a string match to see if he's voted already. The ID is of the item he's voting on is also passed to the processing script as part of the check.

Code:

if (strstr($userinfo['votedon']),"[" . $voteitemid . "]")) { echo "You already voted on this."; };

If this returns true, he voted on the item and we know it. If not, we'd let the processing script continue and record his vote.

The other method, checking a table full of votes by every member in the database, is executed by querying the item Steve is voting on and joining the table which has the votes:

Code:

VOTING
voteid question        answerA answerB etc...
------ --------------- ------- -------
     1 Is this cool?        Yes     No
     3 Are we hip?          Yes     No
     4 Do you smell?        Yes     No
     7 Favorite color?      Red   Blue
     9 Are you tall?        Yes     No
    11 Is the sky red?      Yes     No

VOTERESULTS
id voteitemid voted
-- ---------- -----
 1          1     A
 1          4     B
 1          5     C
 1          7     D
 2          1     B
 2          3     A
 2          9     A
 2         11     C

We simply match their user ID and the ID of the vote against the ID of the item they're voting on.

Code:

SELECT voteid,question FROM voting LEFT JOIN voteresults ON (voteresults.voteitemid = voting.id AND voteresults.id = $userinfo['id'])

In this situation we're querying the database again and if the joined result row spits out data from the voting results, we'll know that Steve voted on the item.

Now, granted, having the voting results in its own table gives us the ability to do different things with the data than just check to see if Steve voted using the delimited way. However, performance-wise, is one method any better than the other? Would string matching through 1,000 bracketed numbers in a single field be more or less work than going through a table of 50,000 votes from all the users?

ameoba · Feb 24, 2007

ZeroH said:
Flexibility only becomes a concern when scaling is involved, which doesn't seem to be the case here. And while flexibility is important in any production environment, efficiency is always the driving factor. To the business it's always a question of "How fast can (and correctly) we get this done for the customer?" not "How much larger can it get?"

It's really hard to justify breaking with convention in order to boost performance in an application where you're not concerned with scalability in the first place, especially when any performance gains you'd see are not even likely to appear.

...oh, and, in my experience, in actual production environments, the primary concern is figuring out how to get things back into a properly functioning state because the asshat before you wrote didn't really know how to solve the problem in the first place. Second concern is getting things up & working again ASAP. Only then do you worry about performance - and, even then, it's probably a lower priority than putting out the next fire or implementing some bells & whistles that some suit has a hardon for..

ZeroH · Feb 24, 2007

Now, granted, having the voting results in its own table gives us the ability to do different things with the data than just check to see if Steve voted using the delimited way. However, performance-wise, is one method any better than the other? Would string matching through 1,000 bracketed numbers in a single field be more or less work than going through a table of 50,000 votes from all the users?

Sp4ank, I assume you're saying there's 50 users with 1000 votes each. In that respect, I don't think it would matter if you use a delimited file with scanning logic versus a database with a normalized OR denormalized table(s) - although a denormalized table will still perform better than normalized. The delimited file would still be the way to go to keep it simple.

*Ok.. big breath*

It's really hard to justify breaking with convention in order to boost performance in an application where you're not concerned with scalability in the first place, especially when any performance gains you'd see are not even likely to appear.

...oh, and, in my experience, in actual production environments, the primary concern is figuring out how to get things back into a properly functioning state because the asshat before you wrote didn't really know how to solve the problem in the first place. Second concern is getting things up & working again ASAP. Only then do you worry about performance - and, even then, it's probably a lower priority than putting out the next fire or implementing some bells & whistles that some suit has a hardon for..

I disagree.. since scalability is NOT a factor and the request is so small and singular in scope that denormalizing is better suited rather than making it more complex than it has to be for the sake of scalability/flexibility..

There are reasons to consider denormalization, but I don't think we've heard any of those from the OP.

I thought it was crystal clear. Definately clearer than the need for scaling and/or flexibility which I didn't hear at all.

In the normalized approach, a clustered index gets us data integrity over duplicate votes, duplicate users, and misspelled user names for no additional storage cost. The denormalized approach, even with a clustered index, gets only duplicate vote checking. Mispelled user names, bogus vote IDs, and so on, all end up thrown to the wind.

Now you're really trying to make it seem more complex than it really is. Why don't we throw an enterprise server and a couple of administrators to design/support this system?

As for your math - although it's a bit incorrect.. doesn't count for the overhead of joins, IO seeks/scans, multple tables and indexes, and do I even need to mention the level of effort of installing, configuring, and building the database.. And you're talking total byte lengths, when I'm comparing a single table that would be 28 bytes wide (if you use the same length of 14bytes for Votes since you didn't account for that) versus multiple tables that's

NameTable + NameVotesTable + VotesTable =
NameID+Name + VoteID+NameID +Vote+ VoteID+Vote =
4+14 + 4+4 + 4+14 = 44 bytes

And that doesn't even include all the indexes you would have which in any case would defunct the need for counting total table size anyway.. Why is that even an issue here?

All I am keeping in mind is keeping it simple for the OP.

mikeblas, you sound like you have IT experience and so you should know the mantra that has the acronym "KISS"...

*big breath*

Oh, and I curse you for making me do math on a Sat morning..

OP.. I think you have enough info to make an educated decision. It sounds like you already have the path of the delimited file defined. I don't know how comfortable you are in setting up a database so that's another other factor to consider.

mikeblas · Feb 24, 2007

ZeroH said:
Now you're really trying to make it seem more complex than it really is.

Sorting out a database that's been botched because of bad referential integrity is far more tedious than dealing with some trivial design complexity to avoid such problems in the first place. Getting RI correct helps assure the database contains correct and usable data and is hardly an overcomplication; it's really a minimum threshold for complexity.

Again, I'm just pointing out the reasons to use one over the other. Despite your zealotry, the reasons exist regardless of what decision and what decision critera are made. As such, accusing me of making something seem "more complex" strikes me as off topic.

ZeroH said:
And you're talking total byte lengths, when I'm comparing a single table that would be 28 bytes wide (if you use the same length of 14bytes for Votes since you didn't account for that) versus multiple tables that's

The denormalized table I was thinking of would have a name and a vote number for each vote: 14 bytes for the name string, 4 bytes for the vote integer.

It looks like Sp4nk is using a scheme where there'd be a single column with the vote numbers in square brackets, as "[1][5][4]" to indicate votes on 1, 4, and 5.

ZeroH said:
NameTable + NameVotesTable + VotesTable =
NameID+Name + VoteID+NameID +Vote+ VoteID+Vote =
4+14 + 4+4 + 4+14 = 44 bytes

Is the total width of all the tables meaningful, since each table grows at a different rate?

ZeroH said:
And that doesn't even include all the indexes you would have which in any case would defunct the need for counting total table size anyway.. Why is that even an issue here?

You'd only need a clustered index on each, which is practically free, as far as storage is concerned.

ZeroH said:
All I am keeping in mind is keeping it simple for the OP.

ZeroH said:
mikeblas, you sound like you have IT experience and so you should know the mantra that has the acronym "KISS"...

In my 30 years working on software, I've found that arbitrary overcomplexity is indeed something to be avoided. Here, the additional complexity is minimal and the benefits are very tangible. Databases do JOINs like basketball players dribble. It's what the relational model is for; it's what the DBMS does for a living. If Sp4nk's application doesn't have a reason for it, that's fine by me -- but if not, it would seem like the use of an RDBMS is an overcomplication in the first place.

ZeroH said:
Oh, and I curse you for making me do math on a Sat morning..

I was doing it on a Friday night. It was like riding a bicycle on a sandy beach.

Storing delimited data vs. relational databases (PHP/MySQL)

Sp4nk

Limp Gawd

Deque

Limp Gawd

ameoba

Supreme [H]ardness

drizzt81

[H]F Junkie

ZeroH

Limp Gawd

Digital Cro-Magnon

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

ZeroH

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

ZeroH

Limp Gawd

CSIXTY4

Weaksauce

Sp4nk

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

ZeroH

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

Sp4nk

Limp Gawd

ameoba

Supreme [H]ardness

ZeroH

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006