Feasible to create a table per entry? (in order to emulate a 2D table)

dalearyous · Dec 23, 2011

hard to explain, but i basically need to create a 2D table. right now, the only way i think i can do it is to create a table for each entry that is submitted into the main table. problem is, doing so would create thousands of small tables. i don't really care, but i am questioning whether or not this is the most optimal solution.

am i missing something completely obvious?

Xeth · Dec 23, 2011

Usually when I need to store that kind of data I just encode it into a BLOB and store that into a column, with additional indexing columns as needed. But you need to tell more about the application.

dalearyous · Dec 23, 2011

i want to store the distance between 2 points for a lot of points

calebb · Dec 23, 2011

dalearyous said:
i want to store the distance between 2 points for a lot of points

What rdbms? What sorta query volume, number of users, expected rate of feature growth, etc? You could store the points, create spatial indexes and calculate the distance from your query:

http://technet.microsoft.com/en-us/library/bb964712.aspx
http://technet.microsoft.com/en-us/library/bb895270.aspx

Or perhaps fact/dimension tables will be more scalable.

We definitely need a lot more data to help you make a good choice; But my gut feeling is that creating a new table for each record might not be optimal in many of the ways mentioned above.

dalearyous · Dec 23, 2011

the table could potentially grow to hundreds of thousands of points ... user probably couple hundred ... it will grow quickly at first and then pretty much stop.

sorry i am being so vague, i didn't know quite what to ask.

*edit, your links are extremely helpful and it seems the second one is almost exactly what i want ... this is very new, will have to do some reading/researching/testing but seems to be going in the right direction

dalearyous · Dec 27, 2011

when i said distance between two points, i meant that i will be running a function, calculating the distance, and i just need to store it.

Whatsisname · Dec 27, 2011

hundreds of thousands of rows is no big deal for competent databases.

We need more info on your problem.

dalearyous · Dec 28, 2011

i need to be able to pass in two points, x & y coordinates into the database and have it find me the distance between the two (and no its not a straight line)

so, normally i would do this:
3 columns: ID, xCoord, yCoord

BUT, if i give you two points (so any 2 rows in the entire table) i need to check distance between the two.

crude example:
basically from point 1 to point 2 is 5 (whatever) and so on

Arainach · Dec 28, 2011

Why? Why not just store X and Y and do the calculations at runtime (or querytime)?

LuminaryJanitor · Dec 28, 2011

Maybe I'm missing something, but I don't see what you'd do with thousands of tables here... Wouldn't the simplest structure just be one table with (ID1, ID2, Dist) for each pair?

Arainach said:
Why? Why not just store X and Y and do the calculations at runtime (or querytime)?

Yeah, unless your distance function is slower than a database lookup, you're probably better off calculating it on the fly.

dalearyous · Dec 28, 2011

because the calculation is the most expensive part ... i need to limit how many times i calculate distance

"Maybe I'm missing something, but I don't see what you'd do with thousands of tables here... Wouldn't the simplest structure just be one table with (ID1, ID2, Dist) for each pair?" - i am not seeing this?

Arainach · Dec 28, 2011

Is calculating the distance really a bottleneck? Have you profiled the performance? Having a "2D" table means that your database size increases with the square of the number of rows, which sounds utterly unsustainable and considerably more expensive.

LuminaryJanitor · Dec 28, 2011

dalearyous said:
i am not seeing this?

You mean you don't understand what I'm talking about?

Say point1, point2 etc. are the primary keys (the "ID" column) in your coordinate table. Then your distance table, for the example you gave, would look like this:

(ID1, ID2, Dist) =
(point1, point1, 0),
(point1, point2, 5),
(point1, point3, 7),
(point1, point4, 9),
(point1, point5, 11),
(point2, point1, 5),
(point2, point2, 0),
...

dalearyous said:
because the calculation is the most expensive part ... i need to limit how many times i calculate distance

It may be the most expensive part, but that doesn't mean this will make it faster. A disk access is an eternity from a CPU's point of view. Just make sure it's actually helping before you commit to anything.

calebb · Dec 28, 2011

As everyone keeps mentioning, more specific details will be helpful.

Do the data points change over time? How often is the same distance between points calculated? Storing the data in a caching server would result in faster reads than storing it in SQL server, but (since it's a caching server), the storage is not durable. Depending on data set size and how static your data points are, storing it in your application's memory could be even faster. Does the underlying data (and computed distance between points) need to be stored reliably prior to presenting results to the user/consumer?

There are hundreds of business requirements (that none of us know about) that could change any of the advice provided in this thread...

For example, you could say:

Hypothetical Quote from dalearyous said:
I need to store the data points at every road intersection and destination for a delivery company. The data points will be stored with some associated reference information like house number or intersection. This data rarely changes; I will be computing the fastest route between two locations using this data, but I need to guarantee that full details for every computed route is stored to two database servers in two separate data centers before providing the route to a customer.

If that was your use case, then we could make some great recommendations backed with real world examples.

dalearyous said:
sorry i am being so vague, i didn't know quite what to ask.

No need to ask more questions; Just explain what you're trying to do so that we can help you solve the problem. Be as specific, clear and concise as possible and try to include every requirement. Currently, we have no idea why you are considering creating a new table for every data point entry; With more details, this conclusion (and a more appropriate solution) might become apparent.

PTNL · Dec 28, 2011

calebb said:
No need to ask more questions; Just explain what you're trying to do so that we can help you solve the problem. Be as specific, clear and concise as possible and try to include every requirement. Currently, we have no idea why you are considering creating a new table for every data point entry; With more details, this conclusion (and a more appropriate solution) might become apparent.

This emphasizes the key point -- We don't know why you've already settled on the approach you are questioning. We don't have the back-story, the other options you've considered and thrown out, the special considerations that are being factored, etc.

We also don't know how this data would be used, and what actual examples there would be. To me, this sounds like a delivery/logistics application. So let's assume that X and Y are addresses. Would taking two sets of X and Y be the, respective, origin and final destinations? Or would a set of X and Y be hops on the trip, and would culminate into the final trip path? Or does X and Y each represent a unique location? This isn't completely apparent.

aL Mac · Dec 28, 2011

As others have pointed out. If you wanted to store this information in a database, you would need the 1 table with a list of all your points. It would also have an ID column which is simply a unique integer for each row (point) in the table. For example { ID, X, Y, ....} (whatever information defines your point other than just X,Y)

Then you would need one more table to store your distances. The distances table would have 3 columns {first_id, second_id, distance} to store the unique ID of the point from the first table for both the points and the distance between them.

I'm not going to tell you that calculating distances is faster on the CPU then using a database because I have no idea what you are doing. However, I will say that you should make sure that it is worth the time it takes to read it from disk. Unless the computation is somewhat complicated it is probably not a good idea to store it in a database. Depending on your problem you may be better off storing a large lookup table in RAM and caching the most frequently requested results and computing ones that you do not know.

As others have mentioned the number of rows in the distances table will grow proportional to the square of the number of points. Or more specifically, it will be equal to the number of combinations possible which is exactly 1/2 * N * (N - 1) where N is the number of points.

mikeblas · Dec 28, 2011

You don't need the ID column, necessarily. You don't need another table for the distances unless there are somehow more than one distance per location pair. A table storing { X, Y, Distance } is fine, with a primary key over { X, Y }.

The estimate of 1/2 * N * (N - 1) only holds if the distances between two points are equal in the same direction.

Zangmonkey · Dec 28, 2011

Assuming you actually need to store the distances (which I reject) you just need two tables....
TBL_Points
TBL_Distances

TBL_Points has the ID, X_Coord, Y_Coord
TBL_Distances has ID1, ID2, Distance

This is still way more than you need... why do you need to store a linear transformation of your data?
Why not just store the points and cross-join it with a formula in your query?

mikeblas · Dec 28, 2011

I give up.

Zangmonkey · Dec 28, 2011

mikeblas said:
You don't need the ID column, necessarily. You don't need another table for the distances unless there are somehow more than one distance per location pair. A table storing { X, Y, Distance } is fine, with a primary key over { X, Y }.

The estimate of 1/2 * N * (N - 1) only holds if the distances between two points are equal in the same direction.

I have to presume Displacement(a,b) is the same as Displacement(b,a)

Storing the distance is silly, as it's nothing more than a transformation of your point locations.
In which case it should just be a table of points (described by their locations)

This problem sounds like homework to me, and it sounds like OP doesn't really understand how databases work

mikeblas · Dec 28, 2011

The distance may or may not be the same in both directions. It's not a presumption I'd be willing to make without an assurance that it was true. Only the OP has that information.

The distance is a "transformation" of the point locations, but it may or may not be a trivial transform. Not storing it means that it can't be used to filter queries. Failing to store the value means it has to be computed for each row examined, and that's really expensive.

Let's say the table has a list of locations in the US. We have an algorithm to compute the actual driving distance (on roads, not just great circle distance or some other trivial transformation) of the value. Maybe there's 16 million locations in the US, so we might have 256 million rows or 128 million rows. Since routed driving directions are usually directional, we probably have 256 million rows.

If I start at my house (one of the 16 million locations) and want to find out how far I can drive in 30 miles. To do this, I have to consider 16 million other locations and re-calculate the route. That's expensive for CPU, and can't be helped by any indexing.

If I have stored the distance, then I can read the 16 million other locations and find the ones that have a stored distance of less than 30 miles. That's very fast -- doesn't require any CPU. And I can help the query along with an index.

I'm not sure it's homework, but it certainly doesn't sound like the OP knows what he wants or even understands the problem he's trying to solve.

Zangmonkey · Dec 28, 2011

mikeblas said:
The distance may or may not be the same in both directions. It's not a presumption I'd be willing to make without an assurance that it was true. Only the OP has that information.

The distance is a "transformation" of the point locations, but it may or may not be a trivial transform. Not storing it means that it can't be used to filter queries. Failing to store the value means it has to be computed for each row examined, and that's really expensive.

Let's say the table has a list of locations in the US. We have an algorithm to compute the actual driving distance (on roads, not just great circle distance or some other trivial transformation) of the value. Maybe there's 16 million locations in the US, so we might have 256 million rows or 128 million rows. Since routed driving directions are usually directional, we probably have 256 million rows.

If I start at my house (one of the 16 million locations) and want to find out how far I can drive in 30 miles. To do this, I have to consider 16 million other locations and re-calculate the route. That's expensive for CPU, and can't be helped by any indexing.

If I have stored the distance, then I can read the 16 million other locations and find the ones that have a stored distance of less than 30 miles. That's very fast -- doesn't require any CPU. And I can help the query along with an index.

I'm not sure it's homework, but it certainly doesn't sound like the OP knows what he wants or even understands the problem he's trying to solve.

Scrolling up (I should have RTFT), OP indicates that distance is not a straight line.
Then I am inclined to agree with you that we shouldn't make that assumption.

Then, given that, it should still only require two tables at most.

LuminaryJanitor · Dec 28, 2011

mikeblas said:
Maybe there's 16 million locations in the US, so we might have 256 million rows or 128 million rows. Since routed driving directions are usually directional, we probably have 256 million rows.

Wouldn't that be 256 trillion rows?

dalearyous · Dec 28, 2011

lol i am really glad i got the "this sounds like homework so i don't want to help you or you should know more" guy to respond ... Zangmonkey, you were helpful. the solution was easy and obvious and i was over thinking the solution. i am limited to how many distances i can calculate per day which is why i want to save it so i don't have to make the same calculation twice.

if the distance was a straight line yes, i would never store it and just make the calculation in the query. anyway, thankyou! zangmonkey.

Zangmonkey · Dec 29, 2011

dalearyous said:
lol i am really glad i got the "this sounds like homework so i don't want to help you or you should know more" guy to respond ... Zangmonkey, you were helpful. the solution was easy and obvious and i was over thinking the solution. i am limited to how many distances i can calculate per day which is why i want to save it so i don't have to make the same calculation twice.

if the distance was a straight line yes, i would never store it and just make the calculation in the query. anyway, thankyou! zangmonkey.

<3

Good luck.
Remember to support StackOverflow

Feasible to create a table per entry? (in order to emulate a 2D table)

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]F Junkie

Supreme [H]ardness

Gawd

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

2[H]4U

[H]ard|Gawd

Supreme [H]ardness