Database or 3D Array?

cuemasterfl · Oct 21, 2009

I have a rather daunting project to do at work, and I'm wondering what the best way to do this would be, so I've come to the people I trust ([H]). I'm going to write software in VB.net (2008) that dials into a computer, which could be in another state, retrieves data in the form of a CSV (customer ID, Name, Meter Reading, Signal Strength, and a bunch of other fields), and then stores the data - somewhere, for processing.

The data retrieved will likely contain several hundred to several thousand records. The data that is collected on each dial (which may be done every hour, to every 24 hours) must be added to prior data, so that a history can be seen, and then we must be able to query the data to find trends, show history, daily averages etc (the exact reports haven't been determined yet, but are likely to include those just mentioned).

Should I use a Database system, like SQL (MySQL?) to store the data, or do you think I could get away with using a 3D Array? I'm not experienced in SQL, but I have used the basic commands in the past, so I have a basic understanding. Which is easier?

Also, as mentioned earlier, each data file I collect has to be added to the previous ones. What would be the best way to do that? Can I take a CSV file and dump it into an existing SQL database (or 3D Array)? If so, can you give me an idea how to do that?

I believe I can get this project done, but I'd like a pointer (or two) in the right direction. Thank you for your time!

PTNL · Oct 21, 2009

Unless you want to permanently persist the data only in temporary memory (and hope the power never ever goes out, and the app never ever crashes), then you'll want a more permanent storage medium. A transactional database system comes to mind. Given that you're heading down the VB.Net route, I'll toss out SQL Server Standard or Express as the first options.

It sounds like a good amount of your UI and business communication rules/patterns have been contemplated, so your next step is getting data from the CSV into the database. Some basic text parsing will get this done, and is not difficult to write.

Fetching data will be done with SQL statements, and dumping to a destination object -- either a custom-defined objects (perhaps even flexing List<T> generics for grouping), or perhaps the DataTable or DataSet. You can then bind those objects to grids for templated list presentation, or iterate/parse as needed to pull the info you want.

Plus, I really think any of the above objects will lead to more maintainable code than a three dimensional array.

cuemasterfl · Oct 21, 2009

Thank you for your reply

The data collected by the Gateway (the machine I will dial into) is going to be stored on an SDCard until I retrieve it (then discarded or kept there for a while - unsure yet), so we don't have to worry about the power going out there, etc.

We will end up with multiple Gateways (in multiple states), and the data for each has to kept separate, so I assume I will need multiple tables?

This is the first time we've done anything like this. We have people who will be building the Gateways (custom circuit boards), while I write the software. I've done quite a bit of programming before, but I've really never messed with Databases or Dialing into remote machines (not too worried about the dialing thing right now).

mikeblas · Oct 21, 2009

You don't need multiple tables. You can store a column in the database that tells you where the data came from. If you store multiple tables, you'll have to do any system-wide aggregation over multiple tables, and that's cumbersome.

Instead of this:

Code:

CREATE TABLE Pennsylvania
(customer ID, Name, Meter Reading, Signal Strength)
CREATE TABLE Idaho
(customer ID, Name, Meter Reading, Signal Strength)

you're probably better off with this:

Code:

CREATE TABLE Observations
(StateCode, customer ID, Name, Meter Reading, Signal Strength)

cuemasterfl · Oct 21, 2009

There could be several gateways for Idaho, for example. Several cities may have their own gateway.

Any kind of aggregation or reporting will only have to be done on that particular gateway. There will not be a need to aggregate data from Boise, Idaho with Pittsburgh, Pennsylvania, etc.

Also, not sure if I made it clear in my first post, but the data for each reading has to be distinguishable from, say, the data collected 12 hours ago. This allows us to see a history/trend in readings. Each reading isn't merged with the other readings.

I'll post some sample data - how we envision the data to look once retrieved, coming up in a few moments...

PTNL · Oct 21, 2009

Off-hand, I think just adding a few extra columns to mikeblas's sample above should suffice. Perhaps adding a link table, depending on how things normalize. But I'll hold off until you can post some data.

cuemasterfl · Oct 21, 2009

10/21/09 03:00
MeterID Name Latitude Longitude Reading SignalStrength Tamper Leak
1458785963 Jack Sparrow 29.20765 -82.23327 21.25 97 0 0
1223698574 Davey Jones 29.25249 -82.21594 28.59 97 0 0
5896547841 Elizabeth Swan 29.15670 -82.14496 15.69 92 1 0
1452369874 Han Solo 29.19985 -82.12514 36.25 96 0 1
1234567890 Luke Skywalker 29.21521 -82.12178 23.78 77 0 0
8965231475 Princess Leia 29.21668 -82.14692 29.85 68 0 0
8956231474 Chew Bacca 29.22277 -82.11424 4.25 49 1 0
5623897451 Spongebob Squarepants 29.24422 -82.14776 19.25 88 0 0
5689741235 Patrick Star 29.18878 -82.14487 31.05 96 0 0
1793486255 Squid Ward 29.19267 -82.14284 36.98 18 0 0

10/21/09 07:00
MeterID Name Latitude Longitude Reading SignalStrength Tamper Leak
1458785963 Jack Sparrow 29.20765 -82.23327 21.25 97 0 0
1223698574 Davey Jones 29.25249 -82.21594 28.59 97 0 0
5896547841 Elizabeth Swan 29.15670 -82.14496 15.69 92 1 0
1452369874 Han Solo 29.19985 -82.12514 36.25 96 0 1
1234567890 Luke Skywalker 29.21521 -82.12178 23.78 77 0 0
8965231475 Princess Leia 29.21668 -82.14692 29.85 68 0 0
8956231474 Chew Bacca 29.22277 -82.11424 4.25 49 1 0
5623897451 Spongebob Squarepants 29.24422 -82.14776 19.25 88 0 0
5689741235 Patrick Star 29.18878 -82.14487 31.05 96 0 0
1793486255 Squid Ward 29.19267 -82.14284 36.98 18 0 0

Note the different times the data was retrieved. Each block of data will need to stay separate. I know the data is the same but I couldn't be bothered to make up new data - imagine block 2 having different values.

The data was copied/pasted from an excel spreadsheet hence the lack of commas.

PTNL · Oct 21, 2009

A quick sample...

Code:

CREATE TABLE Observation (MeterID, ExaminerName, Latitude, Longitude, Reading, SignalStrength, Tamper, Leak)

CREATE TABLE Location (MeterID, MunicipalityName, StateCode)

Observation would link to Location by the MeterID column, where a Location could have zero-to-many Observation records; but a single Observation record can only belong to one Location.

Queries would be done against a single table or joining on the above relationship. Filtering and analysis could be done with WHERE and lots of other SQL syntax.

You could expand this out through further normalization, too. Some examples:
1) Linking to a list of ExaminerNames stored in another table/location
2) Maintaining a distinct list of municipalities (ie: counties, cities, townships, etc.)
3) Using a unique number/key for the relationship (instead of by a value that the user can directly add/update).

And if this doesn't seem clear, then it's really worth your time to dive into the general concepts of database design, basic SQL statements, parent/child relationships, and data normalization. Don't worry so much about the database vendor, focus more on the concepts for now.

cuemasterfl · Oct 21, 2009

I guess we could do this?

GatewayID,Date,Time,MeterID,Name,Latitude,Longitude,Reading,SignalStrength,Tamper,Leak
1001,10/21/2009,3:00,1458785963,Jack Sparrow,29.20765,-82.23327,21.25,97,0,0
1001,10/21/2009,3:00,1223698574,Davey Jones,29.25249,-82.21594,28.59,97,0,0
1001,10/21/2009,3:00,5896547841,Elizabeth Swan,29.15670,-82.14496,15.69,92,1,0
1001,10/21/2009,3:00,1452369874,Han Solo,29.19985,-82.12514,36.25,96,0,1
1001,10/21/2009,3:00,1234567890,Luke Skywalker,29.21521,-82.12178,23.78,77,0,0
1001,10/21/2009,3:00,8965231475,Princess Leia,29.21668,-82.14692,29.85,68,0,0
1001,10/21/2009,3:00,8956231474,Chew Bacca,29.22277,-82.11424,4.25,49,1,0
1001,10/21/2009,3:00,5623897451,Spongebob Squarepants,29.24422,-82.14776,19.25,88,0,0
1001,10/21/2009,3:00,5689741235,Patrick Star,29.18878,-82.14487,31.05,96,0,0
1001,10/21/2009,3:00,1793486255,Squid Ward,29.19267,-82.14284,36.98,18,0,0
1001,10/21/2009,7:00,1458785963,Jack Sparrow,29.20765,-82.23327,21.25,97,0,0
1001,10/21/2009,7:00,1223698574,Davey Jones,29.25249,-82.21594,28.59,97,0,0
1001,10/21/2009,7:00,5896547841,Elizabeth Swan,29.15670,-82.14496,15.69,92,1,0
1001,10/21/2009,7:00,1452369874,Han Solo,29.19985,-82.12514,36.25,96,0,1
1001,10/21/2009,7:00,1234567890,Luke Skywalker,29.21521,-82.12178,23.78,77,0,0
1001,10/21/2009,7:00,8965231475,Princess Leia,29.21668,-82.14692,29.85,68,0,0
1001,10/21/2009,7:00,8956231474,Chew Bacca,29.22277,-82.11424,4.25,49,1,0
1001,10/21/2009,7:00,5623897451,Spongebob Squarepants,29.24422,-82.14776,19.25,88,0,0
1001,10/21/2009,7:00,5689741235,Patrick Star,29.18878,-82.14487,31.05,96,0,0
1001,10/21/2009,7:00,1793486255,Squid Ward,29.19267,-82.14284,36.98,18,0,0

And so use:
CREATE TABLE Observation (GatewayID, Date, Time, MeterID, Name, Latitude, Longitude, Reading, SignalStrength, Tamper, Leak)

This way we could assign an ID to each Gateway and run queries based on that, and other fields.

PTNL · Oct 21, 2009

Looks ok so far. Though "Date" and "Time" would actually exist as a single column of datatype "DateTime".

cuemasterfl · Oct 21, 2009

I figured that having the date/time in a separate column, it would be easier to query the data, but you might be right.

cuemasterfl · Oct 21, 2009

By the way, many thanks to you both PTNL and mikeblas for helping me with this. I really appreciate it

nameless_centurian · Oct 21, 2009

If you don't like the complexity of setting up a database, I would consider using SQLlite for this task. It's essentially the same solution of using a text or CSV file, except you can use SQL to access the data.

Another option would be to store this in XML format on your SD card, which would allow you to load it in a DOM tree an manipulate it later without need to actually think of and implement a data structure for your data. Instead, you create an XML schema, and the rest of the work is done for you.

Another option is to build your own tree-based structure from a CSV file. Your code will be much easier to debug and maintain.

There are too many good options to end up storing this in a huge memory cube. Debugging that thing is going to be a nightmare later down the line.

cuemasterfl · Oct 21, 2009

One thing I didn't mention, if it matters, is that the history will be kept for maybe a year. Imagine each gateway reading 5,000 meters every hour (could be anything from every hour to every 24 hours), for a year, for each Gateway. That's the potential size of the database.

PTNL · Oct 21, 2009

One option would be to create an "archive" table, and push all records after a certain period to that table. You may need to adjust that logic depending on your needs for queries, report generation, etc. Though some people with more of a DBA background may have additional suggestions.

cuemasterfl · Oct 21, 2009

nameless: thank you! I will also consider your suggestions which sound good. Of course, at this point, I have no idea of really knowing what will work best, but it's good to have multiple options

mikeblas · Oct 21, 2009

XML might be useful sometimes for data interchange, but it doesn't scale for various reasons.

cuemasterfl said:
One thing I didn't mention, if it matters, is that the history will be kept for maybe a year. Imagine each gateway reading 5,000 meters every hour (could be anything from every hour to every 24 hours), for a year, for each Gateway. That's the potential size of the database.

You should keep working on this until you understand your database size; it's important for estimating the hardware you'll need and framing the expectations for reporting and storage.

If you have 5000 meters, and read each every hour for a whole year, you only have 43 million rows. Unless each meter reports a ton of data, you've really got a pretty trivial database. You don't specify how many gateways there are. That multiplier can help you reach larger numbers.

Given information about how much data each reading provides, and how many gateways you have (or will have), you can cook up a better estimate of the storage requirements you've got.

Simultaneously, you'll really want to get to a good, solid data model.

nameless_centurian · Oct 21, 2009

XML might be useful sometimes for data interchange, but it doesn't scale for various reasons.

Yes, I don't like any of my suggestions now that I know how much data we are talking about.

cuemasterfl · Oct 21, 2009

Each line will be 32 bytes of data, but we will then append to that, so it could be, say, 64 bytes for each line by the time we're done. We just had a meeting about this, and each city could have up to 20-30 gateways. Right now there are 0 gateways in existence, because we're still building the first gateway, but it could expand pretty quickly. I just did some math on the calculator, and there aren't enough digits....lol. So, we will need whichever system is best for handling large amounts of data and for querying to produce reports. SQL seems to be the best answer.

mikeblas · Oct 21, 2009

I'm not sure what you plugged into you calculator. You've told us that each city could have 20 to 30 gateways, but you haven't said how many cities you'll populate. Is 5,000 meters per gateway a valid number? Is it an average or a maximum?

You've used a new term, "line", which you haven't previously defined before. What does that term mean?

What does "append" mean? That you'll eventually have meters which report more parameters? That you'll record two readings per meter each time you call? For every meter, or only for some? Is that predictable, per meter? Do meters then have types? Perhaps one type reports less data per measurement, and another reports more. Is the type known in advance?

cuemasterfl · Oct 21, 2009

There's no way of knowing how many cities we will populate. If the system works well enough, it could be many. 5,000 is an estimate. When I say line, I mean row/record in the DB. The gateway will report 32 bytes of data for each record, and then we will add to that other information. There are a few various meter types (which is also part of the data that will be transmitted to us) but 32 bytes will be the maximum transmitted to us per record.

mikeblas · Oct 21, 2009

You don't need to know, but certainly you have an estimate. If you didn't have an estimate, you shouldn't be in business and spending money -- because for that case, not knowing will mean you don't know when you'll turn a profit, how much you'll spend, or how much you'll charge.

5000 cities with 30 gateways each, reporting once per hour for 5000 meters, gives 750 million rows per year. At 64 bytes per row, that's 48 million rows, plus overhead for storage and indexing. It's really not a very big database, though you're into the scale where you need to have a very good model and operational policies to avoid problems. While it's not very big, it certainly isn't a casual system.

Depending on your goals for up time and reporting, you might want to investigate partitioning.

Database or 3D Array?

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Gawd

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Gawd

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006