Processing large data set with PHP script - memory/resources concern?

KevySaysBeNice · Jul 28, 2011

Hi all! Thanks in advance for any pro-tips!

I've got around, let's say, 3000 records (rows) in a database that I have to process. This will most likely mean making 3000-6000 database calls, looping through stuff, using switch statements, comparing strings, etc.

Basically, a lot of data massaging, and a lot of DB traffic.

Now, maybe 3000 records and 6000 database calls isn't a problem, but my concern is if doing this all with a single function call is a bad idea, and if so how I should do it?

I basically have a function,

Code:

$this->execute();

which kicks everything off. It creates instances of objects, which create other instances of objects, and loops through stuff, and compares stuff, etc.

Is this OK? Will I have problems? Is there a better way to be doing this (assuming all of the processing and data massaging is necessary of course!)

PTNL · Jul 28, 2011

KevySaysBeNice said:
Is this OK? Will I have problems? Is there a better way to be doing this (assuming all of the processing and data massaging is necessary of course!)

The other questions you need to ask is:
- How fragile would having a single method call be?
- How intuitive could you develop tracing/debugging output?
- How would you handle the various (known/possible) exception types that could be thrown at various points?
- How do you validate the import file/data, and pass preliminary validation errors back to the user if some aspects are invalid?
- Are partial imports allowed, or is it an "all or none" import?

Personally, I would split up the work into more manageable chunks. One "god" method quickly becomes unwieldy to develop, and is considered fragile to modify in the future.

KevySaysBeNice · Jul 28, 2011

PTNL said:
The other questions you need to ask is:
- How fragile would having a single method call be?
- How intuitive could you develop tracing/debugging output?
- How would you handle the various (known/possible) exception types that could be thrown at various points?
- How do you validate the import file/data, and pass preliminary validation errors back to the user if some aspects are invalid?
- Are partial imports allowed, or is it an "all or none" import?

Personally, I would split up the work into more manageable chunks. One "god" method quickly becomes unwieldy to develop, and is considered fragile to modify in the future.

Thanks for the quick reply!

Just to be clear, this won't be a single method, but rather a single method that kicks everything off.. So things will be broken up into logical "chunks", I'm essentially generating a bunch pages of static HTML, but the pages are nested within a tree, and each page consists of data pulled from several different rows in a database, and depending on various attributes in the row the piece of the page needs to have different HTML rendered. So for the sake of my own sanity, and code readability and maintainability, I'll be breaking the code into logical parts (for instance a Page class which will contain the logic and data it needs to render a single page, log any errors, etc).

Long story short, maintainability and readability I'm not so concerned with, but I AM worried about the fragility of having a single originating call. What would the preferred method be for handling this, so if one piece fails/crashes other bits continue?

PTNL · Jul 29, 2011

KevySaysBeNice said:
Long story short, maintainability and readability I'm not so concerned with, but I AM worried about the fragility of having a single originating call. What would the preferred method be for handling this, so if one piece fails/crashes other bits continue?

I disagree, and feel that maintainability and readability are a concern to have; both for you, and anyone else that needs to look at or modify this in the future. Where you split things can be decided by (at least) two factors: what loops/parses can be ignored if invalid or an exception is thrown, and what pieces are required versus optional as you loop and aggregate. Without knowing some of the business and validation rules, more granular suggestions are just speculative.

KevySaysBeNice · Jul 29, 2011

So, nightly process, which I have no control over, will populate a table with three different types of rows: a level 1 pages, level 2 (under level 1 in tree) pages, or a page element which makes up the content of a level 2 page. These all are in the same database, but various attributes distinguish one from another.

So, here is a basic breakdown of the logical parts:

1) A mapping is created - I basically go through all of the level 1 page and level 2 page rows in the database to see if the row represents a NEW page (in which case I create a new row in my mapping database and set the page parent relationship. Here I might also call a CMS specific function which will actually generate a new page, I'll get the new page ID and add that to the mapping table). If it's not a new page, then it's an update, in which case I mark is as such because all subpages will need to be reprocessed (parent metadata trickles down to the subpages).

2) Next step is to go through the mapping I created, and generate HTML based on pages that need to be updated, new pages, etc. This will involve getting level 2 page IDs from the mapping table and getting all it's page elements (rows), then each of these rows will be looped through, and based on their content static HTML will be generated. All of this static content will be put together, along with some additional metadata from the parent level 2 page and level 1 page, and stuck into a final "pre production" field. Then we go onto the next page in the mapping.

3) Finally, once all new/updated page HTML has been generated, and assuming no errors have been encountered, then we'll go through and actually insert the HTML content into the CMS pages that have been generated previously (step 1). If any errors (critical) have occurred, then of course that should have been logged at this point, and no live HTML will actually be updated. Somebody will be notified, and we'll have to look at the data to see what sort of error occurred and how to fix it.

Also, I think you may have misunderstood me - I'm not saying that maintainability and readability aren't things to be concerned with in general, but rather that I intend to have classes/methods with clear purpose, and NOT a single huge method with all logic/etc shoved into it. That said, as of now, there would be on originating function that kicked all of this off and handled when to move onto step 2, step 3, etc .

Thanks again for the time and any further feedback!

PTNL · Aug 2, 2011

Thanks for the clarifications on code intent. Here's a few questions I have, and I grouped them by each of your outlined steps...

Step #1:
- How are mapping deletes handled? I'm wondering whether orphaned records would be recognized as "new" content additions.
- How are content re-organizations handled? Something along the line of moving a piece of content from having one set of Level 1-or-2 parents to a different set of Level 1-or-2 parents.

Step #2:
- Are you writing HTML to the disk (ie: creating physical files), or storing the markup in a database record?

Step #3:
- If a content add/update fails, then it makes sense to not obstruct the rest of the content adds and updates. However, what would happen if a user browsed to a page corresponding to a corrupted tuple?

KevySaysBeNice · Aug 2, 2011

PTNL said:
Thanks for the clarifications on code intent. Here's a few questions I have, and I grouped them by each of your outlined steps...

No, thank YOU for the thoughtful questions!

Step #1:
- How are mapping deletes handled? I'm wondering whether orphaned records would be recognized as "new" content additions.
- How are content re-organizations handled? Something along the line of moving a piece of content from having one set of Level 1-or-2 parents to a different set of Level 1-or-2 parents.

Deletes are actually going to be handled in the first step, I thought about this more and there will be either be NEW, UPDATED, or DELETED actions. Deleted means the page will be removed from the CMS, and all children of the page will automatically also be deleted (or hidden).

(I didn't really think much about moves to be honest before you mentioned it, so thank you!) Moves work basically the same, if a level 1, 2, or 3 row is marked as update, then the page/element will be regenerated and the parent page will be checked and set, every time regardless if it was moved or not (just by default it to make sure the parent is correct).

Step #2:
- Are you writing HTML to the disk (ie: creating physical files), or storing the markup in a database record?

Nope, HTML is stored in the DB. Also, previously I said that I would take all the level 3 rows that belong to a level 2 page and generate one static block of HTML. Now, I'm going to instead create a separate content block (row in our CMS content table) for each level 3 content, and then store the collection of content blocks in the level 2 (page) as apposed to my original plan of generating one massive HTML static block. The reason for this is that it creates fewer dependencies, and if a single level 3 content block is updated then I don't have to update the entire page.

Step #3:
- If a content add/update fails, then it makes sense to not obstruct the rest of the content adds and updates. However, what would happen if a user browsed to a page corresponding to a corrupted tuple?

[/QUOTE]

Well, in theory nothing should be written to the actual final tables until they validate. Also all of the data we are getting is coming from a very "clean" source, and even then passed through a filter so it's perfectly formatted to my requirements, so I wouldn't expect much data (not that it matters, I of course want to plan for any situation).

PTNL · Aug 3, 2011

KevySaysBeNice said:
Well, in theory nothing should be written to the actual final tables until they validate. Also all of the data we are getting is coming from a very "clean" source, and even then passed through a filter so it's perfectly formatted to my requirements, so I wouldn't expect much data (not that it matters, I of course want to plan for any situation).

Regardless of the source, you'll still want to have your own defensive validation in code. Assume dirty until proven innocent

At this point, I think you've got a handle on what needs to be done and some potential vulnerabilities to consider and plan for. I'll also assume that some verbose and useful logging will be implemented as well. Drop a line after you get things coded and tested; I'd like to hear of any other discoveries, remediations, and outside factors that went into the solution.

Processing large data set with PHP script - memory/resources concern?

KevySaysBeNice

[H]ard|Gawd

PTNL

Supreme [H]ardness

KevySaysBeNice

[H]ard|Gawd

PTNL

Supreme [H]ardness

KevySaysBeNice

[H]ard|Gawd

PTNL

Supreme [H]ardness

KevySaysBeNice

[H]ard|Gawd

PTNL

Supreme [H]ardness