What system of automation would you use?

Coldblackice

[H]ard|Gawd
Joined
Aug 14, 2010
Messages
1,152
This is a simple project, but I feel that my methods are inefficient and kludgy:

(EDIT: I'm also open to suggestions of using other languages, it doesn't have to just be PHP)

(EDIT pt II: To any anxious to call foul over the scraping, and suggest a weather service API instead -- Weather providers' data is irrelevant; these are local data collection posts provided by schools in the town, and are specifically what's important. Additionally, I have full permissions to collect the data from all involved, and it was decided at the project's start that a scraping approach would be easiest, as the framework is already in place and running for the respective schools/posts continuous updating of temperature data via their respective websites.)

-A town's current weather stats are continuously updated on a few different websites

-I've written a local PHP script that resides on my desktop that is run in a browser, scraping the stats off these remote sites, then dumps the stats into a local MYSQL database, also on my desktop.

-To keep this process looping, I've altered my main PHP configuration, setting max script execution time to 10 minutes, and then in the html container that encompasses this PHP script, I've set <meta http-equiv="refresh" content="348;url=test.php"> in the head section. So the PHP script loops for a certain time (divided up by using sleep()'s), with the html refreshing at a separate interval.

(Besides the ill-coordination and timing of these two separate "timers", this feels like a really kludgy and half-cocked approach to me, but I don't know what a more veteran programmer would do. The end result works fine even with this happening, but for my own programming development, I'd like to see how more veteran and seasoned programming minds would approach this.)

-The scraped data stats are stored into a local MYSQL server

-(Not yet implemented) Now the stored SQL data needs to be displayed on an external web server, in table format. Initially, I just set the local PHP script to instead access a remote SQL server where it would store the scraped stats remotely. Then, anytime the main index page on that server is accessed, it would poll its SQL server for the stats data, processing it into a table in the browser for the user, on request.


It seems a bit half-cocked to have this program split in "two", with the scraper running locally while the database is remote, however, from my understanding, it wouldn't really be possible to have the scraper continuously running on a remote hosting server -- unless it was some type of local machine/client that's continuously refreshing a PHP file on the server. Any help/clarity/insight on this?


A bit more specific of a summary of the two hiccups I'm needing veteran insight on:

1. Without some manner of special access/permissions, is it typically possible to have a personal program/script running continuously on a hosting company's server? Or is the only way to have a script/PHP file run accomplished by a browser manually "triggering" a script/PHP file to run by accessing it?

2. What workflow would you implement in this situation, specifically, determining which parts are done locally, and which parts are done remotely?

e.g., local scraper/local database/remote publishing (like having a local scraper store into a local SQL, and then have the scraper copy its local db to the server's db, and then the server retrieves and tabulates the data upon request)?

Would you have a local scraper bypass any local databasing, and instead update a remote database directly?

Would a process that needs continual running/looping be best done on a local machine (with full access/control), or would it be better (or even possible) to do it on a hosting account somehow (I'm not aware how one could have something continually running on a host, without some kind of unadvertised special permissions or access. Maybe through an SSH session, possibly?)?
 
Last edited:
Some web hosts don't allow processes to run like that, or will boot your account if they detect it. Maybe look into getting a VPS?
 
Instead of stealing from the weather providers through screenscraping, why not choose a vendor that gives you a reliable data structure to fetch and store? I know that weather.com advertises an API that can return XML or JSON, and several competitors (such as AccuWeather) have something similar.

I can't speak to the quality of the various products, but these are certainly legal ways that would likely make your programmatic implementation much easier.
 
Why are you scraping websites as opposed to just using a weather API like PTNL suggested? Do you live somewhere extremely obscure that has no government weather service you can get data from? If you switch to using a public API, you could easily put this script up on a VPS or other hosting account, but I wouldn't do that if you're scraping sites directly lest you find yourself banned by your host or blocked by the sites.
 
The proper way to do this is:

1) Write a php script that uses one of the many weather site's APIs (as already suggested) to retrieve the results as JSON/XML.
2) Write a cron script to run the php script at a set time interval (whether this cron calls wget or something, or even better the script can be called with php-cli).
 
The proper way to do this is:

1) Write a php script that uses one of the many weather site's APIs (as already suggested) to retrieve the results as JSON/XML.
2) Write a cron script to run the php script at a set time interval (whether this cron calls wget or something, or even better the script can be called with php-cli).

OP: you are correct, setting the max execution time to 10 minutes is rather kludgy. Take the above advice and set it to run as a cron job.
 
Instead of stealing from the weather providers through screenscraping, why not choose a vendor that gives you a reliable data structure to fetch and store? I know that weather.com advertises an API that can return XML or JSON, and several competitors (such as AccuWeather) have something similar.

I can't speak to the quality of the various products, but these are certainly legal ways that would likely make your programmatic implementation much easier.

Why are you scraping websites as opposed to just using a weather API like PTNL suggested? Do you live somewhere extremely obscure that has no government weather service you can get data from? If you switch to using a public API, you could easily put this script up on a VPS or other hosting account, but I wouldn't do that if you're scraping sites directly lest you find yourself banned by your host or blocked by the sites.

The proper way to do this is:

1) Write a php script that uses one of the many weather site's APIs (as already suggested) to retrieve the results as JSON/XML.
2) Write a cron script to run the php script at a set time interval (whether this cron calls wget or something, or even better the script can be called with php-cli).

A. No stealing involved. Full permissions to scrape. Apologies for forgetting this disclaimer in the OP.

B. The information isn't coming from the weather providers, nor would they even be helpful. It's coming from local posts from various spots within the town. Weather services' data (and API's) would be irrelevant to the project.

OP: you are correct, setting the max execution time to 10 minutes is rather kludgy. Take the above advice and set it to run as a cron job.

Thanks, I'll read up on cron-job'ing.

For reference, what if you were to do this entirely contained in a programming language -- what would you do (disregarding any need to have it contained/run on a server)? What language might you use?

Obviously, PHP isn't really meant to be continuously run (as far as I understand). So would you use something like Java? Python?
 
[quote[Obviously, PHP isn't really meant to be continuously run (as far as I understand). So would you use something like Java? Python?[/QUOTE]

You could use Mechanize with Perl or Ruby, or yql was suggested in a similar thread.
 
A. No stealing involved. Full permissions to scrape. Apologies for forgetting this disclaimer in the OP.

B. The information isn't coming from the weather providers, nor would they even be helpful. It's coming from local posts from various spots within the town. Weather services' data (and API's) would be irrelevant to the project.?
This clears things up. However, it does pose the same challenges - the data you are ripping through is user-created.

If you are coordinating on the data entered by the content publishers, then you can mitigate many of the "raw string parsing" concerns. Not bullet-proof, but at least traceable and understood by those involved. Perhaps even having another background watcher that periodically audits the DB, looking for fields that could give you problems in parsing -- this would be something else that could be spun-off as a cron job.
 
Last edited:
Back
Top