Need "programmable" program recommendation?

petreza

n00b
Joined
Feb 16, 2006
Messages
52
(...talk about unclear title....)

I have a tough task given to me at work:

I have a database dump with over 100,000 records of client's contact information (email, name, company) and what services they subscribe to. Because each service has its own signup there are many duplicates in the data.

My job consists of:

- cleanup of the data - find all unique individuals (eliminate duplicates) - while email address is the best clue, it should not be the only criteria - people use multiple email addresses. The cleanup should be manual, that is, I make the decisions of who is who, but I need to build a system that automatically searches through the data and makes intelligent recommendations to me - I don't want to scroll all day in notepad :)

- keep track of what gets changed in the data (and why) and keep track of where the change is in the original database so that it might be referenced later

--------------------------------

I tried working in Excel but it was too limited in power - 100,000 clients X 60 services = 6,000,000 cells - Excel freezes for a few seconds when I make any change in the worksheet. In addition, I don't think there is an easy way to make a "work-flow" automation system inside Excel.

Then I decided to put the data in MySQL database - it is fast but I don't know any type of scripting/programming (learning Perl as fast as I can). I plan to setup an additional database that will keep track of the changes I make in the data and why I make them.

---------------------------------

QUESTION:

Since I don't know much programming, so I cannot make a program for myself, could you recommend me any kind of program/platform where I can setup tables, buttons, shortcuts, and macros driving them, so that I can have an automated system searching through the database, giving me the closest matches and I make the decision which contacts are the same person. The program, obviously, should be able to read/write to MySQL.

Sorry for the long post.

Any help will be greatly appreciated!

Thanks!
 
If you're not a programmer, why were you given this task, and more importantly, why didn't you tell whoever gave you the job "I have no idea how to do this"?
 
Actually my job is to learn to program but that is in general and this project cannot wait until then.

I found one thing - AutoIt - it allows me to build GUIs and interact with other programs. Someone even wrote functions to interact with MySQL.

What do you think?
 
edit: pricess frosty's advice is much better than mine, I knew that most databasing software these days would let you do cool scripted stuff but I wasn't sure, and while you could run through it with some little bash or powershell scripted utility, running through your database with the databases own scripts is probably a better idea.

You might want to become familiar with powershell at this point, it'l be very handy if you want to further your career as an IT pro (or so I've been told --that's how my college sold me on this course), but understand you should look at it like a new language. Its basically Microsoft's version of bash, and I cannot think of a tool that's more fitted to this job than bash + awk... or bash + vim if you want to get nuts.

edit: no command line ever lets you built a GUI, that's kind've the whole point; the term GUI originated as a antethesis to "Character User Interface", aka CUI, aka command-line.

but If you really want to build a GUI to do these kind of things (which you shouldn't, again, Powershell's sole purpose is to do the kind of thing you're trying to do here) then you probably want to look at Visual Basic and .net.
 
Last edited:
You might try looking into CRM software - it's not a problem space I generally deal with but it seems like the sort of thing that, if anything off-the-shelf could do it, would have the answer for you.

I'd probably start writing a script in Python/Perl/Ruby to slurp. You might want/need to use a DB.

I feel sorry for you - you're kind of out of your depth here. You'll probably be stuck supporting whatever you write for as long as you continue working there.
 
Take a backup of the database as it is at the start before you do anything else.

Write SQL stored proceedures to loop through the data and make decisions based on logic about how to clean the clients. Everytime an action is taken update an action log (another table) of what action was taken and then dump the result of the action into a new database.

Once you've looped all of the clients through the proceedure you should have your backup database which is the old dataset, a log of what was changed and why, and the final data set.

You should look at building composite keys to identify individuals, consider formatting the names and removing ambiguities out first to maximize number of matches, such as

remove mr/ms/miss/dr/prof/sir/madame from names and put lower case
remove ltd, inc, gmbh, country name etc from company names and put lower case

Add these fields together to identify individuals, so "Mr Mark Smith" at "Bits & Bobs Inc" becomes "marksmithbitsandbobs"

That way if you have a "Mark SMith" at "Bits and Bobs ltd" you'd still match the record. If you have Mark Smith at Hardocp and Co then the key doesn't match

In the database I look after, we identify unique company site with a cleaned company name + a cleaned version of the zip code, and we idenify contact with a unique conact name + their companies company key.
 
Back
Top