[H]ard|Forum  

Go Back   [H]ard|Forum > Bits & Bytes > Webmastering & Programming

Reply
 
Thread Tools Search this Thread
  #1  
Old 07-19-2006, 01:16 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
C++ and Perl bakeoff

Would anyone like to have a bakeoff to see if C++ or Perl is better at string-intensive processing? It would be fun, something like a programming contest (without prizes). A benchmark-athon.

I like to download the Wikipedia content dumps and play with them. There's lots of interesting things to do. Of course, they're all formatted for MySQL, and I don't run MySQL. So I need to convert some of the scripts for SQL Server.

One file is a bunch of INSERT statements (using some MySQL specific syntax, of course) that loads a single table full of information about every article. I've written a program to rip through that file and make a CSV out of it. Then, I can use another tool to load that CSV file into SQL Server very rapidly.

There's lots of parsing going on;

INSERT (a1,'b21,'c1'),(a2,'b2','c2'),('a3','b3','c3')

becomes

a1,"b1","c1"
a1,"b2","c2"
a1,"b3","c3"

and so on. There's escapements, too; the strings have single ticks escaped with a backslash, and a backslash escaped, too. (So \' becomes ', and \\ becomes \.)

There's also hex escapements, it seems: \xe0 becomes whatever single character 0xE0 is. I haven't figured out if that's also for four byte characters (for Unicode), like \xe2240 becoming whatever single Unicode character 0x2240.

I think I've got a very fast solution in C#, and I want to write a C++ version for comparison. Who can come up with a Perl version? String processing should be Perl's home court, right?

Last edited by mikeblas; 07-19-2006 at 01:31 PM..
  #2  
Old 07-19-2006, 08:40 PM
bassman [H]ard|Gawd, 5.2 Years
 
bassman is offline
Sounds interesting. Since I don't know Perl well, someone else could get you something more efficient (and sooner) than I. But you've given me a good idea for learning more Perl...
  #3  
Old 07-19-2006, 09:06 PM
TheDude05 Limp Gawd, 5.6 Years
 
TheDude05 is offline
I heard Python whoops ass at strings too but I dont have any proof. Anyone know for sure?
  #4  
Old 07-19-2006, 09:19 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
We can put a Python horse in the race, too.
  #5  
Old 07-19-2006, 09:39 PM
BillLeeLee [H]ardForum Junkie, 7.2 Years
 
BillLeeLee is offline
I am game, but probably won't be able to dream up a solution until lunch hour tomorrow or nighttime.

Should be fun with my dirty way of programming Perl. Someone else can take the Python way, or maybe someone will use sed/awk.
__________________
This is genmay, not livejournal.

  #6  
Old 07-19-2006, 10:04 PM
fluxion Gawd, 5.3 Years
 
fluxion is offline
im game to give it a shot at least. i wont be able to get into it till this weekend though, but feel free to go ahead and post up the full criteria (or is this geared toward people who already know what conversions need to be done?).

in any case i seriously doubt i'd outdo mikeblas in anything related to programming but maybe perl can pick up my slack
__________________
gaming rig (vista 64): E8400 @ 3.6ghz, Gskill PC6400 (2x2GB), ECS 8800GT, Abit IP35 Pro, Corsair VX450
workstation (ubuntu intrepid 64): Q9300 @ stock, XMS2 PC6400 (4x2GB), evga 8500gt 512mb, P35-DS3R, Corsair VX450
laptop0 (slackware 10.1): Dell Latitude CPi 13.1", P2 266mhz, 128MB, MagicGraph 128XD
laptop1 (ubuntu intrepid + vista): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, Corsair (2x1GB), Quadro NVS 140M
work laptop (ubuntu intrepid + xp): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, 2x1GB, OCZ Vertex 30GB, Intel GM965, Advanced Dock with 120GB ultrabay drive
  #7  
Old 07-19-2006, 11:36 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
Nah; we have to specify the translation, otherwise it's not a precise comparison and we don't learn anything about it.

The file I'm using is page.sql.gz from http://download.wikimedia.org/enwiki/20060702/. It's 149 megs to download, and expands to 412 megs to uncompress. The start of the file is some DDL to drop a table, then create it and some indexes. Finally, we get to the INSERT statement. The first two inserts are here:

Code:
INSERT INTO `page` VALUES
 (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70),
 (5,0,'AlgeriA','',0,1,0,0.553221851171201,'20060301005610',18063769,41),
...
and, of course, the file has more than three million such tuples -- each one represents a page (not an article) on the English Wikipedia site. They're all on one line; I added all the newlines above for clarity. The file does contain newlines; there's about 15000 records per line. The last tuple on a line ends with a semicolon, and there's a new INSERT INTO statement on the next line.

The program I'm writing actually reads the file and cleans up the records and puts them into a structure that I then bulk insert directly into SQL Server.

Since I'm not using INSERT statements (that's the slow way), I want raw data without escapes. So I translate strings like "'AmeriKKKa\'s_Most_Wanted'" (note single ticks inside quotes) into "AmeriKKKa's Most Wanted" (no single ticks) and insert that directly.

That's all I've got so far, so I'm a little premature about asking if anyone wants to play with comparisons because I don't have a "spec" for all the translations yet.

I was off in my first post; the MySQL Documentation explains there's no \x## hex translations. So when I find this in the file:

Code:
'Broken/Jo\\xc3\\xa2\\xe2\\x80\\x9e\\xef\\xbf\\xbd'
it's just about the backslashes and it translates to this:

Code:
'Broken/Jo\xc3\xa2\xe2\x80\x9e\xef\xbf\xbd'
So I guess the programs should translate with the rules on the MySQL page to get to tab-seperated output. The program, then, should strip all the other junk out of the file and just translate the records. Let's translate this

Code:
 (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70),
 (2,0,'AmeriKKKa\'s_Most_Wanted','',8,1,0,0.116338664774167,'20060401120725',46448774,70),
into this, where "<tab>" is an individual tab character (0x09):

Code:
1<tab>0<tab>AaA<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline>
2<tab>0<tab>AmeriKKKa's Most Wanted<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline>
What do you think? Clear enough? What did I leave out?

Everybody uses ActiveState Perl for Windows, right? I think we should run eachother's programs and average the results. I run my own C# and C++ code, and you run it too; I run your Perl and Python programs and we see how the scores go. That'll help eliminate machines and disk setups and so on.

Last edited by mikeblas; 07-20-2006 at 04:54 PM..
  #8  
Old 07-20-2006, 12:32 AM
fluxion Gawd, 5.3 Years
 
fluxion is offline
hmm, i actually use perl for linux. but i have activestate installed, never really used it but i'll try to make sure it works in windows. i dont see why it wouldnt though, i'll probably just be using core functions.

but yah sounds like a pretty sweet little challenge. seems pretty clear, i'm assuming the records are seperated by newlines? ill try to get something together as soon as i have time.
__________________
gaming rig (vista 64): E8400 @ 3.6ghz, Gskill PC6400 (2x2GB), ECS 8800GT, Abit IP35 Pro, Corsair VX450
workstation (ubuntu intrepid 64): Q9300 @ stock, XMS2 PC6400 (4x2GB), evga 8500gt 512mb, P35-DS3R, Corsair VX450
laptop0 (slackware 10.1): Dell Latitude CPi 13.1", P2 266mhz, 128MB, MagicGraph 128XD
laptop1 (ubuntu intrepid + vista): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, Corsair (2x1GB), Quadro NVS 140M
work laptop (ubuntu intrepid + xp): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, 2x1GB, OCZ Vertex 30GB, Intel GM965, Advanced Dock with 120GB ultrabay drive
  #9  
Old 07-20-2006, 08:15 AM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
Was my previous post not clear enough about newlines?

If you're using Unix, that will make it a bit more difficult to compare timings. I could be wrong, but I don't think Perl supports Unicode in its core functions.
  #10  
Old 07-20-2006, 08:43 AM
BillLeeLee [H]ardForum Junkie, 7.2 Years
 
BillLeeLee is offline
Perl does have some built-in unicode support as of 5.6 (and 5.8.x has improved it a lot) - I know you can at least print unicode, but I haven't dealt with it all that much.

@fluxion

I use ActiveState Perl at work for scripting - runs just as fine as when I just compiled my own Perl 5.8 binaries for windows. The current ActiveState Perl distribution is just 5.8.8 with some more modules like the Win32 modules built in).
__________________
This is genmay, not livejournal.

  #11  
Old 07-20-2006, 08:59 AM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
There's 4,752,747 records in the file, and the input file is 412,064,906 bytes long. About 86 bytes per record, then.
  #12  
Old 07-20-2006, 04:22 PM
Whatsisname [H]ardness Supreme, 9.8 Years
 
Whatsisname is online now
depends which you consider more valuable, the computer hardware or the computer programmer.
__________________
Führer of the Grammar Nazi Association™

Banned -Frg
  #13  
Old 07-20-2006, 04:52 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
Huh?
  #14  
Old 07-20-2006, 05:22 PM
drizzt81 [H]ardForum Junkie, 6.6 Years
 
drizzt81 is offline
Quote:
Originally Posted by mikeblas
Huh?
I think he is trying to say that it may be cheaper to buy more/ "better" hardware and throw it at the problem than pay some programmers "extremely high" salary. Since the hardware is a fixed cost, which in the long run is zero... blah blah blah.
__________________
German Rep. of the Grammar Nazi Association™ ; Anti-HDCP Alliance; AMD[H]unter is dating his sister
Spelling: ridiculous, equivalent, weather (sunny, raining) != whether, threw != through, peak (top of mountain) != peek (quick glance), sentence, (kitchen)sink != sync(hronized), visibility, duel != dual, eligible, identical, definite, they're (They Are)/ it's (it is) != their/ its (Ownership) != there (Place), quiet (not loud) != quite (very), still (not moving) != steal (take stuff from others), since != sense, board (a piece of wood) != bored (not busy), comparable, performance, compatible, inconsistent, lane, herd (like a bunch of cows) != heard, compare, The past tense of lead (guide) is led (guided) not lead (a soft metal), tendency, den != then, sight, sustenance, truth, past tense of "to cost" is "cost", perpendicular, would of != would've, complacent, loose (not tight) != lose, should have
  #15  
Old 07-20-2006, 05:44 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
The Moore's Law argument? That's broken; Moore's Law stopped in 2003. Beofre that, people called code written with this excuse "bloatware". For tools, it can be viable -- but it's hard to justify for client-side apps that ship.

Last edited by mikeblas; 07-20-2006 at 05:50 PM..
  #16  
Old 07-20-2006, 05:54 PM
generelz Limp Gawd, 5.3 Years
 
generelz is offline
Can I do it in Java? What is the memory limit for this program? I don't have anything to do this weekend :-). What is the timeframe?
  #17  
Old 07-20-2006, 07:24 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
Quote:
Originally Posted by generelz
Can I do it in Java? What is the memory limit for this program? I don't have anything to do this weekend :-). What is the timeframe?
I don't think there's a memory limit. The machine I dev on at home, and will run the tests on, has four gigs of RAM. But I don't see a reason to have much stuff in memory.

Whatever language you'd like, sure. I don't run Unix, though; and if it's not something I use every day, you'll have to provide instructions on getting the tools or downloads I'll need to time it.

I think the "specs" in post #7 are pretty much everything I can think of, but I'm happy to answer questions if you think something is missing. Otherwise, feel free to get started. It'll take you a while to download the input file, so that's a great thing to get going right away.

I guess there's no real timeframe, but there's no exclusivity to participation, either. That is, don't expect to be the only guy who will implement at tested solution in a language. The more the merrier. Also, I'll give everybody credit, but I'd like to post a web page with the solutions and the comparisons and analysis and so on.
  #18  
Old 07-20-2006, 09:45 PM
generelz Limp Gawd, 5.3 Years
 
generelz is offline
Quote:
Originally Posted by mikeblas
I don't think there's a memory limit. The machine I dev on at home, and will run the tests on, has four gigs of RAM. But I don't see a reason to have much stuff in memory.
Well theoretically I could read the whole file into RAM in one operation, then manipulate all the text in memory rather than having to read each line. Either way that was not my current plan but it would be interesting to see what sort of performance it would yield.

Quote:
Originally Posted by mikeblas
Whatever language you'd like, sure. I don't run Unix, though; and if it's not something I use every day, you'll have to provide instructions on getting the tools or downloads I'll need to time it.
Speaking to Java, as long as you have the JRE installed (which I expect you might if you run any applets in your web browser) then you should be able to run the class file I generate. There is one question, though, and that is what JRE version you would be running. I might write it to take advantage of some of the newest java features (java 5), and that is generally the platform I compile for. However it is possible to compile a backwards-compatible version, but it might not perform as well because it would not be able to take advantage of the improvements made to the JVM for the latest version.

Quote:
Originally Posted by mikeblas
I think the "specs" in post #7 are pretty much everything I can think of, but I'm happy to answer questions if you think something is missing. Otherwise, feel free to get started. It'll take you a while to download the input file, so that's a great thing to get going right away.
Definitely, I will not hesitate to ask if I run across anything unexpected.

Quote:
Originally Posted by mikeblas
I guess there's no real timeframe, but there's no exclusivity to participation, either. That is, don't expect to be the only guy who will implement at tested solution in a language. The more the merrier. Also, I'll give everybody credit, but I'd like to post a web page with the solutions and the comparisons and analysis and so on.
Yeah that would be great, I would be happy to release the source code and provide any explanation about why I did what I did.

One last thing: you mentioned importing from into SQL Server is super fast using a comma/tab delimited file. However that is not faster than restoring from a BAK correct? The reason I ask is because I deal with client databases on a regular basis, similar to this example (i.e. a client has a mysql database dump they send us, and it is in this same format wikipedia has provided), and I was considering if there was an easy translation from that dump to a MS SQL BAK file. I guess this is probably the best way of importing a mysql dump?
  #19  
Old 07-20-2006, 10:13 PM
mikeblas [H]ard|DCer of the Month - May 2006, 6.2 Years
 
mikeblas is offline
Quote:
Originally Posted by generelz
Well theoretically I could read the whole file into RAM in one operation, then manipulate all the text in memory rather than having to read each line. Either way that was not my current plan but it would be interesting to see what sort of performance it would yield.
I think you'll find that it's slower. If you read a bit at a time, you can process what you've read while waiting for the next chunk to come into memory. Same for writing.

Quote:
Originally Posted by generelz
One last thing: you mentioned importing from into SQL Server is super fast using a comma/tab delimited file. However that is not faster than restoring from a BAK correct?
Sure. But there's no way to write foreign data into a database backup file. That is, it's a backup mechanism, not an import mechanism.

Quote:
Originally Posted by generelz
I guess this is probably the best way of importing a mysql dump?
MySQL dump files aren't natively readable by SQL Server. They're mostly SQL in plaintext, but MySQL uses an unstandard format for INSERT statements and also for many other features. So you'll have to find a tool, or start writing 'em, just like me. One other approach is to setup a MySQL instance just for loading. Run the dump to load that server, then use SSIS to move the data directly from the MySQL server to SQL Server.

If you have further questions about SQL Server, I'm happy to help, but I'd thank you to start your own thread with them.
  #20  
Old 07-20-2006, 11:52 PM
fluxion Gawd, 5.3 Years
 
fluxion is offline
Quote:
Originally Posted by mikeblas
Was my previous post not clear enough about newlines?

If you're using Unix, that will make it a bit more difficult to compare timings. I could be wrong, but I don't think Perl supports Unicode in its core functions.
i meant for the output, i'm assuming tab-delimited fields, newline-delimited records (CR+LF), but i just wanted to be make sure.
__________________
gaming rig (vista 64): E8400 @ 3.6ghz, Gskill PC6400 (2x2GB), ECS 8800GT, Abit IP35 Pro, Corsair VX450
workstation (ubuntu intrepid 64): Q9300 @ stock, XMS2 PC6400 (4x2GB), evga 8500gt 512mb, P35-DS3R, Corsair VX450
laptop0 (slackware 10.1): Dell Latitude CPi 13.1", P2 266mhz, 128MB, MagicGraph 128XD
laptop1 (ubuntu intrepid + vista): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, Corsair (2x1GB), Quadro NVS 140M
work laptop (ubuntu intrepid + xp): Thinkpad T61 14.1" WS, C2D (T7300) 2.0ghz, 2x1GB, OCZ Vertex 30GB, Intel GM965, Advanced Dock with 120GB ultrabay drive
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 08:33 PM.


Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Copyright 2000 - 2010 KB Networks, Inc.