C++ and Perl bakeoff

mikeblas

[H]ard|DCer of the Month - May 2006
Joined
Jun 26, 2004
Messages
12,777
Would anyone like to have a bakeoff to see if C++ or Perl is better at string-intensive processing? It would be fun, something like a programming contest (without prizes). A benchmark-athon.

I like to download the Wikipedia content dumps and play with them. There's lots of interesting things to do. Of course, they're all formatted for MySQL, and I don't run MySQL. So I need to convert some of the scripts for SQL Server.

One file is a bunch of INSERT statements (using some MySQL specific syntax, of course) that loads a single table full of information about every article. I've written a program to rip through that file and make a CSV out of it. Then, I can use another tool to load that CSV file into SQL Server very rapidly.

There's lots of parsing going on;

INSERT (a1,'b21,'c1'),(a2,'b2','c2'),('a3','b3','c3')

becomes

a1,"b1","c1"
a1,"b2","c2"
a1,"b3","c3"

and so on. There's escapements, too; the strings have single ticks escaped with a backslash, and a backslash escaped, too. (So \' becomes ', and \\ becomes \.)

There's also hex escapements, it seems: \xe0 becomes whatever single character 0xE0 is. I haven't figured out if that's also for four byte characters (for Unicode), like \xe2240 becoming whatever single Unicode character 0x2240.

I think I've got a very fast solution in C#, and I want to write a C++ version for comparison. Who can come up with a Perl version? String processing should be Perl's home court, right?
 
Sounds interesting. Since I don't know Perl well, someone else could get you something more efficient (and sooner) than I. But you've given me a good idea for learning more Perl...
 
I heard Python whoops ass at strings too but I dont have any proof. Anyone know for sure?
 
I am game, but probably won't be able to dream up a solution until lunch hour tomorrow or nighttime.

Should be fun with my dirty way of programming Perl. :p Someone else can take the Python way, or maybe someone will use sed/awk.
 
im game to give it a shot at least. i wont be able to get into it till this weekend though, but feel free to go ahead and post up the full criteria (or is this geared toward people who already know what conversions need to be done?).

in any case i seriously doubt i'd outdo mikeblas in anything related to programming :p but maybe perl can pick up my slack
 
Nah; we have to specify the translation, otherwise it's not a precise comparison and we don't learn anything about it.

The file I'm using is page.sql.gz from http://download.wikimedia.org/enwiki/20060702/. It's 149 megs to download, and expands to 412 megs to uncompress. The start of the file is some DDL to drop a table, then create it and some indexes. Finally, we get to the INSERT statement. The first two inserts are here:

Code:
INSERT INTO `page` VALUES
 (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70),
 (5,0,'AlgeriA','',0,1,0,0.553221851171201,'20060301005610',18063769,41),
...

and, of course, the file has more than three million such tuples -- each one represents a page (not an article) on the English Wikipedia site. They're all on one line; I added all the newlines above for clarity. The file does contain newlines; there's about 15000 records per line. The last tuple on a line ends with a semicolon, and there's a new INSERT INTO statement on the next line.

The program I'm writing actually reads the file and cleans up the records and puts them into a structure that I then bulk insert directly into SQL Server.

Since I'm not using INSERT statements (that's the slow way), I want raw data without escapes. So I translate strings like "'AmeriKKKa\'s_Most_Wanted'" (note single ticks inside quotes) into "AmeriKKKa's Most Wanted" (no single ticks) and insert that directly.

That's all I've got so far, so I'm a little premature about asking if anyone wants to play with comparisons because I don't have a "spec" for all the translations yet.

I was off in my first post; the MySQL Documentation explains there's no \x## hex translations. So when I find this in the file:

Code:
'Broken/Jo\\xc3\\xa2\\xe2\\x80\\x9e\\xef\\xbf\\xbd'

it's just about the backslashes and it translates to this:

Code:
'Broken/Jo\xc3\xa2\xe2\x80\x9e\xef\xbf\xbd'

So I guess the programs should translate with the rules on the MySQL page to get to tab-seperated output. The program, then, should strip all the other junk out of the file and just translate the records. Let's translate this

Code:
 (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70),
 (2,0,'AmeriKKKa\'s_Most_Wanted','',8,1,0,0.116338664774167,'20060401120725',46448774,70),

into this, where "<tab>" is an individual tab character (0x09):

Code:
1<tab>0<tab>AaA<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline>
2<tab>0<tab>AmeriKKKa's Most Wanted<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline>

What do you think? Clear enough? What did I leave out?

Everybody uses ActiveState Perl for Windows, right? I think we should run eachother's programs and average the results. I run my own C# and C++ code, and you run it too; I run your Perl and Python programs and we see how the scores go. That'll help eliminate machines and disk setups and so on.
 
hmm, i actually use perl for linux. but i have activestate installed, never really used it but i'll try to make sure it works in windows. i dont see why it wouldnt though, i'll probably just be using core functions.

but yah sounds like a pretty sweet little challenge. seems pretty clear, i'm assuming the records are seperated by newlines? ill try to get something together as soon as i have time.
 
Was my previous post not clear enough about newlines?

If you're using Unix, that will make it a bit more difficult to compare timings. I could be wrong, but I don't think Perl supports Unicode in its core functions.
 
Perl does have some built-in unicode support as of 5.6 (and 5.8.x has improved it a lot) - I know you can at least print unicode, but I haven't dealt with it all that much.

@fluxion

I use ActiveState Perl at work for scripting - runs just as fine as when I just compiled my own Perl 5.8 binaries for windows. The current ActiveState Perl distribution is just 5.8.8 with some more modules like the Win32 modules built in).
 
There's 4,752,747 records in the file, and the input file is 412,064,906 bytes long. About 86 bytes per record, then.
 
depends which you consider more valuable, the computer hardware or the computer programmer.
 
mikeblas said:
I think he is trying to say that it may be cheaper to buy more/ "better" hardware and throw it at the problem than pay some programmers "extremely high" salary. Since the hardware is a fixed cost, which in the long run is zero... blah blah blah.
 
The Moore's Law argument? That's broken; Moore's Law stopped in 2003. Beofre that, people called code written with this excuse "bloatware". For tools, it can be viable -- but it's hard to justify for client-side apps that ship.
 
Can I do it in Java? What is the memory limit for this program? I don't have anything to do this weekend :). What is the timeframe?
 
generelz said:
Can I do it in Java? What is the memory limit for this program? I don't have anything to do this weekend :). What is the timeframe?

I don't think there's a memory limit. The machine I dev on at home, and will run the tests on, has four gigs of RAM. But I don't see a reason to have much stuff in memory.

Whatever language you'd like, sure. I don't run Unix, though; and if it's not something I use every day, you'll have to provide instructions on getting the tools or downloads I'll need to time it.

I think the "specs" in post #7 are pretty much everything I can think of, but I'm happy to answer questions if you think something is missing. Otherwise, feel free to get started. It'll take you a while to download the input file, so that's a great thing to get going right away.

I guess there's no real timeframe, but there's no exclusivity to participation, either. That is, don't expect to be the only guy who will implement at tested solution in a language. The more the merrier. Also, I'll give everybody credit, but I'd like to post a web page with the solutions and the comparisons and analysis and so on.
 
mikeblas said:
I don't think there's a memory limit. The machine I dev on at home, and will run the tests on, has four gigs of RAM. But I don't see a reason to have much stuff in memory.

Well theoretically I could read the whole file into RAM in one operation, then manipulate all the text in memory rather than having to read each line. Either way that was not my current plan but it would be interesting to see what sort of performance it would yield.

mikeblas said:
Whatever language you'd like, sure. I don't run Unix, though; and if it's not something I use every day, you'll have to provide instructions on getting the tools or downloads I'll need to time it.

Speaking to Java, as long as you have the JRE installed (which I expect you might if you run any applets in your web browser) then you should be able to run the class file I generate. There is one question, though, and that is what JRE version you would be running. I might write it to take advantage of some of the newest java features (java 5), and that is generally the platform I compile for. However it is possible to compile a backwards-compatible version, but it might not perform as well because it would not be able to take advantage of the improvements made to the JVM for the latest version.

mikeblas said:
I think the "specs" in post #7 are pretty much everything I can think of, but I'm happy to answer questions if you think something is missing. Otherwise, feel free to get started. It'll take you a while to download the input file, so that's a great thing to get going right away.

Definitely, I will not hesitate to ask if I run across anything unexpected.

mikeblas said:
I guess there's no real timeframe, but there's no exclusivity to participation, either. That is, don't expect to be the only guy who will implement at tested solution in a language. The more the merrier. Also, I'll give everybody credit, but I'd like to post a web page with the solutions and the comparisons and analysis and so on.

Yeah that would be great, I would be happy to release the source code and provide any explanation about why I did what I did.

One last thing: you mentioned importing from into SQL Server is super fast using a comma/tab delimited file. However that is not faster than restoring from a BAK correct? The reason I ask is because I deal with client databases on a regular basis, similar to this example (i.e. a client has a mysql database dump they send us, and it is in this same format wikipedia has provided), and I was considering if there was an easy translation from that dump to a MS SQL BAK file. I guess this is probably the best way of importing a mysql dump?
 
generelz said:
Well theoretically I could read the whole file into RAM in one operation, then manipulate all the text in memory rather than having to read each line. Either way that was not my current plan but it would be interesting to see what sort of performance it would yield.
I think you'll find that it's slower. If you read a bit at a time, you can process what you've read while waiting for the next chunk to come into memory. Same for writing.

generelz said:
One last thing: you mentioned importing from into SQL Server is super fast using a comma/tab delimited file. However that is not faster than restoring from a BAK correct?
Sure. But there's no way to write foreign data into a database backup file. That is, it's a backup mechanism, not an import mechanism.

generelz said:
I guess this is probably the best way of importing a mysql dump?
MySQL dump files aren't natively readable by SQL Server. They're mostly SQL in plaintext, but MySQL uses an unstandard format for INSERT statements and also for many other features. So you'll have to find a tool, or start writing 'em, just like me. One other approach is to setup a MySQL instance just for loading. Run the dump to load that server, then use SSIS to move the data directly from the MySQL server to SQL Server.

If you have further questions about SQL Server, I'm happy to help, but I'd thank you to start your own thread with them.
 
mikeblas said:
Was my previous post not clear enough about newlines?

If you're using Unix, that will make it a bit more difficult to compare timings. I could be wrong, but I don't think Perl supports Unicode in its core functions.

i meant for the output, i'm assuming tab-delimited fields, newline-delimited records (CR+LF), but i just wanted to be make sure.
 
fluxion said:
i meant for the output, i'm assuming tab-delimited fields, newline-delimited records (CR+LF), but i just wanted to be make sure.
Oh, I see. Yep, newlines between output records, where a newline is "\r\n".
 
C was a language written to make best use of expensive computer hardware.
Perl was a language written to make best use of expensive computer programmers.

If you have a problem to solve, what is your goal? Does it need to run as fast as possible on a given system, or do you want to get a working solution as fast as possible.

And often it is indeed cheaper to get more / better hardware and call it a day.

drizzt81 said:
I think he is trying to say that it may be cheaper to buy more/ "better" hardware and throw it at the problem than pay some programmers "extremely high" salary. Since the hardware is a fixed cost, which in the long run is zero... blah blah blah.
 
Perl was designed for string processing. C or C++ will lose terribly.
 
doh said:
Perl was designed for string processing. C or C++ will lose terribly.
What's your wager?

Whatisname said:
Perl was a language written to make best use of expensive computer programmers.
Perhaps. But, if so, why do I not yet have a Perl solution in my hands?
 
Because computers don't program themselves, and nobody is willing to work for free? Maybe if you offered a prize things would change.

mikeblas said:
Perhaps. But, if so, why do I not yet have a Perl solution in my hands?
 
Whatsisname said:
Because computers don't program themselves, and nobody is willing to work for free? Maybe if you offered a prize things would change.
People work for free all the time--I'm an example. Meanwhile, if Perl is so much about saving the efforts of the programmer, why is solving such a simple problem "work", anyway?

I'd thought of offering a prize or reward, but that would imply choosing the best, and that would require scoring, which would make it more complicated and touchy. Flamability around here is high enough without pouring gas on the fire.

And a prize might not solve the problem, anyway. I've offered sizable prizes in the past and recieved zero responses.

I would've thought that discovering something and studying it would've been motivation enough, but perhaps I've overestimated the altruism and drive of my audience.
 
well, i came up with a perl version, i'm a little hesitant to stick my neck out since it was a pretty obvious and unclever approach. i think i got all the formatting right...i'll post up the times and output, took about 2.5 minutes on a P4 2.4ghz with 512mb. if that's in the ballpark then ill try to optimize it a bit more and make sure itll run ok on windows.

fluxion@purity:~/scripts/perl_bakeoff$ time ./bakeoff.pl enwiki-20060702-page.sql

real 2m21.237s
user 1m11.860s
sys 0m2.475s

fluxion@purity:~/scripts/perl_bakeoff$ tail -4 output
5803515 0 "Nambiyur" "" 0 0 1 0.495551754218 "20060702231114" 61757971 1394
5803516 6 "CVCOG logo.png" "" 0 0 1 0.723697823784 "20060702231120" 61757981 68
5803517 0 "Phalacrocorax nigrogularis" "" 0 1 1 0.955231507353 "20060702231129" 61758005 31
5803518 6 "Uniontown PA.jpg" "" 0 0 1 0.111196242907 "20060702231130" 61758009 89

i guess ill go ahead and post the script as it stands:

Code:
#!/usr/bin/perl

use strict;
open(FILE, '>output');

my $rec_start;

while (<>) {
  #skip all uninteresting lines
  next unless (/INSERT INTO `page` VALUES/o);
  chop; chop; chop; #remove the last ); ....ugly as shit 

  $rec_start = 1;
  
  #split each line up into seperate records using '),(' as a delimater
  #iterate through the records and clean as we go
  
  foreach my $record ( split(/\),\(/, $_) ) {
    #shave the insert info off the beginning of the first record
    if ($rec_start) {
      $record = substr($record, 27);
      $rec_start = 0;
    }

    #replace commas with tabs
    $record =~ s/,/\t/g;

    #replace underscores with spaces
    $record =~ s/_/ /g;

    #replace non-escaped single-quotes with double-quotes
    $record =~ s/(?<!\\)'/"/g;

    #trim an escape off any sequence of escapes
    $record =~ s/(?<!\\)(\\+)\\(?!\\)/$1/g;
    
    print FILE "$record\r\n";
  }
}

close FILE;

edit: blah, unicode.....ill have to get to that later
 
That looks pretty good, fluxion. I can read Perl good 'nuff, but I'm certainly not good at it enough to suggest optimizations.

You've run it against the enwiki file, then? I'll try it when I get home on my own machine for the sameness, but 2.5 seconds should be in contention.

How does it deal with the end of one INSERT statement, and the start of the next? Does "foreach my $record ( split(/\),\(/, $_) ) {" end the loop when the regex returns no more matching records? Do you correctly get the last record, which ends with ");" instead of "),(" ?
 
mikeblas said:
What's your wager?

Perhaps. But, if so, why do I not yet have a Perl solution in my hands?

I learned my lesson in General Mayhem and don't do work for others that seems like homework.
 
mikeblas said:
That looks pretty good, fluxion. I can read Perl good 'nuff, but I'm certainly not good at it enough to suggest optimizations.

You've run it against the enwiki file, then? I'll try it when I get home on my own machine for the sameness, but 2.5 seconds should be in contention.

How does it deal with the end of one INSERT statement, and the start of the next? Does "foreach my $record ( split(/\),\(/, $_) ) {" end the loop when the regex returns no more matching records? Do you correctly get the last record, which ends with ");" instead of "),(" ?

yup 2.5 min. to read the file and generate the output file

and yah the split function generates a list and the foreach iterates through till it reaches the end, then it goes back into the while loop which reads in the next line. the ");" was dealt with with that "chop; chop; chop;" nastyness at the beginning, which just trims the last 3 characters off the line, removing ); and a newline i think

i'm not too sure where to look for optimizations either, maybe see about combining or simplifying some of the regex operations. i'll also run it with activestate on my amd box later on and see what happens
 
Good work fluxion, looks like you're the Perl benchmark for me to beat. :)

Downloading the massive text file as we speak...I hope to have a solution within the next 2 hours or so.
 
haha, i doubt you'll have too much trouble. im looking forward to seeing some other solutions
 
I was wondering, are 'we' stripping off the INSERT INTO 'page' ... statements too?
How about the DDL drop tables statements?

looking at that perl code, we are not.
 
Hmm, from mikeblas' explanation, I thought all we wanted are the data in the tuples, and all the DDL info and INSERT statements are stripped away.

On a sidenote: this file is kicking my system's butt. I have 3 GB of RAM, but it crashed - Visual Studio 2005, Metapad, Notepad2, JEdit.

The only things I've managed to view this thing with are less and Wordpad, and less doesn't exactly format it very well, while Wordpad is staying at a constant 100% CPU time whenever it's in focus. Yikes...
 
doh said:
I learned my lesson in General Mayhem and don't do work for others that seems like homework.
I was hoping you teach me how Perl (which is implemented in C/C++) would possibly provide a solution that was faster than a C or C++ solution in itself to demonstrate the verity of your assertion.

BillLeeLee said:
Hmm, from mikeblas' explanation, I thought all we wanted are the data in the tuples, and all the DDL info and INSERT statements are stripped away.
Yeah: all tuples, no SQL.

BillLeeLee said:
On a sidenote: this file is kicking my system's butt.
Yeah, it's a monster. I've got some command-line tools that I can use to browse it if I don't go too deep into the file. My next "homework" assignment might be a file browser that can cope with these big files.

For this particular file, I think the issue is more about the line length. Each INSERT statement is about a megabyte long, IIRC.
 
I am having the hardest time to read ANYTHING from that file... :/
it stink, I am 'giving' up for now.
 
mikeblas said:
I was hoping you teach me how Perl (which is implemented in C/C++) would possibly provide a solution that was faster than a C or C++ solution in itself to demonstrate the verity of your assertion.

Perl is a high level language and as a high level language its strengths lay in its ability to rapidly create a solution which may not be as quick as it would be in, say, assembly or machine code.
 
fluxion said:
and yah the split function generates a list and the foreach iterates through till it reaches the end, then it goes back into the while loop which reads in the next line. the ");" was dealt with with that "chop; chop; chop;" nastyness at the beginning, which just trims the last 3 characters off the line, removing ); and a newline i think

Are there no pages with "),(" in the name? Looks like you got lucky!
 
mikeblas said:
Are there no pages with "),(" in the name? Looks like you got lucky!

well it wouldnt be hard to tighten up the matches, but yah i just kinda assumed that wouldnt happen. i live my life on the edge :p
 
Back
Top