![]() |
|
#1
|
|||
|
|||
|
C++ and Perl bakeoff
Would anyone like to have a bakeoff to see if C++ or Perl is better at string-intensive processing? It would be fun, something like a programming contest (without prizes). A benchmark-athon.
I like to download the Wikipedia content dumps and play with them. There's lots of interesting things to do. Of course, they're all formatted for MySQL, and I don't run MySQL. So I need to convert some of the scripts for SQL Server. One file is a bunch of INSERT statements (using some MySQL specific syntax, of course) that loads a single table full of information about every article. I've written a program to rip through that file and make a CSV out of it. Then, I can use another tool to load that CSV file into SQL Server very rapidly. There's lots of parsing going on; INSERT (a1,'b21,'c1'),(a2,'b2','c2'),('a3','b3','c3') becomes a1,"b1","c1" a1,"b2","c2" a1,"b3","c3" and so on. There's escapements, too; the strings have single ticks escaped with a backslash, and a backslash escaped, too. (So \' becomes ', and \\ becomes \.) There's also hex escapements, it seems: \xe0 becomes whatever single character 0xE0 is. I haven't figured out if that's also for four byte characters (for Unicode), like \xe2240 becoming whatever single Unicode character 0x2240. I think I've got a very fast solution in C#, and I want to write a C++ version for comparison. Who can come up with a Perl version? String processing should be Perl's home court, right? Last edited by mikeblas; 07-19-2006 at 01:31 PM..
|
|
#2
|
|||
|
|||
|
Sounds interesting. Since I don't know Perl well, someone else could get you something more efficient (and sooner) than I. But you've given me a good idea for learning more Perl...
|
|
#3
|
|||
|
|||
|
I heard Python whoops ass at strings too but I dont have any proof. Anyone know for sure?
|
|
#4
|
|||
|
|||
|
We can put a Python horse in the race, too.
|
|
#5
|
|||
|
|||
|
I am game, but probably won't be able to dream up a solution until lunch hour tomorrow or nighttime.
Should be fun with my dirty way of programming Perl. Someone else can take the Python way, or maybe someone will use sed/awk.
|
|
#6
|
|||
|
|||
|
im game to give it a shot at least. i wont be able to get into it till this weekend though, but feel free to go ahead and post up the full criteria (or is this geared toward people who already know what conversions need to be done?).
in any case i seriously doubt i'd outdo mikeblas in anything related to programming but maybe perl can pick up my slack
|
|
#7
|
|||
|
|||
|
Nah; we have to specify the translation, otherwise it's not a precise comparison and we don't learn anything about it.
The file I'm using is page.sql.gz from http://download.wikimedia.org/enwiki/20060702/. It's 149 megs to download, and expands to 412 megs to uncompress. The start of the file is some DDL to drop a table, then create it and some indexes. Finally, we get to the INSERT statement. The first two inserts are here: Code:
INSERT INTO `page` VALUES (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70), (5,0,'AlgeriA','',0,1,0,0.553221851171201,'20060301005610',18063769,41), ... The program I'm writing actually reads the file and cleans up the records and puts them into a structure that I then bulk insert directly into SQL Server. Since I'm not using INSERT statements (that's the slow way), I want raw data without escapes. So I translate strings like "'AmeriKKKa\'s_Most_Wanted'" (note single ticks inside quotes) into "AmeriKKKa's Most Wanted" (no single ticks) and insert that directly. That's all I've got so far, so I'm a little premature about asking if anyone wants to play with comparisons because I don't have a "spec" for all the translations yet. I was off in my first post; the MySQL Documentation explains there's no \x## hex translations. So when I find this in the file: Code:
'Broken/Jo\\xc3\\xa2\\xe2\\x80\\x9e\\xef\\xbf\\xbd' Code:
'Broken/Jo\xc3\xa2\xe2\x80\x9e\xef\xbf\xbd' Code:
(1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70), (2,0,'AmeriKKKa\'s_Most_Wanted','',8,1,0,0.116338664774167,'20060401120725',46448774,70), Code:
1<tab>0<tab>AaA<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline> 2<tab>0<tab>AmeriKKKa's Most Wanted<tab><tab>8<tab>1<tab>0<tab>0.116338664774167<tab>20060401120725<tab>46448774<tab>70<newline> Everybody uses ActiveState Perl for Windows, right? I think we should run eachother's programs and average the results. I run my own C# and C++ code, and you run it too; I run your Perl and Python programs and we see how the scores go. That'll help eliminate machines and disk setups and so on. Last edited by mikeblas; 07-20-2006 at 04:54 PM..
|
|
#8
|
|||
|
|||
|
hmm, i actually use perl for linux. but i have activestate installed, never really used it but i'll try to make sure it works in windows. i dont see why it wouldnt though, i'll probably just be using core functions.
but yah sounds like a pretty sweet little challenge. seems pretty clear, i'm assuming the records are seperated by newlines? ill try to get something together as soon as i have time.
|
|
#9
|
|||
|
|||
|
Was my previous post not clear enough about newlines?
If you're using Unix, that will make it a bit more difficult to compare timings. I could be wrong, but I don't think Perl supports Unicode in its core functions.
|
|
#10
|
|||
|
|||
|
Perl does have some built-in unicode support as of 5.6 (and 5.8.x has improved it a lot) - I know you can at least print unicode, but I haven't dealt with it all that much.
@fluxion I use ActiveState Perl at work for scripting - runs just as fine as when I just compiled my own Perl 5.8 binaries for windows. The current ActiveState Perl distribution is just 5.8.8 with some more modules like the Win32 modules built in).
|
|
#11
|
|||
|
|||
|
There's 4,752,747 records in the file, and the input file is 412,064,906 bytes long. About 86 bytes per record, then.
|
|
#12
|
|||
|
|||
|
depends which you consider more valuable, the computer hardware or the computer programmer.
|
|
#13
|
|||
|
|||
|
Huh?
|
|
#14
|
|||
|
|||
|
Quote:
|
|
#15
|
|||
|
|||
|
The Moore's Law argument? That's broken; Moore's Law stopped in 2003. Beofre that, people called code written with this excuse "bloatware". For tools, it can be viable -- but it's hard to justify for client-side apps that ship.
Last edited by mikeblas; 07-20-2006 at 05:50 PM..
|
|
#16
|
|||
|
|||
|
Can I do it in Java? What is the memory limit for this program? I don't have anything to do this weekend :-). What is the timeframe?
|
|
#17
|
|||
|
|||
|
Quote:
Whatever language you'd like, sure. I don't run Unix, though; and if it's not something I use every day, you'll have to provide instructions on getting the tools or downloads I'll need to time it. I think the "specs" in post #7 are pretty much everything I can think of, but I'm happy to answer questions if you think something is missing. Otherwise, feel free to get started. It'll take you a while to download the input file, so that's a great thing to get going right away. I guess there's no real timeframe, but there's no exclusivity to participation, either. That is, don't expect to be the only guy who will implement at tested solution in a language. The more the merrier. Also, I'll give everybody credit, but I'd like to post a web page with the solutions and the comparisons and analysis and so on.
|
|
#18
|
|||||
|
|||||
|
Quote:
Quote:
Quote:
Quote:
One last thing: you mentioned importing from into SQL Server is super fast using a comma/tab delimited file. However that is not faster than restoring from a BAK correct? The reason I ask is because I deal with client databases on a regular basis, similar to this example (i.e. a client has a mysql database dump they send us, and it is in this same format wikipedia has provided), and I was considering if there was an easy translation from that dump to a MS SQL BAK file. I guess this is probably the best way of importing a mysql dump?
|
|
#19
|
||||
|
||||
|
Quote:
Quote:
Quote:
If you have further questions about SQL Server, I'm happy to help, but I'd thank you to start your own thread with them.
|
|
#20
|
|||
|
|||
|
Quote:
|
![]() |
| Thread Tools | Search this Thread |
|
|