C++ and Perl bakeoff

fluxion said:
well it wouldnt be hard to tighten up the matches, but yah i just kinda assumed that wouldnt happen. i live my life on the edge :p
Looks like you're writing quotes, and also you're writing \r\r\n at the end of each line instead of \r\n.

I wrote my C++ program on my laptop; I'll tear the SQL out of my C# program when it's cool enough to go upstairs and do some timings.
 
I am not sure this is a problem that can be solved 100% correctly with just the use of some simple regular expressions. Doing something as simple as "replace all commas with tabs" is not going to produce the desired output in some cases - what happens when one of the string literals has a comma in it? Is it possible to write a regular expression that says "replace all commans with tabs except commas inside single quotes"? That sounds a little more complex to me.

I tried writing something in Java and it seems to be a dismal failure at this point. I basically went for simplicity and wrote a finite state machine using a switch statement that looked at every single character in a source array, applied some simple logic and then wrote characters to a destination array. Needless to say doing this character by character is terribly inefficient, especially with the overhead of copying every single character.

I am unsure how hardcoded this particular solution could be. for example, each tuple contains 11 fields, the first and second are always numbers which need no conversion, the third is some sort of title which is where the escape codes might possibly need conversion, the fourth appears to be a string literal which is almost always empty, the 5th 6th and 7th are also numbers which need no conversion, the 8th is a decimal number between 0 and 1 exclusive, always beginning with "0." followed by 15 digits, the 9th appears to be a datestamp of some sort which just needs to have the single quotes around it removed...and the last two are numbers. So theoretically I could write a long regex to handle an entire tuple if I wanted. Is that legal? ;)
 
What's legal is getting the correct output as fast as you can, generelz.

Hard coded is fine; 11 fields, knowing which one is which, sure.

I think one thing that this experiment might reveal is that some languages might be great for string handling, like using regular expressions. But regular expressions aren't the panacea they might seem. They're slow, they get complicated fast, and they're not as applicable as they might seem.

I think tht you can use multiple passes with regular expressions, like fluxion is doing. I'd worry about each field, first, before thinking about the content of each field. For most fields, you can get to the next one by finding the next comma in the string. For two of the fields, you have to worry about single ticks and backwhacks and so on.

Is copying character by character really as bad as you think it is? (I'm asking -- you should measure if you don't know how bad it is for sure. In C/C++ it's not bad, because you own and have allocated the target memory. In other languages, you're not always allowed to think directly about memory and might have to reallocate and copy and concatenate repeatedly, and that certainly is expensive!)
 
I'm trying something like that: mmap the source file, open, ftruncate to the same size and mmap the output file. Iterate through the source file, filtering and copying the interesting parts. Finally, ftruncate the output file down to the amount of bytes I ended up copying.
I happily assume that the VFS layer will bunch the writes to the output file together in a reasonably effective way, given the right flags.(I use MAP_NOSYNC, so disk writes can be delayed until it's convenient.)

Almost all the filtering is still missing, but right now the limiting factor seems to be the slow disk IO of my laptop. Besides, I have no idea how this will perform in windows or if everything I do is supported. Time will tell, I guess. :)

edit: This works in FreeBSD, I'll be porting it to windows shortly: bench.c.
On my laptop, the time to convert the file depends very much on file access and caching. I've seen times from 17 to 40 seconds.

edit: I'm not going to bother learning enough Win32 for a real port just yet.
Removing the MAP_NOSYNC is enough to make it compile with cygwin. I'll see if I can manage something slightly more elegant.

edit again: Cygwin works, but requires you to have cygwin installed to run it. MinGW does not work, because it doesn't have mmap. It compiles with SFU (interix) + gcc, but much like cygwin, you'll have to install it to run the executable.
I don't really want to port the thing to Win32 (MapViewOfFileEx?), so ...
 
most definately. A regular expression can be written to match just about any concievable pattern

generelz said:
Is it possible to write a regular expression that says "replace all commans with tabs except commas inside single quotes"? That sounds a little more complex to me.
 
mikeblas said:
Looks like you're writing quotes, and also you're writing \r\r\n at the end of each line instead of \r\n.

I wrote my C++ program on my laptop; I'll tear the SQL out of my C# program when it's cool enough to go upstairs and do some timings.


hmm, i'm looking at my output file and there's definetely only 2 whitespace characters at the end of each line. i'm not sure how to go about seeing exactly what they are, but through inference i'd assume it's the "\r\n" i explicitly added to each record. i think a likely explanation would be that Activestate rewrites "\n" as "\r\n" so that unix scripts will work transparently on windows, which might explain the \r\r\n you're getting. i still havent had a chance to try it myself.

and whoever mentioned that straight replacement of commas with tabs could lead to problems, you're right, that was a major oversight on my part, and i've already seen at least a few cases where that would be an issue. i have several ideas on how to fix that, but it's turning out to be a real challenge to address the issue without hurting performance too much. i'm gonna be extremely busy this weekend, but i hope to have most of the kinks worked out sometime next week.

ahh yes, generalz

generelz said:
Is it possible to write a regular expression that says "replace all commans with tabs except commas inside single quotes"? That sounds a little more complex to me.

it is, you can do look-ahead and look-behind checks. the issue is that if the comma is in the middle of a string, you'd have to do a variable-length look behind to match the "quote followed by non-quotes" part, and variable-length look-behinds arent yet implemented in perl's regex engine. that wouldnt be an issue if it wasnt for that fact that i'm looking to replace commas which DONT fall into that category, so a regular string matching sequence (plus capture) following the look behind wouldnt work in that case. unless i can negate it....i dont know...

but i'm not very good with regex.

you can generally assume that regex can handle pretty much everything you throw at it. i utilize maybe 10% of it's capabilities in my scripts. it's a matter of not knowing it's more subtle features that really define what is and isnt possible for a particular person at a particular time.
 
Fluxion:
Code:
nord# time ./bench.pl enwiki-20060702-page.sql
137.011u 7.501s 3:20.09 72.2%   10+4715k 4816+3112io 38pf+0w
nord# time ./bench enwiki-20060702-page.sql tmpfile
412064906 -> 378689587 characters converted.
7.012u 4.041s 0:29.24 37.7%     5+179k 13+23io 6296pf+0w

My C > your perl. ;)
 
haha, very nice. but keep in mind an optimal perl solution has yet to be posted :)

edit: on another note, i played with it a little more, and it's definetely not the multiple regex substitutions that are hurting me (if i switch them all off i still get about the same time), so it's the way im handling the file that's killing me. i suspect split()'ing each line from the original file is creating quite a bit of overhead. ill definetely look for an alternative before i continue on with this
 
Indeed, we're still looking for a correct Perl solution.

I've got my C++ solution working. On my machine, it finishes in 38 seconds. Fluxion's Perl code runs in 120 seconds.

I haven't done any profiling on the C++ code; maybe I can squeak out some more perf. 38 seconds means 21 megs/second, though, and that seems pretty good.
 
HHunt, mikeblas, and anybody else running benchmarks on this:

How do you handle the effects of file caching? NT and Linux (and probably FreeBSD) will cache files fairly aggressively (depending on configuration), so the order of running the tests could have a profound effect on total runtime.

I think that a cold benchmark (no part of the input file cached) would provide the most interesting results, though that would require a utility that cleared the file cache (or a reboot before running each test). Without such a utility, I'd run each version repeatedly (10 times) discard the high and low results and average the runtimes.

Disk IO will probably dominate the best implementations, so if I were to implement this, that's where I'd begin. A language without access to memory-mapped files or overlapped IO will find itsself at a real disadvantage in a test like this.
 
Memory-mapped files aren't a panacea. They actually cause worse I/O, not better I/O. And they certainly don't cause any less I/O.

I'm thinking that the best comparison numbers will involve writing to NULL and reading from a RAM disk.

There are a few tools that claim to flush the Windows file cache, but I've never used one that I believed worked. Rebooting is the only way to reset the cache.

I don't think the effects are as dramatic as you note:

Just run it. I've run it before, several times, this sesion. But I just rebuilt both debug and release. 31 seconds.
Run it again. 29 seconds.
Reboot. Run it as soon as the machine settles down. 32 seconds.
Run it again. 29 seconds.
The hot times are consistent, but cold and warm times aren't so far away.

I think in this case. the large file sizes prevent the file from being cached in any meaningful way between runs.

I too was thinking that disk I/O would dominate. In fact, so much so that the test was meaningless. If we can get 40 megs/second from a desktop drive while reading the file, and we're running on a 3.2 gig machine, we've only got 80 clock cycles per byte as it's being read. To do even this simple parsing, that's challenging -- it's less than 30 instructions per byte, assuming absolutely zero cache misses!

Why do you think the cold cache case is the most interesting?
 
Hmm. A neat trick for flushing the cache might be to make two copies of the file, and establish two targets. Then, running them back to back assures that the current one is cold. I think that works for this test, since the file sizes are so large -- for smaller files, it wouldn't be too interesting.

So, input1 is the same content as input2.

stripwiki input1 output1 // input1 in cache
stripwiki input2 output2 // input2 in cache, bumping input1
stripwiki input1 output1 // input1 bumps input2
stripwiki input2 output2 // input2 bumps input1 again

Note that it is advantageous to use FILE_FLAG_NO_CACHE while reading, which would skip the Windows cache manager. Since the program's individual run doesn't need to reread the same data from the same file again, why cache it? While this doesn't address the issue of timing for the benchmark, it does provide a useful optimization for applications.
 
doh said:
Perl is a high level language and as a high level language its strengths lay in its ability to rapidly create a solution which may not be as quick as it would be in, say, assembly or machine code.

doh, please stay with the topic of the thread. There's no need to troll aight'? :)

--KK
 
KingKaeru said:
doh, please stay with the topic of the thread. There's no need to troll aight'? :)

--KK

I'm not trolling. He asked for clarification and I gave it.
 
doh said:
I'm not trolling. He asked for clarification and I gave it.

doh said:
Perl was designed for string processing. C or C++ will lose terribly.

doh said:
I learned my lesson in General Mayhem and don't do work for others that seems like homework.

This thread is a friendly contest/excercise that could turn out to be fun and educational for the community. If you wish to contribute positively with mike's challenge then by all means please do so. But if all you're going to do is drop one liners such was what you've posted, please take it someplace else.

Feel free to PM me if you have further questions regarding this. I'm more than happy to continue our conversation via PM's as to not further disrupt the original topic.


--KK
 
My dev machine at home is named ENDURO. It has 4 gigs of memory and runs an Asus P5AD2 Deluxe motherboard and a 3.2 GHz Pentium 4. The C: drive is a Raptor 150, and the F: drive is a RAID0 array of two Hitachi 300 gig drives.

The wiki dump files are on F:. When I time them, the programs read from F: and write to C:.

Fluxion's Perl script from his earlier post runs in about 78 seconds. The version I have doesn't produce correct output. Its output file is 412,052,301 bytes, more than 10% larger than my C++ program.

My C++ program is here. It runs in about 17 seconds. It generates 378,689,587 bytes.

A faster version of my program is here. It runs in about 9 seconds. It also generates 378,689,587 bytes.

My C# program is here. (Note that it is a *.TXT file, but really a *.CS file. My server doesn't like to server *.CS files because it thinks they're executable.) It runs in about 34 seconds. It generates 378,689,587 bytes, and its output compares perfectly to the C++ program's output. I can't apply some of the optimizations I've done to the C++ program to the C# version (obviously), but I'm thinking that I can try a couple of other things so I might get the runtime down.
 
mikeblas said:
Why do you think the cold cache case is the most interesting?
Outside of synthetic benchmarks, I've never felt the need to convert the same file twice in a row :).
 
MonkeyShave said:
Outside of synthetic benchmarks, I've never felt the need to convert the same file twice in a row :).
I'm not sure the term "synthetic benchmark" can be applied to this exercise; particularly carrying the negative connotation it has in these forums.

The idea of the exercise is to try to solve the same problem in a few different languages. Assuming we can provide well-tuned implementations in each language, what language is inherently faster at this task? Since it is impossible to find a single application that reflects everything anyone might do with a language, I just picked something I'm doing a lot of personally, and thought Perl would be a good foil because it's billed as a great langauge for string processing. (Which is what we're doing here.)

To that end, I'm more interested in the application and the libraries or platforms it rides on than the ability of the OS to service a recently used, very large file out of cache -- or to do large-block sequential I/O. I think that's teased out of the numbers if I/O doesn't dominate the runtime, not if I/O is forced to dominate the runtime as much as possible.
 
The SQL::Translator module for Perl would handle this if it was feasible to load a 412MB .sql file into memory.

# sqlt -f MySQL -t SQLServer enwiki-20060702-page.sql > enwiki-20060702-page.sqlserver
 
Where do I get the SQL::Translator module? How do I set it up, and so on? It produces a tab-delimited file, or a T-SQL INSERT script?
 
mikeblas said:
My dev machine at home is named ENDURO. It has 4 gigs of memory and runs an Asus P5AD2 Deluxe motherboard and a 3.2 GHz Pentium 4. The C: drive is a Raptor 150, and the F: drive is a RAID0 array of two Hitachi 300 gig drives.

The wiki dump files are on F:. When I time them, the programs read from F: and write to C:.

Fluxion's Perl script from his earlier post runs in about 78 seconds. The version I have doesn't produce correct output. Its output file is 412,052,301 bytes, more than 10% larger than my C++ program.

My C++ program is here. It runs in about 28 seconds. It generates 378,689,587 bytes.

i dont know if you saw my reply earlier, but i'm thinking that activestate replaces "\n" with "\r\n" so that unix scripts will work correctly on windows. since i wrote "\r\n" explicitly it would be adding an extra character on each line. that's my theory anyway, all i know is that my output doesnt have the extra character you're getting.

of course, my file is 407,xxx,xxx bytes so there's definetely some lingering issues that need addressing.

i did manage to reduce the runtime by 40% by getting rid of that first split(), and instead using perls readline() with the delimiter set to "),(". i wont post till i know everything is working properly though
 
doh said:
Thanks; I don't see anything about installation here, tho.

fluxion said:
since i wrote "\r\n" explicitly it would be adding an extra character on each line. that's my theory anyway, all i know is that my output doesnt have the extra character you're getting.
I think your diagnosis is probably right. But that doesn't take care of the double quotes around each of the strings in the file. I'm getting this output:

Code:
1       0       "AaA"   ""      8       1       0       0.116338664774167       "20060401120725"...

when I expect this:

Code:
1       0       AaA         8       1       0       0.116338664774167       20060401120725...

If you have an updated version that's faster or fixed, let me know and I'll be happy to run it.
 
I have updated post 56 with runtime for my C# program. A link to the source is up there, too.
 
OHH, wow sorry about that i must've misinterpreted. i misread this line:

"'AmeriKKKa\'s_Most_Wanted'" (note single ticks inside quotes) into "AmeriKKKa's Most Wanted" (no single ticks) and insert that directly.

so i was explicitly replacing non-escaped single-quotes with double-quotes. my bad. that's good, cuz i couldnt figure out what the heck could possibly be adding 30mb worth of characters.

and that'll make it much easier to deal with the "comma inside title" case when i start replacing commas with tabs

so yah, ill try to get everything up to spec, maybe look at more ways to optimize it, and get a new version up as soon as i have a chance
 
No worries; sorry it wasn't more obvious. It's hard to be perfectly clear with the limited expression the forum software offers. "" could be tick tick tick tick, or tick tick quote, or whever. And not being able to show a table, and so on ... it really cramps my style.
 
Okay, my version still has bugs and it misses some things, but as it stands I'm having it do my processing of all the lines in roughly

Code:
billylee@canti ~ $ time ./bakeoff.perl

real    0m9.942s
user    0m7.939s
sys     0m1.999s

But you can see from my output that it's not completely correct:

The tail:
Code:
5803517 0       'Phalacrocorax nigrogularis'    ''      0       1       1     0.955231507353 '20060702231129'    61758005 31
5803518 6       'Uniontown PA.jpg'      ''      0       0       1       0.111196242907 '20060702231130'        61758009     89

The head:
Code:
1       0       'AaA'   ''      8       1       0       0.116338664774167       '20060401120725'        46448774    70
5       0       'AlgeriA'       ''      0       1       0       0.553221851171201       '20060301005610'        18063769     41
6       0       'AmericanSamoa' ''      0       1       0       0.387834867823715       '20060301005610'        18063795     48

Here is the code as it stands - I borrowed a regex from fluxion for the time being. :)

Code:
#!/usr/bin/perl

# Author: William Lee
# Perl submission for the bakeoff. Mmmmm, baking.
#
# Not fully complete.


use strict;

open my $SQLFILE, '<', 'enwiki-20060702-page.sql' or die 'Can\'t open input file';
open my $OUTFILE, '>', 'out.txt' or die 'Can\'t open file for writing';

# must be set differently for windows! - to ;\r\n
#$/ = ';\n'; # set new chomp target

#my @fields;
#$#fields = 17000;

while(my $line = <$SQLFILE>)
{
        next unless ($line =~ /INSERT INTO /o);
        chop;chop;chop;
        #truncate and remove INSERT INTO 'page' VALUES (
        $line = substr($line, 27);
        
        # substitute \t for a ',', a ' ' for an '_'
        # attempt 1 - this will also make ),( become )\t(
        $line =~ tr/,_/\t /;
        
        # the escape sequence \'
        $line =~ s/\\'/'/g;
        
        # borrowed from fluxion
        $line =~ s/(?<!\\)(\\+)\\(?!\\)/$1/g;
        
        # replace all )\t( with newline
        $line =~ s/\)\t\(/\n/g;
        
        print $OUTFILE $line;
} # end while

close $SQLFILE;
close $OUTFILE;

As you can see I don't deal with unicode just yet (have to learn about that right now).

I do several things differently than fluxion however.

1 - I do not bother splitting up the line across "),(" because Perl works well on one giant string as a glob. Splitting introduces a lot of overhead since the Perl program has to allocate and constantly extend/reallocate arrays if you split/readline.

2 - fluxion uses s/// for most things, but I use transliteration tr///. This is an improvement for those single character things that you can replace because it does not go to the regular expression engine, which is a vast improvement.

3 - I only hit write operation once for each iteration through the loop; fluxion's code uses a foreach and so prints each time through the foreach loop (thousands).

EXTRA NOTE: the output is borked because for a while I actually couldn't get my code to run, so I started fiddling around with the lines.

Okay, so it's not perfect - I will work on it. I really hope the 'correct' solution doesn't add on 18 more seconds in processing time...heh heh heh. :p

Sorry I didn't get it in sooner. I kind of got distracted for the last 24 hours... I hope to have an actual 'correct' solution in short order. Perhaps fluxion and I could combine our powers; I hope some of the stuff I do can help fluxion, and some of the stuff he does help me. :)

Requisite system info:

Pentium 4 HT 3.0 Ghz / 1000 Mhz FSB
3 GB RAM (200 Mhz)
The disk the program runs and reads/writes with is a Seagate 7200.8 120 GB
 
wow, 9 seconds???

*deletes script and never returns*

very nice. the foreach split() was definetely killing me, current version is a lot faster but your approach still seems more efficient. i'm still reading in each record seperately

getting rid of the quotes from the output should be easy. note that every quote inside a title is escaped in the original file, so you can just remove all unescaped quotes before you get rid of the escapes.

there's also cases where commas appear in the title field so you cant just replace all the commas with tabs, i havent figured that one out yet ...if only perl did variable-length look behinds you could do something simple like s/(?<!'.+),(?=.*')/\t/g;

nice job though :cool:
 
fluxion said:
wow, 9 seconds???

*deletes script and never returns*

very nice. the foreach split() was definetely killing me, current version is a lot faster but your approach still seems more efficient. i'm still reading in each record seperately

getting rid of the quotes from the output should be easy. note that every quote inside a title is escaped in the original file, so you can just remove all unescaped quotes before you get rid of the escapes.

there's also cases where commas appear in the title field so you cant just replace all the commas with tabs, i havent figured that one out yet ...if only perl did variable-length look behinds you could do something simple like s/(?<!'.+),(?=.*')/\t/g;

nice job though :cool:

I looked to your script for inspiration originally (that's what I do mostly, look at existing code and attempt to improve upon it).

If you do it the split way, a possible optimization I would suggest is to pre-extend your array.

Code:
my @fields_array;
$#fields_array = 16000; # or something

@fields_array = ... #split/readme, whatever

This will save you some overhead because by default, Perl doesn't allocate too much space to the array (let alone 10000+ elements). But it will still incur a pretty beefy performance penalty each go around and operating on the entire line at a time is better. Perl tends to be pretty good with long lines, but I don't want to slurp up 400 MB of text in one go (which will probably lead to performance degradation).

But it's a little too early to celebrate - for all I know the extra stuff to make it correct could balloon the runtime to minutes. :D
 
i actually got rid of the split() completely. currently i just iterate through the lines with readline, and when i hit a pertinent line i set $/ = '),(', and do the passes through each record (then set $/ back to default and iterate line by line again)

i was actually worried that reading in a whole line would hurt performance, im not very privy to perl's performance aspects (obviously), but now that i think of it i suppose constant disk writes/reads would be an obvious nono. my processor certainly didnt seem to be working too hard heh

seems like you just need a couple more s/tr passes to get it up to spec, and from what i've seen those dont affect performance much at all (i commented out ALL my substitutions in my first script and the total time was only about 10 seconds faster than the initial 2.5min time) :cool:
 
BillLeeLee said:
But it's a little too early to celebrate - for all I know the extra stuff to make it correct could balloon the runtime to minutes. :D
Yeah, it might. Right now, you're around 15 seconds on my machine. But your results are not correct beause your replacements are sloppy. Essentially, you've ignored the quotes and replace things within the quotes that shouldn't be replaced -- comams and parens, most notably. An article named "Albert III, Elector of Saxony" becomes "Albert III\t Elector of Saxony", for example.

I think it's a good idea, tho, to try to treat as much of the line all at once.

Oh, and should I expect Java and Python versions from anyone? Any other languages?
 
Well I have what I believe to be a 100% correct Java version. The downside is it runs in just under 2 minutes, which is nowhere close to any of the previous submissions! :p

Here is the code:

Code:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;

/**
 * SQLConvert.java: Submission for the C++/Perl bakeoff.
 * 
 * The objective is to read in a MySQL dump file containing table information
 * and convert the INSERT statements to a tab delimited format for import by
 * MS SQL Server.
 * 
 * This implementation relies on Java's regular expression utilities.
 * 
 * @author Zach Bailey
 * @created 21 July 2006
 */
public class SQLConvert 
{
    public static void main(String[] args) throws Exception
    {
        long startTime = System.currentTimeMillis();
        
        //set up input/output files
        File file = new File("enwiki-20060702-page.sql");
        BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file)), 6*1024*1024);
        
        File outputFile = new File("output.txt");
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile)));
        
        while(in.ready())
        {
            String line = in.readLine();
            if(line.length() < 10 || !line.startsWith("IN")) continue;
            line = line.replaceFirst("INSERT INTO `page` VALUES ", "");
            line = line.replaceFirst("\\);$", "),");
            line = line.replaceAll("\\((\\d+),(\\d+),'(.*?)','(.*?)',(\\d+),(\\d+),(\\d+),(0.\\d+),'(\\d+)',(\\d+),(\\d+)\\),","$1\t$2\t$3\t$4\t$5\t$6\t$7\t$8\t$9\t$10\t$11\r\n");
            //replace all "\'" with just "'" and all "\\" with just "\"
            line = line.replaceAll("\\\\('|\\\\)", "'$1");
            out.write(line, 0, line.length());
        }
        
        out.close();
        
        long delta = System.currentTimeMillis() - startTime;
        System.out.println("Completed in: " + delta/1000 + "." + delta%1000 + "s");
        
        System.exit(0);
    }
}

If you have the JDK installed on your machine (which you could test by typing "javac" at a command prompt), you could compile this and run it by doing "javac SQLConvert.java" followed by "java SQLConvert".

On my machine (AMD X2 3800+, 2GB of RAM, Raptor 150) this runs in around 118s. I suspect there is little if any advantage given because of my dual core processor.

Right now I believe the main bottleneck is the new String object created everytime I do a "replaceAll". I'm trying to figure out how to eek out some more performance from this...if possible. I'll let you all know what I come up with.
 
Mikeblas said:
My C++ program is here. It runs in about 28 seconds. It generates 378,689,587 bytes.

Hmm looking at my output it is 378,790,844 bytes. I wonder what the discrepancy is. Did I miss something?
 
mikeblas said:
Yeah, it might. Right now, you're around 15 seconds on my machine. But your results are not correct beause your replacements are sloppy. Essentially, you've ignored the quotes and replace things within the quotes that shouldn't be replaced -- comams and parens, most notably. An article named "Albert III, Elector of Saxony" becomes "Albert III\t Elector of Saxony", for example.

I think it's a good idea, tho, to try to treat as much of the line all at once.

Oh, and should I expect Java and Python versions from anyone? Any other languages?

Ah, that is quite the bummer - I didn't know Wikipedia allowed commas within the actual page names. Now I can't do the magic global transliteration for commas, at least, since you can't use patterns with tr (which would completely negate the performance increase tr/// affords over s/// for trivial replacements).

It'd be much easier to know some of the more subtle formatting things if I could actually read the file without it blowing up my programs. ;)

Back to the drawing board.

If I eventually succeed in making one in Perl, I might attempt it in python if no one else takes up that path. I'd also be interested in digging out my clisp and sml knowledge and building programs in either of those languages...but that'd be off in the future.
 
BillLeeLee said:
Ah, that is quite the bummer - I didn't know Wikipedia allowed commas within the actual page names. Now I can't do the magic global transliteration for commas, at least, since you can't use patterns with tr (which would completely negate the performance increase tr/// affords over s/// for trivial replacements).

It'd be much easier to know some of the more subtle formatting things if I could actually read the file without it blowing up my programs. ;)

Back to the drawing board.

If I eventually succeed in making one in Perl, I might attempt it in python if no one else takes up that path. I'd also be interested in digging out my clisp and sml knowledge and building programs in either of those languages...but that'd be off in the future.

Bill,

Take a look at my regexes and the order in which I do them. I believe that will get you 99.9% of the way there - I would love to see how fast those regexes go on Perl. I just can't figure out where these extra chars are coming from...:confused:
 
BillLeeLee said:
Ah, that is quite the bummer - I didn't know Wikipedia allowed commas within the actual page names. Now I can't do the magic global transliteration for commas, at least, since you can't use patterns with tr (which would completely negate the performance increase tr/// affords over s/// for trivial replacements).

It'd be much easier to know some of the more subtle formatting things if I could actually read the file without it blowing up my programs. ;)

Back to the drawing board.

If I eventually succeed in making one in Perl, I might attempt it in python if no one else takes up that path. I'd also be interested in digging out my clisp and sml knowledge and building programs in either of those languages...but that'd be off in the future.

holy hell man....this is what i came up with for the comma-in-title case. brace yourself...

Code:
#replace commas in titles with <z>
$records =~ s/(\(\d+,\d+,'[^']+?),(?!')/$1<z>/g;

#replace commas with tabs
$records =~ s/,/\t/g;

#replace <z> with commas again
$records =~ s/<z>/,/g;

2 passes for that particular case.....nasty regex....big capture+rewrite....unsurprisingly, it totally kills performance. i couldnt for the life of me figure out how to just ignore the commas-in-titles flat out. about an extra 30s on my machine with what is essentially your script. i havent giving up on it, a better method most certainly needs to be found. but it works....kinda

Code:
fluxion@purity:~/scripts/perl_bakeoff$ head -3 output
1       0       AaA             8       1       0       0.116338664774167       20060401120725  46448774        70
5       0       AlgeriA         0       1       0       0.553221851171201       20060301005610  18063769        41
6       0       AmericanSamoa           0       1       0       0.387834867823715       20060301005610  18063795        48
fluxion@purity:~/scripts/perl_bakeoff$ tail -3 output
5803516 6       CVCOG logo.png          0       0       1       0.723697823784  20060702231120  61757981        68
5803517 0       Phalacrocorax nigrogularis              0       1       1       0.955231507353  20060702231129  61758005        31
5803518 6       Uniontown PA.jpg                0       0       1       0.111196242907  20060702231130  61758009        89

i get a discepancy on the file output. 374,030,735 bytes instead of the 378,689,587 bytes mikeblas got. so i dont know what's up, everything seems right

here's the full script if you wanna check it against your output mikeblas. replace the \r\n with \n and i think you should get the proper output this time

Code:
#!/usr/bin/perl

use strict;
open(FILE, '>output');

while (my $records = <>) {
  #skip all uninteresting lines
  next unless ($records =~ /INSERT INTO `page` VALUES/o);

  #remove sql info from beginning, remove ); from the end
  $records = substr($records, 27, -3);

  #replace commas in titles with <z>
  $records =~ s/(\(\d+,\d+,'[^']+?),(?!')/$1<z>/g;

  #replace ),( with newlines
  $records =~ s/\),\(/\n/g;

  #remove non-escaped quotes 
  $records =~ s/(?<!\\)'//g;

  #commas with tabs
  $records =~ s/,/\t/g;

  #put embedded commas back
  $records =~ s/<z>/,/g;

  #underscores with spaces
  $records =~ s/_/ /g;

  #remove an escape off any sequence of escapes
  $records =~ s/(?<!\\)(\\+)\\(?!\\)/$1/g;

  print FILE "$records\r\n";
}

close FILE;

that's enough for one night
 
I don't really know why I'm doing this [1], but there's a few specific optimisations I just have to try on my code.
First, using madvise() to mark the mmaped input file for sequential IO. The main effect of this is that when a page is faulted in, the priority of the previous one is reduced. This should allow more memory to be used for prereading.
Secondly, use write() instead of mmap for the output file. Extending an mmaped file and then letting the VM system fill it out can supposedly generate some seriously fragmented files, which isn't good for IO. I might even test async_write, or a separate writer thread.

Now, where to start ...


[1] Given that mikeblas won't be able to test it without installing cygwin or SFU (or another OS), and I have no idea how it would perform on windows, anyway.
 
mikeblas said:
Thanks; I don't see anything about installation here, tho.


Same as any CPAN module. Go to a CPAN prompt and do install SQL::Translator.
 
MonkeyShave said:
HHunt, mikeblas, and anybody else running benchmarks on this:

How do you handle the effects of file caching? NT and Linux (and probably FreeBSD) will cache files fairly aggressively (depending on configuration), so the order of running the tests could have a profound effect on total runtime.

I think that a cold benchmark (no part of the input file cached) would provide the most interesting results, though that would require a utility that cleared the file cache (or a reboot before running each test). Without such a utility, I'd run each version repeatedly (10 times) discard the high and low results and average the runtimes.

Disk IO will probably dominate the best implementations, so if I were to implement this, that's where I'd begin. A language without access to memory-mapped files or overlapped IO will find itsself at a real disadvantage in a test like this.

As of now, I don't. I run it a few times to get an impression of how the run time varies, but it's quite possible that it helps my results.

Regarding memory mapped IO, it is indeed not faster than careful use of read(), but neither should it be much worse. Given the way mmap interacts with the system, a linear read through a file should cause about the same IO either way. (Most of the file reads will probably be prefetching into buffers.)
 
Oy. Regarding varying run times, I tested mikeblas's C++ code against mine, and got some wildly varying run times.
I use a (single) P4 2.8 xeon, 1GB ram, FreeBSD 7-CURRENT. Two ATA100 disks.
Code:
Compiled like this:
xeon# c++ -O2 -march=pentium4 -ansi -Wall StripWiki.cpp -o mike
xeon# gcc -O2 -march=pentium4 -ansi -Wall bench.c -o hhunt
I got these results when run in this order:
hhunt: 0:07.44
mike: 0:29.54
hhunt: 0:27.30
mike: 0:23.28
hhunt: 0:24.36
hhunt: 0:05.42
mike: 0:23.29
mike: 0:19.81

edit :
I have done a few things to my code, and thought it would be interesting to see how it performs with a few repetitions (so as to keep the source data in memory).
Code:
xeon# gcc -O2 -march=pentium4 -ansi -Wall bench.c -o hhunt
xeon# time ./hhunt enwiki-20060702-page.sql /newdisk/tmpfil
2.675u 1.878s 0:12.96 35.0%     5+8206k 14+16io 4849pf+0w
xeon# time ./hhunt enwiki-20060702-page.sql /newdisk/tmpfil
2.428u 0.723s 0:03.66 85.7%     5+8379k 0+18io 0pf+0w
xeon# time ./hhunt enwiki-20060702-page.sql /newdisk/tmpfil
2.437u 0.727s 0:03.67 85.8%     5+8258k 0+13io 0pf+0w
xeon# time ./hhunt enwiki-20060702-page.sql /newdisk/tmpfil
2.486u 0.657s 0:03.66 85.5%     5+8310k 0+13io 0pf+0w
Code:
xeon# c++ -O2 -march=pentium4 -ansi -Wall StripWiki.cpp -o mike
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.827u 2.203s 0:16.69 84.0%    5+8171k 2+2889io 2pf+0w
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.682u 2.395s 0:16.94 83.0%    5+8286k 4+2889io 0pf+0w
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.574u 2.465s 0:16.86 83.2%    5+8372k 7+2889io 0pf+0w
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.718u 2.330s 0:16.76 83.7%    5+8284k 0+2889io 0pf+0w
The key to the output from time is:
Usermode time, Kernel time , Wall time, CPU% = (user+kernel)/walltime,
Shared text space + avg. unshared data/stack space, Input operations + Output operations, pages brought in from disk + number of times swapped out.
Note how mine suffers a slew of hard page faults the first run, but none on the later (when the entire file is buffered) and does almost no IO.
Mikeblas's does almost no reading (because it's already buffered [1]), a lot of output, and of course practically no page faults.

In mike's defense, I cheat: my output file doesn't touch the physical disk until the syncer feels like it. Thus, my run time includes no disk writes except what's neccesary to create/update the output file (Which explains the very low number of output operations in the time statistics). Those 3.7 seconds are spent shuffling bytes in memory, while the disk IO has either already happened or will be handled later.

My code : bench2.c.
A version of mikeblas's code that compiles in FreeBSD: StripWiki2.c.
Apologies for reformatting your code, btw. I just prefer reading it in this style. :)
Other than that all I had to do was to remove some unneccesary headers and change _tmain into a more standard main().

[1] This can be verified: Look at the input operations for these two consecutive runs.
Code:
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.528u 2.809s 0:26.73 53.5%    5+8247k 6425+2889io 6pf+0w
xeon# time ./mike enwiki-20060702-page.sql /newdisk/tmpfil
11.717u 2.320s 0:16.92 82.9%    5+8232k 6+2889io 9pf+0w
 
Back
Top