C++ and Perl bakeoff

mikeblas · Jul 23, 2006

generelz said:
If you have the JDK installed on your machine (which you could test by typing "javac" at a command prompt), you could compile this and run it by doing "javac SQLConvert.java" followed by "java SQLConvert".

I don't. How do I install it?

generelz said:
Hmm looking at my output it is 378,790,844 bytes. I wonder what the discrepancy is. Did I miss something?

It's entirely possible I've got a bug, too. If I can run yours, then compare, then I can figure it out. Or, you can compile and run the code I've posted and compare it locally.

mikeblas · Jul 23, 2006

HHunt said:
[1] Given that mikeblas won't be able to test it without installing cygwin or SFU (or another OS), and I have no idea how it would perform on windows, anyway.

I'm happy to give it a whack if you can provide instructions and answer my questions. (And not just a link to a web page with no comment.) I can probably help with porting the implementation to "real" Windows memory-mapped files.

HHunt · Jul 23, 2006

mikeblas said:
I don't. How do I install it?

It's entirely possible I've got a bug, too. If I can run yours, then compare, then I can figure it out. Or, you can compile and run the code I've posted and compare it locally.

Instructions.
To download, get the JDK 5 update 7 from here, or grab the one with netbeans if you also want an IDE.

Basically: Grab installer file, install. Add the right directory to your path.
(Where "the right directory" means the directory that contains java.exe and javac.exe. IIRC this is the "bin" subdirectory of where you installed it.)

HHunt · Jul 23, 2006

mikeblas said:
I'm happy to give it a whack if you can provide instructions and answer my questions. (And not just a link to a web page with no comment.) I can probably help with porting the implementation to "real" Windows memory-mapped files.

Like I just did with the java files?

Give me some minutes and I'll see if I SFU was as easy to install as I seem to remember.

edit: Ok, it's been a while since I did this, but I seem to remember that it was fairly easy.
Grab this little package and install it.

When you installed SFU, it made a minimal unix-ish filesystem tree in a folder somewhere. (It probably prompted you for the location, or you can take a look at the properties of the shortcut for starting one of the shells.)
To place the file you want to compile somewhere you'll find it, find your home folder. It should be in home\yourname within the install folder, so something like c:\sfu\home\mikeblas\ . Copy the C file there.

In the start menu group ("Services for Unix", I think), there are two shells you can start. Either will work, I use csh. When you start it you'll get a rather unix-ish shell. Type "ls" to see if the file is indeed there.

To compile it, type "gcc -O2 bench.c -o bench.exe".

Do note that my current version probably won't compile, since it uses a few slightly unstandard tricks. I'll strip down a version for you in a moment.

mikeblas · Jul 23, 2006

HHunt said:
A version of mikeblas's code that compiles in FreeBSD: StripWiki2.c.

StripWiki is a C++ program, not a C program.

HHunt said:
Note how mine suffers a slew of hard page faults the first run, but none on the later (when the entire file is buffered) and does almost no IO.

If your first run caused 4849 page faults, and you were mapping a file that's 412 megabytes in size, how big is a page? Or does a "page fault" count multiple pages? Does your OS prefetch consecutive pages to resolve faults? Sometimes, it would suck if it did; sometimes (like for this app) you're lucky that it does. So does it do so always, or only sometimes?

HHunt said:
it is indeed not faster than careful use of read(), but neither should it be much worse.

I would think that it is.

Say you map this huge file, and you get a pointer back: char* = 0x1000000, and that your page size is 0x1000 bytes.

You read at *char, and you hit 0x1000000. That causes a fault, so you read 4k with one I/O operation. But then you process the other 4095 bytes. While you're doing so, you're preempted. The other program that runs does its own page fault, and moves the disk head someplace else.

You resume, and char* is now 0x1001000. So you touch it and it faults, and read 4K with one I/O. And get suspended again. And the other proc moves the disk head, and so on.

So now, every time you do a disk I/O, you're doing a seek even though you thought it should be linear. The point is that your I/OS are never guaranteed to be truly sequential. Either they're only sequential for one page, or only sequential for the chunk of pages the memory manager prefetches for you. (But you don't want the memory manager to prefetch lots of pages for you; what if your accsess to the file was actually random? Then, prefetching would be sucky!)

If you can read from the file correctly, particularly if you can give the OS hints like "don't buffer for me" and "I'm only going sequential", will end up doing much better than memory mapping.

mikeblas · Jul 23, 2006

doh said:
Same as any CPAN module. Go to a CPAN prompt and do install SQL::Translator.

Sorry, but I've never installed a CPAN module; I don't know what a CPAN prompt is.

mikeblas · Jul 23, 2006

BillLeeLee said:
It'd be much easier to know some of the more subtle formatting things if I could actually read the file without it blowing up my programs.

Sorry; I thought it was clear enough. If you have questions about assertions you'd like to make, I'd be happy to answer them.

This is also the smallest file that I'm involved with at the moment. I've donloaded and used the full content of all the wikipedia text, in all namespaces, with history. The compressed file is about 6 gigs, and the XML file (fools!) it coughs up is 692,686,106,434 bytes. Yeah: almost 700 gigs.

Anyway, why not add a debug mode to your program that stops processing the first 10000 lines? Or use HEAD to get the first 2 or 3 lines (only) of the input file, so you can load that in your editor and look?

When commercial tools won't do (eg, for most of the things I've done in my Wikipedia work) the you have to start writing your own. Most of the off-the-shelf ones just aren't going to work. Yours don't need to be commercial grade; the program I'm using to compare output versions is abotu 20 lines long and I baby sit it in the debugger. The DIFF I use normally is hopeless against such large files.

The quoted strings are string literals following the rules that MySQL uses for string literals in its implementation of SQL. So you could have "),(" in a title, too. Anyone can edit wikipedia, so you should assume you're dealing with a list of titles created by about 2 million monkeys with keyboards. Because, essentially, -- well, anyway, I think you should assume you can find any printable character inside one, even Unicode characters as UTF-8. Some characters are ecaped per the MySQL String Literal syntax rules that I linked to previously.

It looks like there are 237049 records which have a comma in the title, but

Code:

select * from pages where page_title like '%),(%'

returns no rows. (Until I go create "User:Mikeblas/fluxion),(busted" and put an end to that shortcut!)

Here's few interesting records:

778499: very low page_random value (7.254072E-06); it is represented as scientific notation in the dump file, so under some threshild, you'll see "E" and "-" in that field.

2334064: page_title is the Roman numeral Ⅰ in Unicode, which is not the letter I.

2334177: page_title is the Roman numeral Ⅱ in Unicode, which is not two letters II and ist just one Unicode character.

1387443: page_title is a backtick only, so I expect you get '`' in the input file.

2790108: page_title is the hiragana character の, Unicode 0x6E30.

720054: page_title is backwhack only, so you see '\\' in the dump file

236967: page_title is an apostrophe only, so you see '\'' in the dump file

5232291: page_title is "", so you see '\"\"' in the dump file

Lord of Shadows · Jul 23, 2006

I use editpad lite and it could handle the file quite nicely (I was wondering why it was so short untill I noticed how long my horizontal slider was.)
I'm to lazy to write a parser though. ;o)

HHunt · Jul 23, 2006

mikeblas said:
StripWiki is a C++ program, not a C program.

Gcc doesn't seem to care either way, and as far as I can see it's actually legal C. Still, I did mean .cpp until I momentarily stopped thinking.

If your first run caused 4849 page faults, and you were mapping a file that's 412 megabytes in size, how big is a page? Or does a "page fault" count multiple pages? Does your OS prefetch consecutive pages to resolve faults? Sometimes, it would suck if it did; sometimes (like for this app) you're lucky that it does. So does it do so always, or only sometimes?

The page size is 4K, so the entire thing occupies 105472 pages. When I started that run, a bit of the file was still in memory.
What happens, I think, is that when what time calls a "hard page fault" (requiring a page to be read from disk) happens, it triggers some prefetching; I'm not sure if it returns before it's done with that. If you want to count that as one or several page faults is up to you.

I would think that it is.

Say you map this huge file, and you get a pointer back: char* = 0x1000000, and that your page size is 0x1000 bytes.

You read at *char, and you hit 0x1000000. That causes a fault, so you read 4k with one I/O operation. But then you process the other 4095 bytes. While you're doing so, you're preempted. The other program that runs does its own page fault, and moves the disk head someplace else.

You resume, and char* is now 0x1001000. So you touch it and it faults, and read 4K with one I/O. And get suspended again. And the other proc moves the disk head, and so on.

So now, every time you do a disk I/O, you're doing a seek even though you thought it should be linear. The point is that your I/OS are never guaranteed to be truly sequential. Either they're only sequential for one page, or only sequential for the chunk of pages the memory manager prefetches for you. (But you don't want the memory manager to prefetch lots of pages for you; what if your accsess to the file was actually random? Then, prefetching would be sucky!)

If you can read from the file correctly, particularly if you can give the OS hints like "don't buffer for me" and "I'm only going sequential", will end up doing much better than memory mapping.

Ah, but you can give hints like that when using mmap. That's what madvise is for.
As for your example, wouldn't a very similar thing happen with normal read()s? Read a chunk, operate on it, get preempted, read another chunk (and have to seek), etc.

Both methods do end up reading a chunk, processing it, reading another, etc. The difference, as I see it, is that mmap has a chance of doing some of the IO in the background (because the prefetching might happen while you're working), but with the slightly higher cost of a page fault instead of a read whenever something isn't ready.

mikeblas · Jul 23, 2006

generelz said:
Hmm looking at my output it is 378,790,844 bytes. I wonder what the discrepancy is. Did I miss something?

You're not replacing \' with '. When I run your program and look at the output, I expect to see this on line 404:

Code:

865     0       AmeriKKKa's Most Wanted         294     0       0       0.734338696798625       20060629054538  61149644        9492

and I actually get

Code:

865     0       AmeriKKKa\'s Most Wanted                294     0       0       0.734338696798625   20060629054538      61149644	9492

mikeblas · Jul 23, 2006

Lord of Shadows said:
I use editpad lite and it could handle the file quite nicely (I was wondering why it was so short untill I noticed how long my horizontal slider was.)
I'm to lazy to write a parser though. ;o)

I've tried that one in the past. Is there a way to make the UI look less playskool? Maybe I'll give it another whack ...

HHunt · Jul 23, 2006

By the way, I checked if StripWiki compiled as a C program.
After including <stdbool.h> and moving two declarations outside a for statement, it did.

(What happened is that I have a local StripWiki.cpp, and thought I'd rename it StripWiki2.cpp before uploading, so it wouldn't crash with your original. I then forgot two characters.)

mikeblas · Jul 23, 2006

HHunt said:
I'm not sure if it returns before it's done with that. If you want to count that as one or several page faults is up to you.

Certainly, the documentation for the time tool tells us whether it's counting page fault operations or faulted pages in total.

Say it returns before it is done -- what if it doesn't finish successfully? Then, it has to find the error handling code in the device driver (which probably got paged out), load and execute that, then return the error to the application (which might also fault some code back in response) and so on. It's doable, but it's hard to test and quite a house of cards to handle page faults asynchronously.

HHunt said:
Gcc doesn't seem to care either way, and as far as I can see it's actually legal C.

A couple of the things you changed (local initializers) were because it was legal C++ but not leagal C.

HHunt said:
As for your example, wouldn't a very similar thing happen with normal read()s? Read a chunk, operate on it, get preempted, read another chunk (and have to seek), etc.

Sure. But'd read more than 4096 bytes at a time -- probably 128K bytes, since that's a likely RAID0 stripe size. If I know I'm running on big, fancy disk systems, I'll read megabytes at a time.

HHunt · Jul 23, 2006

mikeblas said:
Certainly, the documentation for the time tool tells us whether it's counting page fault operations or faulted pages in total.

It's a shell builtin, and all the documentation says is "%F The number of major page faults (page needed to be brought from disk)."
I'm not sure where I got the "hard" from, since it's obiously "major". Outside that, I still don't know if this counts pages brought in as a side effect of handling a page fault or just the number of pages that triggered a fault. Also, if there's any form of prefetching that's somewhat independent of the page fault handler, I doubt it's counted.

Say it returns before it is done -- what if it doesn't finish successfully? Then, it has to find the error handling code in the device driver (which probably got paged out), load and execute that, then return the error to the application (which might also fault some code back in response) and so on. It's doable, but it's hard to test and quite a house of cards to handle page faults asynchronously.

I was thinking more of "handle the page fault, notify some system that it would be nice if a few more pages after the current one are brought in, return", so it would be two separate calls.
I'm only speculating at this point, so I'll almost have to go dig in the kernel code (or on the 'net) to get an idea of what actually happens before I continue.

A couple of the things you changed (local initializers) were because it was legal C++ but not leagal C.

Yeh, I noticed. The above post explains matters.

Sure. But'd read more than 4096 bytes at a time -- probably 128K bytes, since that's a likely RAID0 stripe size. If I know I'm running on big, fancy disk systems, I'll read megabytes at a time.

True, that helps.

mikeblas · Jul 23, 2006

HHunt said:
(Where "the right directory" means the directory that contains java.exe and javac.exe. IIRC this is the "bin" subdirectory of where you installed it.)

Which one do I want?

LATER: Oh, I see -- JAVA.EXE and JAVAC.EXE.

HHunt · Jul 23, 2006

Trust me on this: You had more than one JDK installed before getting the most current one.

I think C:\Program Files\Java\jre1.5.0_07\bin looks about right.
edit: Ah, yes. That's somewhat important.
(Java alone is enough to run a compiled classfile, but you need javac to compile anything.)

mikeblas · Jul 23, 2006

I might've had other JREs, but certainly no JDKs. I can compile (I guess), but I can't execute:

Code:

C:\>javac SQLConvert.java

C:\>java SQLConvert
Exception in thread "main" java.lang.NoClassDefFoundError: SQLConvert

generelz · Jul 23, 2006

mikeblas said:
I might've had other JREs, but certainly no JDKs. I can compile (I guess), but I can't execute:

Code:

C:\>javac SQLConvert.java C:\>java SQLConvert Exception in thread "main" java.lang.NoClassDefFoundError: SQLConvert

Mike,

Try "java -cp . SQLConvert". That will tell the java runtime to include the current directory in your classpath (where the classloader looks to load class files from).

HHunt · Jul 23, 2006

mikeblas said:
I might've had other JREs, but certainly no JDKs.

Uhm, yes. Forget I said anything.

mikeblas · Jul 23, 2006

generelz said:
Mike,

Try "java -cp . SQLConvert". That will tell the java runtime to include the current directory in your classpath (where the classloader looks to load class files from).

Code:

C:\>java -cp . SQLConvert
Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 77
\((\d+),(\d+),'(.*?)','(.*?)',(\d+),(\d+),(\d+),(0.\d+),'(\d+)',(\d+),(\d+)\  ),
                                                                             ^
        at java.util.regex.Pattern.error(Unknown Source)
        at java.util.regex.Pattern.compile(Unknown Source)
        at java.util.regex.Pattern.<init>(Unknown Source)
        at java.util.regex.Pattern.compile(Unknown Source)
        at java.lang.String.replaceAll(Unknown Source)
        at SQLConvert.main(SQLConvert.java:40)

hmmm ... looks like that was because of the spurious spaces the forum is always inserting. Correcting the code, it actually runs with the -cp command line.

generelz · Jul 23, 2006

mikeblas said:

Code:

C:\>java -cp . SQLConvert
Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 77
\((\d+),(\d+),'(.*?)','(.*?)',(\d+),(\d+),(\d+),(0.\d+),'(\d+)',(\d+),(\d+)\  ),
                                                                             ^
        at java.util.regex.Pattern.error(Unknown Source)
        at java.util.regex.Pattern.compile(Unknown Source)
        at java.util.regex.Pattern.<init>(Unknown Source)
        at java.util.regex.Pattern.compile(Unknown Source)
        at java.lang.String.replaceAll(Unknown Source)
        at SQLConvert.main(SQLConvert.java:40)

Yeah, oops, how did that get in there? Try this updated code (also lets you specify the files on the cmd line):

Code:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;

/**
 * SQLConvert.java: Submission for the C++/Perl bakeoff.
 * 
 * The objective is to read in a MySQL dump file containing table information
 * and convert the INSERT statements to a tab delimited format for import by
 * MS SQL Server.
 * 
 * This implementation relies on Java's regular expression utilities.
 * 
 * @author Zach Bailey
 * @created 21 July 2006
 */
public class SQLConvert 
{
    public static void main(String[] args) throws Exception
    {
        if(args.length < 2)
        {
            System.out.println("Usage: SQLConvert [infile] [outfile]");
            System.exit(1);
        }
        long startTime = System.currentTimeMillis();
        
        //set up input/output files
        File file = new File(args[0]);
        BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file)), 6*1024*1024);
        
        File outputFile = new File(args[1]);
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile)))  ;
        
        while(in.ready())
        {
            String line = in.readLine();
            if(line.length() < 10 || !line.startsWith("IN")) continue;
            line = line.replaceFirst("INSERT INTO `page` VALUES ", "");
            line = line.replaceFirst("\\);$", "),");
            line = line.replaceAll("\\((\\d+),(\\d+),'(.*?)','(.*?)',(\\d+),(\\d+),(\\d+),(0.\\d+),'(\\d+)',(\\d+),(\\d+)\\),","$1\t$2\t$3\t$4\t$5\t$6\t$7\t$8  \t$9\t$10\t$11\r\n");
            //replace all "\'" with just "'" and all "\\" with just "\"
            line = line.replaceAll("\\\\('|\\\\)", "$1");
            out.write(line, 0, line.length());
        }
        
        out.close();
        
        long delta = System.currentTimeMillis() - startTime;
        System.out.println("Completed in: " + delta/1000 + "." + delta%1000 + "s");
        
        System.exit(0);
    }
}

OK it seems for some reason the forum software is putting those extra spaces in there - I can't figure out how to take them out.

Here we go:

http://geminesis.net/code/SQLConvert.java

HHunt · Jul 23, 2006

I've been writing and testing a mini-benchmark for sequential IO with read() or mmap, and I think I've got some results here.
There's two code paths chosen with a command line parameter that both sum all bytes of the file.

For read() the main code looks like this:

Code:

char *buf = malloc(sizeof(char) * BSIZE);
fd = open(argv[1], O_RDONLY);
while ((read_bytes = read(fd, buf, BSIZE)) > 0) {
	total_read += read_bytes;
	for (i=0; i<read_bytes; i++) sum += buf[i];
}
printf("Read %i bytes with read(). Sum: %lli\n", total_read, sum);

The mmap code looks like this:

Code:

fd = open(argv[1], O_RDONLY);
stat(argv[1], &fd_stat);
infile = (char*) mmap(0, fd_stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(infile, fd_stat.st_size, MADV_SEQUENTIAL);
		
for(i=0; i<fd_stat.st_size; i++) {
	sum += infile[i];
}
printf("Read %i bytes with mmap. Sum: %lli\n", i, sum);

I've got four identical 400MB files that I use in turn, to reduce the cache effects.
With a block size of 8KB (which seems to be ideal), a typical run with read() looks like this:

Code:

xeon# time ./rtest bigfile n ; time ./rtest bigfile2 n ; time ./rtest bigfile3 n ; time ./rtest bigfile4 n
Read 419430400 bytes with read(). Sum: 5026263243
2.641u 0.780s [b]0:10.61[/b] 32.2%     5+7973k 3216+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.593u 0.855s [b]0:10.62[/b] 32.3%     5+7787k 3216+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.502u 0.949s [b]0:10.46[/b] 32.8%     5+8535k 3216+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.478u 0.979s [b]0:10.51[/b] 32.7%     6+8334k 3216+0io 0pf+0w

Fair enough.
Also note that the IO numbers seem to suggest an 128KB transaction size (400M/3216 = 128KB) in an underlying system.

The following mmap run was done immediately afterwards:

Code:

xeon# time ./rtest bigfile mmap ; time ./rtest bigfile2 mmap ; time ./rtest bigfile3 mmap ; time ./rtest bigfile4 mmap
Read 419430400 bytes with mmap. Sum: 5026263243
2.791u 0.765s [b]0:11.13[/b] 31.8%     5+176k 14+0io 6401pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.748u 0.769s [b]0:10.59[/b] 33.0%     5+219k 11+0io 5880pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.749u 0.302s [b]0:03.55[/b] 85.6%     5+318k 0+0io 0pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.851u 0.197s [b]0:03.56[/b] 85.3%     5+296k 0+0io 0pf+0w

I wave madvise with MADV_SEQUENTIAL over the mapped memory, which means that when a page is accessed, the ones before it are marked as unimportant and are likely to get paged out fairly fast. This seems to reduce the amount of buffers used, since it's apparent that the buffering from the normal run has survived even though I've scanned through 800M of data before I got to bigfile3 again.

If I do two normal runs after each other, all four reads in the second run are roughly identical, at about 11s.
If I do two mmap-runs after each other, they get close to identical results.

edit: I might as well put in the best cases, as well. The same file, four times:

Code:

xeon# time ./rtest bigfile n ; time ./rtest bigfile n ; time ./rtest bigfile n ; time ./rtest bigfile n
Read 419430400 bytes with read(). Sum: 5026263243
2.547u 0.912s [b]0:10.51[/b] 32.8%     5+8545k 3201+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.466u 0.558s [b]0:03.51[/b] 85.7%     5+8284k 0+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.510u 0.516s [b]0:03.50[/b] 86.2%     5+8118k 0+0io 0pf+0w
Read 419430400 bytes with read(). Sum: 5026263243
2.457u 0.566s [b]0:03.47[/b] 86.7%     5+8145k 0+0io 0pf+0w

Code:

xeon# time ./rtest bigfile mmap ; time ./rtest bigfile mmap ; time ./rtest bigfile mmap ; time ./rtest bigfile mmap
Read 419430400 bytes with mmap. Sum: 5026263243
2.809u 0.238s [b]0:03.56[/b] 85.1%     5+337k 0+0io 0pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.784u 0.262s [b]0:03.53[/b] 86.1%     5+316k 0+0io 0pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.828u 0.216s [b]0:03.53[/b] 85.8%     5+312k 0+0io 0pf+0w
Read 419430400 bytes with mmap. Sum: 5026263243
2.803u 0.240s [b]0:03.54[/b] 85.8%     5+296k 0+0io 0pf+0w

So, what to conclude?
Buffer state being equal, read() is slightly faster, about 5% in these cases.
Read() will pull everything that passes through it into the buffers, while mmap with the settings I use is close to the opposite. (I'll try removing MADV_SEQUENTIAL and see if it acts more like read.)

Which one is best depends very much on the cache usage patterns. In this specific and carefully constructed case, the light touch of mmap yields faster average results.

Source here.

mikeblas · Jul 23, 2006

Yeah, the software is really bad about injecting spaces. I think the automatic smiley code is firing, even if you have "disable smilies" marked, and replacing what might've been a smiley with the text that might've been a smiley and a space.

Anyway:

Code:

C:\>java -cp . SQLConvert
Usage: SQLConvert [infile] [outfile]

C:\>java -cp . SQLConvert f:\links\enwiki-20060702-page.sql java_output.txt
Completed in: 133.312s

As you wondered a couple of posts ago, I think the problem is immutable strings. C# has the same issue, and I have to use a StringBuilder class to get around it. It's hard to think "that way" sometimes.

Anyway, your output isn't correct; you need to replace "_" with " " in the strings. Also, you've got Unicode problems:

Code:

1478    0       &#9500;ülfheim                258     0       0       0.827191366653136       20060701130138  61536387        12399

1478    0       &#9500;?lfheim                258     0       0       0.827191366653136       20060701130138  61536387        12399

The page name here is "Álfheim", and "Á" is 0x00C1. In UTF-8, that's represented as 0xC3 0x81, but you change it to 0xC3 0x3F.

You also fail to handle \" into ":

Code:

Line 2694
3375    0       "Love and Theft"                97      0       0       0.0622300403772202      20060629143033  61200299        2285
1

3375    0       \"Love_and_Theft\"              97      0       0       0.0622300403772202      20060629143033  61200299        2285
1

In full-blown Unicode, we'd use

Code:

Álfheim	0xC1006C0066006800650069006D00

but that would make the file twice the size (with all the 00's) instead of about 3% larger (with escapements to lead bytes, only when we need them.)

generelz · Jul 23, 2006

I added some logging to my program. Here is the output:

Code:

It took 190840456ns to read.
It took 7543ns to read.
It took 10615ns to read.
It took 59225ns to read.
It took 12850ns to read.
It took 5867ns to read.
It took 6984ns to read.
It took 13968ns to read.
It took 6425ns to read.
It took 5867ns to read.
It took 9777ns to read.
It took 8661ns to read.
It took 13130ns to read.
It took 12571ns to read.
It took 13130ns to read.
It took 11175ns to read.
It took 13969ns to read.
It took 14248ns to read.
It took 13688ns to read.
It took 17600ns to read.
It took 13410ns to read.
It took 13131ns to read.
It took 12851ns to read.
It took 9498ns to read.
It took 13689ns to read.
It took 10616ns to read.
It took 10058ns to read.
It took 7822ns to read.
It took 9498ns to read.
It took 12013ns to read.
It took 5867ns to read.
It took 5867ns to read.
It took 10337ns to read.
It took 5866ns to read.
It took 5587ns to read.
It took 5587ns to read.
It took 9220ns to read.


It took 27700169ns to read.
It took 362396617ns to transform.
It took 38302913ns to write.

It took 12037563ns to read.
It took 308716611ns to transform.
It took 29135267ns to write.

It took 10544636ns to read.
It took 298030895ns to transform.
It took 29268245ns to write.

It took 10538490ns to read.
It took 302467213ns to transform.
It took 28946417ns to write.

It took 10490719ns to read.
It took 295799327ns to transform.
It took 29502633ns to write.

It took 10513627ns to read.
It took 304095633ns to transform.
It took 31726658ns to write.

It took 197092927ns to read.
It took 302789041ns to transform.
It took 28694988ns to write.

It took 10434287ns to read.
It took 298847200ns to transform.
It took 28486022ns to write.

It took 10437919ns to read.
It took 308301753ns to transform.
It took 28447470ns to write.

It took 10446579ns to read.
It took 306460737ns to transform.
It took 28830759ns to write.
Completed in: 3.921s

In this version of my program it is only doing the first 10 INSERT lines. As we can see the transformation step (regular expressions) is taking an order of magnitude greater time than both the read/write, around three tenths of a second.

HHunt · Jul 23, 2006

I just tried removing the madvise from my mmap code, and it acts more or less the same. Strange.
edit: Tagging all of it with MADV_FREE ("this memory can and should be used for something better") didn't have much of an effect either, so it seems like something about the combination of mmap and madvise has some odd effects.

generelz · Jul 23, 2006

mikeblas said:
As you wondered a couple of posts ago, I think the problem is immutable strings. C# has the same issue, and I have to use a StringBuilder class to get around it. It's hard to think "that way" sometimes.

Yeah amazingly enough it seems that when I use a StringBuilder, it does not help at all. It must be something with the regex engine itself.

mikeblas said:
Anyway, your output isn't correct; you need to replace "_" with " " in the strings.

Ok, thanks for the heads up. I've fixed that.

mikeblas said:
Also, you've got Unicode problems:

Code:

1478 0 ├ülfheim 258 0 0 0.827191366653136 20060701130138 61536387 12399 1478 0 ├?lfheim 258 0 0 0.827191366653136 20060701130138 61536387 12399

The page name here is "Álfheim", and "Á" is 0x00C1. In UTF-8, that's represented as 0xC3 0x81, but you change it to 0xC3 0x3F.

Strange. I will have to look into this a little. Is there some conversion that needs to be done? I figured it was UTF-8 in, UTF-8 out, correct?

mikeblas said:
You also fail to handle \" into "

OK, fixed...

Kinda bummed I can't seem to figure out how to speed this up more

Tawnos · Jul 23, 2006

mikeblas said:
Yeah, the software is really bad about injecting spaces. I think the automatic smiley code is firing, even if you have "disable smilies" marked, and replacing what might've been a smiley with the text that might've been a smiley and a space.

Nope... Vbulletin has a "word size limit" to prevent someone entering "AAAAGG....HHH" and breaking the tables. It automatically spaces after an admin-defined number of sequential characters. We may be able to request something larger (150? 200?), it won't break the forum, but will allow long code strings to function.

fluxion · Jul 23, 2006

ahh, nevermind my last submission mikeblas, i made more modifications, and i ran your C++ code and diff'ed the output files and still found a bunch of mismatches. the times do seem comparable though (considering the 30 seconds that gigantic regex added that ill be redoing since it doesnt even work on all cases), i think the final perl version will be pretty efficient

mikeblas · Jul 23, 2006

generelz said:
Strange. I will have to look into this a little. Is there some conversion that needs to be done? I figured it was UTF-8 in, UTF-8 out, correct?

I couldn't tell you about Java. I'd expect the stream reader and stream writer objects to be told what encoding to use.

Is there a way to have the regex be compiled? I'd imagine you're re-parsing it every time you use it.

Tawnos said:
Nope... Vbulletin has a "word size limit" to prevent someone entering "AAAAGG....HHH" and breaking the tables. It automatically spaces after an admin-defined number of sequential characters. We may be able to request something larger (150? 200?), it won't break the forum, but will allow long code strings to function.

Why would that apply to CODE-tagged blocks, which get their own internal scroll bar?

generelz · Jul 23, 2006

mikeblas said:
I couldn't tell you about Java. I'd expect the stream reader and stream writer objects to be told what encoding to use.

You are correct, I have now specified those.

mikeblas said:
Is there a way to have the regex be compiled? I'd imagine you're re-parsing it every time you use it.

Yeah I am trying that right now but it seems that is not the main time waster - the major time expenditure seems to be coming from the way the regex engine iterates over a String while doing a replaceAll - it does not do an in-place modification, rather it copies to a new buffer the entire time.

HHunt · Jul 23, 2006

mikeblas said:
Anyway, your output isn't correct; you need to replace "_" with " " in the strings.

Huh, I forgot that, too. New version here.
edit: Aah, good. According to diff, your code and mine now produces identical files.
(This was the first time I checked with diff, or I'd have caught it earlier.)

Tawnos · Jul 23, 2006

mikeblas said:
Why would that apply to CODE-tagged blocks, which get their own internal scroll bar?

I believe it has to do with the way vbull handles the tags. They can be enabled or disabled, so if they were disabled, but all text within the blocks did not have the same length requirements as the rest of the forum, a user could put code blocks around text intended to break the forum on forums without code blocks enabled (ow, painful sentence).

I could take a look through the vbull code and see if there's a way to fix this, with hooks or hacks, but I'd rather not

. Instead, I may (depending on how long house cleaning takes ...) try a PERL program that takes a different approach than the current regexp attempts. May or may not be faster, we'll see if I ever get it done

.

generelz · Jul 23, 2006

For those who are interested I posted updated code here:

http://geminesis.net/code/SQLConvert.java

The program now takes three arguments, infile outfile debug, where debug is true/false.

If debug is true you will get information about how long everything takes and it will only convert the first 10 INSERT lines. If debug is false you will only get information about how long the entire thing took.

It seems that now this code is taking even longer to run, at least on my machine. Perhaps it is because I added the fixes for the underscore and \" escape.

Later: OK I updated the code once again. I am no longer using a regex to remove the insert statement or the semicolon at the end and that seems to have gotten the times back down under 2m.

fluxion · Jul 23, 2006

hey mikeblas, is there a deadline for the final submission? if we could let this run for about a week before the final timings are posted i think that'll give everyone a chance to work out all the kinks and have some pretty decent submissions for each language. i wont have much time to work on it for the next few days

mikeblas · Jul 23, 2006

No deadline. It's just for fun.

Whatsisname · Jul 23, 2006

why are you compiling and running as root?

HHunt · Jul 23, 2006

Whatsisname said:
why are you compiling and running as root?

It's a practically singleuser, seldom used, toy installation, and for the things I were doing it was convenient.

drizzt81 · Jul 24, 2006

sidenote: if you guys are interested, I can give you FTP access to my webserver. May make code exchanges easier?! It's on dreamhosters, which is 'ok' for FTP.

mikeblas · Jul 25, 2006

I made a change to my C++ code that makes it run in about 17 seconds. The change was pretty simple; I'd been keep track of lengths of the fields, but also storing a '\0' at the end of the field. I now store a tab, and then increase the length I kept by one. The store I was doing, then, is no longer useless, and I can eliminate 10 calls to fwrite() for a single character per record written.

I read 412 million bytes and write 378 million, so that's 790 million bytes. If I run in 17 seconds, I'm processing around 46 megs/second.

mikeblas · Jul 25, 2006

Lord of Shadows said:
I use editpad lite and it could handle the file quite nicely (I was wondering why it was so short untill I noticed how long my horizontal slider was.)
I'm to lazy to write a parser though. ;o)

I figured out why I don't like this one. I think I tried an older version, and this one has a less garrish UI. But what happens when loading a large file is a real deal breaker:

1) Open large file with lots of lines (like the output files we're generating here)
2) As the file loads, the screen repaints and flickers.
3) If you press [END] to get to the end of the file, you go to the end of what was loaded at the moment.
4) Pressing [END] gets you to the new end of the file.

3+4 seems OK, but there's no indidication of progress: has the file been completely loaded? How much more? Other than, of course, it stops flickering.

Changing the text encoding is a really nice thing to be able to do... but the large file support still isn't quite there.

C++ and Perl bakeoff

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

2[H]4U

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Limp Gawd

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Limp Gawd

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Limp Gawd

Supreme [H]ardness

Limp Gawd

2[H]4U

Gawd

[H]ard|DCer of the Month - May 2006

Limp Gawd

Supreme [H]ardness

2[H]4U

Limp Gawd

Gawd

[H]ard|DCer of the Month - May 2006

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006