Would anyone like to have a bakeoff to see if C++ or Perl is better at string-intensive processing? It would be fun, something like a programming contest (without prizes). A benchmark-athon.
I like to download the Wikipedia content dumps and play with them. There's lots of interesting things to do. Of course, they're all formatted for MySQL, and I don't run MySQL. So I need to convert some of the scripts for SQL Server.
One file is a bunch of INSERT statements (using some MySQL specific syntax, of course) that loads a single table full of information about every article. I've written a program to rip through that file and make a CSV out of it. Then, I can use another tool to load that CSV file into SQL Server very rapidly.
There's lots of parsing going on;
INSERT (a1,'b21,'c1'),(a2,'b2','c2'),('a3','b3','c3')
becomes
a1,"b1","c1"
a1,"b2","c2"
a1,"b3","c3"
and so on. There's escapements, too; the strings have single ticks escaped with a backslash, and a backslash escaped, too. (So \' becomes ', and \\ becomes \.)
There's also hex escapements, it seems: \xe0 becomes whatever single character 0xE0 is. I haven't figured out if that's also for four byte characters (for Unicode), like \xe2240 becoming whatever single Unicode character 0x2240.
I think I've got a very fast solution in C#, and I want to write a C++ version for comparison. Who can come up with a Perl version? String processing should be Perl's home court, right?
I like to download the Wikipedia content dumps and play with them. There's lots of interesting things to do. Of course, they're all formatted for MySQL, and I don't run MySQL. So I need to convert some of the scripts for SQL Server.
One file is a bunch of INSERT statements (using some MySQL specific syntax, of course) that loads a single table full of information about every article. I've written a program to rip through that file and make a CSV out of it. Then, I can use another tool to load that CSV file into SQL Server very rapidly.
There's lots of parsing going on;
INSERT (a1,'b21,'c1'),(a2,'b2','c2'),('a3','b3','c3')
becomes
a1,"b1","c1"
a1,"b2","c2"
a1,"b3","c3"
and so on. There's escapements, too; the strings have single ticks escaped with a backslash, and a backslash escaped, too. (So \' becomes ', and \\ becomes \.)
There's also hex escapements, it seems: \xe0 becomes whatever single character 0xE0 is. I haven't figured out if that's also for four byte characters (for Unicode), like \xe2240 becoming whatever single Unicode character 0x2240.
I think I've got a very fast solution in C#, and I want to write a C++ version for comparison. Who can come up with a Perl version? String processing should be Perl's home court, right?