FS/"Canned" compared with HOCP/"Real-World"

jimmyb · Jul 21, 2006

Everyone seems to have a strong opinion on the recent Conroe gameplay evaluation/review and the ensuing shitstorm that is still taking place. I thought it would be interesting to do a formal analysis of what "real-world" differences there are between "canned" and "real-world" performance testing (largely in the context of the FS-HOCP debacle).

In a rather unusual turn of events, I'm trying to keep out my usual biting sarcasm and subtle jabs. They're totally unprofessional and immediately detract from the credibility of an article as we have learnt. (oops, that won't happen again)

Methodology
I have taken tests run by HardOCP in their Intel Core 2 Gaming Performance article and compared them to "canned" benchmarks on settings and hardware as similar as possible (single video card, ram, resolution, game settings, etc.). I have specifically looked for performance differentials when the CPU is varied, as this was the focus of the HardOCP article, and certainly the most imortant factor when upgrading a CPU. Currently there are just Oblivion and HL2 benchmark comparisons here. I would like to do more comparisons, so if people can point me to suitable candidates that would be great. Since no two sites ran the same suite of games, it will probably be necessary to get the tests off multiple sites.

Oblivion
HardOCP Results @ 1280*1024
6800 - MIN: 27, MAX: 66, AVG: 39.8
6700 - MIN: 26, MAX: 65, AVG: 39.5
FX-62 - MIN: 26, MAX: 63, AVG: 39.6
Shadows had to be to a lower quality for the FX. Otherwise performance and quality difference was reportedly negligable.

"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/city1600.gif [url]http://firingsquad.com/hardware/intel_core_2_performance/images/obl1600.gif[/url]
Firingsquad lacks minimum and maximum figures, as well as a histogram. The results are the same though: A performance difference is [b]essentially non-existant[/b] between CPUs.

[u][b]Half Life 2[/b][/u]
HardOCP tested Episode 1, while Firingsquad tested Lost Coast. If someone can point me to a "canned" episode 1 test I will gladly include it in this section. I believe Lost Coast and Episode 1 share essentially the same engine (ie. HL2 + HDR, etc.) so this is a pretty reasonable comparison. Nonetheless, HardOCP and Firingsquad both agree in their results.

[url]http://enthusiast.hardocp.com/article.html?art=MTEwOCw1LCxobmV3cw==HardOCP[/url] Results[/URL] @ 1600*1200
6800 - MIN: 34, MAX: 96, AVG: [b]59.9[/b]
6700 - MIN: 32, MAX: 96, AVG: [b]59.9[/b]
FX-62 - MIN: 34, MAX: 95, AVG: [b]59.9[/b]
HardOCP stated, [i]"There is absolutely [b]no way in-game to tell any difference[/b] between the CPUs [...]"[/i].

[URL=http://firingsquad.com/hardware/intel_core_2_performance/page13.asp"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/lc1600.gif

Again, Firingsquad lacks some very useful information like max and min fps, but the results are the same: There is no performance benefit to the Core 2 CPUs.

Summary
Although this is a pretty limited comparison (about 1/3 of the HOCP tests), it is pretty clear that FS/can-marks get the same results as HOCP. The conclusion to be drawn from both reviews is there is a negligible difference in performance when going from an FX-62 to a Core 2 at those particular settings. If Firingsquad is lying about something, then so is HardOCP, because they both have the same results.

My own thoughts/Stop reading here
I'm not yet convinced that timedemos aren't useful tools for benchmarking. For that matter, I'm more convinced that irreproducabley testing a game (aka "Real-world") is unreliable. You could not get away with testing like this in the scientific or engineering communities. Now, as far as all the reviews I've read, the "real-world" testing agrees with the "canned" testing, so practically speaking it doesn't make a difference what method is used.

More importantly, it is very useful to provide a wide variety of tests for a wide variety of readers. I'm of the opinion that more information is better than less information. It is vital for the reader to understand the data s/he is reading and make decisions about what is relevant to them and what is not.

We are perfectly capable of coming to the conclusion that, "those 1600*1200 benchies indicate I won't see a difference right now, but those 800*600 benchies suggest that the CPU will scale better when I get a G80 in december". Do not underestimate our intelligence. It is insulting.

Also here's some constructive criticism for everyone that I'm sure no one will disagree with. It would be really useful if reviewers could start including standard deviation figures with their tests. This provides excellent insight into the stability of frame rates. No one seems to be doing this currently as far as I know.

Discuss, contribute, point out why it's incorrect, etc.

blink hi · Jul 21, 2006

This argument that's been going on here for the last week is really useless

One side says that real world tests are all that really matter in the end while the other states that the point of a CPU review is to test the full capabilities of the CPU using what tools and benchmarks we have available to limit variables.

It's really a moot point: Conroes are capable of more than the current AMD chips, but if you're running at high res/graphics options, then you won't really be able to tell a difference until new GPUs are released.

RobbieV · Jul 21, 2006

Excellent post. I would agree, a review is more informative if it includes real world benchmarks AND tests of the CPU alone, at low resolutions. The fact of the matter is, time-demos are currently just as good at showing us what the bottleneck will be as real world results are, and are repeatable to boot. This is not all too surprising, as Jakub said in his rant, there are ways to run time-demos that do indeed calculate physics and AI.

The argument against running at low resolutions, at least by HOCP, seems to be that "since we do not know what future games will bring and what future video cards will be like, we can not predict which CPU will perform better one or two years from now."

I would disagree with this however. If one CPU today is putting up 20% better frames at low resolutions than another, it is very safe to assume that this CPU will maintain an advantage when new hardware comes out. History would certainly prove this, a CPU simply does not lose an advantage this great with new game code. In fact, one might argue that the Conroe's advantage will actually increase in the future since no game code has been optimized for it yet, it is an SSE monster after all.

Puterguru · Jul 21, 2006

Do we really need Another Thread on this subject?

jimmyb said:
We are perfectly capable of coming to the conclusion that, "those 1600*1200 benchies indicate I won't see a difference right now, but those 800*600 benchies suggest that the CPU will scale better when I get a G80 in december". Do not underestimate our intelligence. It is insulting..

This is where I disagree. Some of the other sites only do time demos at 800x600. If that is ALL THAT YOU READ, then you come to the conclusion that the Conroe that got 158 fps at 800x600 while the AMD got 100 fps will still kick some major booty at higher resolutions over the AMD. That is what Kyle proved to be INCORRECT and I personally thank him for that.

jimmyb · Jul 21, 2006

If that was what I read, then I would come to the conclusion that the Conroe was a lot faster at 800*600, which it is!

RobbieV · Jul 21, 2006

Puterguru said:
Do we really need Another Thread on this subject?

This is where I disagree. Some of the other sites only do time demos at 800x600. If that is ALL THAT YOU READ, then you come to the conclusion that the Conroe that got 158 fps at 800x600 while the AMD got 100 fps will still kick some major booty at higher resolutions over the AMD. That is what Kyle proved to be INCORRECT and I personally thank him for that.

Which sites only did time demos at 800x600 and nothing else?

Puterguru · Jul 21, 2006

A Porsche is also faster than an Escort but in the "Real World" they are both limited to either 55 mph or 70 mph, which BOTH cars can do. So until they raise the Speed Limit to 120 mph or above there would be no need to upgrade from the Escort to the Porsche if you are purely looking for speed.

jimmyb · Jul 21, 2006

The conclusion to be drawn from the OP is that the Firingsquad "canned" benches are the same as the HOCP "real-world" benches.

If you would like to talk about cars that is fine though.

chrisf6969 · Jul 21, 2006

So what? 2 wrongs don't make a right.

I think both lacked enough information to really helpe me.

I always play at 1152x768 with as much AA/AF that I can turn on and still get my average FPS to be around 100. yeah some people say you only need 40fps. Well, that might be true for a casual gamer or a noob. But for someone with keen eyesight & reflexes, like myself, we notice if the fps drops below 50-60fps. So thats "REAL WORLD" for me. In the real world, I might get a next generation videocard or buy a 2nd videocard, in a few months, so that GPU bottleneck will no longer be there.

Anyway, Anandtech's use of crossfire, http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=14 made it much more valuable as thats closer to the framerate that I like to acheive.

Both FS & HOCP pushed the graphic detail up to a level where all the top processors were GPU limited.

They should have benchmarked at a lower resolution/detail to show the differences in the CPU. Which will let people make a more informed decision. So what if Firing Squid's review sucked too? I don't see your point. How about comparing [H]'s review with Anandtech?

So what if Kyle picked on the retarded kid (FS) down the block to make his review look better!

P.S. Enough with the car comparisons, yes, they compared 2 things and then capped them like a speed limit or govenor, we got that point about 100 times ago!

numbers · Jul 21, 2006

Well put together, and I agree with everything. But I think one of the major differences between the two is the narrow scope of processors used in the [H] test compared to any other review out there.

jimmyb · Jul 21, 2006

I'm not really trying to address the issue of low resolution testing here.

A lot of noise is made about how "canned" benchmarks don't represent "real-world" gaming (or vice versa). As far as the FS article is concerned, their "canned" tests seem to come to the exact same conclusions that HOCP's "real-world" tests came to.

Both websites are engaging in an editorial shit-war when their data agrees with each other. It's ridiculous, regardless of who started it.

WuTangClam · Jul 21, 2006

I agree 100% with the OP.

DimFilter · Jul 21, 2006

I disagree 100% with the OP.

Sharky974 · Jul 21, 2006

Exactly.

Ther results of fs "canned" benchmarks will likely be exactly the same as H's "real world".

You see, that completely confuses the issue. Whether you use FRAPS or not is basically irrelevant. You'll be GPU limited in certain games at 1600X1200 either way.

I think we need to seperate the issue of timdemos versus the issue of running GPU bottlenecked tests in a CPU review. They're completely unrelated.

CaiNaM · Jul 21, 2006

what is the point of evaluating "core 2 gaming performance" if the tests are totally dependant on the the GPU?

heck, using HOCP's methodology you could throw a Pentium 950 into the mix, and all you would conclude is that it's just as fast as the X2 A64 and the new C2D. is that what we want to know from a cpu "gaming performance" evalution?

Bona Fide · Jul 21, 2006

WuTangClam said:
I agree 100% with the OP.

qft

wfalcon · Jul 21, 2006

jimmyb said:
Everyone seems to have a strong opinion on the recent Conroe gameplay evaluation/review and the ensuing shitstorm that is still taking place. I thought it would be interesting to do a formal analysis of what "real-world" differences there are between "canned" and "real-world" performance testing (largely in the context of the FS-HOCP debacle).

In a rather unusual turn of events, I'm trying to keep out my usual biting sarcasm and subtle jabs. They're totally unprofessional and immediately detract from the credibility of an article as we have learnt. (oops, that won't happen again)

Methodology
I have taken tests run by HardOCP in their Intel Core 2 Gaming Performance article and compared them to "canned" benchmarks on settings and hardware as similar as possible (single video card, ram, resolution, game settings, etc.). I have specifically looked for performance differentials when the CPU is varied, as this was the focus of the HardOCP article, and certainly the most imortant factor when upgrading a CPU. Currently there are just Oblivion and HL2 benchmark comparisons here. I would like to do more comparisons, so if people can point me to suitable candidates that would be great. Since no two sites ran the same suite of games, it will probably be necessary to get the tests off multiple sites.

Oblivion
HardOCP Results @ 1280*1024
6800 - MIN: 27, MAX: 66, AVG: 39.8
6700 - MIN: 26, MAX: 65, AVG: 39.5
FX-62 - MIN: 26, MAX: 63, AVG: 39.6
Shadows had to be to a lower quality for the FX. Otherwise performance and quality difference was reportedly negligable.

"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/city1600.gif [url]http://firingsquad.com/hardware/intel_core_2_performance/images/obl1600.gif[/url]
Firingsquad lacks minimum and maximum figures, as well as a histogram. The results are the same though: A performance difference is [b]essentially non-existant[/b] between CPUs.

[u][b]Half Life 2[/b][/u]
HardOCP tested Episode 1, while Firingsquad tested Lost Coast. If someone can point me to a "canned" episode 1 test I will gladly include it in this section. I believe Lost Coast and Episode 1 share essentially the same engine (ie. HL2 + HDR, etc.) so this is a pretty reasonable comparison. Nonetheless, HardOCP and Firingsquad both agree in their results.

[url]http://enthusiast.hardocp.com/article.html?art=MTEwOCw1LCxobmV3cw==HardOCP[/url] Results[/URL] @ 1600*1200
6800 - MIN: 34, MAX: 96, AVG: [b]59.9[/b]
6700 - MIN: 32, MAX: 96, AVG: [b]59.9[/b]
FX-62 - MIN: 34, MAX: 95, AVG: [b]59.9[/b]
HardOCP stated, [i]"There is absolutely [b]no way in-game to tell any difference[/b] between the CPUs [...]"[/i].

[URL=http://firingsquad.com/hardware/intel_core_2_performance/page13.asp"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/lc1600.gif

Again, Firingsquad lacks some very useful information like max and min fps, but the results are the same: There is no performance benefit to the Core 2 CPUs.

Summary
Although this is a pretty limited comparison (about 1/3 of the HOCP tests), it is pretty clear that FS/can-marks get the same results as HOCP. The conclusion to be drawn from both reviews is there is a negligible difference in performance when going from an FX-62 to a Core 2 at those particular settings. If Firingsquad is lying about something, then so is HardOCP, because they both have the same results.

My own thoughts/Stop reading here
I'm not yet convinced that timedemos aren't useful tools for benchmarking. For that matter, I'm more convinced that irreproducabley testing a game (aka "Real-world") is unreliable. You could not get away with testing like this in the scientific or engineering communities. Now, as far as all the reviews I've read, the "real-world" testing agrees with the "canned" testing, so practically speaking it doesn't make a difference what method is used.

More importantly, it is very useful to provide a wide variety of tests for a wide variety of readers. I'm of the opinion that more information is better than less information. It is vital for the reader to understand the data s/he is reading and make decisions about what is relevant to them and what is not.

We are perfectly capable of coming to the conclusion that, "those 1600*1200 benchies indicate I won't see a difference right now, but those 800*600 benchies suggest that the CPU will scale better when I get a G80 in december". Do not underestimate our intelligence. It is insulting.

Also here's some constructive criticism for everyone that I'm sure no one will disagree with. It would be really useful if reviewers could start including standard deviation figures with their tests. This provides excellent insight into the stability of frame rates. No one seems to be doing this currently as far as I know.

Discuss, contribute, point out why it's incorrect, etc.

QF "Real World" T

nigerian_businessman · Jul 22, 2006

Puterguru said:
Do we really need Another Thread on this subject?

This is where I disagree. Some of the other sites only do time demos at 800x600. If that is ALL THAT YOU READ, then you come to the conclusion that the Conroe that got 158 fps at 800x600 while the AMD got 100 fps will still kick some major booty at higher resolutions over the AMD. That is what Kyle proved to be INCORRECT and I personally thank him for that.

If that's all that you read, and thats the conclusion you come to, that's unfortunate for you.

FS/"Canned" compared with HOCP/"Real-World"

jimmyb

2[H]4U

blink hi

Weaksauce

RobbieV

n00b

Puterguru

2[H]4U

jimmyb

2[H]4U

RobbieV

n00b

Puterguru

2[H]4U

jimmyb

2[H]4U

chrisf6969

[H]F Junkie

numbers

Limp Gawd

jimmyb

2[H]4U

WuTangClam

Limp Gawd

DimFilter

Limp Gawd

Sharky974

Identified Troll

CaiNaM

Limp Gawd

Bona Fide

2[H]4U

wfalcon

[H]ard|Gawd

nigerian_businessman

[H]ard|Gawd