Everyone seems to have a strong opinion on the recent Conroe gameplay evaluation/review and the ensuing shitstorm that is still taking place. I thought it would be interesting to do a formal analysis of what "real-world" differences there are between "canned" and "real-world" performance testing (largely in the context of the FS-HOCP debacle).
In a rather unusual turn of events, I'm trying to keep out my usual biting sarcasm and subtle jabs. They're totally unprofessional and immediately detract from the credibility of an article as we have learnt. (oops, that won't happen again)
Methodology
I have taken tests run by HardOCP in their Intel Core 2 Gaming Performance article and compared them to "canned" benchmarks on settings and hardware as similar as possible (single video card, ram, resolution, game settings, etc.). I have specifically looked for performance differentials when the CPU is varied, as this was the focus of the HardOCP article, and certainly the most imortant factor when upgrading a CPU. Currently there are just Oblivion and HL2 benchmark comparisons here. I would like to do more comparisons, so if people can point me to suitable candidates that would be great. Since no two sites ran the same suite of games, it will probably be necessary to get the tests off multiple sites.
Oblivion
HardOCP Results @ 1280*1024
6800 - MIN: 27, MAX: 66, AVG: 39.8
6700 - MIN: 26, MAX: 65, AVG: 39.5
FX-62 - MIN: 26, MAX: 63, AVG: 39.6
Shadows had to be to a lower quality for the FX. Otherwise performance and quality difference was reportedly negligable.
"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/city1600.gif [url]http://firingsquad.com/hardware/intel_core_2_performance/images/obl1600.gif[/url]
Firingsquad lacks minimum and maximum figures, as well as a histogram. The results are the same though: A performance difference is [b]essentially non-existant[/b] between CPUs.
[u][b]Half Life 2[/b][/u]
HardOCP tested Episode 1, while Firingsquad tested Lost Coast. If someone can point me to a "canned" episode 1 test I will gladly include it in this section. I believe Lost Coast and Episode 1 share essentially the same engine (ie. HL2 + HDR, etc.) so this is a pretty reasonable comparison. Nonetheless, HardOCP and Firingsquad both agree in their results.
[url]http://enthusiast.hardocp.com/article.html?art=MTEwOCw1LCxobmV3cw==HardOCP[/url] Results[/URL] @ 1600*1200
6800 - MIN: 34, MAX: 96, AVG: [b]59.9[/b]
6700 - MIN: 32, MAX: 96, AVG: [b]59.9[/b]
FX-62 - MIN: 34, MAX: 95, AVG: [b]59.9[/b]
HardOCP stated, [i]"There is absolutely [b]no way in-game to tell any difference[/b] between the CPUs [...]"[/i].
[URL=http://firingsquad.com/hardware/intel_core_2_performance/page13.asp"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/lc1600.gif
Again, Firingsquad lacks some very useful information like max and min fps, but the results are the same: There is no performance benefit to the Core 2 CPUs.
Summary
Although this is a pretty limited comparison (about 1/3 of the HOCP tests), it is pretty clear that FS/can-marks get the same results as HOCP. The conclusion to be drawn from both reviews is there is a negligible difference in performance when going from an FX-62 to a Core 2 at those particular settings. If Firingsquad is lying about something, then so is HardOCP, because they both have the same results.
My own thoughts/Stop reading here
I'm not yet convinced that timedemos aren't useful tools for benchmarking. For that matter, I'm more convinced that irreproducabley testing a game (aka "Real-world") is unreliable. You could not get away with testing like this in the scientific or engineering communities. Now, as far as all the reviews I've read, the "real-world" testing agrees with the "canned" testing, so practically speaking it doesn't make a difference what method is used.
More importantly, it is very useful to provide a wide variety of tests for a wide variety of readers. I'm of the opinion that more information is better than less information. It is vital for the reader to understand the data s/he is reading and make decisions about what is relevant to them and what is not.
We are perfectly capable of coming to the conclusion that, "those 1600*1200 benchies indicate I won't see a difference right now, but those 800*600 benchies suggest that the CPU will scale better when I get a G80 in december". Do not underestimate our intelligence. It is insulting.
Also here's some constructive criticism for everyone that I'm sure no one will disagree with. It would be really useful if reviewers could start including standard deviation figures with their tests. This provides excellent insight into the stability of frame rates. No one seems to be doing this currently as far as I know.
Discuss, contribute, point out why it's incorrect, etc.
In a rather unusual turn of events, I'm trying to keep out my usual biting sarcasm and subtle jabs. They're totally unprofessional and immediately detract from the credibility of an article as we have learnt. (oops, that won't happen again)
Methodology
I have taken tests run by HardOCP in their Intel Core 2 Gaming Performance article and compared them to "canned" benchmarks on settings and hardware as similar as possible (single video card, ram, resolution, game settings, etc.). I have specifically looked for performance differentials when the CPU is varied, as this was the focus of the HardOCP article, and certainly the most imortant factor when upgrading a CPU. Currently there are just Oblivion and HL2 benchmark comparisons here. I would like to do more comparisons, so if people can point me to suitable candidates that would be great. Since no two sites ran the same suite of games, it will probably be necessary to get the tests off multiple sites.
Oblivion
HardOCP Results @ 1280*1024
6800 - MIN: 27, MAX: 66, AVG: 39.8
6700 - MIN: 26, MAX: 65, AVG: 39.5
FX-62 - MIN: 26, MAX: 63, AVG: 39.6
Shadows had to be to a lower quality for the FX. Otherwise performance and quality difference was reportedly negligable.
"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/city1600.gif [url]http://firingsquad.com/hardware/intel_core_2_performance/images/obl1600.gif[/url]
Firingsquad lacks minimum and maximum figures, as well as a histogram. The results are the same though: A performance difference is [b]essentially non-existant[/b] between CPUs.
[u][b]Half Life 2[/b][/u]
HardOCP tested Episode 1, while Firingsquad tested Lost Coast. If someone can point me to a "canned" episode 1 test I will gladly include it in this section. I believe Lost Coast and Episode 1 share essentially the same engine (ie. HL2 + HDR, etc.) so this is a pretty reasonable comparison. Nonetheless, HardOCP and Firingsquad both agree in their results.
[url]http://enthusiast.hardocp.com/article.html?art=MTEwOCw1LCxobmV3cw==HardOCP[/url] Results[/URL] @ 1600*1200
6800 - MIN: 34, MAX: 96, AVG: [b]59.9[/b]
6700 - MIN: 32, MAX: 96, AVG: [b]59.9[/b]
FX-62 - MIN: 34, MAX: 95, AVG: [b]59.9[/b]
HardOCP stated, [i]"There is absolutely [b]no way in-game to tell any difference[/b] between the CPUs [...]"[/i].
[URL=http://firingsquad.com/hardware/intel_core_2_performance/page13.asp"Canned" Results @ 1600*1200, courtesy of Firingsquad
http://firingsquad.com/hardware/intel_core_2_performance/images/lc1600.gif
Again, Firingsquad lacks some very useful information like max and min fps, but the results are the same: There is no performance benefit to the Core 2 CPUs.
Summary
Although this is a pretty limited comparison (about 1/3 of the HOCP tests), it is pretty clear that FS/can-marks get the same results as HOCP. The conclusion to be drawn from both reviews is there is a negligible difference in performance when going from an FX-62 to a Core 2 at those particular settings. If Firingsquad is lying about something, then so is HardOCP, because they both have the same results.
My own thoughts/Stop reading here
I'm not yet convinced that timedemos aren't useful tools for benchmarking. For that matter, I'm more convinced that irreproducabley testing a game (aka "Real-world") is unreliable. You could not get away with testing like this in the scientific or engineering communities. Now, as far as all the reviews I've read, the "real-world" testing agrees with the "canned" testing, so practically speaking it doesn't make a difference what method is used.
More importantly, it is very useful to provide a wide variety of tests for a wide variety of readers. I'm of the opinion that more information is better than less information. It is vital for the reader to understand the data s/he is reading and make decisions about what is relevant to them and what is not.
We are perfectly capable of coming to the conclusion that, "those 1600*1200 benchies indicate I won't see a difference right now, but those 800*600 benchies suggest that the CPU will scale better when I get a G80 in december". Do not underestimate our intelligence. It is insulting.
Also here's some constructive criticism for everyone that I'm sure no one will disagree with. It would be really useful if reviewers could start including standard deviation figures with their tests. This provides excellent insight into the stability of frame rates. No one seems to be doing this currently as far as I know.
Discuss, contribute, point out why it's incorrect, etc.