Benchmarking the Benchmarks @ [H]

ITOAO · Feb 11, 2008

This article was wonderful and just proves what you guys have been saying all along. I too am one that makes sure I read your reviews before I go out and make a major purchase. I recently made some recommendations to a guild member of mine about video cards and pointed to several of your reviews for proof that he could save a few bucks and still get great performance. Keep up the great work. (Still sad that the system evaluations had to go. Those were the best also.)

brucedeluxe169 · Feb 11, 2008

Big D. said:
That would defy the whole point of actual gaming performance, wouldn't it? If you make a map that's too demanding, it'll be unlike anything you'll actually experience when playing the game. If it's not demanding enough, same thing. It would allow you to compare 2 video cards when rendering the same scene, but it would not be a scene found in the game...therefore it wouldn't give you an idea of how the game would perform for you.

I dunno, i just figured along the same lines of how people on the forum tell others that request tests in specific resolutions do extrapolate the results at that resolution from the data for other resolutions, the same thinking would hold for the custom, extra stressful scenario (or even a custom not too bad scenario).... that the card that wins in that would win in other settings as well. I mean... sure its not found in the shipping game, but it is still a scene made with the same engine.... and its results would apply just as well as a scene from the shipping game....

not to mention it really shouldnt be that big a deal if a scene is in the game or not... i mean a scene is just that, a scene, and the rest of the game isnt the same scene, yet we can extrapolate how the rest of the game will play based on the results taken from a playthrough of that particular scene.... so I'd say that a custom scene would be just as good as a scene in the official game if this logic is used...

Slade · Feb 11, 2008

I agree with most of the concepts behind your benchmarking and evaluations.

The only part that I find that will still remain subjective is just what is the minimal sustained FPS that provides the user a smooth gaming experience. Crysis has shown that if you toss in a little motion blur, 20fps is fine to get by.

Unfortunately for me, I find the information lacking at 20 fps. I've found that I've been able to use as much as 85fps (back in the old quake 3/crt days) when playing FPS games. So reading your evaluations have really helped me over the years make sure that when purchase cards, I get to play the most sustained FPS for the resolution I play and I'm thankful for that.

Keep up the good work!

Cadaver · Feb 11, 2008

Excellent read. Thank you.

One quick question if I may. You mentioned in the article that you put a tremendous amount of time and effort into ensuring that each lap in the game is as comparable as possible. Do you run any statistical analyses (such as Analysis of Variance/pairwise comparisons) to ensure there are no significant differences among your sample means?

Big D. · Feb 11, 2008

brucedeluxe169 said:
so I'd say that a custom scene would be just as good as a scene in the official game if this logic is used...

It would be, but only if it would be comparable in terms of detail (geometry, textures, effects, etc.) to some levels in the game. If I took the Unreal Engine 3 - used for great-looking games like GoW and UT3 - and ported an Unreal 1 level into it, I'd be using the same engine as UT3, but my fps would be completely different. That's what's important about this.

Frosteh said:
They clearly have no intention of addressing issues people have with their review method [...] until they provide benchmarks at higher frame rates which the competative gamers demand then the reviews will remain useless to a decent portion of the gaming community.

Let's not start that in this thread as well. I locked it for a reason. - Kyle

[RIP]Zeus · Feb 11, 2008

And it starts

http://forums.anandtech.com/messageview.aspx?catid=31&threadid=2153074&enterthread=y

Blacklash · Feb 11, 2008

"The last thing we want our readers to experience is installing that brand new extremely expensive video card and then being let down or disappointed in its performance."

Some of us appreciate that a lot.

I purchased an HD 3870 X2 based on all the reviews I saw that refuted [H]ard's findings. I was "disappointed and let down". I'll pay closer attention to your opinions in the future.

FPS is nice and I am more concerned with real game play too. For instance; card A gets 60FPS, card B gets 75 FPS however B stutters horridly while moving about in actual play. I'd much rather card A.

I want to know about IQ in the game, and smoothness of play.

When I get a card, I do just like Kyle; feel around for the best balance of IQ and performance.

BTW good read and I think folks should finally understand where you are coming from.

Kowan · Feb 11, 2008

[RIP]Zeus;1032046273 said:
And it starts

http://forums.anandtech.com/messageview.aspx?catid=31&threadid=2153074&enterthread=y

Stay above it and let the mortals sling mud....

newster · Feb 11, 2008

Great article and not surprising to me in the least, but hopefully it will silence many haters. Now if you guys would only review games as well I would have to stop visiting other sites altogether!

[RIP]Zeus · Feb 11, 2008

Kowan said:
Stay above it and let the mortals sling mud....

Oh i'm not even going to post there

I am a [H] Fanboi to the end

But i do have the popcorn and the just sitting through the previews

Posts about other posts are lame. If we identify ANY of our forum members causing issues on Anandtech's forum, you will be permanently banned. This is your one warning. - Kyle

craigbru · Feb 11, 2008

Excellent work as usual guys!

techie81 · Feb 11, 2008

Thank you for the work you guys do, excellent read as always.

Jaguar Jim1 · Feb 11, 2008

Kyle and Brent have been on this Method since the GeForce Ultra 5800 Days becasue if I remember Correctly, wasn't it nVidia who got caught optimizing their Code in order to run Benchmarks faster, therefore rendering those tests invalid. Kyle, I do remember how you said if GPU Makers would play fair, you wouldn't have had to implement the "Real World Game PlaY Model", that everyone seems to be so bent out of Shape over.

I've Purchased a lot of Equipment over the Years, and it's because of Sites like yours, that I keep my Interest in New Technology, and even though I'm Like 58 Years Old, I can totally relate! KUDO's!!!!!!

TYoda · Feb 11, 2008

[H] FTW!

As a side note I do read Anandtech's 'reviews' and while I appreciate their in depth analysis of certain things I've always taken their benchies with a grain of salt as I know they're using canned demos to produce their results.

Lord_Exodia · Feb 11, 2008

And here I was a self proclaimed [H]ardOCP testing method expert. LOL this article did more than show how irrelevant the canned time demo results are when wanting to know what card is best to experience the new hot games it also took us deeper into the [H] way of testing. Honestly I did not know it was that deep. Damn, you guys play through the entire game and find the most stressful points to benchmark.

These are usually the points that matter most guys

Remember oblivion

Outoors, during sunlight with HDR(still shivers from the memory of the lagfest)

Thanks guys for working MUCH harder than the others to give us what we enjoy for FREE

don't go getting any ideas now

Love-Man · Feb 11, 2008

Wow, those differences were HUGE!

You guys said that even you were surprised. I think you said all you guys needed to say in this article, but it would be interesting to see this time of comparison in more games. I'm willing to bet we would see more of the same.

Fantastic work. Your integrity and willingness to prove yourselves makes HardOCP really stand out in this community.

anthrex · Feb 11, 2008

damn you guys and your good reviews.

even though I truly never get a real world experience as you guys use much more powerful processors than I will ever have

Proxy · Feb 11, 2008

Awesome review Kyle!

Zero82z · Feb 11, 2008

[H] is the only site I give a shit about nowadays when it comes to video card reviews(ever since bit-tech went back to the canned method, anyway). I understand the perspective of some other sites who just want to show the pure performance of one card compared to the others on the market, and I think that there's place for both types of performance metrics. What I don't get is the rabid fanboyism for canned benchmarks, especially on that "other" forum

. I've seen the same type of thing when I've debated people over certain issues, where people confronted by logic will basically just go "nyah, nyah, I can't hear you" and then start insulting you, because they're so convinced of their own beliefs that they refuse to even consider that there might be something better out there. If only people could be a little more open-minded, then maybe they could consider that there are benefits to both methods, and they shouldn't just exclude the [H] method because it goes contrary to their preconceptions.

It really makes me glad that you guys have the balls to stand up to all the pressure and defend your point, because you are right, and the fact that you continue to use your method makes you the only relevant video card reviewing site on the internet, at least in my opinion. Hopefully the naysayers will start to actually use their heads, and start paying attention to what they read in [H] reviews instead of being pissed off that they aren't presented with a bunch of neat columns filled with numbers definitively declaring the "winning product".

leSLIe · Feb 11, 2008

Great work! kudos to the Hardocp guys!! finally someone that actually speaks up!

i hope the guys at Anandtech see the light with this one... or they are gonna hate u forverer Kyle

Zinn · Feb 11, 2008

Great work, and Dugg!

I'd be really interested to see what Derek Wilson has to say in response to this. So far he just seemingly brushed the article off. I think the fact that you've shown a 3870x2 performs better than the competition in benchmarks but worse in game play indicates a pretty big problem in the industry right now. AMD and NVIDIA are creating drivers to win benchmarking pissing matches when their actual products are not performing / have serious problems (Crossfire clock-throttling bug in Crysis, for example)

I buy video cards to play games, and [H] guides my purchasing decisions.

Keep up the good work, Kyle.

peo · Feb 11, 2008

Nice article.
Personally I would do away with the graphs int he reveiws but I guess people still want them.

Spacy9 · Feb 11, 2008

Good article, and one that is hard to argue with (although I'm sure some will).

I have agreed with your real world testing methods since you first started using them. It makes sense to me that if I'm buying a card to play games with, then the testing used to review that card should show how well it is going to perform it's main function, actually playing the games.

I don't play timed demos. I don't know why other people and other sites don't get that.

4745454b · Feb 11, 2008

First, let me assure people who recognize my name that I am the same person who posts with this name on the other forums. I have not as of yet seen anyone "hijack" this name on a forum yet.

Second, I do value the reviews on this site. It is one of the four that I check daily.

I do understand why Hard does review the way they do. Frankly, I like it. Using ACTUAL game play to judge a card? I wonder why more people don't do it. I do have one issue however. When people judge cards, the human "judging" can influence the results. Unless these are done "double blind", I can't see how you can claim to keep the human element out of it. Swear all you want, I do add a human "judgment factor" to everything I read from this site.

I'm not sure what pissing contest you've got into with Anand. Frankly, I don't care. Both sites provide valuable input into what is better then which. In my opinion, only by looking at both sides of the coin can you really know WTF is going on.

So says the voice of reason.

Decko87 · Feb 11, 2008

I was introduced to this site by a friend a few years ago, I used to just brush over sites and average FPS between cards together in order to make decisions. I remember when first reading through your articles the "apples to apples" and real world game-play comparisons drew my attention immediately. The idea I am getting is the mere fact I was always interested in exactly what you did. This article was very informative as far as your methods go and I am even more impressed then before. Thanks for being so scrupulous and tenacious with your methods.

Thxiiet · Feb 11, 2008

The "real world" tests are great for a person buying their very first video card, where they have no point of reference. The average [H] reader already has a pretty good idea of how well their current system works. Time Demo based benchmarks provide a method to say "If I upgraded my 6800GT to the 8800GTX I can expect a performance increase of 'x.'"

Your argument that the premade benchmarks do not provide a "real world" assessment of performance may be accurate, but it will still provide the best method for comparison.

SamuraiInBlack · Feb 11, 2008

As someone once told me, "Just because a car drives really fast doesn't mean it can take a corner worth a damn."

Sr7 · Feb 11, 2008

I think the in-game testing is the way to go, but:

Just to be the devils advocate here.. there is still the question of why, when you run the timedemo like everyone else, the x2 loses to the GTX. I understand that fraps and the engine reported FPS may not agree, that's not really in question. The fact is now you've run the test the way other sites have, and you still have a totally different result/conclusion than they do.

So this begs the question: where does THIS discrepancy spawn from?

Talz · Feb 11, 2008

Thxiiet said:
The "real world" tests are great for a person buying their very first video card, where they have no point of reference. The average [H] reader already has a pretty good idea of how well their current system works. Time Demo based benchmarks provide a method to say "If I upgraded my 6800GT to the 8800GTX I can expect a performance increase of 'x.'"

Your argument that the premade benchmarks do not provide a "real world" assessment of performance may be accurate, but it will still provide the best method for comparison.

Well I don't agree with that 100% but honestly comparing 8800s to 6800s is less important than comparing competetive products. I do think having an extra person to handle the hardware side while another did the actual evalution without really knowing which card they were testing at the time would add to the [H]'s review quality though.

Thxiiet · Feb 11, 2008

Talz said:
Well I don't agree with that 100% but honestly comparing 8800s to 6800s is less important than comparing competetive products. I do think having an extra person to handle the hardware side while another did the actual evalution without really knowing which card they were testing at the time would add to the [H]'s review quality though.

Actually since they use Fraps to measure frame-rate, the best way to increase the reliability would be to have whoever is analyzing the Fraps output be "blind." I love your idea of having a blind hardware test, but that would be very time consuming and/or require multiple rigs, which of course creates another level of possible inaccuracy.

I still think the best review is the one that has both forms of testing. The inclusion of "Best Playable Settings" from [H] is a great idea, now throw in a few time demo benchmarks into the equation and it's golden.

next-Jin · Feb 11, 2008

I know this might have already been talked about but why doesnt [H] simply add the "canned" benchmarks and timedemos to the reviews? It'll add at most an hour to the process and deals cards to everyone at the table. And yea I have been coming here for a long ass time, I just have never been a die hard [H] person as I goto tons of other sites for other information. Don't get me wrong, I love the way they do it now I'm just curious as to why they stopped it all together. Anyone can reply it doesnt have to be a staff member

bangmal · Feb 11, 2008

This is interesting.
So let us analyse this article. The only way to determine whether this is a good article or it is a just piece of used toilet paper is to exam the conclusion and the premises leading to that conclusion.

So your argument is:
Conclusion: The "canned" built-in gpu crysis test is flawed(and implying all other benching are flawed).
Premise: You(harocp) get less fps when you measure it by using FRAPS.

And that is all what the conclusion is based on.
But you have made a lot of assumptions.
In order to make your article less laughable, you should at least prove the following:
1. The program FRAPS reflect the most accurate FPS
2. The GPU benching program from the game engine maker is inferior to the third-party recording tool
3. Your timing to press the benching hotkey to start and end fraps is flawless.

All your "real world" benchings are done on one assumption, that is, you assume FRAPS is infallible.
If you have a little back ground in programming, you should have known that FRAPS is nothing more than a counter counting the numbers in SwapBuffers or D3D->Present (Flip) or the WM_PAINT messages per second.

It is a tool to APPROXIMATE the performance of the games which don't have built-in benching tool, it is by no means the REPLACEMENT for it.

FRAPS can only give you the rough numbers. It can be off from 1% to 30%(the human factor). Your mistake is to take it too seriously, that is why you have turned this site into a joke. What makes it even funnier is that you seem to be very proud of that "real world" thing.

bbf · Feb 11, 2008

Good article. Great technical content and explanations about why [H] does "evaluations" the way you do them. However, it would have been better if it weren't as hard on the Anandtech guy (showing his stupid quote twice in the article was harsh). It's better for a website of the stature of [H] to take the high road than to the rub it in, even though I agree that the Anandtech guy was completely wrong.

BTW I've read the posts so far on the Anandtech forum and most of the posts seem to be a blind defense of "their" site with no real rebuttals of the article based on fact... not that I can see any technical holes in the article to criticize.

PS I'm a reader of both sites.

Sr7 · Feb 11, 2008

bangmal said:
This is interesting.
So let us analyse this article. The only way to determine whether this is a good article or it is a just piece of used toilet paper is to exam the conclusion and the premises leading to that conclusion.

So your argument is:
Conclusion: The "canned" built-in gpu crysis test is flawed(and implying all other benching are flawed).
Premise: You(harocp) get less fps when you measure it by using FRAPS.

And that is all what the conclusion is based on.
But you have made a lot of assumptions.
In order to make your article less laughable, you should at least prove the following:
1. The program FRAPS reflect the most accurate FPS
2. The GPU benching program from the game engine maker is inferior to the third-party recording tool
3. Your timing to press the benching hotkey to start and end fraps is flawless.

All your "real world" benchings are done on one assumption, that is, you assume FRAPS is infallible.
If you have a little back ground in programming, you should have known that FRAPS is nothing more than a counter counting the numbers in SwapBuffers or D3D->Present (Flip) or the WM_PAINT messages per second.

It is a tool to APPROXIMATE the performance of the games which don't have built-in benching tool, it is by no means the REPLACEMENT for it.

FRAPS can only give you the rough numbers. It can be off from 1% to 30%(the human factor). Your mistake is to take it too seriously, that is why you have turned this site into a joke. What makes it even funnier is that you seem to be very proud of that "real world" thing.

Actually, you're quite wrong. Fraps is more accurate than the Crysis in-game benchmarks when used carefully (and Kyle made a point of noting that they used it very carefully). I've seen cases in multi-gpu where it shows 90FPS in-game with 3 GTXs when it's actually running at 20fps or less, extremely choppy. The number of D3D presents triggered is an extremely reliable way to measure how many frames were actually rendered by the graphics card.

You seem to have forgotten to refute anything with that technical point you bring up...you say it's "nothing more than a counter counting the numbers in SwapBuffers or D3D->Present (Flip)" but you yourself are now implying that this is somehow inaccurate. So Fraps is a counter of present calls... and??...

Granted, how you start and stop it are going to vary slightly run to run, so if you do 3 runs and average results you have a pretty damn representative result. I've tested and can say that you can put fraps within 1-5% accuracy.

Now I agree this article doesn't answer the question which I asked.. how is the outcome of the GTX over the X2 with your in-game benchmark runs showing us the err of other sites' ways? Granted, the framerates are different, in-game vs. fraps, but but the fact is you still show the X2 below the GTX, regardless of whether you use the flyby or the built-in method.

To summarize, I agree that in-game with Fraps is the best way to measure when done carefully, but we still haven't figured out why you even have different *canned* benchmark results than other sites.

bbf · Feb 11, 2008

[Deleted My Reply] The reply above mine does a better job.

Zinn · Feb 11, 2008

bangmal said:
This is interesting.
So let us analyse this article. The only way to determine whether this is a good article or it is a just piece of used toilet paper is to exam the conclusion and the premises leading to that conclusion.

So your argument is:
Conclusion: The "canned" built-in gpu crysis test is flawed(and implying all other benching are flawed).
Premise: You(harocp) get less fps when you measure it by using FRAPS.

And that is all what the conclusion is based on.
But you have made a lot of assumptions.
In order to make your article less laughable, you should at least prove the following:
1. The program FRAPS reflect the most accurate FPS
2. The GPU benching program from the game engine maker is inferior to the third-party recording tool
3. Your timing to press the benching hotkey to start and end fraps is flawless.

All your "real world" benchings are done on one assumption, that is, you assume FRAPS is infallible.
If you have a little back ground in programming, you should have known that FRAPS is nothing more than a counter counting the numbers in SwapBuffers or D3D->Present (Flip) or the WM_PAINT messages per second.

It is a tool to APPROXIMATE the performance of the games which don't have built-in benching tool, it is by no means the REPLACEMENT for it.

FRAPS can only give you the rough numbers. It can be off from 1% to 30%(the human factor). Your mistake is to take it too seriously, that is why you have turned this site into a joke. What makes it even funnier is that you seem to be very proud of that "real world" thing.

Fact is, Crysis seriously overestimates the number of frames per second it's rendering. From years of gaming, I know the difference between 20 and 30 and between 30 and 50, and oftentimes what's "50" as Crysis' r_displayinfo command displays is a lot less. Sometimes it overestimates by a factor of 50%. FRAPS backs me up on this.

What's more interesting to me than the discrepency between FRAPS and Crysis' built in performance-measuring tools is the fact that Kyle has shown that the numbers in these canned benchmarks in no way translate to real-world gaming scenarios, especially when the 3870x2 tears up the timedemos but lags when it comes to actual gameplay.

It's sad how skewed these companies' priorities have gotten, when they care more about optimizing their drivers for benchmarks than for actual gameplay. It's also sad how people cling to a flawed but "scientific" method of performance measuring, when it's actually responsible for misleading consumers. The 7950GX2 comes to mind. Canned benchmarks don't even begin to tell what a piece of crap that card was.

If more sites evaluated cards the way [H] does, the entire PC gaming industry would be changed for the better as no longer would companies be able to get away with releasing half-assed drivers that optimize for benchmarks while leaving gameplay as second rung.

Silus · Feb 11, 2008

Great article! Much like what happened to the 3870 X2 review, the HD 2900 XT also suffered some harsh critics, so this article should at last prove that the [H] methodology is the best, when it comes to evaluate new graphics cards.

Still, I'm sure there will be those that will say "Why would [H] prove themselves wrong ?". To those I say "Buy the card that other sites are praising and then regret your purchase. Trial and error, but a very expensive one".

ghostchamber · Feb 11, 2008

Good article.

Glad to see you guys follow up on the reasons behind your methods.

Rock&Roll · Feb 11, 2008

What, NO 3DMARKS!!!!! BLASPHEMY!!!!!

Seriously, can people just come to terms with the fact that [H] data is quite valuable and trustworthy. Everything I've seen in the last two years testing my own hardware has always fell in line with what these guys show on their site.

evilcartman · Feb 11, 2008

Great article and one of the best I've read in a while. It pretty much answered any questions that popped into my head while reading it and at the end, I had a much clearer picture of where [H] was coming from. Even though I've been following this site a for a few years now (and have always agreed with your testing methods), it's great to see that you can stand behind your methods and explain your stance clearly with a good deal of solid data to back you up. I agree with what other [H] members have stated...I buy hardware to play games, not timedemos. {H] is one of the only sites that I know won't steer me wrong which is why it's always my first stop while looking for hardware reviews.

I think this article will quiet some of the detractors, but I think it opened up a whole new can of worms by bringing up Anandtech....LOL....though I don't think you had any malicious intent behind it...you were just pointing them out an example.

*Dons his anti-flame suit and grabs a big mug o' coffee*

This should be interesting. Time to get comfortable and watch some fireworks!

Benchmarking the Benchmarks @ [H]

Limp Gawd

2[H]4U

2[H]4U

Limp Gawd

Gawd

2[H]4U

[H]ard|Gawd

Supreme [H]ardness

Gawd

2[H]4U

[H]ard|Gawd

[H]ard for [H]ardware

Limp Gawd

Gawd

Supreme [H]ardness

Limp Gawd

[H]ard|Gawd

Pumpkin Ghost

Fully [H]

Fisting is Too Mainstream for Me

Zinn

Guest

Weaksauce

Scotch is my Lord and Savior

Gawd

2[H]4U

n00b

Supreme [H]ardness

Gawd

Gawd

n00b

Supreme [H]ardness

Limp Gawd

Gawd

Gawd

Gawd

Zinn

Guest

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd