Benchmarking the Benchmarks @ [H]

FlyinBrian · Feb 11, 2008

Article went pretty much as I expected. Used the worst possible game for benching and I saw it coming a mile away. Sorry I appreciate H because I love reading the forums. But a single game hardly proves Custom Timedemos are not accurate and reliable way to measure performance. Especially with the infancy of crossfire for Crysis. I look at H numbers as worst case scenario and I think thats pretty good way to look at it.

-PK- · Feb 11, 2008

I agree with the real-world testing and have loved the [H] evaluations; However, there are times that I wish there were more comparable results between a series of cards. Many times the test systems change or only share 1 title in common. I have adapted to compare these evaluations using my expectations gathered from reading your articles, but sometimes that is not enough (if the performance demands are high and budget is tight) and I must use fps benchmarks from other sites to make sure my relative expectations are within a margin I feel comfortable passing on.

If something looks fishy, I dive back into the real-world benchmarks before making a recommendation or purchase. Real world testing often makes it difficult to compare a series of cards when you plan on using a lower resolution with high settings and demand high fps (45-60).

That said, I do trust your opinions and data collected when performing real-world testing. I remember when Oblivion 1.2 patch was released, I did nearly 100 run throughs using 3 routes (1 on horseback) while testing various playable settings; and finally comparing 3 nvidia drivers that were recommended for performance. I did these 3-5 minute runs without sound and suprisingly 90% of the runs were within +/- 1.5 avg fps and +/-3 max fps despite the numerous variables. I even tested for variations in results based on tempurature, including first-runs after cooldowns.

It was extensive testing, but I was able to determine the best garphical settings, ini tweaks, and driver to use for my system (it actually came down to an old-performance driver, another that performed best in daylight [HDR sky effects], and the final one was just a very good all around driver with higher minimum fps). Not to mention I got extremely good at being at every 'checkpoint' within 0.5 seconds everytime.

Atech · Feb 11, 2008

It's been said before:

Real world game play rules, canned benchmarks belong at...Cannedtech?

MartinX · Feb 11, 2008

Ever since the shaders came in and the whole "The same only faster" methodology for designing videocards became obsolete, fixed timedemo benchmarks have become increasingly irrelevant.

When a new card comes out I want to know what it will do for me, not an abstraction of relative performance compared to other cards, you may as well just make a tool that flickers the screen between black and white, and the cards that can do the most flickers in 10 seconds wins, it's effectively meaningless.

I've supported the [H] position on this since day.

Nenu · Feb 11, 2008

FlyinBrian said:
Article went pretty much as I expected. Used the worst possible game for benching and I saw it coming a mile away. Sorry I appreciate H because I love reading the forums. But a single game hardly proves Custom Timedemos are not accurate and reliable way to measure performance. Especially with the infancy of crossfire for Crysis. I look at H numbers as worst case scenario and I think thats pretty good way to look at it.

There are plenty of other reviews (appraisals) if you need more proof.
I'm satisfied with their testing and I trust them to report the truth.

FlyinBrian · Feb 11, 2008

Nenu said:
There are plenty of other reviews (appraisals) if you need more proof.
I'm satisfied with their testing and I trust them to report the truth.

First I didnt say they didnt report the truth. I am saying a single game doesnt prove much. It would be nice to see more games. I am pretty tired of this topic so I am just going to agree to disagree and leave it at that.

MrMike · Feb 11, 2008

Hopefully this will help people who get all caught up in overthinking testing methodology to realize what's going on.

What you guys do really is that "simple" as you stated. You play the freaking game instead of starting a canned benchmark and going afk to watch pr0n. Why do I want to read a review if it's not even by someone who plays the games?

Edit: Crysis GPU Benchmark still sucks. Don't get me wrong here, I agree with [H]'s testing methodology, but why can't all these sites using canned benchmarks at least make a good canned benchmark of Crysis? Crysis GPU Benchmark is worthless. Period.

MrLonghair · Feb 11, 2008

Realtime non-canned is the only way to benchmark in this age where 99% of the games have the one graphics card company working with them in-house from alpha to retail and beyond but not the other.

overzealot · Feb 11, 2008

I'm glad you elaborate a bit more on your setup, it's certainly a bonus in a review. But can you say that they (AT) are wrong?
You investigate the difference between the FRAPS and timedemo times, but what about the difference between your timedemo and theirs?
If you are trying to prove your method more scientific, you should attempt to get a run with no difference from theirs to use as a control.
There is the possibility you've only proved that there is a significant difference between your testing platform and AT's (AT used 32bit. Not sure if they used DX10, this stuff makes this all the more harder to verify).

A good read anyhow. Thankyou for the article Kyle.

PS: All you "2gb is all any program uses" people need to get out more. 3gb > 4gb people moreso.
If you've never had a BSOD from a program running out of user space, you can ignore the above statement. I wish I could.

Lord_Exodia · Feb 11, 2008

bangmal said:
This is interesting.
So let us analyse this article. The only way to determine whether this is a good article or it is a just piece of used toilet paper is to exam the conclusion and the premises leading to that conclusion.

So your argument is:
Conclusion: The "canned" built-in gpu crysis test is flawed(and implying all other benching are flawed).
Premise: You(harocp) get less fps when you measure it by using FRAPS.

And that is all what the conclusion is based on.
But you have made a lot of assumptions.
In order to make your article less laughable, you should at least prove the following:
1. The program FRAPS reflect the most accurate FPS
2. The GPU benching program from the game engine maker is inferior to the third-party recording tool
3. Your timing to press the benching hotkey to start and end fraps is flawless.

All your "real world" benchings are done on one assumption, that is, you assume FRAPS is infallible.
If you have a little back ground in programming, you should have known that FRAPS is nothing more than a counter counting the numbers in SwapBuffers or D3D->Present (Flip) or the WM_PAINT messages per second.

It is a tool to APPROXIMATE the performance of the games which don't have built-in benching tool, it is by no means the REPLACEMENT for it.

FRAPS can only give you the rough numbers. It can be off from 1% to 30%(the human factor). Your mistake is to take it too seriously, that is why you have turned this site into a joke. What makes it even funnier is that you seem to be very proud of that "real world" thing.

Pretty bold statements, but this is also getting old and tired. I'm serious!! Both camps need to STFU when it comes to Fraps. Whenever it doesn't give the answer YOU want you say "fraps isn't that accurate"

Fraps was developed by programmers. It's all those bastards do so from my years reading and being in the know about this stuff I have come to realize that they do it well. If it were inaccurate then by version 10 of their revisions, don't you think they would have fixed it?

Show me and the others in this thread a single link from a reputable source saying that fraps at this point in time is anything but accurate to within 1-5% or your above post is as you accused hardocp's site from being currently... a joke

Chris_B · Feb 11, 2008

FlyinBrian said:
Article went pretty much as I expected. Used the worst possible game for benching and I saw it coming a mile away. Sorry I appreciate H because I love reading the forums. But a single game hardly proves Custom Timedemos are not accurate and reliable way to measure performance. Especially with the infancy of crossfire for Crysis. I look at H numbers as worst case scenario and I think thats pretty good way to look at it.

Crossfire works fine for dx10 mode in crysis (though obviously performance should increase with driver updates), its in dx9 mode (On vista) that its broken. performance can go from 50+fps to 10fps just by changing your view slightly on some levels.
Heres an example i grabbed

http://homepage.ntlworld.com/gerald.marley/cry/crysis64 2008-02-05 14-58-40-00.jpg

http://homepage.ntlworld.com/gerald.marley/cry/crysis64 2008-02-05 14-58-45-16.jpg

The second screen i moved my mouse maybe a quarter inch to the lft and look at the hit, and that framerate pretty much stays that way for the rest of the level. In 5 of 11 levels this happens with crysis and the only review to pick up on it was from a german site which i only found out about a week or so after i discovered it.

xvioz · Feb 11, 2008

Nice article, as usual. Makes me happy I found [H]. I rarely upgrade (maybe once every 2 or 3 years) but when I do I'd like to have these kind of reviews... when you're actually you know, doing stuff? Especially when I'm putting $1000+/- into a system.

Silus · Feb 11, 2008

FlyinBrian said:
Article went pretty much as I expected. Used the worst possible game for benching and I saw it coming a mile away. Sorry I appreciate H because I love reading the forums. But a single game hardly proves Custom Timedemos are not accurate and reliable way to measure performance. Especially with the infancy of crossfire for Crysis. I look at H numbers as worst case scenario and I think thats pretty good way to look at it.

First of all, this article is meant to show generic differences between timedemos and real-world gameplay numbers. It's NOT meant to cover every game with a built-in benchmark tool...
Second, Crysis is the perfect example, since it's the most graphically demanding game out there. What were you expecting ? That they used CoD4, which is a new game, but runs smooth as butter on my aging 7800 GTX 256 ? That would make no sense.

Atech · Feb 11, 2008

FlyinBrian said:
First I didnt say they didnt report the truth. I am saying a single game doesnt prove much. It would be nice to see more games. I am pretty tired of this topic so I am just going to agree to disagree and leave it at that.

No matter how you spin it, real wolrd gameplay = what you will be doing in front ot the Pc.
Then you can test 1 game...or 5000...won't cahnge a bloody thing...I am also tired of the topic, but from a diffrent POV than you.

I am tired of people thinking they can comparing canned timedemos with real gameplay...you CAN'T...get over it!
I am tired of people who think 3dmark-styles numbers..or benchmarks numbers mean anything...
I am tired of canned canned timedemos-fans trying to "convert" HardOCP.

Get over it, go play a game...

BaconGrease · Feb 11, 2008

To put it plainly, it was a painful gaming experience.

I think that wraps it up. If the game is not playable/enjoyable with the hardware at certain settings then who cares what the "benchmark" is.

Niceone · Feb 11, 2008

This site called Muropaketti did choose sides and they choosed [H]. They attacked against Anandtech on charges that in their tests time demos and cut scene benches ran faster on both cards, but AMD's cards benefitted much more from this.

They tested HD3870 X2 against 8800 ultra..and got pretty similar results as [H]. Actually they tested Anantech's quality settings on Crysis and in actual game play they got MUCH worse results and HD3870 X2 lost against 8800 Ultra. In that build in bench mark HD3870 X2 made much better..winning.

Modred189 · Feb 11, 2008

Funny, after the 3870 x2 review, the thread was up to page 15 by now with people screaming at [H].
The relative silence this time around is deafening... Not much to say when the proof is in the pudding, and you get served a whole [H]ard spoonful.

Myrandex · Feb 11, 2008

I support the real world gaming evals too! Its like a car tested on rollers and then someone speaking from those results how well it handles the road. I want someone to actually get out there and drive the thing, through rain, dirt, rocks, curves, etc.!
Jason

Yoshiyuki Blade · Feb 11, 2008

To those who're complaining that [H] only used 1 game, just take notice a section in "The Bottom Line" paragraph:

There is also no doubt that there are some games out there that benchmark perfectly in relation to their real world gameplay. We just dont know what they are, and quite frankly we dont care.

Even though there *may* be some games with benchmarking tools that are accurate in relation to real-world gameplay, there's really not much of a point to go out and find it. The only thing we can hope for is to have more games with accurate benchmarking tools, so at least all the "canned" benchmarks out there will have some sense of meaning.

Pepsiennis · Feb 11, 2008

Congratulations! A very convincing article; list me as a convert!

eric66 · Feb 11, 2008

muropakketti what !!! i think anandtech lost i mean muro is on hard side

what else do you want

FlyinBrian · Feb 11, 2008

I dont think silence is really speaking volumes. I think people have just accepted that this is the way Kyle will do benchmarks. Its not a bad thing it just distinguishes this site over the rest. I dont agree on every area here nor do I elsewhere. I have been in this business a VERY long time and I take an over view of several review sites to get an idea of how a card will perform in relation to another.

Deusfaux · Feb 11, 2008

Sr7 said:
I think the in-game testing is the way to go, but:

Just to be the devils advocate here.. there is still the question of why, when you run the timedemo like everyone else, the x2 loses to the GTX. I understand that fraps and the engine reported FPS may not agree, that's not really in question. The fact is now you've run the test the way other sites have, and you still have a totally different result/conclusion than they do.

So this begs the question: where does THIS discrepancy spawn from?

Ok so it's super late and I'm super tired and I think I'm missing something too cuz I *get* the article and everything but I must be reading the #s backwards, because it seems like the ATI card does worse regardless - as this guy I quoted points out. Shouldn't it be doing better as the words in the article suggest?

Slade · Feb 11, 2008

I think this really exposes the flaws in Crysis' display of its framerate. I and many others have wondered where these magical "high and fast" frame rates are coming from when our own experiences on fairly high end rigs have shown that the fps is far from it.

This is why HardOCP's benchmarks are so valuable to the community. It's telling it how the player will experience it, it's going through what a typical computer player will go through when PLAYING the game. I'm not saying Anandtech has their testing wrong, it's just that the tools they've been given to test have been rigged before they used them. Between nvidia, AMD and Crytek, each of them are doing something that will somehow always put their products in a favorable light that is usually off from what the experience will be.

Let them sling it out over there in that forum. The truth of the matter is that if there wasn't a problem, then there'd be no need to argue and that the results can stand on their on merit.

Pepsiennis · Feb 11, 2008

Deusfaux said:
...Shouldn't it be doing better as the words in the article suggest?

I think the article is focusing on methodologies and which is most accurate and useful.

Niceone · Feb 11, 2008

eric66 said:
muropakketti what !!! i think anandtech lost i mean muro is on hard side what else do you want

Hmm...

They have reliable and professional staff there. Their site would be more popular outside Finland if they would use english

.

Jakalwarrior · Feb 11, 2008

All I need do is look at 3dmark06 which says "hello bechmarker, you have an AMD processor, and even though this is supposed to be a graphical benchmark, we are going to artifically nuke your score far beyond any real world difference (especially in gpu bottlenecked games)"

Bigbacon · Feb 11, 2008

Baseless accusations against AT will not be allowed here. - Kyle

Niceone · Feb 11, 2008

Jakalwarrior said:
All I need do is look at 3dmark06 which says "hello bechmarker, you have an AMD processor, and even though this is supposed to be a graphical benchmark, we are going to artifically nuke your score far beyond any real world difference (especially in gpu bottlenecked games)"

Well 3dMark06 is also bad without that CPU thingy. I mean HD2900 XT made good scores in that while it sucked in games.

Thing is that at least Muropaketti noted around that time that R600-architecture works very well in static benchmarks where every situation is pre-determined. Now that RV670 still uses R600-architecture! Now Anandtech still doesn't seem to realize this and claims that there's no need for real gameplay tests

!

next-Jin · Feb 11, 2008

Modred189 said:
Funny, after the 3870 x2 review, the thread was up to page 15 by now with people screaming at [H].
The relative silence this time around is deafening... Not much to say when the proof is in the pudding, and you get served a whole [H]ard spoonful.

Being it was released late Sunday I'm sure had nothing to do with it.

Sovereign · Feb 11, 2008

So they've gone from cooking the drivers (NVIDIA GeForce 5) to cooking the timedemos (Crysis). Epic fail... At least there's one site honest enough to do more than just run the built-in timedemos and call it a day.

Silus · Feb 11, 2008

Niceone said:
Well 3dMark06 is also bad without that CPU thingy. I mean HD2900 XT made good scores in that while it sucked in games.

Thing is that at least Muropaketti noted around that time that R600-architecture works very well in static benchmarks where every situation is pre-determined. Now that RV670 still uses R600-architecture! Now Anandtech still doesn't seem to realize this and claims that there's no need for real gameplay tests !

I think Jakalwarrior was being sarcastic, though he didn't use a sarcasm smiley face

Nobi125 · Feb 11, 2008

This is more of a collection of suggestions that this thread caused me to come up with, rather than any sort of argument over canned vs. real world testing.

I think what turns a lot of people off when reading [H]'s video card reviews is that all they see is the FPS (min, avg, max) and then a line graph for 1-2 video cards. This makes comparing performance between cards awkward and cumbersome. I've had several friends of mine mention this to me specifically when I've linked them to [H]'s reviews. If you guys were to toss that same real world gaming data into a bar graph that compared the performance of several cards at once, I think the reviews would appeal to a much broader audience. You don't even have to get rid of the current ways the data is displayed, just add another (Making a bar graph should only take a few minutes).

HardOCP's video card reviews are my first choice as far as data goes, but I don't really like the way the data is presented. What draws people to Anand's (and similar websites') reviews, is the simplistic presentation of the data that allows them to quickly compare one card's performance to another's (how the data is obtained is a secondary concern to many people unfortunately). People want their info as quickly and simply as possible, which is probably why they are willing to settle for canned benchmarks. That is, if they even bother to invetigate what methodology is being used, or even understand it when they do.

As far as the line graphs go, they don't really tell me much since I have no idea what each second in that line graph actually represents in the game play. It would be more helpful if you guys could provide a video of the actual game play used to get the data for each card with FRAPS and a timer running. This would also help to ease concerns over inconsistency when it comes to non-canned benchmarking since one can verify with their own eyes that the game play used was similar enough to draw conclusions from. Being able to see exactly where the slowdown in the game occurs and what effects in the game caused it would be really interesting and helpful when making the decision to buy a video card.

With continued real world testing and an improved presentation, {H] would have the best video card reviews hands down.

Niceone · Feb 11, 2008

Silus said:
I think Jakalwarrior was being sarcastic, though he didn't use a sarcasm smiley face

I know that. I just continued from where he left it

2fast4u2c · Feb 11, 2008

Nice work on the article, I was pretty surprised to see the difference in frame rate between benchmark demos and the same run using FRAPS. Will definately keep this in mind for the future.

SmokeRngs · Feb 11, 2008

Damn good article pointing out exactly what you've been saying all along. And what most people using logic and critical thinking have understood for a long time.

I generally don't care a whole lot about video cards. I never get the high end card because I won't spend that kind of money when that money could be more useful elsewhere. By the time I get a card, it's usually known whether or not it's a good card. I got my 7600GT a little over a year ago. By that time, everyone knew it was basically the best midrange card for the money. I should have an 8800GT by the end of the week with the rest of my new system. Again, this is obviously a card well known to be great for performance, especially for the money. Hell, this will probably be the highest end card I've ever owned.

When I was researching both cards, I hit up [H] reviews because they play the damn games and tell me the results along with all of the settings in regards to testing. In many cases they are running resolutions I won't end up hitting for one reason or another, but I know what the cards are doing. I can base my preferences for gameplay upon the results [H] gets because they tell me if the gameplay is smooth or not. I have yet to be able to do that with any damn canned benchmarks or timedemos. Those things don't tell me shit. They give me a damn FPS number or something like that but they don't tell me whether the game is actually playable or not. They don't tell me if there is some type of corruption or mis-rendering. They don't tell me if the scene just flat out looks like shit.

I remember not long after I got my Radeon 8500 a friend of mine picked up a Radeon 7500 AIW. He was having trouble with it and I tossed it into my system to check it out. Turned out the card was just fine and the problem was with his system. Anyway, the performance of the 7500 was basically the same as my 8500 if not a little faster in regards to FPS in my favorite game at the time, RTCW. Since the cards were different generations and had different features, the 7500 looked completely different than my 8500. Sure, it obviously "scored" well in regards to FPS but it was not as visually appealing.

While that example is not the best comparison since the cards were different generations, it does illustrate what [H] has been doing. There are differences between cards of the same generation in performance and image quality. Image quality has been pushed more and more in video cards for years now. It's the higher image quality with playable performance that's important. Canned benchmarks and timedemos do not show this.

Both nVidia and ATI have been caught optimizing for benchmarks and timedemos. Because of this, how can anyone believe canned benchmarks and timedemos can be an accurate measurement of performance? Following the scientific method does not make a test accurate if there are flaws in the testing tools to begin with. At least in the case of [H], the tools are accepted by most to be accurate (the games themselves as well as FRAPS). The only real complaint I've ever seen in regards to the scientific method with [H]'s method is that their results can't be reproduced exactly.

Now we're at a bit of a sticking point. Which is more accurate? Testing with accurate tools in situations where the results can't be "exactly" reproduced; or testing with known flawed tools but which can be reproduced exactly?

Those are the questions you have to ask yourself.

Silus · Feb 11, 2008

Nobi125 said:
HardOCP's video card reviews are my first choice as far as data goes, but I don't really like the way the data is presented. What draws people to Anand's (and similar websites') reviews, is the simplistic presentation of the data that allows them to quickly compare one card's performance to another's. People want their info as quickly and simply as possible, which is probably why they are willing to settle for canned benchmarks.

When you say "they settle for canned benchmarks", seems like you're assuming that the people who settle for canned benchmarks, actually understand that they are really not getting the most accurate info, but that's not true.
People that dispute [H]'s methodology do NOT understand it. For these people, if the numbers in the graphs in site 1, do not match the numbers in site 2, then the "favorite company" kicks in and the site with the lower numbers for the card of the "favorite company", gets trashed, no matter the methodology.

You are right that people prefer fast and simple graphs, but assuming that they understand the methodology is wrong. They don't.

Silus · Feb 11, 2008

Niceone said:
I know that. I just continued from where he left it

Forget what I said then

Nobi125 · Feb 11, 2008

Silus said:
When you say "they settle for canned benchmarks", seems like you're assuming that the people who settle for canned benchmarks, actually understand that they are really not getting the most accurate info, but that's not true.
People that dispute [H]'s methodology do NOT understand it. For these people, if the numbers in the graphs in site 1, do not match the numbers in site 2, then the "favorite company" kicks in and the site with the lower numbers for the card of the "favorite company", gets trashed, no matter the methodology.

You are right that people prefer fast and simple graphs, but assuming that they understand the methodology is wrong. They don't.

I totally agree with you, I actually edited the post to reflect that sentiment a bit (probably should have been more clear about it). I think that the methods used to obtain the data become a secondary concern to many people, while the presentation is most important. There are definitely people who don't even bother to find out what methodology is being used (or even understand it when they do).

Modred189 · Feb 11, 2008

next-Jin said:
Being it was released late Sunday I'm sure had nothing to do with it.

The 3870 x2 article was released 12:01 monday morning, when the NDA expired. This one was released EARLIER on sunday.

Benchmarking the Benchmarks @ [H]

Gawd

[H]ard|Gawd

2[H]4U

One Hour Martinizing While You Wait

[H]ardened

Gawd

Supreme [H]ardness

[H]ard|Gawd

n00b

Supreme [H]ardness

Supreme [H]ardness

Weaksauce

Supreme [H]ardness

2[H]4U

Limp Gawd

Gawd

Can't Read the OP

Limp Gawd

2[H]4U

[H]ard|Gawd

Gawd

Gawd

Gawd

2[H]4U

[H]ard|Gawd

Gawd

2[H]4U

Fully [H]

Gawd

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

[H]ard|Gawd

Gawd

n00b

[H]ard|DCer of the Month - April 2008

Supreme [H]ardness

Supreme [H]ardness

[H]ard|Gawd

Can't Read the OP