Today's Phenom News

Status
Not open for further replies.
It just shows that as a benchmark for SRI and Crossbar it is not objective and therefore pretty meaningless in a scientific sense, and does nothing to help prove outlandish claims about SRI and Crossbar on either the K8 or K10 platform.

If something *completely* unrelated to SRI and Crossbar, and even HTT skews the results, then what good is the program for only benchmarking SRI and Crossbar and HTT? You can dodge with "but your hardware sucks" which you have several times, but that doesn't answer the question.

Isn't it pretty obvious? Your hardware *does* suck. Truth may hurt, but it's still true.
A properly configured system would have 16x the AGP performance that your system has, and as such, would have negligible overhead on the total results. A factor 16 is an incredible lot. Your system is simply outside of the targeted configurations of this benchmark. A properly configured X300-card has no effect on the CPU results. And that's pretty much one of the slowest videocards available today.
That is the answer, deal with it.

It is not an objective benchmark, period. It is dependent on external factors other than that which you are trying to prove its validity for. If it were a valid, objective benchmark, it could be installed on *any* system, regardless of memory or video hardware with more than one physical or logical core and do the same thing. It's silly to think that this is a meaningful measurement of anything other than "It draws fire this fast on this hardware with these exact settings"

It's meaningless for my hardware(which is currently pissing me off. stupid ATI). More importantly, it's meaningless for big iron.. You know.. MP systems with the silly 8mb video adapters, which interestingly enough, have FAR more interesting crossbars and HTT links than boring ass desktop boards - imagine that - a supposed SRI and Crossbar benchmark that can't run on some of the most hardcore SRI and crossbars created to date!

First of all, it's not meant to be the perfect benchmark for crossbar/HT performance. It's just one of the side-effects one of the users ran into. The benchmark's purpose is to get the best possible performance out of various architectures available today (with this particular algorithm). In order to get the best out of the architectures, we need to study and understand these architectures. Therefore, if increasing the HT-bus speed means our inter-core communication will improve, and therefore the multithreaded parts of the algorithm will be more efficient, then that is valuable information.
But anyway, because of its purpose, namely benchmarking itself, it's implicitly valid on any system you run it on. In your case it proved that your configuration has such poor AGP performance that the graphics becomes the bottleneck. Valuable information.

Secondly, you're the one making outlandish claims here. Your system is running in an extremely crippled state. That's what makes benchmarks (yes, plural) useless... Not the benchmarks themselves.

Thirdly, servers with crappy 8 mb video adapters are outside of the scope of my application's usage. This is meant for visualization stations, some of which have more GPUs than they do CPUs. Not all multiprocessor systems are web/database servers, you know.
Pathetic that you need to have a go at me just because you can't properly configure your own system, and don't even have the slightest clue what my benchmark is about.
 
Does it matter? :p I think it's settled that the crossbar speed is independent of HT speed

Yes it matters. And I don't consider it settled. I have yet to see an official document from AMD, which explains exactly where the >= 2000 comes from.
It's just too vague a term to get any kind of idea of how it works.

and that there is some truth to the sarcastic comment earlier in the thread that HT bandwidth wasn't a problem right now.

Well, not too sure...
The tests show that basically you need to get it to at least 3x to get 'full' performance... or near as makes no difference anyway.
But that's with just two cores.
I can imagine that by adding another two cores, you'd need twice that (worst case), so you're already up to 6x, which is above the current limit of 5x.
And when you go with eg 4x4, using two quadcore processors, you'd need more still, etc.

So I think the quadcores may already get a *slight* benefit from the extra HT-speed, and anything over 4 cores certainly would benefit.
 

Stop posting links that don't contain any relevant data.
Now, if you want to prove something, don't post a ton of useless links... Instead, *quote* the piece of text that actually contains the info we're looking for, then post ONE link, to the source of that quote, and also refer to the exact page/location of this quote.
 
Stop posting links that don't contain any relevant data.
Now, if you want to prove something, don't post a ton of useless links... Instead, *quote* the piece of text that actually contains the info we're looking for, then post ONE link, to the source of that quote, and also refer to the exact page/location of this quote.

If you cant take the time to look at it, then I'm sorry but I cant help you... All of the links I provided have said the same thing. Plus with a lot of other good information that is worth looking over anyhow. Most of them contain dayta about topics that we discuss daily...

I'm not your teacher... Google exists for a reason, and your computer has its own mouse....
 
If you cant take the time to look at it, then I'm sorry but I cant help you... All of the links I provided have said the same thing. Plus with a lot of other good information that is worth looking over anyhow. Most of them contain dayta about topics that we discuss daily...

I'm not your teacher... Google exists for a reason, and your computer has its own mouse....

I took the time to look at it, the info I require isn't there.
If it were, you would be able to quote it here.
It's the same as those 386-benchmarks you mentioned.
Weren't there. Could only find benchmarks that proved my point, not yours. You're a liar.
I don't understand why people tolerate you on this forum.
 
The tests show that basically you need to get it to at least 3x to get 'full' performance...
A 3x HT multiplier is lower than any CPU that AMD has shipped (S754 = 4x, S939/AM2/F/S1 = 5x), so it's not a problem.

Maybe you missed the couple of times I mentioned it, but I have tested inter core communication with Sandra with HT set from 1x-5x. It makes no difference so it's pretty much settled that HT and crossbar speed are independent of each other.

You're just getting a bit silly trying to extrapolate these unrelated items to the K10. I'm done with this sub-discussion.
 
Dude your calling me a liar again..... Wow, just wow.....

I offer some advise to help you troubleshoot your bottleneck, you tell me that I'm stupid, and your not taking my advice... I offer link after link of data to prove that your ill-informed claims are wrong

Then in the face of industry data, you call me liar... Wow....
 
A 3x HT multiplier is lower than any CPU that AMD has shipped (S754 = 4x, S939/AM2/F/S1 = 5x), so it's not a problem.

Read my post again, you missed the point.

Maybe you missed the couple of times I mentioned it, but I have tested inter core communication with Sandra with HT set from 1x-5x. It makes no difference so it's pretty much settled that HT and crossbar speed are independent of each other.

Yes, but I know nothing of this Sandra test... In fact, I can't even find such a test in my copy of Sandra. This may be because I run an Intel-system, I don't know.
Besides, it's not relevant to me. I don't develop Sandra, I develop my own code, and clearly HT-speed does *something* there. But since you didn't post single-threaded results, we don't know how much of this relates to inter-core communication, and how much of it is regular system communication.

You're just getting a bit silly trying to extrapolate these unrelated items to the K10. I'm done with this sub-discussion.

Good, because you were getting very unfriendly right there. Oh, and you're pretty silly when you think that everyone who doens't agree with you is 'silly'.
I don't see how everyone here seems to get unfriendly, we're just benchmarking something and trying to explain the results. At least, I am.
 
Then in the face of industry data, you call me liar... Wow....

Industry data that you fail to produce, liar.
Who's ill-informed anyway? I don't see you writing any software to test multithreading performance, Mr. "386DX-40 is way faster than any Intel 486, just google benchmarks!".
 
In order of authenticity.......

Theres this guy here, who may or may not know what he is talking about....
With AMD’s dual-core processors, the two cores are connected to each other via a system request bus, which operates at core frequency. This in turn is connected to a crossbar, which again runs at core frequency and acts as to the interface to the memory controller and Hyper Transport link to peripheral devices

Then there is this one here according to Dell......
Memory bandwidth can be more important than core clock speed when properly applied, with processors coupled closely through crossbar switching and a memory controller that runs at the processor’s speed, as in AMD Opteron processors.

And Finally this one here according to AMD.....
AMD64 designed from the ground up for multiple cores
– True dual-core, with two CPU cores on one die
– Inter-core communication at processor speed
– Direct Connect Architecture allows access via
crossbar to memory controller and
HyperTransport™ technology link

Note that I didnt quote it for YOU. If you can call me a liar, then I guess I can call you lazy.

Along with many, many other sources that are a simple google search away...
 
Theres this guy here, who may or may not know what he is talking about....

Probably not, like most :)

Then there is this one here according to Dell......

Doesn't literally say that the crossbar itself runs at core frequency (just because you try to make it appear that way with your bold emphasis doesn't mean it HAS to be read that way). Only says that literally about the memory controller... which we already knew.

And Finally this one here according to AMD.....

Which isn't official technical documentation, but just some marketing mumbo-jumbo, and already contradicts itself because it says crossbar runs at >= 2000 MHz, while clockspeed on certain models is below 2000 MHz.
So does this mean the crossbar can run faster than the core clock?
In that case it means it could also run slower than the core clock on systems of more than 2000 MHz. At any rate, there won't be a hard link between core clock and crossbar clock.

Note that I didnt quote it for YOU. If you can call me a liar, then I guess I can call you lazy.

I'm far from lazy, but there's nothing trustworthy out there. Blame AMD, their technical documentation is appalling.
 
Probably not, like most :)



Doesn't literally say that the crossbar itself runs at core frequency (just because you try to make it appear that way with your bold emphasis doesn't mean it HAS to be read that way). Only says that literally about the memory controller... which we already knew.



Which isn't official technical documentation, but just some marketing mumbo-jumbo, and already contradicts itself because it says crossbar runs at >= 2000 MHz, while clockspeed on certain models is below 2000 MHz.
So does this mean the crossbar can run faster than the core clock?
In that case it means it could also run slower than the core clock on systems of more than 2000 MHz. At any rate, there won't be a hard link between core clock and crossbar clock.



I'm far from lazy, but there's nothing trustworthy out there. Blame AMD, their technical documentation is appalling.

I tell ya.... You are the king of wiggle.
 
I tell ya.... You are the king of wiggle.

And you are constantly spreading lies and misinformation.
At least I write my own software, so I obviously know a thing or two about computer technology, and I have some things to investigate my hypotheses. Given the results presented in this and other threads, these hypotheses are not all that crazy.
Nobody has been able to prove them wrong anyway.

If AMD wants us to believe their inter-core communication is that great, then explain to me how I can get the same results on a Pentium D, and a Core2 Duo runs circles around the Athlon X2, when it comes to inter-core communication?
My code doesn't lie, anyone can run it on any system and analyze, and see... So how do we place AMD's claims in perspective?
 
Doesn't literally say that the crossbar itself runs at core frequency (just because you try to make it appear that way with your bold emphasis doesn't mean it HAS to be read that way). Only says that literally about the memory controller... which we already knew.

how else can you interpret it? there's only one thing doing inter-core communication... the system request interface. if the inter-core communication is at cpu speed, the SRI is as well, because the two are the same.
 
how else can you interpret it? there's only one thing doing inter-core communication... the system request interface. if the inter-core communication is at cpu speed, the SRI is as well, because the two are the same.

I'm talking about this:
"Memory bandwidth can be more important than core clock speed when properly applied, with processors coupled closely through crossbar switching and a memory controller that runs at the processor’s speed, as in AMD Opteron processors."

Now, duby had cleverly written this in bold:
"crossbar switching and a memory controller that runs at the processor’s speed"

This way it appears like you should read it like this:
"with processors coupled closely through (crossbar switching and a memory controller) that runs at the processor’s speed"

However, one can also read it like this:
"with processors coupled closely through (crossbar switching) and (a memory controller that runs at the processor’s speed)"

Considering that they write "runs" instead of "run", I think the second one is actually the proper interpretation of the sentence.
In which case it doesn't say that the crossbar switch runs at the processor's speed, it merely mentions that there is one.

So it doesn't prove anything.
 
i wasn't talking about that one, i'm talking about this one

AMD64 designed from the ground up for multiple cores
– True dual-core, with two CPU cores on one die
– Inter-core communication at processor speed
– Direct Connect Architecture allows access via
crossbar to memory controller and
HyperTransport™ technology link
 
As I said, that presentation contradicts itself on the next page. What am I to make of that? And why isn't that data in their tech docs? Where in the world is the *real* info!?

And I repeat myself with a far more pressing issue than whatever AMD puts in some obscure presentation:
Scali said:
If AMD wants us to believe their inter-core communication is that great, then explain to me how I can get the same results on a Pentium D, and a Core2 Duo runs circles around the Athlon X2, when it comes to inter-core communication?
My code doesn't lie, anyone can run it on any system and analyze, and see... So how do we place AMD's claims in perspective?
 
I think the AFX-22 looks nice, mmm unlocked multiplier for around $400 in AM2. I beleave the most expencive AFX will be $499 a pair because of the fact 4x4 is there.

http://www.fudzilla.com/index.php?option=com_content&task=view&id=631&Itemid=40

One hell of a OCer too with these ES's. Just think what more volts would get us. Conroe speeds... :eek: Its only set at 1.3v. Thats insane, the wonders 1.4v would do. 4ghz isn't impossible with 1Ghz+ OC's by a minor voltage up seen in the 1st OC attempt.
 
And you are constantly spreading lies and misinformation.
Dubby certainly doesn't do this any more than you.

At least I write my own software, so I obviously know a thing or two about computer technology, and I have some things to investigate my hypotheses.
The ability to write some test code doesn't make as large of an impact on others' opinions of you as much as you seem to think it does. However, I agree it is good to see that you are making effort to shed some light on the issue.

Given the results presented in this and other threads, these hypotheses are not all that crazy. Nobody has been able to prove them wrong anyway.
You have not yet proven yourself correct either. Further, considering that your claims are based on a small piece of test software you wrote and are in direct contradiction to AMD's published claims, I think it is you who are under the burden of proof; not the other way around.

While your application is a representation of a particular load, it is in no way indicative of every or even most loads CPUs are likely to see. The fact that AMD's chips do not perform well with your one particular multi-threaded test is quite a different thing from your implied claim that AMD CPUs do not offer on-chip full-speed inter-core communication. In fact, I don't know whether to consider your huge leap in conclusion to be funny or shocking.

If AMD wants us to believe their inter-core communication is that great, then explain to me how I can get the same results on a Pentium D, and a Core2 Duo runs circles around the Athlon X2, when it comes to inter-core communication?
My code doesn't lie, anyone can run it on any system and analyze, and see... So how do we place AMD's claims in perspective?
I can't complain here. If you want to make qualitative claims or observations with some sample code, go right ahead. However, you seem to think your little snippet of code proves that AMD chips don't have full-clock links between cores. This simply isn't true. If you really are an academic, I encourage you to find and read the Software Optimization Guide for the K8, which spells out (in a slightly odd way) the K8's IPC links.

So, stop implying the layout of the hardware based on the performance of your sample app. Stick to the qualitative claims.
 
And you are constantly spreading lies and misinformation.
At least I write my own software, so I obviously know a thing or two about computer technology, and I have some things to investigate my hypotheses. Given the results presented in this and other threads, these hypotheses are not all that crazy.
Nobody has been able to prove them wrong anyway.

If AMD wants us to believe their inter-core communication is that great, then explain to me how I can get the same results on a Pentium D, and a Core2 Duo runs circles around the Athlon X2, when it comes to inter-core communication?
My code doesn't lie, anyone can run it on any system and analyze, and see... So how do we place AMD's claims in perspective?

And yet none of us can see the code.... Hows that for perspective?
 
Isn't it pretty obvious? Your hardware *does* suck. Truth may hurt, but it's still true.
A properly configured system would have 16x the AGP performance that your system has, and as such, would have negligible overhead on the total results. A factor 16 is an incredible lot. Your system is simply outside of the targeted configurations of this benchmark. A properly configured X300-card has no effect on the CPU results. And that's pretty much one of the slowest videocards available today.
That is the answer, deal with it.



First of all, it's not meant to be the perfect benchmark for crossbar/HT performance. It's just one of the side-effects one of the users ran into. The benchmark's purpose is to get the best possible performance out of various architectures available today (with this particular algorithm). In order to get the best out of the architectures, we need to study and understand these architectures. Therefore, if increasing the HT-bus speed means our inter-core communication will improve, and therefore the multithreaded parts of the algorithm will be more efficient, then that is valuable information.
But anyway, because of its purpose, namely benchmarking itself, it's implicitly valid on any system you run it on. In your case it proved that your configuration has such poor AGP performance that the graphics becomes the bottleneck. Valuable information.

Secondly, you're the one making outlandish claims here. Your system is running in an extremely crippled state. That's what makes benchmarks (yes, plural) useless... Not the benchmarks themselves.

Thirdly, servers with crappy 8 mb video adapters are outside of the scope of my application's usage. This is meant for visualization stations, some of which have more GPUs than they do CPUs. Not all multiprocessor systems are web/database servers, you know.
Pathetic that you need to have a go at me just because you can't properly configure your own system, and don't even have the slightest clue what my benchmark is about.

My claims about my hardware have all been nullified. No one's even talking about them anymore, but you seem to need to go back and point out that my hardware sucks. Which, again, has nothing to do with the invalidity of your benchmark.

You still haven't answered ANYONE's questions here. No source code, no legitimate numbers proving any point you're trying to make with any logical analysis. Excuses handed out left and right for these deficiencies, but no real matter added to the conversation. Flames left and right, and self-victimization are rampant, though. I'm still looking for some numbers to back up your outlandish claims.

Duby's more than one upped on you in that he's done a simple google search, I'm guessing on SRI or Crossbar, and it seems to have added more to the discussion than anything you've posted in this thread. Maybe the search was on "valid scientific data" or "meaningful statistical analysis" Clearly both google searches you need to do, as flaming and a once-over qualitative analysis aren't cutting it for you here, and God knows they won't cut it in academia or a corporate environment. Plenty of people can write software. I know very few that can actually do it well and actually know what they're writing.

You have proven nothing, and no one has proven you not wrong. We're still waiting. Gesticulation and pontification are useless here. If you want to have attention paid to you, provide legitimate, indesputible numbers. Unfortunately, since your application utterly fails at being an objective benchmark for SRI and Crossbar, you'll have to go back to the drawing board for that. We'll keep waiting.
 
I think it is pretty obvious by now that the HT, as long as it is not run at an incredibly slow 200MHz is not a bottleneck since results are virtually the same at anything 400MHz and above. We already know the cores don't use HT, but the crossbar, and there is enough evidence and other facts that have already been posted that suggest the corssbar is not directly related to HT speed. The issue is over IMO. HT@2GHz has plenty of bandwidth, and HT 3 is not going to give any tangable benifits on a single socket machine, regardless of how many cores. You can continue to complain and say "but you didn't test it this way" over and over again, but there are PLENTY of single threaded benchmarks out there that you can easily test, and if multithreaded benchmars are not showing any difference, I don't see why a single threaded would.
 
You have not yet proven yourself correct either. Further, considering that your claims are based on a small piece of test software you wrote and are in direct contradiction to AMD's published claims, I think it is you who are under the burden of proof; not the other way around.

How so? You are getting this backwards. I never wrote this application to investigate this issue. Instead, one of the guys on another forum decided to experiment with the HT-link, and he came up with the whole hypothesis of HT-link affecting multithreading performance.
So there aren't any 'claims' as such. Instead, there are benchmarks that show a certain behaviour, and this is what we're trying to explain. The proof that it happens is alraedy in the benchmarks.
I don't get why people hang on to what AMD says when the benchmarks are right in their face: HT-link speed *does* affect the results. I didn't make that up.
Now if you think that it's not the inter-core communication, but instead something else, then why don't you come up with a plausible hypothesis for that, so we can investigate.

While your application is a representation of a particular load, it is in no way indicative of every or even most loads CPUs are likely to see. The fact that AMD's chips do not perform well with your one particular multi-threaded test is quite a different thing from your implied claim that AMD CPUs do not offer on-chip full-speed inter-core communication. In fact, I don't know whether to consider your huge leap in conclusion to be funny or shocking.

Since in certain cases the Athlon pretty much gets beaten by a Pentium D in terms of gains from a single thread to multiple threads, I find it impossible to believe that AMD's chips do run at full speed. After all, how can they lose from an 800 MHz FSB, which has to be shared with memory and videocard access?
If you have an explanation, I'm dying to hear it...

I can't complain here. If you want to make qualitative claims or observations with some sample code, go right ahead. However, you seem to think your little snippet of code proves that AMD chips don't have full-clock links between cores. This simply isn't true. If you really are an academic, I encourage you to find and read the Software Optimization Guide for the K8, which spells out (in a slightly odd way) the K8's IPC links.

Well, give me some code that proves this. My code certainly indicates otherwise, and I wouldn't know of any other way to make the cores communicate. Obviously I've already studied the optimization guide, but it didn't give me any good explanations as to why my code runs the way it does... But I'm used to that, was the same with my K7.
It's pretty simple... If AMD writes something, and code running on actual hardware displays otherwise, then the writing must not correspond to the actual hardware. There is no other explanation. We shouldn't go by what AMD writes, but by what the hardware is doing in practice.

So, stop taking stabs at my application. Either you explain what's happening, or you accept my explanation of it. All this "But AMD's docs say this-and-that" doesn't jive with my results. Accept the facts, you can't deny the results.

Speaking of which... another issue that I cannot explain is that my code doesn't run faster in 64-bit: http://www.hardforum.com/showthread.php?t=1095839
At least, not on K8, it runs faster on Pentium D and Core2 Duo.
Explain that aswell, while you're at it.
 
The issue is over IMO.

But it isn't. My code still performs extremely poorly on Athlon64, as poorly as on an Pentium D.
I need an explanation for that. If you think it's not the HT-speed, then it's gotta be something else, because the test results don't lie.

HT@2GHz has plenty of bandwidth, and HT 3 is not going to give any tangable benifits on a single socket machine, regardless of how many cores. You can continue to complain and say "but you didn't test it this way" over and over again, but there are PLENTY of single threaded benchmarks out there that you can easily test, and if multithreaded benchmars are not showing any difference, I don't see why a single threaded would.

Quite simple: The only singlethreaded benchmarks I've seen here, are Everest and Doom3. They show no difference with HT-link.
If I am to accept that my application in single-threaded mode would do the same, without any actual tests, then my conclusion has got to be correct, since obviously we have quite a few results where the multithreaded version takes a hit from lower HT-speeds.
Now, if we test my application in single-threaded mode we can be 100% sure that the data flow from CPU to videocard is equal in both cases, so that part can't be affecting the HT-link's impact.

Which still leaves us without an explanation as to why single-threaded applications don't take a hit, while multithreaded applications *with shared read/write access memory* take a hit from a lower HT-speed.
The shared read/write access to memory is not a very common approach, so most multithreading applications won't be using it. On a multi-socket system, it's murder. I originally wrote that particular implementation mainly for Core2 Duo, because its shared cache would allow efficient sharing of memory.
Now in theory, if AMD's claims are correct, then Athlon's sharing should also be reasonably efficient, but as it turns out, it's just as inefficient as a Pentium D in practice.
Now, with the Pentium D I was expecting this, but with Athlon I was expecting something reasoanbly close to the Core2 Duo...
However, I now suspect that a system with two single-core Opterons would perform about the same as the dualcores do (as is the case with Pentium D)... Some test results on that would be nice.

If I'm right, then that would indicate that the Athlon is really no more 'native' than the original Pentium D: just a two-CPU system glued together on a single die, no specific advantage in terms of performance over a dual-CPU system.
Now I can understand that this may be unpleasant for the avid AMD-fan, but there is a possibility that this is the truth. Let's see those benchmarks, shall we?
 
Your application running poorly on Athlon64's has nothing to do with this topic. This topic has to do with HT speeds and it's effect on multi core systems, and then evolved to HT link speed and crossbar link speed. That issue is over IMO and I'm not the only one who thinks so. Your app running poorly is another topic all together.

We've done what you suggested with your app and the results showed that an HT link speed of 400MHz or better showed no difference becuase at that speed and over, it is not a bottleneck. Everything i've seen here, combine with what I already know about the Athlon64's architecture and how it works, combined with the results I've seen with my own PC through expiremental overclocking is more than enough for me to draw up valid conclusions about the topic. Of coures if you cripple HT down to 1x, you will see a performance hit, it is afterall, responsible for all the data that travels to and from the socket. Again, the fact that it runs poorly on A64's is, IMO a seperate issue.
 
And yet none of us can see the code.... Hows that for perspective?

What code would you want to see then?
You realize that this application is far more than what you can see here?
It contains a complete 3d-engine with tons of features, code for exporting and importing 3dsmax scenes etc, all of which isn't used in this particular application, but rather hard to remove without breaking the rendering backend (the options screen at the start should indicate that there's a whole lot more features than this particular test application has use for).
I could give some snippets of certain parts of the algo, but I can't just give out the whole thing.
So specify what parts you'd like to see, then perhaps we can work something out.

However, since the algo itself is reasonably wellknown and well-documented (as I said, MarchingCubes), it shouldn't be too hard to figure out what it does even without seeing the code, especially after my explanation earlier on.

So, be specific about *what* part of the code you want to see, and *why*.
If you don't even seem to grasp the MarchingCubes algo in the first place, I see little reason to give out any of my code, because you wouldn't be able to understand much of it anyway. My code just juggles with some tables and some comparisons. It doesn't make a whole lot of sense unless you know the underlying algo.
 
Everything i've seen here, combine with what I already know about the Athlon64's architecture and how it works, combined with the results I've seen with my own PC through expiremental overclocking is more than enough for me to draw up valid conclusions about the topic.

Probably not... Re-read my previous post... It's the *read/write* part of the shared memory that's quite unique. Multi-core systems have traditionally also been multi-CPU systems, where sharing memory this way is a big nono, so I doubt there's much, if any, multithreaded code in existence that does this, unlike mine.

Again, the fact that it runs poorly on A64's is, IMO a seperate issue.

While I don't share your opinion, even so, it's a bit late to open a separate topic for it now, since the discussion has been going on for quite some time now.
Perhaps a moderator would be kind enough to split the thread, but I wouldn't blame them if they didn't bother.
 
[. . .] I see little reason to give out any of my code, because you wouldn't be able to understand much of it anyway.
I want to start by saying that your ability to code is not unique, and I would not be surprised to see many coders on this forum (including me). It can be hard to be truly offended here on the forums, but you just hit it on the head for me. I code for a living, as do many other people here. Your assumption that you are the only one is stupid and quite offensive. I'm going to try to respond in a civil manner, but you sure don't make it easy with such comments.

What code would you want to see then?
You realize that this application is far more than what you can see here?
It contains a complete 3d-engine with tons of features, code for exporting and importing 3dsmax scenes etc, all of which isn't used in this particular application, but rather hard to remove without breaking the rendering backend (the options screen at the start should indicate that there's a whole lot more features than this particular test application has use for).
I could give some snippets of certain parts of the algo, but I can't just give out the whole thing.
So specify what parts you'd like to see, then perhaps we can work something out.
To start with, if you want to use your code for a purpose even resembling real benchmarking on this forum, you will either need to be one of the big names or you will need to release code. Note that many of the big names even release the code, so you have a lot of work to do.

To start with, profile the code with all inlining disabled but will all other optimizations enabled. This will cause a speed hit, but is about as good as you can get while ensuring that all the symbols are in the executable and the separation of the code segments is preserved. Then, post the results with per-symbol percentage time. If you cannot do this with real optimized code, then you are wasting everyone's time. If you need software support, I recommend Linux and Oprofile. If windows can't do this, too bad.

Now that you have a profile built, I want to see the full source of any function with over 5% relative CPU time, and some display showing the distribution of arguments actually used to call the function. With this, one could actually check many of your claims if one was so inclined.

Before you complain that this is too hard or to too much, I should say it isn't a big deal. It would take less time than it would take you to defend all the posts you made, and is something you should be doing anyways. All work-related projects go through a profiling stage where I look at much more than just run time. Typically I profile for time, L2 cache misses, L1D and L1I cache misses , at a minimum.

However, since the algo itself is reasonably wellknown and well-documented (as I said, MarchingCubes), it shouldn't be too hard to figure out what it does even without seeing the code, especially after my explanation earlier on.
I'm not saying this to be mean in any way, but this sounds like something a math-based CS student would say. Many people from this background tend to focus on theory. This is a very important aspect of CS, so I don't want to belittle it, but at the same time, it is important to realize that three different people implementing the same algorithm will typically produce code that performs very differently. If you meant to imply that knowing you used "Marching Cubes" could tell us anything about your implementation, you are very mistaken.

Probably not... Re-read my previous post... It's the *read/write* part of the shared memory that's quite unique. Multi-core systems have traditionally also been multi-CPU systems, where sharing memory this way is a big nono, so I doubt there's much, if any, multithreaded code in existence that does this, unlike mine.
Yes, the memory access patterns you describe (reading and writing from two or more threads on a single memory segment) are typically not recomended, especially for K8. Again, you can find this in the Software Optimization Guide for K8 (many of the same recommendations apply for K10 as well). If you are interested, you can see this in Intel documentation with regard to HyperThreading as well; Intel recommends threads use different data segments to get the most out of HT.

That said, most naively written multi-threaded software does exactly what you seem to describe, as guaranteeing threads have their own memory to work with takes extra effort and complexity. Simply reading and writing from the same shared memory pool on each thread is typically simpler to implement. Again, I'm not trying to bash your code so much as to let you know that read/write loops to a single shared memory are common in MT code. Your "unlike mine" comment sounds silly when read by anyone with multi-threaded coding experience.
 
Scali,

I also had a couple of comments on a more personal level. I'm truly sorry the way everyone has been treating your code. I know it can be quite frustrating to have so many objections to a piece of code you have faith in. I've been in the exact same situation before. I'm sorry that your test isn't getting the "that's cool" factor it probably deserves.

However, I've learned from past mistakes. You'll notice that the last few times I tried to use code to demonstrate something with CPUs, I posted source. It wasn't the best I've written, and it may have even been messy or a bit lame. The point is that, my coding pride aside, as long as I provide source, my claims can be verified or thrown out. From where many of your opponents sit, there is simply no way to verify any of your claims. Considering this, you should expect the harsh treatment you have received.

Why don't you release the source? Someone may use your code in some obscure project you'll probably never see or work with. This will probably never impact your life or financial situation in any meaningful way. While I understand your concern and right to hold on to your work, I see only gain for you by giving the source out for a couple of days here on the forums. Just as I have done in the past, posting a link to source and making it available for a couple of days exposes you to little real risk.
 
I want to start by saying that your ability to code is not unique, and I would not be surprised to see many coders on this forum (including me). It can be hard to be truly offended here on the forums, but you just hit it on the head for me. I code for a living, as do many other people here. Your assumption that you are the only one is stupid and quite offensive. I'm going to try to respond in a civil manner, but you sure don't make it easy with such comments.

First of all, I wasn't referring to you, but to duby. Secondly, there's a huge difference between a 'coder' and a 'coder'. I was specifically referring to the MarchingCubes algo. Unless you studied that first, my code won't make a lot of sense.
That's not offensive, it's just a simple truth. I wouldn't understand much of certain algo's either if I didn't first study the theory behind them.
I don't assume I'm the only coder here, but I think not a lot of coders here have implemented the MarchingCubes algo before.

If you meant to imply that knowing you used "Marching Cubes" could tell us anything about your implementation, you are very mistaken.

No, I mean to imply that you need to first know the algo, then you can ask specific questions about my implementation, which I can answer and provide sourcecode for.

Yes, the memory access patterns you describe (reading and writing from two or more threads on a single memory segment) are typically not recomended, especially for K8. Again, you can find this in the Software Optimization Guide for K8 (many of the same recommendations apply for K10 as well).

Obviously, then again, why not? If it has fast inter-core communication, it should not be an issue? See how this contradicts?
At any rate, I never meant it to be optimal for K8, it was an experiment on my Core2 Duo to see how fast its shared cache was. Didn't expect a K8 to be *slower* than a single-threaded algo... or (relatively) slower than a Pentium D for that matter.

If you are interested, you can see this in Intel documentation with regard to HyperThreading as well; Intel recommends threads use different data segments to get the most out of HT.

Not true. This particular code runs pretty well on HT, because HT has a shared cache. Ironically enough it runs better than on a Pentium D.

That said, most naively written multi-threaded software does exactly what you seem to describe, as guaranteeing threads have their own memory to work with takes extra effort and complexity. Simply reading and writing from the same shared memory pool on each thread is typically simpler to implement. Again, I'm not trying to bash your code so much as to let you know that read/write loops to a single shared memory are common in MT code. Your "unlike mine" comment sounds silly when read by anyone with multi-threaded coding experience.

My code does it a whole lot more though. My code *relies* on that fast inter-core communication. It isn't practical on K8 or Pentium D, so I doubt that any code uses it... It makes it slower.
 
Why don't you release the source? Someone may use your code in some obscure project you'll probably never see or work with. This will probably never impact your life or financial situation in any meaningful way. While I understand your concern and right to hold on to your work, I see only gain for you by giving the source out for a couple of days here on the forums. Just as I have done in the past, posting a link to source and making it available for a couple of days exposes you to little real risk.

I already explained it. This is a HUGE 3d engine. I can't release that. It also serves no purpose for this particular benchmark. But I can't just remove the code, because the algo is deeply rooted in the whole object structure of the engine.
Besides, I don't mean to be arrogant, but my experience is that hardly anyone here would even understand most of my code anyway, so why bother giving it out?
It's not exactly trivial code.
Now if someone would prove that they understand the MarchingCubes algo, and have some specific questions about my particular implementation, then I may be persuaded to show that person the relevant parts of the code.
As it stands, I simply believe nobody would be able to do anything with the code anyway. Go ahead and prove me wrong.
 
Since in certain cases the Athlon pretty much gets beaten by a Pentium D in terms of gains from a single thread to multiple threads, I find it impossible to believe that AMD's chips do run at full speed. After all, how can they lose from an 800 MHz FSB, which has to be shared with memory and videocard access?
If you have an explanation, I'm dying to hear it...

Ah, thank you. I don't particularly enjoy the mild flames and arguing about side issues, and I really appreciate it when you are able to come out and directly make the claim I have a problem with. Throw all other complaints I have ever made about your posts out the window: this is the one I have a real problem with.

Now, this may be too personal, and it may get me banned from the site. Frankly, I don't really care; this simply needs to be said. I think you of all people, especially after claiming things such as, "Of course it also helps that I am an experienced assembly and Direct3D programmer, so I know a lot about what goes on inside CPUs and videocards [. . .]," would understand that the performance of an on-chip bus is very different from the presence of such as bus.

I hate to make car analogies, but here goes. You seem to be suggesting that if a Honda were to loose to a Toyota in a race, it would suggest that the Honda has a smaller engine. Note that the brand names are not important, only the situation. The fallacy should be immediately obvious. Everyone knows the size of the Honda engine is just as large as the size of the Toyota engine, just as everyone (but you) knows the K8 has an one die bus. The point is that there are many other design considerations: tire size/design, gas octane/design, transmission, etc, etc, etc. All of these "secondary" things impact performance in a very real way.

Do you see what I'm getting at here? You make an astronomical leap in logic when you claim performance == hardware specs. Again, post all day that AMD sucks and the K8 has slow and inefficient core <-> core links: you'll see silence from me. That is because such claims are qualitative, subjective and you have near infinite leeway here. However, your hardware claims are simply untrue; this is exactly the kind of game that gets you in my sig (yes, I realize you don't care). You simply state things that are false and then fight aggressively to defend them with poor logic.

Take some formal logic classes, it would help you. On the very first day I think you'll learn that given the simple assumption that P -> Q, there is no way to logically arrive at the conclusion that Q -> P in this context. These operators are not always (or even typically) reversible. A bit of rant here, but this is always my problem with people, in the US, hell the whole world. P -> Q does NOT mean Q -> P, and it is 100&#37; maddening that so many seemingly intelligent people base their whole foundation of meaning on such blatant misunderstandings of logic.

----

As for your source code, I told you exactly what I wanted. I want the code for all functions with over 5% CPU time, hell, 10% would make be happy. This is not an unreasonable request given the fact that you offered to release portions upon request. Further, my understanding of Marching Cubes is irrelevant for the purposes of understand the runtime behavior of your app and certainly irrelevant for verifying that the code is written in a manner that is not specifically designed to be unfair.

That aside, Marching Cubes is not a difficult algorithm. It is quite simple and anyone worth their workstation should be able to code a base implementation in a day or less. As always the bitch is in the details, though I'm not sure I would even go so far as to call adaptive subdivision a bitch. This is certainly not more complicated than an implementation of an octree for a display graph or physics simulation. Considering that you don't provide a list of all the attention to detail your algorithm and implementation is based upon past the name of the primary algorithm, I find your assertion that no one could understant it laughable, as anyone with a BS or higher in CS should be able to handle marching cubes without significant problems. Hell, anyone with a math degree should be able to get it. It is even quoted as being "a relatively obvious solution to the surface-generation problem."

So, tell me again. Tell me right to my motherfscking face that you are a god and that none other than you are able to understand marching cubes. Tell me right to my face that you think I'm unable to understand it and that your code would be gibberish to me. Come on, what's holding you up? You've made the claim in general as to insult everyone here. Why not stop being a pansy bitch? Why not actually make the claim to me directly, as you already have, indirectly, several times? Fscking prick.
 
I hate to make car analogies, but here goes. You seem to be suggesting that if a Honda were to loose to a Toyota in a race, it would suggest that the Honda has a smaller engine. Note that the brand names are not important, only the situation. The fallacy should be immediately obvious. Everyone knows the size of the Honda engine is just as large as the size of the Toyota engine, just as everyone (but you) knows the K8 has an one die bus. The point is that there are many other design considerations: tire size/design, gas octane/design, transmission, etc, etc, etc. All of these "secondary" things impact performance in a very real way.

Do you see what I'm getting at here? You make an astronomical leap in logic when you claim performance == hardware specs. Again, post all day that AMD sucks and the K8 has slow and inefficient core <-> core links: you'll see silence from me. That is because such claims are qualitative, subjective and you have near infinite leeway here. However, your hardware claims are simply untrue; this is exactly the kind of game that gets you in my sig (yes, I realize you don't care). You simply state things that are false and then fight aggressively to defend them with poor logic.

Whatever, that's just grasping at straws.
The real point I've repeatedly made is that the inter-core communication is not efficient, and apparently selecting different HT-speeds has some kind of impact on the whole thing.
Now I can be perfectly happy with the claim "AMD has a full-speed crossbar, but it just is implemented so poorly that it doesn't perform better than the Pentium D's FSB solution".
That's basically no different than what I'm saying.

I'lll ignore the rest, you can thank me for that.
 
Whatever, that's just grasping at straws.
The real point I've repeatedly made is that the inter-core communication is not efficient, and apparently selecting different HT-speeds has some kind of impact on the whole thing.
Now I can be perfectly happy with the claim "AMD has a full-speed crossbar, but it just is implemented so poorly that it doesn't perform better than the Pentium D's FSB solution".
Good, I'm glad to see you came around and found value in making factual, accurate claims. I really appreciate it, thank you. (Though, I would add "in certain cases" to your quoted claim, but I'm happy I got any change at all).
That's basically no different than what I'm saying.
No, it is very different than what you have been saying for quite some time now. Again, I appreciate the fact that you have rephrased your claim to better align with reality.

I'lll ignore the rest, you can thank me for that.
Eh, as it eases my uncomfortable anger, I will thank you: thanks. However, I still feel my ranting and response is justified given your general stuck-up attitude.
 
Status
Not open for further replies.
Back
Top