Why is the i7 so much faster?

ZenDragon

[H]ard|Gawd
Joined
Oct 22, 2000
Messages
1,698
So Ive been out of the loop for a while. Still running a pre AM2 Athlon 64 X2 in fact. Ive been reading reviews and such on both intels new i7 platform as well as AMD's platform. And to be frank, Im a little confused as to why i7 is SOO much faster than its competition! I hear everybody touting the on chip memory controller and such, but if Im not mistaken the Phenom II has an on chip memory controller as well. Or maybe Im just not understanding the distinction between the two? Pardon my ignorance here, but I just dont understand the technical reasons why the i7 is in some cases over twice as fast as AMD.

Just to clarify Im not an AMD fanboy or anything, Ive actually been seriously considering a Core2 here soon as an upgrade over my existing hardware, but even the Core2 is completely trampled on by the i7. I did a little searching around the net to try and get a comparison of technologies and not just benchmarks, but good clear information (without all the marketing hype) seems to be difficult to find. Though its possible that Im just not using the right search terms. If somebody could point me towards some good side by side comparisons and info regarding this I would greatly appreciate it. I might very well be convinced to wait for a couple more months for the i7 prices to come down a bit before upgrading at all.

Thanks!!!
 
Basically, the memory architecture on the new Intel design is head and shoulders above anything else. A memory controller with more cache where it's important means there's more bandwidth and less latency.
 
I did a little searching around the net to try and get a comparison but good clear information (without all the marketing hype) seems to be difficult to find. Though its possible that Im just using the right search terms.
That's hard to understand considering how much information there is, unless you're using some very strange search criteria. I suggest you try something sensible like "intel i7 review" or similar.

If somebody could point me towards some good side by side comparisons and info regarding this I would greatly appreciate it.
Almost every single hit using the aformentioned search will provide you with a review of the intel i7. Almost every single decent review will have lots of side by side comparisons. Not going to cut and paste the links for you as you simply need to run google yourself.
 
^^ exaftly

and a new core and architecture, AMD has been behind since C2 started and PII is now just coming on par with C2Q.

Intekl doesnt want to get caught with another A64 deal and be caught with their pants down.

i figure in a year or so, AMD may be back up to competition level directly with Intel with the new Dubai investors.
 
Basically, the memory architecture on the new Intel design is head and shoulders above anything else. A memory controller with more cache where it's important means there's more bandwidth and less latency.

So is it more of a "platform" advantage? Meaning the combination of the I7 processor, as well as new capabilities of the X58 (such as the tripple channel memory) that is really allowing it to perform so much better or is it just the processor that is providing the primary advantage?
 
That's hard to understand considering how much information there is, unless you're using some very strange search criteria. I suggest you try something sensible like "intel i7 review" or similar.


Almost every single hit using the aformentioned search will provide you with a review of the intel i7. Almost every single decent review will have lots of side by side comparisons. Not going to cut and paste the links for you as you simply need to run google yourself.

Well most of the info I can find are in one of two categories; benchmark comparisons (which I dont really need to see as i7 obviously dominates), or marketing garbage. I have looked at all the specifications of each, and I understand that, but what Im trying to understand is exactly which "new feature" is giving it such an enormous advantage. Which is what I think OldScotch was touching on, though I wouldnt mind a little more detail.
 
I have the answer-- but lets start with a little history to establish context.


The Pentium III and Athlon K8 micro-architectures could each decode 3 program instructions in a single clock cycle, and the back-end units execute these these instructions out-of-order depending on which execution units are available.


The Pentium 4 was a different beast in that it stored already decoded instructions in its trace cache, so even with just 1 instruction decoder it could still execute multiple instructions per clock. The Pentium 4, however, has few execution units-- two integer units and one floating point unit. Compared to the 3 integer units and 3 floating point units of the K8, the P4 backend seemed rediciously slow but made up for it with higher clocks rates.


The Intel Core micro-architecture abanded the trace cache and reverted to the instruction decoder of Pentium III, then added a 4th decoder port. However, just as it was on the Pentium III, the new Core micro-architecture was limited to decoding 1 complex instruction per clock-- the remaining decoders could handle only simple instructions. As program software is a mixture of various instruction lengths, this kept the overall number of instructions that could be processed well below the theoretical maximum-- only highly optimized software could approach that threshold.



So Nehalem takes the instruction decoder from Core 2 and adds a special pre-decode buffer. This little unit can identify certain complex instructions ahead of time and reduce them to simple instructions before they enter the main instruction decoding pipeline. This improves greatly the overall number of instructions that can decode in a single clock cycle. Intel calls this 'Macro-op fusion'



Instruction fetch-decode bandwidth is one of the primary factors in terms of performance-- think of instruction decoders as cash registers at the fast food chain. The order is taken and sent to the back-end, the execution units, which would be analogous to the kitchen. And in this case Intel added another execution unit--Port 5-- to handle the increased bandwidth. This execution port does have certain limitations-- namely it cannot handle floating point or SSE multiply/divide instructions.


There you have it; Nehalem can process more instructions per clock cycle. And-- AMD is only one instruction decoder away from being competitive :)
 
I have the answer-- but lets start with a little history to establish context.


The Pentium III and Athlon K8 micro-architectures could each decode 3 program instructions in a single clock cycle, and the back-end units execute these these instructions out-of-order depending on which execution units are available.


The Pentium 4 was a different beast in that it stored already decoded instructions in its trace cache, so even with just 1 instruction decoder it could still execute multiple instructions per clock. The Pentium 4, however, has few execution units-- two integer units and one floating point unit. Compared to the 3 integer units and 3 floating point units of the K8, the P4 backend seemed rediciously slow but made up for it with higher clocks rates.


The Intel Core micro-architecture abanded the trace cache and reverted to the instruction decoder of Pentium III, then added a 4th decoder port. However, just as it was on the Pentium III, the new Core micro-architecture was limited to decoding 1 complex instruction per clock-- the remaining decoders could handle only simple instructions. As program software is a mixture of various instruction lengths, this kept the overall number of instructions that could be processed well below the theoretical maximum-- only highly optimized software could approach that threshold.



So Nehalem takes the instruction decoder from Core 2 and adds a special pre-decode buffer. This little unit can identify certain complex instructions ahead of time and reduce them to simple instructions before they enter the main instruction decoding pipeline. This improves greatly the overall number of instructions that can decode in a single clock cycle. Intel calls this 'Macro-op fusion'



Instruction fetch-decode bandwidth is one of the primary factors in terms of performance-- think of instruction decoders as cash registers at the fast food chain. The order is taken and sent to the back-end, the execution units, which would be analogous to the kitchen. And in this case Intel added another execution unit--Port 5-- to handle the increased bandwidth. This execution port does have certain limitations-- namely it cannot handle floating point or SSE multiply/divide instructions.


There you have it; Nehalem can process more instructions per clock cycle. And-- AMD is only one instruction decoder away from being competitive :)


Very interesting :) Thank you very much, that is preciesly the explaination I was looking for.
 
It's not only decoding , basically I7 takes the Core uarch and tweaks it in every way possible to squeeze out the last ounce of performance.On top of that , Intel improved the uncore part tremendously ; IMC and serial point to point extremely fast links.They created a very well balanced , good all around performer.

RWT is a must read if you want to understand about CPU and system uarch.

http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719
 
That is not the reason. AMD has proven this by kicking the Pentium 4's ass with the Athlon 64.

Well, back then it wasn't that simple. Amd got really lucky there because Intel managed to piss of so many DEC engineers into leaving after buying it, so AMD jumped on the opportunity to pick up the best engineers. Derrick Meyer, a former DEC engineer worked on the K7 architechture and another former DEC engineer James Keller worked on the K8 arch. Don't really know on the state of Amds engineer talent pool now, but judging from the phenom cpus its probably not in tip top shape.
 
Well, back then it wasn't that simple. Amd got really lucky there because Intel managed to piss of so many DEC engineers into leaving after buying it, so AMD jumped on the opportunity to pick up the best engineers. Derrick Meyer, a former DEC engineer worked on the K7 architechture and another former DEC engineer James Keller worked on the K8 arch. Don't really know on the state of Amds engineer talent pool now, but judging from the phenom cpus its probably not in tip top shape.

Well that proves my point. It is about the talent working for you more than it is the sheer money of your company. However a deeper pocket means a larger talent pool and access to newer emerging technologies that your competitor may not be able to purchase, license or develope.
 
I have the answer-- but lets start with a little history to establish context.


The Pentium III and Athlon K8 micro-architectures could each decode 3 program instructions in a single clock cycle, and the back-end units execute these these instructions out-of-order depending on which execution units are available.


The Pentium 4 was a different beast in that it stored already decoded instructions in its trace cache, so even with just 1 instruction decoder it could still execute multiple instructions per clock. The Pentium 4, however, has few execution units-- two integer units and one floating point unit. Compared to the 3 integer units and 3 floating point units of the K8, the P4 backend seemed rediciously slow but made up for it with higher clocks rates.


The Intel Core micro-architecture abanded the trace cache and reverted to the instruction decoder of Pentium III, then added a 4th decoder port. However, just as it was on the Pentium III, the new Core micro-architecture was limited to decoding 1 complex instruction per clock-- the remaining decoders could handle only simple instructions. As program software is a mixture of various instruction lengths, this kept the overall number of instructions that could be processed well below the theoretical maximum-- only highly optimized software could approach that threshold.



So Nehalem takes the instruction decoder from Core 2 and adds a special pre-decode buffer. This little unit can identify certain complex instructions ahead of time and reduce them to simple instructions before they enter the main instruction decoding pipeline. This improves greatly the overall number of instructions that can decode in a single clock cycle. Intel calls this 'Macro-op fusion'



Instruction fetch-decode bandwidth is one of the primary factors in terms of performance-- think of instruction decoders as cash registers at the fast food chain. The order is taken and sent to the back-end, the execution units, which would be analogous to the kitchen. And in this case Intel added another execution unit--Port 5-- to handle the increased bandwidth. This execution port does have certain limitations-- namely it cannot handle floating point or SSE multiply/divide instructions.


There you have it; Nehalem can process more instructions per clock cycle. And-- AMD is only one instruction decoder away from being competitive :)


You do know that the K8s were the first generation A64s right? You can't compare them to a p3 in anyway. The K7s (durons, athlon \ althlon xp) were the p3/p4 combatants until the K8 stepped in with the X86_64 arch we know and love today.

The K7 did up to 6 instructions per clock. That was the whole XP selling point: do more with a slower clock speed. Where intel was trying to do less per clock at a higher clock speed. The K8s continued this fashion.
 
The K7 did up to 6 instructions per clock. That was the whole XP selling point: do more with a slower clock speed. Where intel was trying to do less per clock at a higher clock speed. The K8s continued this fashion.


Not exactly; the K8 did only execute 3 instructions per clock because there were only 3 instruction decoders. It had 6 execution ports-- 3 integer/ALU and 3 floating point/FPU ports-- so yes it could execute a combination of up to 6 instructions on the back-end if the reservation station got clogged up but the maximum sustained throughput in point of fact is just 3 instructions per clock. Same as Pentium III and K7 and even Phenom II.



Interestingly enough, AMD's got their groove back with clock speed on Phenom II. If they would just bother to add a 4th instruction decoder they'd be competetive with Core!
 
That is not the reason. AMD has proven this by kicking the Pentium 4's ass with the Athlon 64.

Outperforming a netbust processor is kinda like winning the special olympics...somewhere less than noteworthy.
 
out performaing one that was almost 1Ghz faster, is something :D considering the speeds the P4 arc hit vs AMD.... and they still gave em a good whooping~
 
It's not only decoding , basically I7 takes the Core uarch and tweaks it in every way possible to squeeze out the last ounce of performance.On top of that , Intel improved the uncore part tremendously ; IMC and serial point to point extremely fast links.They created a very well balanced , good all around performer.

RWT is a must read if you want to understand about CPU and system uarch.

http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719

Good read, thanks! :) Though 80% of that stuff is way over my head! haha
 
out performaing one that was almost 1Ghz faster, is something :D considering the speeds the P4 arc hit vs AMD.... and they still gave em a good whooping~

I agree. Kicking the Pentium 4's ass was note worthy.
 
I agree. Kicking the Pentium 4's ass was note worthy.
Why? Here we have a processor who's performance is abysmally poor. Considering a processor that exceeds that low level of performance to be "noteworthy" is setting the bar a little low, don't you think?
Maybe I'm unfairly biased, but I thought the entire netburst processor line, from beginning to end, was so unimpressive that I kept right on using my PIII-S @ 1.6GHz until the core2duo line came out.
 
Why? Here we have a processor who's performance is abysmally poor. Considering a processor that exceeds that low level of performance to be "noteworthy" is setting the bar a little low, don't you think?
Maybe I'm unfairly biased, but I thought the entire netburst processor line, from beginning to end, was so unimpressive that I kept right on using my PIII-S @ 1.6GHz until the core2duo line came out.

It really is a matter of perspective. The Pentium 4 line wasn't that bad, it was that the Athlon 64 was just that good. That's the way I see it. The early socket 423 Pentium 4's weren't all that impressive, but they were faster than the Pentium III due to clock speed alone. That is why they were better. The later socket 478 Northwood processors were great, and the Athlon 64 was better.
 
It really is a matter of perspective. The Pentium 4 line wasn't that bad, it was that the Athlon 64 was just that good. That's the way I see it. The early socket 423 Pentium 4's weren't all that impressive, but they were faster than the Pentium III due to clock speed alone. That is why they were better. The later socket 478 Northwood processors were great, and the Athlon 64 was better.

I just wonder what the world would have been like had they released Pentium 3's at 3.6GHz :)
 
I just wonder what the world would have been like had they released Pentium 3's at 3.6GHz :)

Given its IPC I think that it would have been more than a match for the Athlon 64 given the clock speed advantage.
 
Given its IPC I think that it would have been more than a match for the Athlon 64 given the clock speed advantage.

I strongly believe that would have kicked Athlon 64's ass! haha..

Maybe AMD might have come up with a better 64 to match it.
 
They did (with a little overclocking). They called it "core 2"

Well, yeah I mean, back then =)

We wouldn't have had this P4 mess up and a lot of time would have been saved. :)
We might actually be like a generation ahead.
 
PIIIs at the time of P4 introduction were at 180nm. P3Ts (after P4 introdcution) were 130nm. they couldnt get PIII to clock to 1.5, much less 3.6, on the fab process they were using. We got core 2s starting at what? 65nm? Slight difference.
 
So Nehalem takes the instruction decoder from Core 2 and adds a special pre-decode buffer.

Weren't you the guy trying to argue with me that it's not a pre-decode buffer? :)

Anyway, your answer isn't entirely correct.

Micro-op and macro-op fusion are also present in the Core2 (and Core Solo/Duo. Pentium M has micro-op fusion only).
Core i7 does improve it a bit, because on Core2, macro-op fusion only worked in 32-bit mode. Now it works in 64-bit mode aswell. But since most benchmarks are done in 32-bit mode, that cannot be an explanation for Core i7's performance.
It also adds a few extra instruction pairs that it can 'fuse', that may account for a few % improvement here and there.

What Core i7 also has is a 'loop stream' feature. Basically it can detect loops in the code, and store the decoded instructions, much like the Pentium 4 did with trace cache.
This means loops (which are usually the bottleneck in a program) can be executed more efficiently. But this mainly makes sense because of the following.

The main things that make Core i7 so fast are the cache and onboard controller.
The execution core is nearly the same as the Core2. This already was a very efficient execution unit. It may have been tweaked a bit here and there, like with the above features (and also better branch prediction and such)... But those only affect the performance marginally.
The biggest reason is that the cache and onboard controller are now able to feed the execution units much better (data is far more important than instructions, especially when SSE is involved). With the Core2, you would rarely use more than two ALUs at a time, even though it had four onboard. Core i7 is able to make better use of those ALUs.

But even still, the very nature of x86 code makes it very dependent, and therefore it's hard to keep all execution units busy.
This is why HyperThreading was introduced. Two threads are two separate streams of instructions, that are independent of eachother by default. By feeding two threads into a single core, you have a much larger selection of independent instructions to execute, allowing you to make more efficient use of your execution units.

The thing about Pentium 4 isn't entirely correct either. Pentium III also only had two ALUs and a single FPU execution port, so it's no different for Intel.
Athlon had three ALUs and its FPU was cut into three different sub-units, which allowed more parallelism.
Pentium 4 did have 'double-pumped' ALUs. This meant that a certain number of instructions could be executed in only half a clock cycle, after which the CPU could feed another such instruction. Effectively this allowed the Pentium 4 to run 4 instructions in 1 cycle, with only two ALUs.
 
PIIIs at the time of P4 introduction were at 180nm. P3Ts (after P4 introdcution) were 130nm. they couldnt get PIII to clock to 1.5, much less 3.6, on the fab process they were using. We got core 2s starting at what? 65nm? Slight difference.

More and more I look into this I really think that Intel rather than AMD have done things right. They got it right first with P3's, learned a valuable lesson in P4, made a good choice with Core/Core2, and then finally got the uncores right with i7's.

AMD, on the other hand, have gotten some things right with K8's and all, but I don't know where they can go from here.

It could be a simple marketing ploy to entice simple people like me, but even looking at Intel's roadmap, I know what they will do in 2010 or 2012 pretty much. With AMD, I don't have that kind of... foresight with the company. It's nice and all to surprise all of us with releases of HD4000's and all, and that they're aiming for this price/performance bracket of the market, which I completely agree is a good thing for the majority of the buyers, but it feels like they will never be this "pioneering" company anymore.

I repeat myself again, but Intel, whether they have released the right or wrong chips, their succession is always a lesson learned from their previous.
 
The K7 did up to 6 instructions per clock. That was the whole XP selling point: do more with a slower clock speed. Where intel was trying to do less per clock at a higher clock speed. The K8s continued this fashion.

That's not entirely true.
The K7's bottleneck was its decoder. It could decode at maximum 3 simple x86 instructions per clk, and was much slower when complex instructions were involved.
As such, the K7 could never feed its backend more than 3 instructions per clk sustained.

What it could do was retire up to 6 instructions per clk... But this could only happen if some of those instructions took more than a single cycle, and as such got 'overtaken' by other instructions.

Intel's Pentium III and Pentium 4 were also capable of 3 instructions per clk at maximum (or technically the Pentium 4 could do 6 instructions per 2 clks, because part of its reorder/retire logic processed only every other cycle).
Pentium III was stronger than the Athlon when complex instructions were involved (or general branch prediction and such), but the Athlon had an extra ALU and a more parallel design for the FPU, allowing it to be faster in integer-heavy and/or FPU-heavy code.
In practice the Pentium III and Athlon were never far apart in terms of IPC, with the Athlon having a slight edge most of the time.
Pentium 4 usually didn't get as close to its theoretical maximum because it had a much larger pipeline, which meant that any kind of stall/misprediction took a much larger penalty... and it had to run code that was usually optimized for Pentium III-style architectures, which suited the Athlon just fine, but could be disastrous for Pentium 4 in some cases.

Having said that, neither Pentium III nor Athlon ever got to 3 instructions per clk sustained in practice. Most of the time you were lucky if you got more than 2 instructions per clk on average, because of all the other limitations in the x86 architecture and with branch prediction, caching and memory access in general.
 
More and more I look into this I really think that Intel rather than AMD have done things right. They got it right first with P3's, learned a valuable lesson in P4, made a good choice with Core/Core2, and then finally got the uncores right with i7's.

That's right.
I remember back when the Core2 specs were surfacing, a few months before the official introduction, that people were surprised at the pipeline length.

Intel chose not to take the same pipeline length as Core or Athlon64, but actually make it a few stages longer.
Some people would get flashbacks of Pentium 4's uber-long pipeline and thought it was a bad thing.

But in the end it showed that Intel didn't just get scared after Pentium 4 and go the 'safe' route by copying from earlier architectures that worked out okay... but they apparently actually put some thought into the whole design.
The longer pipeline allows the architecture to scale further... And that's why Core2 turned out to be a legendary overclocker, and Intel never even had to tap into the huge clockspeed potential that the architecture offers... Intel managed to get better IPC than AMD while using a longer pipeline, which in theory would be less efficient. But Intel just got it right.
What's even more amazing is that Intel did this without an integrated memory controller. So the Core2 core is VERY efficient. And Core i7 now takes away the limit of the FSB.
People like us, who overclock these chips, are very well aware of how easily the architecture goes past 3 GHz. It's not quite a Pentium 4, but it sure scales nicely past the 3 GHz mark with little effort.

As for the triple-channel controller. I wonder if it would even make a difference on a Phenom II processor. Unlike the Core i7, AMD's execution core is probably not efficient enough to handle more than dual-channel bandwidth.
That's another thing you have to realize. Anyone can strap on three or more channels, but for Core i7 it actually works.
 
More and more I look into this I really think that Intel rather than AMD have done things right. They got it right first with P3's, learned a valuable lesson in P4, made a good choice with Core/Core2, and then finally got the uncores right with i7's.

AMD, on the other hand, have gotten some things right with K8's and all, but I don't know where they can go from here.

It could be a simple marketing ploy to entice simple people like me, but even looking at Intel's roadmap, I know what they will do in 2010 or 2012 pretty much. With AMD, I don't have that kind of... foresight with the company. It's nice and all to surprise all of us with releases of HD4000's and all, and that they're aiming for this price/performance bracket of the market, which I completely agree is a good thing for the majority of the buyers, but it feels like they will never be this "pioneering" company anymore.

I repeat myself again, but Intel, whether they have released the right or wrong chips, their succession is always a lesson learned from their previous.

Indeed they do. The Pentium 4, like Phenom (hopefully) isn't worthless in the sense that some of the technology may prove promising to another CPU generation down the line. Though it wasn't nearly as competitive as it should have been at the time it was released.

The CPU market is a chess game. Move, counter move, and move again. It repeats until a company throws in the towel. Then hopefully, a new player steps up. Cryix had one golden period of relative success after their moderately successful Cyrix Fastmath Co-Processors. The 6x86 days were good to them. Though not the equal of the Pentium in some areas they were a nice alternative for the price. They floundered with the 6x86MX and MII processors, and never recovered. Eventually they threw in the towel more or less and have been bought and sold a couple of times. They are currently owned by VIA and will never be likely to take on anyone in the near future unless their parent company really wants to try and step up into that arena.

AMD has had successes, but haven't learned from their failures. Worse yet is they've gravely miscalculated and misjudged their opponent. You have to give AMD props for holding out and fighting with Intel better and longer than anyone else. Cyrix, NexGen Systems, and others have all failed to compete and are all gone. Cyrix to VIA, and NexGen Systems to AMD. I'm not one of the "Doom and Gloom" guys, but AMD needs to bring themselves into a better market position in order to survice. That said, even if Phenom II doesn't do very well, I know they'll hang on for a while yet. Can they survive until 2011 which is the projected launch time frame of the Bulldozer launch? I'm not sure, but I think so.
 
AMD has had successes, but haven't learned from their failures. Worse yet is they've gravely miscalculated and misjudged their opponent. You have to give AMD props for holding out and fighting with Intel better and longer than anyone else. Cyrix, NexGen Systems, and others have all failed to compete and are all gone. Cyrix to VIA, and NexGen Systems to AMD. I'm not one of the "Doom and Gloom" guys, but AMD needs to bring themselves into a better market position in order to survice. That said, even if Phenom II doesn't do very well, I know they'll hang on for a while yet. Can they survive until 2011 which is the projected launch time frame of the Bulldozer launch? I'm not sure, but I think so.

Blame or flame me if I'm horribly wrong, but I've always given Intel credit for AMD's successes actually. I do remember in the 386/486 era when Intel accused AMD of "stealing" Intel's chip design. Now, I also agree that these kind of lawsuits run rampant especially in the tech industry, but I also strongly remember the feeling that AMD did really take advantage of their "PC-Compatible" that Intel offered. It seemed to me at that point that AMD based their chip design on Intel's blueprints, and that is why they were really successful with the Athlons, compared to Cyrix and others. I may be wrong, but I also remember that Cyrix didn't have as much "dip" to Intel's chip technology as much as AMD did.

Do not get me wrong, as many have pointed out already, I really hope AMD steps up and not only catches up to but surpasses Intel. Right now however, even to an average man like me, it seems like an almost an impossible feat for AMD. What do they have in the future, really, other than settling for price/performance? Hope it becomes a pioneer again.
 
The CPU market is a chess game. Move, counter move, and move again.

Intel's strategy for Pentium 4 was pretty obvious at the time:
Intel has a competitor named AMD, which is getting stronger at every generation.
What are Intel's strongest points?
Manufacturing. Intel can build chips at a smaller process, allowing more transistors and higher clockspeeds than anyone else in the world.
So what weapon shall Intel use to fight off AMD?
Exactly.
So Intel came up with a very large and complicated chip, introducing lots of rather revolutionary technology, such as trace cache, double-pumped ALUs and HyperThreading. Then they attach it to a large cache and very high-bandwidth memory (Rambus, remember?), and try to reach insane clockspeeds.

Most of it made perfect sense at the time, it's just that nobody knew that transistor leaking would increase more than exponentially as processes got smaller and clockspeeds got higher.
The 90 nm Pentium 4 was estimated to leak over 25% of its total power consumption. That alone made it very power-inefficient.
Somebody had to hit that brick wall eventually. Intel was just the first on the scene.
All the time AMD wasn't even ready for an architecture like this. Their K7 and K8 are essentially just knockoffs of Intel's Pentium Pro philosophy, and AMD hadn't had time yet to even move to a Pentium 4 philosophy.
Who knows... If Intel hadn't hit that brick wall there and then, AMD might have just copied Intel's design philosophy like they always had.
Hindsight is always 20/20, right?
 
I do remember in the 386/486 era when Intel accused AMD of "stealing" Intel's chip design.

AMD actually DID steal the 386/486 designs.
Their CPUs were virtually identical down to the transistor level (which you can see if you X-ray them). The performance of an AMD 386 and Intel 386 was exactly 1:1 for every instruction, every application, everything.
Same with the 486.
AMD competed by adding heatsinks and later fans, allowing their chips to clock higher, while keeping the prices low. The 386DX40 is legendary. Intel's fastest 386 was 33 MHz. AMD's 40 MHz model filled the void between the fastest Intel 386 and the slowest 486 at a bargain price (the 486 was already on the market for a few years when AMD launched its 386).

After that, AMD no longer copied Intel 1:1. The K5 and later chips all were their own designs, although you can clearly see how they borrowed quite a few architectural ideas from Intel. K5/K6 are very similar to the classic Pentium, and K7/K8/K10 are very similar to Pentium Pro and its derivatives (PII/PIII/Pentium M, Core).
But note that AMD didn't always actually design everything from scratch. They bought other companies and used their designs. The K6 for example was originally developed by a company called NexGen.
The K7 used quite a lot of technology that AMD acquired when the DEC Alpha technology was being sold off.
 
Back
Top