The article discusses Moore's law as applied to many different aspects of microprocessor design and implementation and its implications for the future of the field.
We can tell from current industry trends that transistor counts/density and processor MHz will increase in accordance with Moore's law for the foreseeable future; doubling roughly every 18 months until certain molecular limitations are reached. Microprocessor designs will move more and more towards "System on a chip" type designs. Pad size will remain roughly where it is. Voltage will continue to decrease with the amount by which it decreases over time also decreasing.
The combination of increasing transistor densities and leveling voltage implies that power consumption will increase at nearly the rate of transistor density. The increase in MHz also means that the noise floor will be lowered making noise more of a design consideration.
The rapidly increasing power consumption and noise problems means that the costs associated with dealing with heat and noise reduction will eventually dwarf the savings that we get from higher transistor counts. Therefore, there appears to be a practical limitation to Moore's law.
In the past, the microprocessor market has been driven by the home PC and its need for ever faster and more powerful general purpose processors. Now, with the ubiquity of the internet, importance on CPU power is diminishing in favor of increasing network connectivity. Processors seem to have pretty much all the power that is needed to drive standard consumer applications; the bottleneck is in the network.
As computers continue to progress from large, general purpose machines to portable, personal, targeted, network driven forms, microprocessor design will have to concentrate more on reducing power consumption, portability, security, connectivity, user interface, application compatability, universal data access, and cost. The importance of quick times to market will drive modular designs and open standards based interfaces.
I enjoyed this article very much. Herring puts forth a very good argument. It also has provided me with another angle supporting my own personal theory (which i bore my friends with whenever the opportunity arises) that personal computers as we know them are on their way out. My argument usually is based on interface considerations (i just don't think that it is possible to make an easy to learn interface for a general purpose computer that doesn't seriously hamper advanced users; after years of trying no one has really made anything that bridges the gap between point and drool interfaces which don't intimidate the unwashed masses but merely annoy power users and UNIX command-lines which are extremely powerful but have a steep learning curve) and current trends (consider for a moment the enjoyment/cost ratios for video game console systems vs putting together a computer with nice specs and expensive graphics/sound cards. A playstation and tv can be had for a few hundred that will give roughly the same quality of gameplay as a multi-thousand dollar computer system) along with the observation that most people who own computers use them for a very limited range of tasks (email, web browsing, word-processing, and video games) and would be better served with seperate devices with their own specific interfaces. Basically, i believe, as Herring seems to, that in a matter of time, computer appliances will replace general purpose computers for all but programmers. I had never really thought about it in terms of demands on microprocessor design though.
The IA-64 uses many aggressive techniques to achieve large amounts of instruction level parallelism (ILP).
The IA-64 has: 64 bit address space, 128 65 bit general purpose registers (64 bits of data/memory address plus 1 bit NaT), 128 82 bit floating point registers, space for up to 128 64 bit special purpose registers, 8 64 bit branch registers, and 64 1 bit predicate registers. Instructions are 128 bits long (called a bundle).
Provides parallel compare instructions that allow compound And and Or conditions to be computed in parallel. Multiway branches allow several normal branches to be grouped together and executed in a single instruction. The hardware uses a combination of compiler provided information and histories of run-time behavior to get as much parallelism as possible.
Predication lets the compiler expose a larger pool of instructions to the hardware from which it can extract parallelism. It does this by executing instructions from multiple condistional paths at the same time and eliminating branches that could have caused missed predictions.
The compiler can predict what instructions' results will and will not be used more efficiently than the hardware can. IA-64 helps out by providing speculative load operations which can safely be scheduled before prior branches. It then uses exception handling in a manner similar to Java's try/throw/catch mechanism to ensure program correctness. Basically, it works by assuming that its prediction is correct but double-checking just to be sure and accepting a larger performance hit in the situations where it has predicted incorrectly. Speculation allows the compiler to aggressively rearrange code to exploit parallelism.
Of the 128 registers which IA-64 makes visible to the compiler, 32 are static and the other 96 are stacked (can be renamed under software control) and are fresh upon procedure entry. It makes use of a Register Stack Engine (RSE) that stores registers of previous procedures in parallel with the currently called procedure. In addition, Asynchronous RSEs can spill and fill parts of memory before they are actually needed.
Special Loop Count (LC) registers, loop type branch instructions and rotating predicate registers all combine to reduce the overhead of software pipelining and making it practical and efficient.
My knowledge of microprocessor architecture isn't yet deep enough to allow me to make much in the way of intelligent commentary on this article. It certainly sounds like a very impressive design with many useful advances over older processors. I would be curious to see some real-world type benchmarks to see how it actually performs with existing compilers and typical code.
After acquiring Digital Semiconductor as the result of a patent infringement suit, Intel discovered that it had also inherited a new network processor that was secretly under development. Recognizing the potential of the new processor and the importance of entering the communications/networking market in a big way, it continued its development.
The IXP 1200 was specially designed to perform the types of tasks that are related to routing internet packets. A guiding principle of its design seems to have been future expansion and interlinking capabilities.
It consists of a central StrongArm processor which runs the supervisory software which performs less critical tasks such as managing the router's lookup tables. The StrongArm was chosen because it's fast, consumes very little power, occupies a small amount of silicon, and is well supported by development tools. However, rather than using an existing StrongArm core, the designers created an entirely new 32-bit architecture with a data-oriented instruction set. They added several new instructions for more efficient data throughput, byte manipulation, and pattern matching while removing other, less used instructions.
The bulk of the processing is offloaded onto the 7 independent RISC cores, or "microengines". Each has its own physical register file (with 128 general-purpose registers, and 128 transfer registers, all 32-bits wide and single ported), a 32-bit ALU with basic five-stage pipeline, a single-cycle shifter, an instruction-control store (4K of SRAM), a microengine controller, and four program counters.
The parallel architecture of 7 microengines (each capable of running 4 hardware threads synchronously) makes it ideal for processing large amounts of constantly streaming data such as is needed for reading internet packets, looking in a table and sending them out to the right address. In addition, there are read/write queueing optimizations and hashing instructions which make common router tasks extremely efficient.
The microengines are all functionally identical, which should make it easier to scale the architecture up in the future. Furthermore, the onboard, shared scratchpad memory and nonblocking, innovative high-speed bus designs are all optimized for the simplest possible integration between any number of microengines or entire processors. Vendors are already working on 180 chip chassis and Intel says that it can scale the architecture to dozens of cores and peripherals on a single die.
Intel says that a conservatively clocked (166 MHz) IXP 1200 can perform layer-3 routing for 2.5 million 64-byte packets per second. The die size is only 126 mm2, with 6.5 million transistors, packaged with 432 pins on an enhanced BGA package. The IXP 1200 will be manufactured at 0.28 micron with three metal layers (hardly cutting edge). The core frequency is only 166 MHz. With a 2-V core and 3.3V I/O, the chip only consumes 5W of power. The architecture is obviously designed to grow as it migrates to smaller geometries and higher clock frequencies.
Again, I'm probably not qualified to analyze the article very much. It appears to be an excellent example of the idea that a processor which is designed for a specific special purpose rather than general usage, can be made extremely efficient. It also reflects well on the potential for parallel and modular architectures. I'm just curious why there are 7 microengines. Usually, things are done in powers of 2.