Pipelining the Switch Microarchitecture Performance can be enhanced by pipelining the switch microarchitecture. Pipelined processing of packets in a switch has similarities with pipelined execution of instructions in a vector processor. In hop heroine vector pipeline, a single instruction indicates what operation to apply to all the vector elements executed in a pipelined way.

Similarly, in a switch pipeline, a single packet header indicates how to process all of the internal data path physical transfer units (or phits) of a packet, which are executed in a pipelined fashion.

Also, as packets at different input ports are independent of each other, they can be processed in parallel similar to the way multiple independent instructions or threads of pipelined instructions can be executed in parallel.

After receiving the header portion of the packet in the first stage, the routing information is computed. Concurrent with this, subsequent portions of the packet are received and buffered in the input port queue at the first stage. Arbitration is performed in the third stage. The crossbar is configured to allocate the granted output port for the packet in the fourth stage, and the packet header is buffered in the switch output port and ready for transmission over the external link in the fifth stage.

The notation in the figure is as IB is the input link control and buffer stage, RC is the route computation stage, SA is the crossbar switch arbitration stage, ST is the crossbar switch traversal stage, and OB is the output buffer and link control stage.

Packet fragments (flits) coming after the header remain in the IB stage until the header is processed and the crossbar switch resources are provided. A virtual channel switch usually requires an additional stage for virtual channel allocation. Moreover, arbitration is required for every flit before transmission through the crossbar. Finally, depending on the complexity of the routing and arbitration algorithms, several clock cycles may be required for these operations.

Other Switch Microarchitecture Enhancements As mentioned earlier, internal switch speedup is sometimes implemented to increase switch output port больше информации. One way of doing this is by increasing the number of crossbar input ports.

When implementing several physical queues per input port, this can be achieved by devoting a separate crossbar port to each input queue. Another way of implementing parallel data paths between input and output ports is to move the buffers to the crossbar crosspoints.

This switch architecture is usually referred to as a buffered crossbar switch. A buffered crossbar provides independent data paths from each input port to the different output ports, thus making it possible to send up to k packets at a time from a given input port to k different output ports. By implementing independent crosspoint memories for each inputoutput port pair, HOL blocking is eliminated at the switch level.

Moreover, arbitration is significantly simpler than in other switch architectures. Effectively, each output port can receive packets covance labcorp only a disjoint subset of the crosspoint memories.

Thus, a completely independent arbiter can be implemented at each switch output port, each of those arbiters being very simple. A buffered crossbar would be the ideal switch architecture if it were not so expensive. The number of crosspoint memories increases quadratically with the number of switch ports, dramatically increasing its cost and reducing its scalability with respect to the basic switch architecture. In addition, each crosspoint memory must be large enough to efficiently implement link-level flow control.

To reduce cost, most designers prefer input-buffered or combined input-output-buffered switches enhanced with some of the mechanisms described previously. We mention a few of these below. The protocols must target the largest network size and handle the types of anomalous systemwide events that might occur.

Should it support cache coherence. If the operating system must get involved for every network transaction, the sending and receiving overhead becomes quite large. This is the case for the Cray XT3 SeaStar, Intel Thunder Tiger 4 QsNetII, and many other supercomputer and cluster networks. To support coherence, the sender may have to flush the cache before each send, and the receiver may have to flush its cache before each receive to prevent the stale-data problem.

Such flushes further increase sending and receiving overhead, often causing the network interface to become the network bottleneck. It also has a chip-to-chip switched network to interconnect multiple chips in a multiprocessor configuration. Two of the on-chip networks are switched networks: One is used for operand transport and the other is used for on-chip memory communication.

The others are essentially fan-out trees or recombination dedicated networks used for status and control.

The portion of chip area allocated to the interconnect is substantial, with five of the seven metal layers used for global network wiring.

Standardization: Cross-Company Interoperability Standards are useful in many places in computer design, including interconnection networks. Advantages of successful standards include low cost and stability. It makes the success of the interconnection independent of the stability of a single company.

Finally, a standard allows many companies to build products with interfaces to the standard, so the customer does not have to wait for a single company to develop interfaces to all the products of interest.

One drawback of standards is the time it takes for committees and specialinterest groups to agree on the definition of standards, which is a problem when technology is changing rapidly.



