************************************************************************ ATM Forum Document Number: ATM Forum/95-0178R1 ************************************************************************ Title: A Sample Switch Algorithm ************************************************************************ Abstract: Describes the switch algorithm developed at OSU. The algorithm is fully compatible with the latest source and switch requirements and has been thoroughly tested. The text to be included in the TM document appendix is presented. ************************************************************************ Source: Raj Jain, Shiv Kalyanaraman, Ram Viswanathan, and Rohit Goyal The Ohio State University Department of CIS Columbus, OH 43210-1277 Phone: 614-292-3989, Fax: 614-292-2911, Email: Jain@ACM.Org The presentation of this contribution at the ATM Forum is sponsored by NIST. ************************************************************************ Date: February 6-10, 1995 ************************************************************************ Distribution: ATM Forum Technical Working Group Members (Traffic Management) ************************************************************************ Notice: This contribution has been prepared to assist the ATM Forum. It is offered to the Forum as a basis for discussion and is not a binding proposal on the part of any of the contributing organizations. The statements are subject to change in form and content after further study. Specifically, the contributors reserve the right to add to, amend or modify the statements contained herein. ************************************************************************ It is proposed that the following text be added as an appendix to the TM document. Appendix: A Sample Switch Algorithm The traffic management requirements for switches allow considerable flexibility to vendors in deciding what feedback to give to the sources. This appendix describes a sample switch feedback algorithm that is compatible with the current source, switch, and destination requirements. It is designed for explicit rate switches. It has been tested thoroughly in simulations. It is included here only as an example. Design Goals: The switch algorithm is named Explicit Rate Indication for Congestion Avoidance (ERICA) algorithm. The design goals of this algorithm were to provide the following features: 1. Congestion Avoidance: The algorithm tries to keep cell queue lengths small while keeping the link utilization high. Small queue lengths results in small cell transfer delay (CTD) and small cell transfer delay variation (CTDV). 2. Max-Min Optimality: The ERICA algorithm converges to the max- min optimal. Variations of this algorithm for other optimality criteria can be easily developed. 3. Fast Response: Small queue lengths result in fast feedback to the source. Under overload, the rates come down fast. During underload, the rates go up fast, thereby, optimizing the transient performance. 4. Small Number of Parameters: Switches implementing ERICA have only two parameters: target utilization and load averaging interval. The utilization parameter should be set close to 95%. A higher value would result in higher queue length under overload. The load averaging interval is specified in number of cells. For 155 Mbps links, the default value is 30 cells. 5. Insensitivity to Parameters: If the user sets the parameter incorrectly, the performance degrades gracefully. For example, if the target link utilization is set to 80%, the link still operates but the maximum utilization under steady state is 80%. Of course, during overloads the link may be fully utilized. 6. Easy to set parameters: Setting the switch parameters is easy. Most network system managers are familiar with the term link utilization and can set the parameter correctly. 7. High-Start Feasible: One side benefit of fast response is that the rates (and hence the queue lengths) decrease rapidly during overload. The sources can start at any rate including the peak cell rate. The queues may build up but are reduced quickly along with the allowed cell rates. Although, high-start may not be recommended under many circumstances, the scheme works even if the VCs start at PCR. Starting at PCR may be desirable in the LAN environment. OVERVIEW OF THE SCHEME: The scheme consists of switches monitoring the load on each outgoing link and determining the overload factor and the number of active VCs. The switch measures the time until the Nth cell arrives. If this time is T seconds, the input rate is N/T cells per second. If the rated capacity of the link is C cells per second and the desired target utilization is U, the overload factor is computed as follows: Input Rate = N/T cps Target Cell Rate = U*C cps Overload Factor = Input Rate/Target Cell Rate = N/(T*U*C) During the next measurement interval of N cells, the switch may simply ask all VCs to change (increase or decrease) their load by the overload factor. Based on loading considerations only, the explicit rate for a VC with current cell rate of CCR is given by: ER based on load = ER1 = CCR/Overload Factor Fairness is achieved by ensuring that every VC gets at least the Fair_Share, which is computed as follows: Fair Share = Target Cell Rate/Number of Active VCs The number of active VCs is the number of distinct VCs that were seen during the last measurement interval (of N cells). Combining the fair share with the explicit rate based on load we get: ER based on load and fairness = ER2 = Max{Fair Share, ER based on load} If this computed value is lower than the explicit rate field in the RM cell, the switch replaces the explicit rate field in the cell with the computed value. ER in Cell = Min{ER in Cell, ER based on load and fairness} There are a few fine details that should be pointed out: 1. The input rate, number of active VCs, and overload is measured in the forward direction while the feedback is sent in RM cells going in the reverse direction. The CCR value in the RM cells going in the forward direction is more current and has a direct relationship with the overload. The CCR value in RM cells returning in the reverse direction may be out of date. It is therefore important to use the latest value of CCR in computing ER1. The switches note this value in the VC table for all forward-going RM cells and use it when computing explicit rate indication for reverse-going RM cells. 2. The switches measure load over one measurement interval of N cells and use it over the next measurement interval of N cells. If several reverse-going RM cells from a VC are seen during one interval, they should all carry the same explicit rate indication to the VC. The switches, therefore, store ER2 in the VC table and also keep an indication that a reverse-RM cell has been seen for the VC in that measurement interval. Not following this rule results in oscillatory behaviour due to confusing feedback given to the source. 3. In computing the explicit rate based on load and fairness, ERICA algorithm allows the *maximum* of the Fair_Share and the explicit rate based on load alone. This prevents VCs from being unnecessarily damped and allows fast response time. Not following this rule may result in low rate VCs not being able to come up. 4. The switch generated BECN option can be optionally used during the first round-trip on new VCs. This is particularly helpful in WAN configurations where the round-trip feedback delay may be large. Although the standard allows BECNs after the first round- trip, there is little advantage in using it. PSEUDO CODE: 1. Parameters: For each output port (or link), the system manager sets the the following parameters: int Measurement_Interval_In_Cells; /* Measurement Interval in Cells */ real Target_Utilization; /* Target Utilization between 0 and 1 */ real Link_Cell_Rate; /* Link capacity in cells per second */ The default suggested values of the first two parameters are 30 cells and 0.95, respectively. For a 155 Mbps link, the link cell rate is 155,000,000/(8*53) or 365,566 cells per second. 2. Data Structure: The switch maintains the following variables: A. For each output port (or link): int In_Count; /* Number of cells received for that link */ real Target_Cell_Rate; /* Target cell rate in cps */ int Number_Of_VCs; /* Total number of VCs on this link */ int Number_Of_Active_VCs; /* VCs seen in the last interval */ real Measurement_Interval_Start_Time; B. For each VC: boolean Cells_Seen; /* One or more cells were seen in this measurement interval*/ real Last_Seen_CCR; /* CCR last seen in the forward direction */ boolean Reverse_RM_Cells_Seen_In_This_Interval; real Last_ER; /* Last explicit rate feedback given for this VC */ 3. Initialization: /* For all ports, initialize the number of cells received */ for(i=0; i= Output_Port[i]. Measurement_Interval_In_Cells) End_Of_Interval_Calculations(i); } /* ------------------ RM cell processing ----------------------------*/ if( RM cell is received){ if( cell->DIR == forward) { /* forward path */ /* i th output port */ /* Note the latest rate seen on the forward path */ Output_Port[i].Last_Seen_CCR[cell->VC_Number] = cell->CCR; /* Include the BECN option here if it is being implemented */ } else if(cell->DIR == backward) { /* reverse path */ /* Port corresponding to the forward link */ i = Forward_Link_Port; /* New feedback only if an RM cell of this VC has not yet been seen in this interval */ if( Output_Port[i]. Reverse_RM_Cell_Seen_In_This_Interval[cell->VC_Number] != 1) { /* core algorithm */ ER1 = (Output_Port[i].Last_Seen_CCR[cell->VC_Number]/ Output_Port[i].Overload); ER2 = Max(ER1, Output_Port[i].Fair_Share); New_ER = Min( Output_Port[i].Target_Cell_Rate, ER2); if(New_ER < cell->ER) cell->ER = New_ER; /* Switches compute only one feedback per measurement interval */ /* Same feedback is given if other RM cells follow */ Output_Port[i]. Reverse_RM_Cells_Seen_In_This_interval[cell->VC_Number] = 1; /* Switches dont inhibit other switches' feedback if They have shorter intervals */ /* They simply advertise one rate per VC throughout The interval */ Output_Port[i].Last_ER[cell->VC_Number] = New_ER; } /* end:First reverse RM cell in this interval */ else{ /* More than one reverse RM cell seen in this interval */ /* Give out the same ER for all RMs of this VC */ New_ER = Output_Port[i].Last_ER[cell->VC_Number] ; if(New_ER < cell->ER) cell->ER = New_ER; } /* end : Second RM cell seen in the same interval */ } /* end : reverse direction */ New_VC = 0; } /* end : RM cell */ /* ------------- end of interval processing ------------------*/ End_Of_Interval_Calculations(int i) { Interval_Length = now - Output_Port[i].Measurement_Interval_Start_Time; Output_Port[i].Measurement_Interval_Start_Time = now; /* overload computation */ Output_Port[i].Overload = ( Output_Port[i].In_Count / (Output_Port[i].Target_Cell_Rate*Interval_Length) ); Output_Port[i].Number_Of_Active_VCs = Get_Number_Of_Active_VCs(i); Output_Port[i].Fair_Share = ( Output_Port[i].Target_Cell_Rate/ Output_Port[i].Number_Of_Active_VCs ); Output_Port[i].In_Count = 0; /* Init RM cells */ For(j=0; j < Output_Port[i].Number_Of_VCs; j++) Output_Port[i].Reverse_RM_Cells_Seen_In_This_IntervalCell[j]=0; } Get_Number_Of_Active_VCs(int i) { int count = 0; int j; for ( j =0; j< Output_Port[i].Number_Of_VCs; j++){ if(Output_Port[i].Cells_Seen[j] ==1) count = count + 1; Output_Port[i].Cells_Seen[j] = 0; } }