Studio 0: Racing Arrays

Welcome to your first CSE247 studio! In this session, you'll empirically explore the running times of simple programs and consider different ways of building a list whose size grows dynamically as needed.

For this and all future studios, your group will complete a writeup during the studio session. The writeup contains numbered "checkpoint" questions that you should fill in as you go. You can find a blank writeup for each studio in the studiowriteups folder of your repository. At the end of studio, your TA will review your group's writeup as part of your checkout procedure.

This studio has a pre-studio section. You are expected to complete the writeup of the pre-studio (the questions labeled "PS") individually, before coming to studio on Thursday. Be ready to discuss your answers with your group and TA. Parts A-D of the studio should be completed in the Thursday session.

Pre-Studio: Time and Ticks

We'll spend much of our time in this course reasoning about the time taken to execute an algorithm. How can we measure this time? Here are two reasonable approaches:

  • Time. We could measure the actual time in seconds needed to execute an algorithm, realized as a piece of code. While this seems like an "obvious" approach, it is not so easy in practice. Modern computers are so fast that they can execute simple programs in a tiny amount of time, which is difficult to measure. Hence, as we'll see, it may be hard to extrapolate empirically observed times from small to large inputs.

  • Ticks. We can count the number of operations performed by a program. In this studio, we'll introduce a counter facility (a ticker) that you can use for this purpose. You'll use tickers to empirically investigate your code's behavior throughout the course.

Measuring Time and Ticks

Let's try measuring the running time of a simple program.

  • In the coursesupport source folder, find and open the timing.examples package, and find and open the Linear class within this package.

  • Take a look at the run() method in this class:

public void run() {
   for (int i = 0; i < n; ++i) {
      this.value = this.value + i;
      ticker.tick();
   } 
} 

The statement inside the loop performs a single addition, storing the result in the value instance variable. The call to ticker.tick() advances the program's tick counter by one.

Now try running this program in Eclipse. You'll find that it executes the above function using values for n that range from 10000 to 90000 in steps of 10000.

The particular values of n that are used in the run() method are generated in the main method by the line: GenSizes sizes = GenSizes.arithmetic(10000, 100000, 10000); which generates an arithmetic progression of values for n from 10000 up to (but not including) 100000 in steps of 10000. The result is then used by the line

  ExecuteAlgorithm.timeAlgorithm(
		"linear",
		"timing.examples.Linear"
		new IntArrayGenerator(),
		sizes
	);

The output in the console window prints, among other things, the size, ticks, and time in milliseconds (thousandths of a second) taken for each sample of the size variable. But it's more convenient to look at the output left in the outputs folder in your repo.

Eclipse doesn't automatically notice when your program writes new data to the outputs folder. To see the latest contents of that filter, right-click (or control-click) on the folder and select "Refresh".

  • double-click on the file linear-ticks0.csv, which should open the file in Excel. What you see might look like this:

linear

These values were generated by your run of Linear.

  • Select the data, including the column headings, and use Excel to create a scatter plot.

To make a scatter plot in Excel, follow these steps:

  1. Highlight all the data, including the labels.
  2. Go to the Insert tab on the Excel toolbar.
  3. In the Charts header, click on Scatter Plot and select the option that shows markers only.
  • You should see a plot that looks like this:

linear

The plot you see shows how many times ticker.tick() was called in the run of your Linear program for each value of n. Not surprisingly, the plot shows a linear relationship between the input parameter (number of loop iterations) and the number of times ticker.tick() was called.

  • Examine the code and convince yourself that number of times the line this.value = this.value + i; executes is indeed a linear function of n.

Now let's compare the tick counts to actual measurements of running time.

  • Open the file linear-time0.csv in the output directory and plot the values the same way you plotted the tick counts. What do you see?

The time to execute run() is so small, even for tens of thousands of iterations, that the program likely completed in a millisecond or less; hence, the reported running time is 0 ms (or close to it). Remember, your computer can execute billions of machine instructions per second, so it takes millions of such instructions to add up to a millisecond of running time!

  • Now think of a way to increase the run time of the program, so that it is actually measurable. Note that each time you run the program, you get a new set of .csv files in your output directory; refresh to see these files.

You can modify the number of loop iterations n by increasing the values passed to Gensizes.arithmetic. Try at least 10x larger values, if not more. If you are unsure of what each parameter means, mouse over the arithmetic method name in eclipse, and the JavaDoc will be shown to you.

You might notice from the console output that the Linear program is run several times with each value of n. The ticker count for each of these runs should be identical, but the time may vary because of other programs competing for time on your computer. We run the program several times and take the minimum reported time as the best indicator of your program's actual running time without external interference.

Now let's try the same measurement technique using a toy program with very different behavior.

  • Find and open the Quadratic class in the timing.examples package in the coursesupport source folder.

  • Run the program and generate plots of the data in the output files quadratic-ticks0.csv and quadratic-time0.csv.

Question PS1: What do you see in these plots? Are there any differences between the them? What could account for those differences?

Question PS2: Why do the iteration counts provided for Quadratic produce a nicely shaped plot of time, while the same counts provided for Linear did not? (Note: this question is asking about time, not ticks.)

  • Look at the run method of Quadratic. We call ticker.tick() in just one spot -- inside the innermost loop of the function.
public void run() {
  for (int i=0; i < n; ++i) {
    for (int j=0; j < n; ++j) {
      this.value = this.value + i;
      ticker.tick();
    }
  }	
}

Each ticker.tick() call models the cost of one operation. If you wanted to count every operation in the program, including the loop initializations, tests, and increments, you might arrive at the following code, which adds a ticker for each fundamental operation in the program:

public void run() {
  ticker.tick();                   // i <- 0
  for (int i=0; i < n; ++i) {
    ticker.tick();                 // test i < n
    ticker.tick();                 // j <- 0
    for (int j=0; j < n; ++j) {
      ticker.tick();               // test j < n
      ticker.tick();               // addition
      this.value = this.value + i;
      ticker.tick();               // ++j
     }
    ticker.tick();                 // ++i
  }	
}
  • Experiment with adding ticks at different places in the nested loop (inside the inner loop, inside the outer loop, and at the top level). Rerun Quadratic after each change and see what happens to the running time plot generated from the tick counts.

Question PS3: From the runs you have tried so far, how does the placement of ticker.tick() calls affect the plots you see? In particular, do the changes affect the shape of the curve, the values plotted, or both?

Question PS4: In terms of n, how would you characterize, in the most simple terms, the shape of the time and ticks curves that have been generated for Quadratic?

Question PS5: What would happen if you deleted any ticker.tick() calls in the inner loop, while leaving calls in the outer loop?

  • Finally, think about the following questions and be ready to discuss them with your TA at the beginning of studio.

    1. Throughout this course, you will be asked to place ticker.tick() calls in your code to measure its running time. You could put in a tick for every operation, as in the example above, but are all these ticks needed to get a good idea of the program's running time?

    2. Where should you put ticks to get a good estimate (i.e. the right curve shape) for the running time while doing as little work as possible?

    3. In this class, we will care mostly about figuring out the "right curve shape" for a program's running time, without actually coding it up and measuring it. In particular, we'll think about asymptotic running time, for which we don't care about constant factors or terms that do not dominate a program's running time (e.g. extra linear work when the dominant work is quadratic in n).
      How is the notion of asymptotic running time related to parsimonious tick placement?

Part A: Some statements take constant time; some don't.

When we look at a line of Java code such as

this.value = this.value + i;

we want to reason about how much time it takes to execute that line of code. Our use of ticker.tick() above assumes that it takes one tick, or one basic operation, to execute one line. But is this true for all statements? Surely not! For example, the single line of code:

Arrays.sort(array);

would sort an array of values, which surely takes more than one operation. In fact, for an array of n values, it would take at least n operations simply to print the array, much less to sort it.

Let's look at the cost of allocating an array of integers in Java.

  • In the studios source folder, open the AddsTwoNumbers class, which is in the studio0.allocates package.

  • Run AddsTwoNumbers, refresh the outputs folder, and open and plot the time and ticks recorded in the output files for this program.

Question A1: What do you see? How do the curves reflect the code inside the run function of AddsTwoNumbers? Does the value of n matter here in terms of the time it takes to perform the operation? Why or why not?

  • Now open and run the Allocates class, which generates ticks and time data for allocating space for an array of n integers, with n varying from 0 to 10000. Again, plot the time and ticks this program as a function of n.

Question A2: What do the data tell you about the time it takes to allocate an array of n integers? Is it reasonable to say that the line of code

this.array = new int[this.n]

takes a constant amount of time, independent of the value of this.n?

Question A3: Do the ticks agree in shape with the time we measured in running the Allocates code? Why or why not?

Instead of having a ticker tick just once, you can have it tick n times in one call by call ticker.tick(n).

  • Modify the call to ticker in Allocates so that it registers this.n ticks for the new int[this.n] allocation.

  • Now rerun Allocates and refresh the outputs folder and generate plots for the resulting tick counts.

Question A4: Are the plots more similar to each other than before? What does this tell you about how much time it takes to allocate an array of n integers? What do you think allocation is doing that takes the amount of time you observe?

When you reason about an algorithm's behavior, you must consider how much time each operation or statement takes, and how many times it runs, as a function of its input size (in this case n). In particular, when you use ticks to help model an algorithm's running time (or when you reason about such time abstractly), you must set the tick counter so as to account properly for the asymptotic cost of each operation!

Efficiency as Applied to Representing Numbers

Perhaps the most important message of this course is the following:

How you compute a given result -— your choice of algorithm -- can have a profound impact on the time necessary to obtain that result.

To see this lesson in practice, do the following exercise with your group.

  1. One person should think of a positive integer x. There is no limit to how large x can be, but it should be fairly large -- at least 50.

  2. Divide the rest of the group into two teams.

  3. When the person who chooses x announces it, one team writes down the decimal representation of the number. The other team must write the number using tally marks. For example, to represent the number 5, the team would write

linear

Question A5: Which group do you expect to finish first? Can you give a formula, in terms of the number x, for the number of separate marks each group must make to write down x? (Consider a digit to be one mark.) For what value of x does decimal notation start to require fewer marks than tally marking?

Now suppose the second group is instead allowed to write their number down in binary (i.e. using both zeros and ones) instead of using tallies (a.k.a. unary).

Question A6: Can you give a formula for the number of marks needed to write x in binary? What is the ratio of the numbers of marks needed in binary vs. decimal notation, as a function of x?

Question A7: Which is more important for writing large numbers efficiently: choosing between binary and decimal, or choosing either option vs. unary? Why?

Part B: Growing a List

Let's apply what we have learned to growing an array-based list.

  • In your repository, go to the studios folder and open the Rarrays class in the studio0.growinglist package.

    This class adds repeatedly adds data to the end of an array. Initially, the reset() method allocates an array of size 2. Each call to the run() method then tries to add an element to the end of the array. If the array is full, replaceArrayWithBiggerOne() grows the array to a larger size so that the addition of the new element will succeed.

    Rarrays is an abstract class because the method int getNewSize(), which gives the new size of the array when we need to grow it, is missing. We'll experiment with different choices for this size and look at their performance implications.

    We provide two extensions of Rarrays that fill in the getNewSize() function:

    • AddOne: causes the new array to be one element larger than the old array. This is the least we can do to make room for one new element in an already full array.

    • Doubling: causes the new array to be twice the old array's size. This may seem wasteful -- a full array of size 1024 would grow to a new size of 2048, even though we are adding just one new element!

Before we can experiment, we have a small problem: the method replaceArrayWithBiggerOne() has not been fully implemented. In particular, we must allocate a new array of size newSize, copy the data from existing array member to the new array, and finally replace the old array with the new one.

  • Write the missing code for the replaceArrayWithBiggerOne() method to implement the functionality described above. Your method must pass the three unit tests in class TestRarrays, which will result in the red bar you see when running these tests turning green.

    Of the three test cases in TestRarrays,

    • testInit should work on the code as given to you.

    • testGrowPreservesData ensures that you retain information from the old array as you make the newer, larger one.

    • testGrowSufficientTicks ensures that you account properly for the ticks taken for allocating and filling the newer, larger array.

    Before you write the missing code, running TestRarrays should produce output like this:

    linear

    As you write your code, be sure to add ticks as directed in the comments so that you can correctly analyze the running time of the new code!

Part C: How much does it cost to grow the array?

Let's study how the growth rate of the array affects the overall cost of Rarrays.

  • Run AddOne, refresh your view of the outputs folder, and open the growbyone-ticks0.csv file. Plot the number ticks vs n.

Question C1: How would you describe the curve you see? What polynomial best fits the growth rate of this curve?

  • Now repeat the above exercise, but this time using the Doubling implementation.

Question C2: The plot you see is not a nice, straight line but rather exhibit several discontinuities. What do these discontinuities reflect?

  • Using any straight edge you can find nearby, see if you can find a straight line that forms an upper bound on the number of ticks as a function of the number of elements added. Try constructing some larger arrays with the Doubling implementation and see if the slope of your line stays the same.

Question C3: If you had to give a simple approximate description of the cost of the Doubling implementation as a function of the the number of elements added, what might you say? Is this enough information to choose between this implementation and the AddOne implementation?

Part D: Mathematical Analysis

We shouldn't have to write a bunch of code to decide whether we like AddOne or Doubling better. We can analyze mathematically how much time is spent growing the list to a given size $n$ in each implementation. Being able to analyze running times approximately, without needing to count ticks, is an important skill that we'll develop over the next several weeks.

There are two kinds of cost to account for here: the cost to write each new element at the end of a non-full list, and the cost whenever we resize the list. The first cost is clearly at most $n$ operations for $n$ additions, so we'll focus on the second cost.

AddOne

Each time we add an element to the array, AddOne must copy the entire array and then add the new cell. How much time does this take? When we add the $n$th element, we must initialize $n$ new cells and then copy $n-1$ existing elements. Since we're just trying to get the functional form of the cost, we'll say that that we need $n$ operations to add the $n$th element, ignoring the fact that the actual number of ops is more like $2n$.

To add $n$ elements (ignoring the fact that the first couple of copies don't require resizing), the total time $T(n)$ is therefore given by the sum \[ T(n) = 1 + 2 + \ldots + n. \]

Question D1: What is a closed-form expression for the sum $T(n)$? Does your result agree with what you have seen empirically?

If you're not sure how to compute this sum, look for a formula on the Wikipedia summations page, or look at the appendices of your textbook for a hint. You will need to know the relevant summation formula by heart at many points later in the course.

Doubling

Suppose the Doubling implementation grows an array up to size $n = 2^k$. This requires $k$ doubling operations. The $i$th such doubling must initialize $2^i$ elements and copy $2^{i-1}$. Again, we'll ignore the factor of two and estimate the overall cost by the sum \[ T(2^k) = 1 + 2 + 4 + \ldots + 2^k. \]

Question D2: What is a closed-form expression for this sum in terms of $k$? In terms of $n = 2^k$? Does your result agree with what you have seen empirically?

Again, feel free to use any resources you want to look up how to compute this sum.