Homework / CS 294-105 / Empirical Analysis / Fall 2014

CS 294-105: Homework #1 - Due 9PM Wed Sep 3

The first 3 parts of this assignment should be straightforward and not take much time. Part D has a lot of elements, with some of the later ones being quite challenging. Give each one some thought, and put in some decent effort exploring them as far as you can - but don't kill yourself trying to do all of them if you find you're at a loss or it's taking too much time. Since this is a new assignment, I don't have a sense of whether it gets too hard too fast, or only at the end of part D.

Complete the assignment by by 9PM Wed Sep 3 if possible. If you can't finish it by that time, you should still strive to have it done prior to our next class meeting at 5PM Thu Sep 4, as you will get a lot more out of the accompanying lecture if you've done so.

Turn in your answers (only for part D) via email to vern@cs.berkeley.edu with the term Homework in the Subject.

Update: this is Version 3 of the assignment: Diffs versus Version 2. (Diffs of Version 2 versus Version 1.)

Complete the student survey by 3PM on Friday Aug 29.
If needed, (re)familiarize yourself with:
1. How to construct histograms and cumulative distribution functions (CDFs).
2. The (very) basics of (homogeneous) Poisson processes. This means the gist of what they capture (counting the arrivals of a memoryless arrival process) and their most basic properties (independent arrivals; exponentially-distributed interarrivals).
3. The workings of traceroute, a tool very widely used to measure the routes that individual messages ("packets") take through the Internet. (Update: my best guess now is that we won't get to talking about traceroute during the next class meeting, so if you're unfamiliar with it, you could postpone learning about it.)
For all of these, an understanding at the level of the "easy parts" of the corresponding Wikipedia page (or Unix man page, for traceroute) will suffice.
Download the keystroke timing dataset. This dataset is derived (with one exception, as discussed below) from actual measurements. Each row in the dataset represents activity seen from a given user during a remote login session. For the most part, such activity represents individual keystrokes that the user types and which are then sent to a remote server. (Some of the entries instead represent network protocol messages used by the login service instead, though these are a minority.) The first column is a timestamp in seconds, and the second is an anonymized user ID.
In your homework writeup, briefly answer as much of the following as you're able to:
1. Summarize the most basic properties of the dataset. (I'm leaving this vague so you can reflect on just what constitutes those basic properties.)
2. Do you have any comments about the data's quality? If not, simply state that you don't.
3. How would you characterize the per-user duration of activity and number of keystrokes ?
4. There's a large body of theoretical modeling premised on the notion that the "arrival" of new users (i.e., the times at when they first show up) looks like a Poisson process. To the degree that you can, do you believe that this data supports that premise? Explain how you arrived at your conclusion. Recall that Poisson processes have exponentially-distributed interarrivals, i.e., if you compute the time difference between consecutive arrivals, it should be exponentially distributed. Note that you'll have to do a bit of processing of the data to extract the interarrivals between new users.
5. Extract from the data the interarrivals of each user's individual keystrokes. Analyze these collectively (pooling them across all users). For example, if in total the data looked like:
  1.3 user5 1.6 user7 1.9 user5 2.1 user5 2.8 user7
  then you would extract the values 0.6 (the time between user5's 1st and 2nd keystrokes), 0.2 (the time between their 2nd and 3rd keystrokes), and 1.2 (the time between user7's 1st and 2nd keystrokes). The pool of aggregate data you'd analyze would then be these three values (in no particular order): 0.2, 0.6, and 1.2.
6. The aggregate set of interarrivals that you extracted has more than one qualitatively different type of behavior. See to what degree you can identify these, and explain how you did so.
7. Similar to the above regarding the arrival of users, some theories presume that the arrival of an individual user's keystrokes likewise looks like a Poisson process. Based on the aggregate interarrivals of user keystrokes that you extracted above, explore (to the degree that you can) whether you believe that this data supports that premise. Explain how you arrived at your conclusion.
8. I've added an artificial artifact to the data that manifests when analyzing the interarrivals of individual user keystrokes. (This is based on a true quirk that occurs in other such keystroke datasets, though didn't happen to do so for this one.) See if you can spot it. If so, explain how you found it, and whether you also find any evidence that it's synthetic rather than arising from actual user activity.
9. Three crucial tips:
  - If after a bit of exploration and thinking about it, you're at a loss for what a given question is asking, or how to go about it, ask on Piazza.
  - When replying to such queries on Piazza, be careful not to "telegraph" the answer, but instead just provide strategic guidance.
  - This assignment is not graded in any sense other than I'd like to ensure that you've put effort/thought into it. If you have done so, and you come up empty for some / most of it, just say so.
10. How much time did you spend on this assignment?
11. What additional data or meta-data did you find you wished this dataset had included (if any)?
12. (Optional:) Any comments on the assignment? Was it illuminating/frustrating/fun/boring? What would improve it?