Spam Traffic Characteristics Dataset Tu Ouyang, Soumya Ray, Michael Rabinovich, Mark Allman March 2011 The following describes the data used in the experiments presented in the following paper: Tu Ouyang, Soumya Ray, Michael Rabinovich, Mark Allman. Can Network Characteristics Detect Spam Effectively in a Stand-Alone Enterprise? Passive and Active Measurement Conference, March 2011. Each month in our dataset is treated independently and therefore has its own directory of the form "2009MM" where "MM" is the numeric month. As described in the above paper, we constructed 10 folds for each month. While the folds were initially chosen at random we then had to derive some of the features across the messages in each fold. So, we have retained the precise folds we used in this dataset. Each ARFF filename indicates the fold number and whether the given file contains training data ("trn" files) or testing data ("tst" files). The following is a description of the features included in the data. @ATTRIBUTE geoDistance NUMERIC The geographical distance in miles between the sender and ICSI, based on the MaxMind GeoIP database. @ATTRIBUTE senderHour NUMERIC The hour of the message arrival in sender's timezone (determined from the geographic location established by the MaxMind GeoIP database). @ATTRIBUTE AS2Ratio NUMERIC Number of spams from an AS divided by total messages from that AS in the training set. @ATTRIBUTE AverageIPNeighborDistance NUMERIC Average numerical distance from sender's IP to the nearest 20 IPs of other senders. @ATTRIBUTE pkts_sunk/pkts_sourced NUMERIC Ratio of the number of packets sent by the local host to the number of packets received from the remote host. @ATTRIBUTE rxmt_sourced NUMERIC Approximate number of retransmissions sent by the remote host. @ATTRIBUTE rxmt_sunk NUMERIC Number of retransmissions sent by the local mail server. @ATTRIBUTE rsts_sourced NUMERIC Number of segments with ``RST'' bit set received from remote host. @ATTRIBUTE rsts_sunk NUMERIC Number of segments with ``RST'' bit set sent by the local mail server. @ATTRIBUTE fins_sourced NUMERIC Number of TCP segments with ``FIN'' bit set received from the remote host. @ATTRIBUTE fins_sunk NUMERIC Number of TCP segments with ``FIN'' bit set sent by the local mail server. @ATTRIBUTE idle NUMERIC Maximum time between two successive packet arrivals from remote host. @ATTRIBUTE 3whs NUMERIC Time between the arrival of the SYN from the remote host and arrival of the ACK of the SYN/ACK sent by the local host. @ATTRIBUTE jvar NUMERIC The variance of the inter-packet arrival times from the remote host. @ATTRIBUTE rttv NUMERIC Variance of RTT from local mail server to remote host. @ATTRIBUTE fngr_wss(K) NUMERIC Advertised window size from SYN received from remote host. @ATTRIBUTE fngr_ttl NUMERIC IP TTL field from SYN received from remote host. @ATTRIBUTE OS {Windows,Solaris,Linux,UNKNOWN,FreeBSD,Others} OS of remote host as determined by \emph{p0f} tool from SYN packet. @ATTRIBUTE bytecount_sourced NUMERIC Number of non-retransmitted) bytes received from the remote host. @ATTRIBUTE bytecount_sourced/tdur NUMERIC bytecount_sourced divided by the connection duration. @ATTRIBUTE class {0,1} 0 == message judged as ham 1 == message judged as spam