 |
Tu Ouyang, Soumya Ray, Mark Allman, Michael Rabinovich. A Large-Scale Empirical Analysis of Email Spam Detection Through Network Characteristics in a Stand-Alone Enterprise, Computer Networks, 59, February 2014.
PDF | Data
Abstract:
Spam is a never-ending issue that constantly consumes
resources to no useful end. In this paper, we envision spam
filtering as a pipeline consisting of DNS blacklists, filters
based on SYN packet features, filters based on traffic
characteristics and filters based on message content. Each
stage of the pipeline examines more information in the message
but is more computationally expensive. A message is rejected
as spam once any layer is sufficiently confident. We analyze
this pipeline, focusing on the first three layers, from a
single-enterprise perspective. To do this we use a large email
dataset collected over two years. We devise a novel ground
truth determination system to allow us to label this large
dataset accurately. Using two machine learning algorithms, we
study (i) how the different pipeline layers interact with each
other and the value added by each layer, (ii) the utility of
individual features in each layer, (iii) stability of the
layers across time and network events and (iv) an operational
use case investigating whether this architecture can be
practically useful. We find that (i) the pipeline architecture
is generally useful in terms of accuracy as well as in an
operational setting, (ii) it generally ages gracefully across
long time periods and (iii) in some cases, later layers can
compensate for poor performance in the earlier layers. Among
the caveats we find are that (i) the utility of network
features is not as high in the single enterprise viewpoint as
reported in other prior work, (ii) major network events can
sharply affect the detection rate, and (iii) the operational
(computational) benefit of the pipeline may depend on the
efficiency of the final content filter.
BibTeX:
@article{ORAR14,
author = "Tu Ouyang and Soumya Ray and Mark Allman and Michael Rabinovich",
title = "{A Large-Scale Empirical Analysis of Email Spam Detection Through Network Characteristics in a Stand-Alone Enterprise}",
journal = "Computer Networks",
year = 2014,
volume = 59,
month = feb,
pages = "100--121",
}
|
|