Exploring the Long Tail of (Malicious) Software Downloads Babak Rahbarinia∗, Marco Balduzzi?, Roberto Perdisciz ∗Dept. of Math and Computer Science, Auburn University at Montgomery, Montgomery, AL ?Trend Micro, USA zDept. Computer Science, University of Georgia, Athens, GA
[email protected], marco balduzzi(at)trendmicro.com,
[email protected] Abstract—In this paper, we present a large-scale study of global This dataset contains detailed (anonymized) information about 3 trends in software download events, with an analysis of both benign million in-the-wild web-based software download events involving and malicious downloads, and a categorization of events for which no over a million of Internet machines, collected over a period of ground truth is currently available. Our measurement study is based on a unique, real-world dataset collected at Trend Micro containing seven months. Each download event includes information such as a more than 3 million in-the-wild web-based software download events unique (anonymous) global machine identifier, detailed information involving hundreds of thousands of Internet machines, collected over about the downloaded file, what process initiated the download and a period of seven months. the URL from which the file was downloaded. To label benign and Somewhat surprisingly, we found that despite our best efforts and malicious software download events and study their properties, we the use of multiple sources of ground truth, more than 83% of all downloaded software files remain unknown, i.e. cannot be classified make use of multiple sources of ground truth, including information as benign or malicious, even two years after they were first observed.