<<

1 ISOT Forgery Dataset Description

An important issue that we need to address to achieve a robust CA system is to assess and strengthen the approach against forgeries. Stylometry analysis can be the target of automated attacks, also referred to as generative attacks, where high quality forgeries can be generated automatically using a small set of genuine samples. An adversary having access to writing samples of a user may be able to effectively reproduce many of the existing stylometric features. In order to assess the robustness of our proposed stylometry analysis ap- proach against forgeries attempts, a novel forgery dataset was collected. We organized an experiment where volunteers were invited to generate forgeries against sample tweets selected randomly from datasets. Participants in our experiments consisted of 10 volunteers - with 7 male and 3 female - with ages varying from 23 to 50 years, with different back- ground. We randomly selected sample tweets from 10 authors, considered as legal users from the Twitter Dataset. Impostor samples were collected through the following simple form. The top section of the form provides a list of tweets from a specific legal user, used in simulating the forgery attack. The lower section involved two fields, one for participants to enter their name and the other for them to write three or four tweets trying to reproduce the writing style of the legal user. We implemented our survey using Google Forms platform. The only restriction imposed was a minimum size of 350 characters per sample spread over 3 to 4 tweets. We sent to each volunteer a new form with different legal user information, one per every work day. All volunteers were instructed to provide one sample per day. The data was collected over a period of 30 days. We had no control over the way volunteers wrote their tweets. Collected data consisted of an average of 4,253 characters per volunteer spread over 10 attacks. The file “forgeryDataset.zip” contains our dataset in plain text. Each directory belongs to a specific author and each file inside the directory is a forgery session generated by a volunteer. The profiles of the users targeted in the forgery attacks are as follows: 1. ; screen name = MayorofLondon; id = 14700117; 2. ; screen name = campbellclaret; id = 19644592;

1 Figure 1: Screenshot of a form with tweets from an author in the forgery attack experiment.

2 3. Susiebubble; screen name = susiebubble; id = 16935734;

4. Martha lane fox; screen name = Marthalanefox; id = 22239898;

5. AndrewSparrow; screen name = AndrewSparrow; id = 15778426;

6. Imogen Heap; screen name = imogenheap; id = 14523801;

7. Robert Peston; screen name = Peston; id = 14157134;

8. Simon Pegg; screen name = simonpegg; id = 18713254;

9. John Rentoul; screen name = JohnRentoul; id = 14085096;

10. Tim Minchin; screen name = timminchin; id = 18980276.

To Reference this dataset use:

Marcelo Luiz Brocardo, Issa Traore. “Continuous Authentication using Micro-Messages”, Twelfth Annual International Conference on Privacy, Se- curity and Trust (PST 2014), Toronto, Canada, July 23-24, 2014.

3