File System Usage Patterns

–Independent Work Report, Fall 2013 (Revised)– File System Usage Patterns Abstract This paper studies the types of data that users store on their file systems and their patterns of interaction with that data. I compare the metrics gathered for this paper with those of previous studies to demonstrate the evolution of file system usage patterns and to predict future developments. In addition to the comparison, I summarize previous research on users’ desires for file system functionality. Finally, I suggest a hybrid cloud-local file system that will improve the file system user experience. The data presented in this paper and previous ones demonstrates that the hypothesized system is more suited to users’ present and future usage patterns compared to current file systems and that a prototype should be built for further testing. Contents 1 Introduction3 2 Data Collection5 2.1 Sample Selection...................................5 2.2 Infrastructure.....................................6 3 Research On Local Storage File Systems8 3.1 File Sizes.......................................8 3.2 File Ages....................................... 11 3.3 File Types....................................... 13 3.4 File Count....................................... 15 3.5 File System Sizes................................... 16 4 Previous Research on Cloud Storage 19 4.1 Consumer Cloud Usage................................ 19 4.2 Enterprise Cloud Usage................................ 19 5 The Hybrid Cloud-Local File System 21 5.1 File System Specifics................................. 21 5.2 Do Users Want It?................................... 22 6 Conclusion 25 2 Introduction Over previous decades, file system usage pattern research guided the development of file systems. Previous studies examined various aspects of file systems including rates of data storage and retrieval, the types of data stored, and the amount stored. [5, 22, p. 1, p. 93] Designers of industry standard file systems such as Microsoft’s NTFS used this data to make feature inclusion and implementation decisions. [22, p.103-4] However, these papers are not the final word in file system research. They are old and user activity patterns may have changed since their publications. Additionally, the researchers limited their analyses to file systems which keep all data within the physical boundary of a single computer’s case. File systems currently under development, particularly ones that utilize decentralized storage technologies, require guidance from research that does not have these limitations. This paper addresses the age issue by combining previously existing data with a new study to analyze the development of file systems usage patterns over the past 15 years. Computers have gotten faster processors and larger hard drives. Internet speeds have increased about 50% per year over the last decade. [16] In response to these changes, users may be storing different types of media on their system. Or, they may be keeping all of their data on servers that they connect to over the Internet. The data in my study shows current tendencies for data storage. This paper also offers a prediction of future file systems. I suggest an improved file system that utilizes both hard drive and Internet storage, a “hybrid cloud-local file system”. Such a system will improve the user experience by optimizing the file access patterns most commonly found in file system usage data. For the rest of this paper, I shall refer to a collection of remote servers as “the cloud” and storage solutions built on these computers as “cloud-based storage”. This paper has four components: 1. The first section explains the data collection procedures. This will allow future researchers to collect data more easily and to reproduce my results. 2. The second section summarizes previous research on local storage file systems and compares 3 their results to my data. This will demonstrate the evolution and current state of file system usage patterns. 3. The third section reviews research on consumer sentiment toward cloud storage. This research will show users’ opinions of non-local file system technologies. 4. The final section analyzes this and other papers’ data in order to evaluate the validity of the hypothesized, hybrid cloud-local file system. 4 Data Collection I collected the following data from the file systems of 24 subjects: total capacity, free space, and distribution of file sizes. Additionally, for each file type that consumed at least 0.1% of the used space on a file system, I collected the average size, total size, and average time since last access of files of that type. Sample Selection I attempted to select a population that was representative of all computer users. Previous studies had focused on limited populations such as Microsoft employees. [1, p. 1] In order to get a sufficiently broad sample, I started by randomly selecting students from the entire undergraduate population of a university as of December 2013. The data that resulted from these subjects may be biased due to a low survey response rate of 2.93%. 11 students participated out of the 375 that were invited. It is plausible that only users with particular usage patterns would be willing to participate in the experiment of running a program that collects data from their computer and reports the results to a central server. While this low response rate is troubling, there are no superior methods for data collection. One cannot ethically collect information from subjects who are unwilling to participate. Nevertheless, future researchers should attempt to achieve a higher response rate as I was forced to select additional subjects using inferior, less random methods. I directly recruited other users to participate in order to gather enough data to draw meaningful conclusions. I invited 13 friends and family members who were not part of the random sample. Their data may be biased as my network of acquaintances may not represent the average user. However, the recruited users come from a diverse set of age groups, from 20 to 70 years old, and occupations, such as student, engineer, and lawyer. The resulting data also appears to be representative of the entire universe of computer users. It includes all major operating systems, Windows, Linux, and Macintosh, and, as seen below, seems plausible when compared to previous papers. The 24 user sample is not optimal, but it is large enough and diverse enough to support meaningful conclusions about the average computer user. 5 Infrastructure Figure 1: Data Collection Infrastructure As described in the above picture, the experiment has two main technical components: a client and a server. The client runs on the users’ computers and collects information about a user’s file system. The server runs a program called Splunk which compiles the clients’ data. In order to collect data from a sample that is representative of average computer users, as many people and computers must be able to run it. The profiler is easy to use so that less technical users can provide data. Additionally, the profiler runs on all major operating system, Windows, Linux, and Macintosh. The multiplatform requirement presents difficulties as the profiler must collect OS specific information, such as the file path format. I made several language and library choices to accomplish this combination of ease-of-use, OS independence, and OS dependence. The profiler uses the Java programming language in order to be able to run on any computer. Additionally, it uses the JWrapper program and the Sigar libraries. JWrapper converts Java jar files into native executables for Windows, Linux, and Macintosh computers. [19] Since the Macintosh executable did not function properly, those users ran the jar using a bash file. Nontechnical users can run these executables and bash scripts with only a single click. The Sigar library provides Java with APIs for examining the file system. These libraries are 6 natively compiled for all targeted operating systems. They are then accessed through Java functions that provide OS specific information while allowing the programmer to write OS agnostic code. [21] Using these libraries, I wrote a profiler that easily downloads, executes, and reports back to a central server while running on any platform and requiring minimal user interaction. I created a data collection server using the Splunk software running on an Amazon EC2 instance. EC2 instances are easy to maintain as Amazon handles the issues of server uptime and router configuration. The Splunk software automatically listens for incoming packets on a particular port and then provides a report of the data in those packets. I used this capability to track each client’s report and then print out a file containing all of the data. The server and profiler pair provided the necessary infrastructure for my experiment. 7 Research On Local Storage File Systems In this section, I will analyze current statistics on file sizes, file ages, file types, file counts, and file system sizes and the evolution of these values over the last 15 years. In order to do this, I will compare my data with that of four academic papers from the 1990s and 2000s. I will first examine A Five-Year Study of File-System Metadata (referred to as the Five-Year Study paper). This paper analyzes the computers of Microsoft employees from 2000 to 2004 including file size, age, and type. File System Usage in Windows NT 4.0 (referred to as the NT paper) is a 1998 paper that measures the types and sizes of files stored in NTFS file systems as well as the file open, read, and write patterns of the Windows NT 4.0 operating system. A Large-Scale Study of File-System Contents (referred to as the Large Scale paper) is another 1998 study of Microsoft employees’ computers. This paper investigates file and directory properties and how they differ depending on the occupation of the computer user. Finally, A Study of Irregularities in File-Size Distributions (referred to as the Irregularities paper) records the types of files used by Windows, Linux, and Macintosh users at Harvey Mudd College in 2001.

File System Usage Patterns

Dealing with Document Size Limits

If You Have Attempted to Upload Your Files and You Receive an Error Or

Sequence Alignment/Map Format Specification

Mac OS X Server

Converting Audio – Audio File Size

Lecture 03: Layering, Naming, and Filesystem Design

Tinkertool System 6 Reference Manual Ii

File System Structure Kevin Webb Swarthmore College March 29, 2018 Today’S Goals

Bit Nibble Byte Kilobyte (KB) Megabyte (MB) Gigabyte

What Is Digital Forensics (Aka Forensic Computing)?

Pro Tools 11.0.1 Read Me

The Second Extended File System Internal Layout