Scale and Performance in a Distributed

John H. Howard et al. ACM Transactions on Computer Systems, 1989

Presented by Gangwon Jo, Sangkuk Kim

1 Andrew File System

. Andrew • Distributed computing environment for Carnegie Mellon University • 5,000 – 10,000 Andrew workstations in CMU . Andrew File System • Distributed file system for Andrew • Files are distributed across multiple servers • Presents a homogeneous file name space to all the client workstations

2 Andrew File System (contd.)

Servers Disks . Disks . Disks . Unix Kernel Unix Kernel Unix Kernel Vice Vice Vice

Network

Clients User User User Prog. Venus Prog. Venus Prog. Venus Unix Kernel Unix Kernel Unix Kernel Disk . Disk . Disk .

3 Andrew File System (contd.)

. Design goal: Scalability Disks • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are performed directly on the Unix Kernel cached copy Disk

4 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy Disk

5 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy Disk

6 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy Disk

7 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy A Disk

8 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are read/write performed directly on the Unix Kernel cached copy A Disk

9 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are read/write performed directly on the Unix Kernel cached copy A Disk

10 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are close(A) performed directly on the Unix Kernel cached copy A’ Disk

11 Andrew File System (contd.)

. Design goal: Scalability Disks A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are close(A) performed directly on the Unix Kernel cached copy A’ Disk

12 Andrew File System (contd.)

. Design goal: Scalability Disks A’A • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are close(A) performed directly on the Unix Kernel cached copy A’ Disk

13 Andrew File System (contd.)

. Design goal: Scalability Disks A’ • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy A’ Disk

14 Andrew File System (contd.)

. Design goal: Scalability Disks A’ • As much work as possible is performed by Venus Unix Kernel . Solution: Caching Vice • Venus caches files from Vice Network • Venus contacts Vice only when a file is opened or User closed Program Venus • Reading and writing are open(A) performed directly on the Unix Kernel cached copy A’ Disk

15 Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

16 Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

17 The Prototype

. Preserve directory hierarchy • Each server contained a directory hierarchy mirroring the structure of the Vice files

a/ Server Disks Vice Client Disk a1 .admin/ a2 Venus b/ a/ File a1 b1 Cache a2 b2 b/ → Server 2 c/ Status c/ c1/ Cache c1/ → Server 3 c11 c2 .... c12 c2 18 The Prototype (contd.)

. Preserve directory hierarchy • Each server contained a directory hierarchy mirroring the structure of the Vice files

a/ Server Disks Vice Client Disk a1 a2 .admin/ .admin directories: contain Vice Venus a/ b/ file status information File a1 b1 Cache a2 b2 c/ b/ → Server 2 Stub directories: represent portions Status c/ c1/ located on other servers Cache c1/ → Server 3 c11 c2 .... c12 c2 19 The Prototype (contd.)

. Preserve directory hierarchy • Vice-Venus interface name files by their full pathname

Server Disks Vice Client Disk .admin/ Venus a/ a/a1 File a1 Cache a2 b/ → Server 2 Status c/ Cache c1/ → Server 3

c2 ....

20 The Prototype (contd.)

. Dedicated processes • One process for each client

Server Disks Vice Client Disk .admin/ Venus a/ File a1 Cache a2 b/ → Server 2 Status c/ Cache c1/ → Server 3

c2 ....

21 The Prototype (contd.)

. Use two caches • One for files, and the other for status information about files

Server Disks Vice Client Disk .admin/ Venus a/ File a1 Cache a2 b/ → Server 2 Status c/ Cache c1/ → Server 3

c2 ....

22 The Prototype (contd.)

. Verify cached timestamp for each open • Before using a cached file, Venus verify its timestamp with that on the server

Server Disks Vice Client Disk .admin/ Venus a/ a/a1(5)? Filea/a1 a1 Cache(5) a2 OK b/ → Server 2 Status c/ Cache c1/ → Server 3

c2 ....

23 Qualitative Observation . stat primitive • Testing the presence of files, obtaining status information, ... • Programs using stat run much slower than the authors expected • Each stat involve a cache validity check . Dedicated processes • Excessive context switching overhead • High virtual memory paging demands . File location • Difficult to move users’ directories between servers

24 Performance Evaluation

. Experience: the prototype was used in CMU • The authors + 400 other users • 100 workstations and 6 servers . Benchmark • A command script for source files • MakeDir → Copy → ScanDir → ReadAll → Make • Multiple clients (load units) run the benchmark simultaneously

25 Performance Evaluation (contd.)

. Cache hit ratio • File cache: 81% • Status cache: 82%

26 Performance Evaluation (contd.)

. Distribution of Vice calls in prototype on average

SetFileStat ListDir All others Fetch Store Call Distribution (%) TestAuth 61.7 GetFileStat 26.8 Fetch 4.0 GetFileStat Store 2.1 TestAuth SetFileStat 1.8 ListDir 1.8 All others 1.7

27 Performance Evaluation (contd.)

. Server usage • CPU utilizations are up to 40% • Disk utilizations are less than 15% • Server loads are imbalanced

Utilization (%) Server CPU Disk 1 Disk 2 cluster0 37.8 12.0 6.8 cluster1 12.6 4.1 4.4 cmu-0 7.0 2.5 cmu-1 43.2 13.9 15.1

28 Performance Evaluation (contd.)

. Benchmark performance • Time for TestAuth rises rapidly beyond a load 5

Overall time Time per TestAuth call 4.5 14

4 12

3.5 10 3 2.5 8 2 6 1.5

4 Normalized time Normalized Normalized time Normalized 1 0.5 2 0 0 1 2 5 8 10 1 2 5 8 10 Load units Load units 29 Performance Evaluation (contd.)

. Caches work well! . We need to • Reduce the frequency of cache validity check • Reduce the number of server processes • Require workstations rather than the servers to do pathname traversals • Balance server usage by reassigning users

30 Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

31 Changes for Performance

. Cache management: use callback • Vice notifies Venus if a cached file or directory is modified by other workstation • Cache entries are valid unless otherwise notified −Verification is not needed • Each Vice and Venus maintain callback state information

32 Changes for Performance (contd.)

. Name resolution and storage representation • CPU overhead is caused by namei routine −Maps a pathname to an inode • Indicate files by fids instead of pathnames − is a collection of files located on one server – Contains multiple vnodes which indicate files in the volume −Uniquifier allows reuse of vnode numbers

Volume number Vnode number Uniquifier

32bit 32bit 32bit

33 Changes for Performance (contd.)

. Name resolution and storage representation

Clients Volume number Vnode number Uniquifier

Servers

34 Changes for Performance (contd.)

. Name resolution and storage representation

Clients Volume number Vnode number Uniquifier

Servers Volume Server 0 1 1 4 2 2 ...... Volume location database

35 Changes for Performance (contd.)

. Name resolution and storage representation

Clients Volume number Vnode number Uniquifier

Servers Volume Server fid Vnode 0 1 (0, 0, ...) 1 4 (0, 1, ...) Vnode inode 2 2 (0, 2, ...) ...... Volume location Vnode lookup database table

36 Changes for Performance (contd.)

. Name resolution and storage representation • Indicate files by fids instead of pathnames • Each entry in a directory maps a component of a pathname to a fid −Venus performs the logical equivalent of a namei operation

37 Changes for Performance (contd.)

. Server process structure • Use lightweight processes (LWPs) instead of processes • LWPs are not dedicated to a single client

38 Performance Evaluation

. Scalability

4.5 4 Prototype 3.5 New 3 2.5 2 1.5

Normalized time Normalized 1 0.5 0 1 2 5 8 10 15 20 Load units

39 Performance Evaluation (contd.)

. Server utilization during benchmark

80 70

60 50 CPU 40 Disk 30 Utilization (%) Utilization 20 10 0 0 5 10 15 20 Load units

40 Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

41 Comparison with A Remote-Open File System

. The Caching of Andrew File System • Locality makes caching attractive • Whole-file transfer approach contacts servers only on opens and closes • Most files in a 4.2BSD environment are read in their entirety • Disk caches retain their entries across reboots • Caching of entire files simplifies cache management Comparison with A Remote-Open File System

. The Caching of Andrew File System – Drawbacks • Requiring local disks • Large file handling • Strict emulation of 4.2BSD concurrent read/write semantics is impossible Comparison with A Remote-Open File System

. Remote Open • The data in a file are not fetched en masse • Instead the remote site potentially participates in each individual read an write operation • File is actually opened on the remote site rather than the local site • NFS Comparison with A Remote-Open File System

1400

1300

Andrew Cold Cache 1200 Andrew Warm Cache 1100 NFS 1000

900

800

700

600 Benchmark Benchmark TimeSeconds in 500

400 1 2 5 7 10 15 18 Load Units Comparison with A Remote-Open File System

. Serious functional problems with NFS at high loads

Network Traffic for Andrew and NFS

Andrew NFS Total Packets 3,824 10,225 Packets from Server to Client 2,003 6,490 Packets from Client to Server 1,818 3,735 Comparison with A Remote-Open File System

100 100 Andrew Cold Cache

90 90

Andrew Warm Cache 80 Andrew Cold 80 Cache 70 70 NFS Disk 1 60 Andrew 60 Warm Cache NFS Disk 2 50 50 NFS 40 40

30 30 Percent Disk Utilization Disk Percent Percent CPU CPU Utilization Percent 20 20

10 10

0 0 1 2 5 7 10 15 18 1 2 5 7 10 15 18 Load Units Load Units Comparison with A Remote-Open File System

. Advantage of remote-open file system • Low latency Latency of Andrew and NFS

Time (miliseconds) File size Andrew Andrew (bytes) NFS Stand-alone Cold Cache Warm Cache 3 160.0 16.1 15.7 5.1 1,113 148.0 4,334 202.9 10,278 310.0 24,576 515.0 15.9 Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

49 Change for Operability

Servers . Volume • A collection of files

MountedVolume Volume 1 forming a partial subtree of the Vice Volume 3 name space

Volume 2 • Glued together at mount points • Operational Transparency Change for Operability

Clients . Volume movement Volume number . Quotas . Read-Only Replication Servers Volume Server 0 1 . Backup 1 4 2 2 ...... Volume location database Change for Operability . Volume movement

Clients Volume number

Servers Volume Server Server3 Server4 0 1 1 43 Volume 1 Volume 1 2 2 Clone ...... Volume location database Outline

. Building a prototype • Qualitative Observation • Performance Evaluation . Changes for performance • Performance Evaluation . Comparison with a Remote-Open File System . Change for operability . Conclusion

53 Conclusion

. Scale impacts Andrew in areas besides performance and operability . Future goals • Performance optimization • Administration features Thank you.

55