<<

How to Think Like a Computer Scientist: Learning with Python 3 previous | next | index

13. Files

1 3.1. Reading and writing files

While a program is running, its is stored in rando m access m e m ory (RAM). RAM is fast and inexpensive, but it is also volatile, which means that when the program ends, or the computer shuts down, data in RAM disappears. To make data available the next time you turn on your computer and start your program, you have to it to a n o n - volatile storage medium, such a hard drive, usb drive, or CD-RW.

Data on non-volatile storage media is stored in named locations on the media called files. By reading and writing files, programs can save information between program runs.

Working with files is a lot like working with a notebook. To use a notebook, you have to it. When youre done, you have to it. While the notebook is open, you can either write in it or from it. In either case, you know where you are in the notebook. You can read the whole notebook in its natural order or you can skip around.

All of this applies to files as well. To open a file, you specify its name and indicate whether you want to read or write.

Opening a file creates what we call a file h a n d l e. In this example, the variable myfile refers to the new handle object. Our program calls methods on the handle, and this makes changes to the actual file which is usually located on our disk.

myfile = open('test.dat', 'w')

The open function takes two arguments. The first is the name of the file, and the second is the m o d e. Mode 'w' means that we are opening the file for writing.

If there is no file named test.dat on the disk, it will be created. If there already is one, it will be replaced by the file we are writing.

To put data in the file we invoke the write method on the handle:

myfile.write("Now is the time") myfile.write("to close the file")

Closing the file handle tells the system that we are done writing and makes the disk file available for reading by other programs (or by ourselves):

myfile.close()

Now we can open the file again, this time for reading, and read the contents into a string. This time, the mode argument is 'r' for reading:

>>> mynewhandle = open('test.dat', 'r')

If we try to open a file that doesnt exist, we get an error:

>>> mynewhandle = open('test.cat', 'r') IOError: [Errno 2] No such file or : 'test.cat'

Not surprisingly, the read method reads data from the file. With no arguments, it reads the entire contents of the file into a single string:

>>> text = mynewhandle.read() >>> p r i n t(text) Now is the timeto close the file

There is no space between time and to because we did not write a space between the strings.

read can also take an argument that indicates how many characters to read:

>>> myfile = open('test.dat', 'r') >>> p r i n t(myfile.read(5)) Now i

If not enough characters are left in the file, read returns the remaining characters. When we get to the end of the file, read returns the empty string: >>> p r i n t(myfile.read(1000006)) s the timeto close the file >>> p r i n t(myfile.read())

>>>

The following function copies a file, reading and writing up to fifty characters at a time. The first argument is the name of the original file; the second is the name of the new file:

de f copy_file(oldfile, newfile): h_infile = open(oldfile, 'r') h_outfile = open(newfile, 'w') wh i l e True: text = h_infile.read(50) i f text == "": b r eak h_outfile.write(text) h_infile.close() h_outfile.close()

This functions continues looping, reading 50 characters from infile and writing the same 50 characters to outfile until the end of infile is reached, at which point text is empty and the break statement is executed.

A han dle is so mewhat like a TV re m ote control

Were all familiar with a remote control for a TV. You perform operations on the remote control switch channels, change the volume, etc. But the real action happens on the TV. So, by simple analogy, wed call the remote control your hand l e to the underlying TV.

Sometimes we want to emphasize the difference the file handle is not the same as the file, and the remote control is not the same as the TV it controls. But at other times we prefer to treat them as a single mental chunk, or abstraction, and well just say close the file, or flip the TV channel.

13.2. Text files

A te x t file is a file that contains printable characters and whitespace, organized into lines separated by characters. One of the Python design goals was to provide methods that made processing easy.

Notice the subtle difference in abstraction here: in the previous section, we simply regarded a file as containing many characters, and could read them one at a time, many at a time, or all at once. In this section, particularly for reading data, were interested in files that are organized into lines, and we will process them line-at-a-time.

To demonstrate, well create a text file with three lines of text separated by :

>>> h_outfile = open("test.dat","w") >>> h_outfile.write("line one \ nline two \ nline three \ n") >>> h_outfile.close()

The readline method reads all the characters up to and including the next newline character:

>>> h_infile = open("test.dat","r") >>> p r i n t(h_infile.readline()) line one

>>> readlines returns all of the remaining lines as a list of strings:

>>> p r i n t(h_infile.readlines()) ['line two\n', 'line three\n']

In this case, the output is in list format, which means that the strings appear with quotation marks and the newline character appears at the end of each.

At the end of the file, readline returns the empty string and readlines returns the empty list:

>>> p r i n t(h_infile.readline())

>>> p r i n t(h_infile.readlines()) []

The following is an example of a line-processing program. filter makes a copy of oldfile, omitting any lines that begin with #:

1 de f filter(oldfile, newfile): 2 infile = open(oldfile, 'r') 3 outfile = open(newfile, 'w') 4 wh i l e True: 5 text = infile.readline() 6 i f text == "": 7 b r eak 8 i f text[0] == '#': 9 con t i nue 10 outfile.write(text) 11 infile.close() 12 outfile.close()

The contin ue state ment ends the current iteration of the loop, but continues looping. The flow of execution moves to the top of the loop, checks the condition, and proceeds accordingly.

Thus, if text is the empty string, the loop exits. If the first character of text is a hash mark, the flow of execution goes to the top of the loop. Only if both conditions fail do we copy text into the new file.

Lets consider one case: suppose your original file contained empty lines. At line 6 above, would this program not find the first empty line in the file, and terminate immediately? No! Recall that readline always includes the newline character in the string it returns, so even an empty line in your file would arrive in the text variable on line 5 containing its newline character. It is only when we try to read b e y ond the end of the file that we we get back the empty string.

1 3.3. Directories

Files on non-volatile storage media are organized by a set of rules known as a . File systems are made up of files and directories, which are containers for both files and other directories.

When you create a new file by opening it and writing, the new file goes in the current directory (wherever you were when you ran the program). Similarly, when you open a file for reading, Python looks for it in the current directory.

If you want to open a file somewhere else, you have to specify the p a t h to the file, which is the name of the directory (or folder) where the file is located:

>>> wordsfile = open('/usr/share/dict/words', 'r') >>> wordlist = wordsfile.readlines() >>> p r i n t(wordlist[:6]) ['\n', 'A\n', "A's\n", 'AOL\n', "AOL's\n", 'Aachen\n']

This (unix) example opens a file named words that resides in a directory named dict, which resides in share, which resides in usr, which resides in the top-level directory of the system, called /. It then reads in each line into a list using readlines, and prints out the first 5 elements from that list.

A Windows might be "c:/temp/words.txt" or "c:\\temp\\words.txt". Because backslashes are used to escape things like newlines and tabs, you need to write two backslashes in a literal string to get one! So the length of these two strings is the same!

You cannot use / or \ as part of a ; they are reserved as a deli m iter between directory and .

The file /usr/share/dict/words should exist on unix-based systems, and contains a list of words in alphabetical order.

13.4. What about fetching something from the web?

The Python libraries are pretty messy in places. But here is a very simple example that copies the contents at some web URL to a local file.

1 i mpo r t u r l l i b . r eques t 2 3 url = 'http://xml.resource.org/public/rfc/txt/rfc793.txt' 4 destination_filename = 'rfc793.txt' 5 6 urllib.request.urlretrieve(url, destination_filename)

The urlretrieve function just one call could be used to download any kind of content from the Internet.

Well need to get a few things right before this works: The resource were trying to fetch must exist! Check this using a browser. Well need permission to write to the destination filename, and the file will be created in the current directory- i.e. the same folder that the Python program is saved in. If we are behind a proxy server that requires authentication, (as some students are), this may require some more special handling to work around our proxy. Use a local resource for the purpose of this demonstration!

Here is a slightly different example. Rather than save the web resource to our local disk, we read it directly into a string, and return it:

1 i mpo r t u r l l i b . r eques t 2 3 de f retrieve_page(url): 4 ' ' ' Re t r i e v e t h e c on t e n t s o f a we b p a g e . 5 Th e c on t e n t s i s c on v e r t e d t o a s t r i ng b e f o r e r e t u r n i ng i t . 6 ' ' ' 7 my_socket = urllib.request.urlopen(url) 8 dta = str(my_socket.readall()) 9 my_socket.close() 10 r e t u r n dta 11 12 the_text = retrieve_page("http://xml.resource.org/public/rfc/txt/rfc793.txt") 13 p r i n t(the_text)

Opening the remote url returns what we call a socket. This is a handle to our end of the connection between our program and the remote web server. We can call read, write, and close methods on the socket object in much the same way as we can work with a file handle.

13.5. Counting Letters

The ord function returns the integer representation of a character:

>>> ord('a') 97 >>> ord('A') 65 >>>

This example explains why 'Apple' < 'apple' evaluates to T r ue.

The chr function is the inverse of ord. It takes an integer as an argument and returns its character representation:

>>> f o r i i n range(65, 71): . . . p r i n t(chr(i)) . . . A B C D E F >>>

The following program counts the number of times each character occurs in the book Alice in Wonderland:

1 de f count_letters(text): 2 ' ' ' Pe r f o r m f r e qu e n c y c oun t o f how ma n y t i me s 3 e a c h ASC I I c h a r a c t e r o c c u r s i n s ome t x t . 4 ' ' ' 5 counts = 128 * [0] 6 f o r letter i n text: 7 counts[ord(letter)] += 1 8 r e t u r n counts 9 10 de f display(i): 11 ' ' ' Sub s t i t u t e s ome n ame s o f non - p r i n t a b l e c h a r a c t e r s ' ' ' 12 i f i == 10: r e t u r n 'LF' 13 i f i == 13: r e t u r n 'CR' 14 i f i == 32: r e t u r n 'SPACE' 15 r e t u r n chr(i) 16 17 infile = open('alice_in_wonderland.txt', 'r') 18 text = infile.read() 19 infile.close() 20 21 freq_counts = count_letters(text) 22 23 layout = "{0:>12} {1:>5} \ n" 24 outfile = open('alice_counts.txt', 'w') 25 26 outfile.write(layout.format("Character", "Count")) 27 outfile.write("======\ n") 28 29 f o r (i, v) i n enumerate(freq_counts): 30 i f v > 0: 31 outfile.write(layout.format(display(i), v)) 32 33 outfile.close()

Run this program and look at the output file it generates using a . You will be asked to analyze the program in the exercises below.

1 3.6. Glossary deli m iter A sequence of one or more characters used to specify the boundary between separate parts of text. d irect o ry A named collection of files, also called a folder. Directories can contain files and other directories, which are refered to as su b directories of the directory that contains them. f ile A named entity, usually stored on a hard drive, floppy disk, or CD-ROM, that contains a stream of characters. file syste m A method for naming, accessing, and organizing files and the data they contain. han dle An object in our program that is connected to an underlying resource (e.g. a file). The file handle lets our program manipulate / read/ write / close the actual file that is on our disk. fully q ualified name A name that is prefixed by some namespace identifier and the dot operator, or by an instance object, e.g. math.sqrt or tess.forward(10). m o d e A distinct method of operation within a computer program. Files in Python can be openned in one of three modes: read ('r'), write ('w'), and append ('a'). n o n - volatile m e m ory Memory that can maintain its state without power. Hard drives, flash drives, and rewritable compact disks (CD-RW) are each examples of non-volatile memory. p at h A sequence of directory names that specifies the exact location of a file. te x t f ile A file that contains printable characters organized into lines separated by newline characters. soc k e t One end of a connection allowing one to read and write information to or from another computer. volatile m e m ory Memory which requires an electrical current to maintain state. The m ain m e m ory or RAM of a computer is volatile. Information stored in RAM is lost when the computer is turned off.

1 3.7. Exercises

1. unsorted_fruits.txt contains a list of 26 fruits, each one with a name that begins with a different letter of the alphabet. Write a program named sort_fruits.py that reads in the fruits from unsorted_fruits.txt and writes them out in alphabetical order to a file named sorted_fruits.txt.

2. Answer the following questions about countletters.py:

a. Explain in detail what the three lines do:

infile = open('alice_in_wonderland.txt', 'r') text = infile.read() infile.close()

What would type(text) return after these lines have been executed? b. What does the expression 128 * [0] evaluate to? Read about ASCII in Wikipedia and explain why you think the variable, counts is assigned to 128 * [0] in light of what you read. c. What does

f o r letter i n text: counts[ord(letter)] += 1

do to counts? d. Explain the purpose of the display function. Why does it check for values 10, 13, and 32? What is special about those values?

e. Describe in detail what the lines

layout = "{0:>9} {1:>5}\n" outfile = open('alice_counts.dat', 'w') outfile.write(layout.format("Character", "Count")) outfile.write("======\n") do. What will be in alice_counts.dat when they finish executing? f. Finally, explain in detail what

f o r (i, v) i n enumerate(freq_counts): i f v > 0: outfile.write(layout.format(display(i), v))

does.

How to Think Like a Computer Scientist: Learning with Python 3 previous | next | index

Copyright 2011, Peter Wentworth, Jeffrey Elkner, Allen B. Downey and Chris Meyers. Created using Sphinx 1.0.1.