COMPUTING with STRINGS One Critical Part of Computing Is the Ability to Store and Retrieve Data Stored in Files

CMSC 120: Visualizing Information 1 File I/O COMPUTING WITH STRINGS One critical part of computing is the ability to store and retrieve data stored in files. There are two basic types of data files: binary and plain text. Binary files contain data that have been encoded in binary for storage and processing purposes; i.e. the data are stored as a sequence of bytes. Plain text files contain data that have been stored as text. In this case, all data are converted to text and then encoded using a mapping (e.g., ASCII or UNICODE) from each character to a unique numerical code. This is why, for example, if you were to open a binary file in a text editor, what you usually see is an unintelligible display of characters produced when each sequence of eight bits in the binary file is treated as a character code and translated accordingly. Many binary files contain headers, or metadata, that provide a set of instructions for how a program should interpret the stored data (e.g., the format and layout of the text in this document). This metadata is usually proprietary and is what allows for the creation of and distinction among special file formats (e.g., Microsoft Office Documents, JPEGs, and MP3s). In many cases, only the program that created the original binary file is capable of interpreting the metadata and decoding the stored data. This can be problematic if we have any information that needs to be manipulated by multiple programs. While this difficulty is in part overcome by recent movement to make file formats readable by multiple programs, such as the interchange between MS Office or Adobe programs, most of the time we have to fall back on the old standard and store data in plain text files to make the data accessible to other software! This issue is of special concern to individuals interesting in manipulating and analyzing sets of data, a process that may involve the use of multiple programs: e.g., specialized or personally written computer programs for processing, recording, or producing data, spreadsheet for entering and organizing the data, statistical software for analysis, and graphical display software for visualization. Thus, it has become a standard in data analysis to save data in text files to facilitate transfer of information from one program to the next. In this chapter, we will take a first look at the problem of plain text file input and output, which at its most basic, is just a form of string processing. Do this: Look up ASCII and UNICODE on Wikipedia. What are the differences between the two encodings? CMSC 120: Visualizing Information 2 File I/O STRING PROCESSING String processing, or the translation and manipulation of information stored as text, is a very important programming task. To understand what is involved in this task, we first need to review a bit about strings. Strings We have already spent quite a bit of time using and manipulating strings. In programming, strings are a special data type that describes any information stored as a sequence of characters. A character is a unit of information (a piece of data) that corresponds to a symbol in the written form of a natural language. In Python, a string is a literal formed by enclosing two or more characters in either single or double quotation marks, e.g.: >>> str1 = 'spam' >>> type(str1) <type 'string'> We have already done a bit of string manipulation. We know how to print strings and we’ve also (in the last chapter) had a bit of experience getting string‐valued input from the user (in the Rock, Paper, Scissors game). We’ve also learned that strings are Python sequences and so can be manipulated in the same way we manipulate any other sequence. For example, we can perform the following operations: >>> 'spam' + 'eggs' 'spameggs' >>> 3 * 'spam' 'spamspamspam ' >>> (3 * 'spam') + (5 * 'eggs') 'spamspamspameggseggseggseggseggs' >>> len('spam') 4 >>> for ch in 'Spam!' print ch S p a m ! CMSC 120: Visualizing Information 3 File I/O Basic sequence operations that can be performed on strings are all summarized in the table below: Operator Meaning + Concatenation * Repetition <string>[] Indexing <string>[:] Slice len(<string>) Length for <var> in <string> Iteration through characters Strings as Objects However, strings are more than just your typical Python sequence. Because a lot of basic problems in computing involve the manipulation of sequences of characters, there are a lot string‐specific operations that are frequently performed. Because of this, Python strings are in fact objects and Python provides a set of special operations that can only be performed on strings, for example: >>> s = 'Spam and Eggs' >>> s.capwords() # capitalize the words 'Spam And Eggs' >>> s.upper() # make all upper case 'SPAM AND EGGS' >>> s.replace('and','or') # replace one part with another 'Spam or Eggs' >>> s.count('a') # count no times a is in the string 2 >>> s.split() # split string into words ['Spam', 'and', 'Eggs'] CMSC 120: Visualizing Information 4 File I/O A complete list of string object methods is provided below: Method Meaning .capitalize() capitalize only first letter .capwords() capitalize first letter of all words .lower() make whole string lower case .upper() make whole string upper case .count(substring) count # times substring occurs .find(substring) find first position where substring occurs .rfind(substring) like find but searches from the right .join(list) concatenates a list of strings .replace(oldsub, newsub) replaces old substring with new one .split(character) splits the list everywhere character appears, if no character is specified splits at whitespaces .ljust(width) left justify string in field of given width .center(width) center string in field of given width .rjust(width) right justify string in field of given width Do This: Enter the following at the IDLE prompt: >>> sentence = 'The quick brown fox jumped over the lazy dogs.' Use the string methods and sequence operations introduced in the preceding two sections to manipulate sentence in various ways. Try each one out until you are sure you understand how each works. INPUT/OUTPUT AS STRING MANIPULATION Before we begin to manipulate strings, we first need to obtain textual data from some source. Text may be obtained through direct interaction with the user (e.g., through the keyboard) or may be loaded into a program through the process of reading a stored file. Let's start by looking at text obtained through interactive processes of input through the keyboard. Raw Input In the last chapter, we experienced getting textual data from the user first hand when we asked the user to provide a choice ('Rock', 'Paper', 'Scissors') in the Rock, Paper, Scissors! game. This was done with the following command: CMSC 120: Visualizing Information 5 File I/O yourChoice = input('Please enter Rock, Paper, or Scissors: ') Executing this command would result in the following: >>> yourChoice = input('Please enter Rock, Paper, or Scissors: ') Please enter Rock, Paper, or Scissors:| where Python relinquishes control of the program to the user and waits to proceed until some input is entered at the prompt (after the text string). But what do we enter? Well, many of you may have been frustrated to realize that the only way to avoid the following: >>> yourChoice = input('Please enter Rock, Paper, or Scissors: ') Please enter Rock, Paper, or Scissors: Rock Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> yourChoice = input('Please enter Rock, Paper, or Scissors: ') File "<string>", line 1, in <module> NameError: name 'Rock' is not defined Was to type quotes around the string input so that it evaluates the string as a literal, as follows: >>> yourChoice = input('Please enter Rock, Paper, or Scissors: ') Please enter Rock, Paper, or Scissors: 'Rock' Do this: Enter the following command in IDLE. >>> answer = input('Enter a mathematical expression: ') When prompted for input, enter the expression 2 + 9. Then print the value stored in the variable answer. What happened? The answer to that question explains why entering Rock as opposed to 'Rock' in our Rock, Paper, Scissors example caused an error in the first place. The answer is simply that an input statement is a delayed expression. Thus, when we enter the name, Rock, this is the same as executing the following assignment statement: >>> yourChoice = Rock CMSC 120: Visualizing Information 6 File I/O The statement reads: lookup the value stored in the variable Rock and then store that value in the variable yourChoice. Since Rock was never defined, Python cannot find the name and gives you the result is a NameError. Putting quotes around the entered string is one way around this error, but it is kind of tedious to require that a user enter quotes each time s/he wants to input a string value. A more effective way is just to tell Python not to evaluate the input. We can do this using an alternative function, raw_input, that is exactly like input, except it does not evaluate the expression that the user types. The expression is instead handed to the program as a string of text. For example, the following: >>> yourChoice = raw_input('Please enter Rock, Paper, or Scissors: ') Please enter Rock, Paper, or Scissors: Rock no longer results in an error. Moreover, if we try to access the value saved in yourChoice, we see that it is a string: >>> type(yourChoice) <type 'str'> >>> yourChoice 'Rock' Do This: We've seen that computers manipulate numbers by storing them in binary notation (sequences of zeros and ones) and that the computer contains circuitry designed to do arithmetic on those sequences. How then, does a computer manipulate strings? Pretty much in the same way; i.e., manipulating textual information is no different from manipulating numerical information.

Load more