UTBM Strsep Exercise
Total Page:16
File Type:pdf, Size:1020Kb
strsep exercises Introduction The standard library function strsep enables a C programmer to parse or decompose a string into substrings, each terminated by a specified character. The goals of this document and its exercises are to establish terminology and to program strsep. C strings Arrays of char A string is a sequence of characters, treated as a unit of data. In C language programming, a string literal such as “hello, world” may be used to specify a string value explicitly in a program. C strings are normally stored in memory as a null-terminated array of characters. We will now review this technology, using the terminology we will employ in the remainder of this document. The char type in C language is capable of holding a single byte of integer data, which is typically a numerical code for a single character. The ASCII coding system provides codes for standard keyboard characters that can be expressed within a single byte. For example, the ASCII code for the character A is 65; B has code 66; C has code 67; etc. We will use box diagrams to illustrate this memory usage: illustrates the ASCII code for A stored within one char location. Note: There are other character encodings besides ASCII. In particular, the Unicode encoding is capable of expressing many more characters suitable for thousands of languages. However, Unicode is a multi-byte encoding, and we will use ASCII for this document as a simplification. A null byte is a byte value of zero. In diagrams, we will use the symbol ∅ to represent a null byte, as follows: By a null-terminated array of characters we mean an array of char storing ASCII codes for characters, finishing with a nullbyte. For example, we can illustrate the string literal “hello” with the following diagram: Each memory location has a numerical identifier called the address of that memory location. The addresses of an array are consecutive numbers, since the elements of an array occupy consecutive memory locations. For Intel Pentium and many other computer architectures, each char memory location has its own address. For example, in the illustration above, if the memory location containing H has address 4200, then the memory location containing e has address 4201, the first l has address 4202, etc. We will sometimes annotate diagrams with addresses, as follows: We can illustrate all of the addresses in an array of char as follows: Pointer variables A string variable is often represented as a pointer variable, that is, a memory location that contains an address. For example, consider the following C coding segment: This C language code may be illustrated by the following diagram: (Note: The number 4200 is unlikely to be the actual address value in practice, since addresses are typically 32-bit or 48-bit integers. We will invent example numerical addresses as needed for our illustrations.) The type of the memory location named str is char* , meaning that str is capable of holding one address of a char memory location. Of course, str is a memory location, too, and therefore has its own memory address. Here is an illustration, using the number 8800 for the address of str: The address 8800 is the address of a char* memory location. We can store that number, too! The following diagram illustrates a variable ptr that holds the address of str. The type of ptr is char** , because it holds the address of a char* variable. This memory configuration would be created by the following C code. If a programming situation required it, we could create char*** variables, char**** variables, etc. However, we will need only the char** type for our exercises. Using explicit memory address numbers describes how a computer system identifies and manages its memory locations. But we will use arrows to make it easier for human programmers to recognize address relationships. We will draw an arrow from one memory location P to another memory location Q when P holds the address of Q. Here is an illustration of the same memory configuration as above, except using arrows instead of explicit address numbers. Here, one arrow represents storage of the address 8800 of the memory location str, and the other arrow represents storage of the address 4200 of the string “Hello”. This allows us to represent the most important address relationships in this diagram while omitting all the (invented) explicit address numbers. A pointer variable (such as str or ptr) can be modified using assignment. For example, starting from the memory configuration above, the assignment causes the variable str to hold the address of the second char location in the string ”Hello”. Thereafter, str will have the value ”ello” . We can illustrate this modification as follows: Alternatively, we can illustrate the modification using address numbers: Here, the “cross out” indicates that the value stored in the memory location str was changed from 4200 to 4201, which is the address of the character e. Memory allocation strategies Memory allocation for string literals C-language code must be transformed into an executable program using a compiler. We refer to this process as compilation. Often a string literal such as ”Hello” in C code will produce a string in memory that cannot be modified later in the program. In that case, the following assignment operation for the example memory configuration above would lead to a compilation error: Compiler configuration options determine whether or not a string literal produces a string that can be modified. Therefore, we will allocate memory explicitly for our strings in many of our examples, in order to insure that we can modify the corresponding memory locations later. Allocation means setting aside memory for use by a program. Compile-time memory allocation One option is to allocate modifiable memory at compile time, that is, during compilation. (Note that a compiler allocates memory for a string literal such as ”Hello” at compile time, although that memory allocation may not be modifiable.) The following code produces modifiable compile-time allocation, then assigns the character x to the second location of the string. Here we use the standard library function strcpy, which may require the header file string.h on your system. A memory diagram for this code is shown below: Note: We have indicated the assignment str[1] = ’x’ by crossing out the character e and writing the new value x nearby. This means that the value in the second char location has changed. (The allocation of that second char location remains unchanged – only the value stored in that location was changed.) Dynamic memory allocation It is possible to allocate memory dynamically, that is, while a program is running (executing), using the standard library function malloc. Here is an example: This leads to the following memory diagram: Note that malloc may require the following header file inclusion for successful compilation: Dynamic allocation is especially useful when a programmer cannot know how much memory to allocate at compile time. We will not need dynamic memory allocation for our exercises, but it may optionally be used, and it is worth knowing about for further applications. More terminology: function calls; guards A function call is a programming-language expression specifying that a particular function should be applied using particular values. For example, the expression is a call of the function printf, using the two values ”The value of str is %s\n” and str. The values provided for a function call (if any) are called arguments or parameters. We will use the term guard to refer to a (Boolean) expression that determines whether conditional execution should occur. For example, in the if statement the expression x < 0 is the guard. Likewise, in the for statement the expression i < max is the guard. Introduction to strsep() The standard library function strsep (header file string.h) makes it possible to break an initial substring off of a larger string. For example, in the following code, the call of the function strsep returns the value ”Hello,” and changes str to have the value ”world!” . Note: The call to strsep above includes an address operator &, because the first argument of strsep must be the address of a char* variable (such as str). Passing the address of a char* variable makes it possible to modify that char* variable. The second argument of strsep specifies the delimiter character(s) to use for breaking a string. In the example above, a space character is the delimiter, but the following code example uses a comma as delimiter: In this case, the call of strsep returns ”Hello” (without a comma), and the string variable str is changed to hold ” world!” (observe the initial space). Multiple delimiter characters may be specified in the second argument. For example, in the second argument for strsep indicates that any of the three characters comma, space, or the letter l can serve as delimiter. The return value from that call of strsep is ”He” and the variable str becomes ”lo, world!” , since the first delimiter character appearing in str is the first letter l. We may describe the library function strsep using the following function specification. Here, arg1 and arg2 refer to the two arguments for strsep. strsep • 2 arguments: the address of a string variable (type char**); and a string (type char*). • State change: the first matching occurrence of a character in arg2 among the characters of *arg1 is replaced by a nullbyte; and *arg1 is replaced by the address of the character just after that matching character. • Return: the original value of *arg1. Exercise 1 Write a C program that reads one line of input, then uses strsep to print the words of that input, one per line.