<<

Character Arrays (aka “cstrings”)

------Section #1: Character Arrays (p.455) ------

As you know already, an array is a contiguous block of elements, all of the same data type. Let’s review the syntax for creating an array:

[] = {};

This allows us to create arrays of ints, doubles, longs, shorts, etc. One special and important use for arrays is to create arrays of chars. Think of it – with an array of chars, we would be able to store a sequence of chars, which is exactly what a string is. Let’s try it out!

What does this program do? Take a few moments to study the code, perhaps even draw a picture of memory, and what will be written to stdout:

// version #1 int main() { char charBuf[12] = {‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ’, ‘t’, ‘h’, ‘e’, ‘r’, ‘e’, ‘!’}; int index;

for (index = 0; index < 12; ++index) { cout << charBuf[index]; }

return 0; }

Got it? If you guessed you’ll see “Hello there!” on stdout, you’re correct! All that’s happening is that we have an array charBuf where each array element is of type char, and it has a dimension of 12. The initialization list is setting each array element to a character using a list of character constants. So naturally, when the loop is entered and a zero-based index is used to subscript into the array, you’ll see each character appear, one at a time. In memory it would look something like this:

charBuf [0x210] +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | ‘H’ | ‘e’ | ‘l’ | ‘l’ | ‘o’ | ‘ ’ | ‘t’ | ‘h’ | ‘e’ | ‘r’ | ‘e’ | ‘!’ | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

In the picture above, the array charBuf landed at memory address 0x210 and occupies a contiguous block of chars, with each char initialized to what’s in the initialization list. You can see how a loop using a zero-based index would walk through that array and display each char, right?

How about this version:

// version #2 int main() { char charBuf[] = {‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ’, ‘t’, ‘h’, ‘e’, ‘r’, ‘e’, ‘!’}; int index;

for (index = 0; index < 12; ++index) { cout << charBuf[index]; }

return 0; }

If you guessed that you’ll get exactly the same result, then you’ be exactly right! The only difference is that the dimension was left out of the array declaration, but as you know, that’s allowed if you provide a complete initialization list. So everything works the same.

------Section #2: cstrings (p.455) ------

But you know, it’s a real hassle to write out that initialization list for charBuf, there are so many single quotes and commas, they almost overwhelm what we’re trying to do. There has to be a better way… And there is! Here’s another version:

// version #3 int main() { char charBuf[] = “Hello there!”; int index;

for (index = 0; index < 12; ++index) { cout << charBuf[index]; }

return 0; }

What’s going on here? A is being used to initialize the array? How can that work?

It works because a string literal is a character array! Every time you’ve used a string literal to write to stdout, what you’re really writing out to the screen is a character array, because a string literal is a character array, so it’s an appropriate initializer for the charBuf character array. So let’s try putting the dimension back into the array declaration:

// version #4 int main() { char charBuf[12] = “Hello there!”; int index;

for (index = 0; index < 12; ++index) { cout << charBuf[index]; }

return 0; }

And what do we see this time? Well, if you try to compile that code you’ll see a compiler error! But why? The charBuf array is declared with a dimension of 12, and if you count the characters inside the string literal, you’ll see that there are exactly 12 characters inside that string literal. However, there’s one character inside the string literal that you don’t see, it’s at the very end of the string, just after that exclamation point. It has an ASCII integer value of zero, and it’s used to mark the end of the character array. If we were to see what the string literal looks like in memory, it would be something like this:

[0x480] +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | ‘H’ | ‘e’ | ‘l’ | ‘l’ | ‘o’ | ‘ ’ | ‘t’ | ‘h’ | ‘e’ | ‘r’ | ‘e’ | ‘!’ |‘\0’ | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Do you see that character at the very end? That’s what is called the character. As I just mentioned, the has an ASCII integer value of zero, and if you’re working with a char array, the integer value zero has special meaning in that context, it’s purpose is to mark end of the string. Notice that it’s written as an , because if you simply wrote ‘0’ the compiler would interpret that as the character zero which has an ASCII integer value of 48 (see for yourself on the ASCII table on p.1038 in your textbook).

Why is there a null character at the end of the array above? Because, every time you write a string literal in your code (which is a sequence of chars enclosed in a pair of double quotes) what you’re really writing is a null-terminated character array. In other words, a string literal is a null-terminated character array!

So can you now see why version #4 above won’t compile? The variable charBuf was declared with a dimension of 12, which would allocate twelve chars in a solid block. However, the string literal being used as an initialization list actually contains 13 chars – the 12 you can see, and the null that you cannot see (but it’s there!). So the code fails to compile because the storage requirements of the string literal exceed the declared dimension of charBuf. This also explains why it will compile if you don’t explicitly specifiy the dimension, because without an explicit dimension the compiler will simply allocate enough memory to store the string literal (take another look at version #3).

It’s important that you remember that the character zero isn’t the same thing as the null character (which has a value of zero). The character zero is the displayable character that you would see on the screen, but the null character is not a displayable character, it’s just used to indicate the end of a string. ++ expects strings to be marked with the null character at the end, and this kind of string goes by a special name, it’s what we call a cstring. What’s a cstring?

A cstring is a null-terminated character array.

In other words, a cstring is just a character array that’s terminated with the integer value zero, because a null character has an integer value of zero. Since a character array is used to store human readable text as a string, the null character plays a special role in the representation of that string, it allows us to know where the end of the string is. This is true only of character arrays – an array of any other data type is not assumed to be terminated with a null character, only cstrings are.

Okay, back to version #3 of the code above. That version works nicely because we’re now able to initialize charBuf by leaving out the dimension and simply using the string literal as the initialization list. However, look at the boolean expression controlling the loop: // version #3 int main() { char charBuf[] = “Hello there!”; int index;

for (index = 0; index < 12; ++index) { cout << charBuf[index]; }

return 0; }

Ugh! There’s still that pesky 12 in there. In other words, although the string literal saves us from declaring a dimension, we still have to count the number of characters inside the string literal to know how many times to loop. Or do we?

The answer is “no!” Why? Because since the string literal is a cstring, that means that the initialization list contains all of the characters you see inside the double quotes, as well as that null character that you don’t see. So when charBuf is allocated and initialized by the compiler, it will look like this:

charBuf [0x210] +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | ‘H’ | ‘e’ | ‘l’ | ‘l’ | ‘o’ | ‘ ’ | ‘t’ | ‘h’ | ‘e’ | ‘r’ | ‘e’ | ‘!’ |‘\0’ | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

That means that charBuf contains a cstring, and since a cstring is null-terminated, that means that charBuf is also null-terminated. Now, we can exploit that and simplify our loop:

// version #5 int main() { char charBuf[] = “Hello there!”; int index;

for (index = 0; charBuf[index] != ‘\0’; ++index) { cout << charBuf[index]; }

return 0; }

With this version, we no longer have to count the characters inside the string literal, instead we just rely on the fact that cstrings are null-terminated. So the loop keeps running as long as the character at charBuf[index] is not a null character. Very cool!

Okay, I can tell you’re getting excited about this stuff, but there’s more… How about another way to display the string to stdout:

// version #6 int main() { char charBuf[] = “Hello there!”; int index;

cout << charBuf return 0; }

Huh? How can that possibly work? Well, recall that an array identifier with no subscript operator evaluates to its base address. In this cout statement, charBuf is being used without an index:

cout << charBuf; // provide the base address of charBuf for the insertion operator

Take a look at the picture above I drew of charBuf showing its layout in memory, you’ll notice that it has a base address of 0x210, so conceptually it’s as if the cout statement were written like this:

cout << 0x210; // the base address of charBuf

What will happen is that the compiler will see that we’re providing the base address of an array, and that it’s an array of chars, so it will be processed in a special way. The insertion operator will jump to the base address of the array and display the characters, one at a time, until the final null character is reached.

Remember that a string literal is a cstring, it’s sitting in memory somewhere as an unnamed, null-terminated array of chars. Not only that, but when you write one, it will evaluate to the base address of that unnamed array. So every time you’ve ever written a string literal to stdout, what you’ve really been doing is providing the base address of an unnamed, null-terminated character array. Think back to the very first program we wrote:

int main() { cout << “Hello world!” << endl; return 0; }

What’s really happening is that the string literal “Hello world!” is sitting in memory as a cstring, and the compiler is providing the base address of that unnamed array to the insertion operator. The insertion operator goes to the location of that cstring and displays the characters until the terminating null is found. ------Review Questions “Character Arrays and cstrings” ------

1. What is a cstring?

2. What is the null character used for?

3. What is the ASCII integer value of the null character?

4. What is the ASCII integer value of the character zero?

5. How is the character zero written as a character constant?

6. How is the null character written as a character constant?

7. What is a string literal?