Characters, Character Strings, and String-Manipulation Functions in C

Total Page:16

File Type:pdf, Size:1020Kb

Characters, Character Strings, and String-Manipulation Functions in C Characters, Character Strings, and string-manipulation functions in C see Kernighan & Ritchie – Section 1.9, Appendix B3 Characters Printable characters (and some non-printable ones) are represented as 8-bit numeric values Stored in variables of type char 7-bit ASCII character code is used in C compatible with Latin-1, UTF-8 encodings Character ASCII decimal ASCII hexadecimal ASCII binary A 65 0x41 0100 0001 a 97 0x61 0110 0001 B 66 0x42 0100 0010 b 98 0x62 0110 0010 Character Strings (Text) C does not have a “string” type! C has arrays of variables of its data types, including characters Arrays must have a constant, fixed length Text is kept in character arrays The arrays must be as big as or bigger than the text strings In practice, they are almost always bigger Strings of Characters in Character Arrays C character strings are held in memory as ASCII values, with an ASCII 0, or null, at the end. “Null-terminated strings” Also called ASCIIZ Character arrays must be big enough to include the null character as well as the printable characters Any extra elements in the array may be filled with more nulls, with garbage, or with remnants of previous strings Putting a String in a Character Array Initialize an array when you declare it: char bfrA[10] = "abcdefg"; // last two char's unused char bfrB[] = "hijklm"; // array is just big enough Assign characters one-by-one: char bfrA[10]; bfrA[0] = 'a'; bfrA[1] = 'b'; We'll see a better way shortly bfrA[2] = 'c'; ⁝ bfrA[7] = '\0'; String Input scanf() - use “%s” for the format specifier, and supply a character array Amount of input can be limited with the size modifier: “%20s” will get 20 characters fgets() expects a character array fgets() also expects the array size, to limit the amount of input text gets() also expects a character array, but don't use it – always use fgets() instead The result is an array of characters that are valid up to the terminating null String Output printf(), puts(), fputs() - all expect a null- terminated character string They will keep printing characters until they see a null So if you give them a character array that doesn’t contain a null, they’ll keep going - off the end of the array, and until they happen to run into a 0 byte somewhere in memory (or run out of legal memory) Example - Caesar Cipher Also known as "rot-N" for "rotate N characters" rot-13 is a common one Simply done with character arithmetic Use Boolean variables Repeated inputs Rotation count as cmd-line input Solution Sometimes side-effects are good, even necessary... Character-String Functions in C Functions That Work With Character Strings Find these in <string.h> … strlen(char *src) report the number of "meaningful" characters stops counting at the first null character strcmp(char *dest, char *src) compare two string arrays returns 0 if they match each other returns -1 if dest comes before (is "less than") src returns +1 if dest comes after src note: "dest == src" would test whether both names refer to the same string Finding a Character In a String char *strchr(char *s, int c) Return the location of the first occurrence of character c in the string s . This is a pointer, not an index/offset char *strrchr(char*s, int c) Return the location of the last occurrence of character c in the string s char *strstr(char*haystack, char *needle) Return the location of the substring needle within the (larger) string haystack Example: the strchr() function strrchr() – finds "am!" Making New Strings strncpy(char *dest, char *src, size_t n) . size_t is a type related to unsigned or long unsigned copy a text string from src array into dest array copies at most n characters . if src string is longer than n, the copied result will not be null terminated! set n <= the dest array's length to prevent buffer overflow strncat(char *dest, char *src, size_t n) appends at most n characters from src array to end of dest array . if src string is longer than n, the copied result will not be null terminated Putting a String in a Character Array Assign characters to the array: char bfrA[10]; /* bfrA[0] = 'a'; bfrA[1] = 'b'; bfrA[2] = 'c'; ⁝ Ugh bfrA[7] = '\0'; */ strncpy(bfrA, "abcdefg", 10); Better than character-by-character Avoiding Buffer Overflow Recognize, but don't use these... Older versions of strncpy(), strcat(): strcpy(char *dest, char *src) copy a text string from one array into another appends null character to the end buffer overflow occurs if src is longer than dest! strcat(char *dest, char *src) appends a text string to the end of another also appends null character can also overflow the dest Implementing String Functions Functions that operate on strings actually work on character arrays A function must step through each character of the arrays, checking for the null character Typical function uses a loop to step through the array An Implementation of “strlen()” Prototype: unsigned strlen(char *src); Equivalent to: unsigned strlen(char src[]); A simple counting loop works: unsigned strlen(char *src) { unsigned i; for (i = 0; src[i] != ‘\0’; i++) ; return i; } An Implementation of “strncpy()” Prototype: char *strncpy(char *dest, char *src, unsigned maxlen); Equivalent to: char *strncpy(char dest[], char src[], unsigned maxlen); Use a loop to copy (see "man strncpy"): for (i = 0; i < maxlen && src[i] != '\0'; i++) dest[i] = src[i]; for ( ; i < maxlen; i++) // pad with nulls, if possible dest[i] = '\0'; return dest; .
Recommended publications
  • C Strings and Pointers
    Software Design Lecture Notes Prof. Stewart Weiss C Strings and Pointers C Strings and Pointers Motivation The C++ string class makes it easy to create and manipulate string data, and is a good thing to learn when rst starting to program in C++ because it allows you to work with string data without understanding much about why it works or what goes on behind the scenes. You can declare and initialize strings, read data into them, append to them, get their size, and do other kinds of useful things with them. However, it is at least as important to know how to work with another type of string, the C string. The C string has its detractors, some of whom have well-founded criticism of it. But much of the negative image of the maligned C string comes from its abuse by lazy programmers and hackers. Because C strings are found in so much legacy code, you cannot call yourself a C++ programmer unless you understand them. Even more important, C++'s input/output library is harder to use when it comes to input validation, whereas the C input/output library, which works entirely with C strings, is easy to use, robust, and powerful. In addition, the C++ main() function has, in addition to the prototype int main() the more important prototype int main ( int argc, char* argv[] ) and this latter form is in fact, a prototype whose second argument is an array of C strings. If you ever want to write a program that obtains the command line arguments entered by its user, you need to know how to use C strings.
    [Show full text]
  • Technical Study Desktop Internationalization
    Technical Study Desktop Internationalization NIC CH A E L T S T U D Y [This page intentionally left blank] X/Open Technical Study Desktop Internationalisation X/Open Company Ltd. December 1995, X/Open Company Limited All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owners. X/Open Technical Study Desktop Internationalisation X/Open Document Number: E501 Published by X/Open Company Ltd., U.K. Any comments relating to the material contained in this document may be submitted to X/Open at: X/Open Company Limited Apex Plaza Forbury Road Reading Berkshire, RG1 1AX United Kingdom or by Electronic Mail to: [email protected] ii X/Open Technical Study (1995) Contents Chapter 1 Internationalisation.............................................................................. 1 1.1 Introduction ................................................................................................. 1 1.2 Character Sets and Encodings.................................................................. 2 1.3 The C Programming Language................................................................ 5 1.4 Internationalisation Support in POSIX .................................................. 6 1.5 Internationalisation Support in the X/Open CAE............................... 7 1.5.1 XPG4 Facilities.........................................................................................
    [Show full text]
  • Data Types in C
    Princeton University Computer Science 217: Introduction to Programming Systems Data Types in C 1 Goals of C Designers wanted C to: But also: Support system programming Support application programming Be low-level Be portable Be easy for people to handle Be easy for computers to handle • Conflicting goals on multiple dimensions! • Result: different design decisions than Java 2 Primitive Data Types • integer data types • floating-point data types • pointer data types • no character data type (use small integer types instead) • no character string data type (use arrays of small ints instead) • no logical or boolean data types (use integers instead) For “under the hood” details, look back at the “number systems” lecture from last week 3 Integer Data Types Integer types of various sizes: signed char, short, int, long • char is 1 byte • Number of bits per byte is unspecified! (but in the 21st century, pretty safe to assume it’s 8) • Sizes of other integer types not fully specified but constrained: • int was intended to be “natural word size” • 2 ≤ sizeof(short) ≤ sizeof(int) ≤ sizeof(long) On ArmLab: • Natural word size: 8 bytes (“64-bit machine”) • char: 1 byte • short: 2 bytes • int: 4 bytes (compatibility with widespread 32-bit code) • long: 8 bytes What decisions did the designers of Java make? 4 Integer Literals • Decimal: 123 • Octal: 0173 = 123 • Hexadecimal: 0x7B = 123 • Use "L" suffix to indicate long literal • No suffix to indicate short literal; instead must use cast Examples • int: 123, 0173, 0x7B • long: 123L, 0173L, 0x7BL • short:
    [Show full text]
  • Wording Improvements for Encodings and Character Sets
    Wording improvements for encodings and character sets Document #: P2297R0 Date: 2021-02-19 Project: Programming Language C++ Audience: SG-16 Reply-to: Corentin Jabot <[email protected]> Abstract Summary of behavior changes Alert & backspace The wording mandated that the executions encoding be able to encode ”alert, backspace, and carriage return”. This requirement is not used in the core wording (Tweaks of [5.13.3.3.1] may be needed), nor in the library wording, and therefore does not seem useful, so it was not added in the new wording. This will not have any impact on existing implementations. Unicode in raw string delimiters Falls out of the wording change. should we? New terminology Basic character set Formerly basic source character set. Represent the set of abstract (non-coded) characters in the graphic subset of the ASCII character set. The term ”source” has been dropped because the source code encoding is not observable nor relevant past phase 1. The basic character set is used: • As a subset of other encodings • To restric accepted characters in grammar elements • To restrict values in library literal character set, literal character encoding, wide literal character set, wide lit- eral character encoding Encodings and associated character sets of narrow and wide character and string literals. implementation-defined, and locale agnostic. 1 execution character set, execution character encoding, wide execution character set, wide execution character encoding Encodings and associated character sets of the encoding used by the library. isomorphic or supersets of their literal counterparts. Separating literal encodings from libraries encoding allows: • To make a distinction that exists in practice and which was not previously admitted by the standard previous.
    [Show full text]
  • String Class in C++
    String Class in C++ The standard C++ library provides a string class type that supports all the operations mentioned above, additionally much more functionality. We will study this class in C++ Standard Library but for now let us check following example: At this point, you may not understand this example because so far we have not discussed Classes and Objects. So can have a look and proceed until you have understanding on Object Oriented Concepts. #include <iostream> #include <string> using namespace std; int main () { string str1 = "Hello"; string str2 = "World"; string str3; int len ; // copy str1 into str3 str3 = str1; cout << "str3 : " << str3 << endl; // concatenates str1 and str2 str3 = str1 + str2; cout << "str1 + str2 : " << str3 << endl; // total lenghth of str3 after concatenation len = str3.size(); cout << "str3.size() : " << len << endl; return 0; } When the above code is compiled and executed, it produces result something as follows: str3 : Hello str1 + str2 : HelloWorld str3.size() : 10 cin and strings The extraction operator can be used on cin to get strings of characters in the same way as with fundamental data types: 1 string mystring; 2 cin >> mystring; However, cin extraction always considers spaces (whitespaces, tabs, new-line...) as terminating the value being extracted, and thus extracting a string means to always extract a single word, not a phrase or an entire sentence. To get an entire line from cin, there exists a function, called getline, that takes the stream (cin) as first argument, and the string variable as second. For example: 1 // cin with strings What's your name? Homer Simpson E 2 #include <iostream> Hello Homer Simpson.
    [Show full text]
  • The Char Type ASCII Encoding Manipulating Characters Reading A
    The char Type ASCII Encoding • ASCII ( American Standard Code for Information Interchange) • Specifies mapping of 128 characters to integers 0..127. • The characters encoded include: • The C type char stores small integers. I upper and lower case English letters: A-Z and a-z • It is 8 bits (almost always). I digits: 0-9 • char guaranteed able to represent integers 0 .. +127. I common punctuation symbols I special non-printing characters: e.g newline and space. • char mostly used to store ASCII character codes. • You don’t have to memorize ASCII codes • Don’t use char for individual variables, only arrays Single quotes give you the ASCII code for a character: • Only use char for characters. printf("%d", ’a’); // prints 97 • Even if a numeric variable is only use for the values 0..9, use printf("%d", ’A’); // prints 65 the type int for the variable. printf("%d", ’0’); // prints 48 printf("%d",’’+ ’\n’); // prints 42 (32 + 10) • Don’t put ASCII codes in your program - use single quotes instead. Manipulating Characters Reading a Character - getchar C provides library functions for reading and writing characters The ASCII codes for the digits, the upper case letters and lower • getchar reads a byte from standard input. case letters are contiguous. • getchar returns an int This allows some simple programming patterns: • getchar returns a special value (EOF usually -1) if it can not // check for lowercase read a byte. if (c >= ’a’&&c <= ’z’){ • Otherwise getchar returns an integer (0..255) inclusive. ... • If standard input is a terminal or text file this likely be an ASCII code.
    [Show full text]
  • Recommendation Itu-R Br.1352-2
    Rec. ITU-R BR.1352-2 1 RECOMMENDATION ITU-R BR.1352-2 File format for the exchange of audio programme materials with metadata on information technology media (Question ITU-R 215/10) (1998-2001-2002) The ITU Radiocommunication Assembly, considering a) that storage media based on Information Technology, including data disks and tapes, are expected to penetrate all areas of audio production for radio broadcasting, namely non-linear editing, on-air play-out and archives; b) that this technology offers significant advantages in terms of operating flexibility, production flow and station automation and it is therefore attractive for the up-grading of existing studios and the design of new studio installations; c) that the adoption of a single file format for signal interchange would greatly simplify the interoperability of individual equipment and remote studios, it would facilitate the desirable integration of editing, on-air play-out and archiving; d) that a minimum set of broadcast related information must be included in the file to document the audio signal; e) that, to ensure the compatibility between applications with different complexity, a minimum set of functions, common to all the applications able to handle the recommended file format must be agreed; f) that Recommendation ITU-R BS.646 defines the digital audio format used in audio production for radio and television broadcasting; g) that various multichannel formats are the subject of Recommendation ITU-R BS.775 and that they are expected to be widely used in the near future; h)
    [Show full text]
  • Chapter 11 Strings
    Chapter 11 Strings Objectives ❏ To understand design concepts for fixed-length and variable- length strings ❏ To understand the design implementation for C-language delimited strings ❏ To write programs that read, write, and manipulate strings ❏ To write programs that use the string functions ❏ To write programs that use arrays of strings ❏ To write programs that parse a string into separate variables ❏ To understand the software engineering concepts of information hiding and cohesion Computer Science: A Structured Programming Approach Using C 1 11-1 String Concepts In general, a string is a series of characters treated as a unit. Computer science has long recognized the importance of strings, but it has not adapted a standard for their implementation. We find, therefore, that a string created in Pascal differs from a string created in C. Topics discussed in this section: Fixed-Length Strings Variable-Length Strings Computer Science: A Structured Programming Approach Using C 2 FIGURE 11-1 String Taxonomy Computer Science: A Structured Programming Approach Using C 3 FIGURE 11-2 String Formats Computer Science: A Structured Programming Approach Using C 4 11-2 C Strings A C string is a variable-length array of characters that is delimited by the null character. Topics discussed in this section: Storing Strings The String Delimiter String Literals Strings and Characters Declaring Strings Initializing Strings Strings and the Assignment Operator Reading and Writing Strings Computer Science: A Structured Programming Approach Using C 5 Note C
    [Show full text]
  • Technical Study Universal Multiple-Octet Coded Character Set Coexistence & Migration
    Technical Study Universal Multiple-Octet Coded Character Set Coexistence & Migration NIC CH A E L T S T U D Y [This page intentionally left blank] X/Open Technical Study Universal Multiple-Octet Coded Character Set Coexistence and Migration X/Open Company Ltd. February 1994, X/Open Company Limited All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owners. X/Open Technical Study Universal Multiple-Octet Coded Character Set Coexistence and Migration ISBN: 1-85912-031-8 X/Open Document Number: E401 Published by X/Open Company Ltd., U.K. Any comments relating to the material contained in this document may be submitted to X/Open at: X/Open Company Limited Apex Plaza Forbury Road Reading Berkshire, RG1 1AX United Kingdom or by Electronic Mail to: [email protected] ii X/Open Technical Study (1994) Contents Chapter 1 Introduction............................................................................................... 1 1.1 Background.................................................................................................. 2 1.2 Terminology................................................................................................. 2 Chapter 2 Overview..................................................................................................... 3 2.1 Codesets.......................................................................................................
    [Show full text]
  • Distinguishing 8-Bit Characters and Japanese Professional Quality [9], Including Japanese Line Break- Characters in (U)Ptex Ing Rules and Vertical Typesetting
    TUGboat, Volume 41 (2020), No. 3 329 Distinguishing 8-bit characters and Japanese professional quality [9], including Japanese line break- characters in (u)pTEX ing rules and vertical typesetting. pTEX and pLATEX were originally developed by Hironori Kitagawa the ASCII Corporation2 [1]. However, pTEX and Abstract pLATEX in TEX Live, which are our concern, are community editions. These are currently maintained pTEX (an extension of TEX for Japanese typesetting) 3 by the Japanese TEX Development Community. For uses a legacy encoding as the internal Japanese en- more detail, please see the English guide for pTEX [3]. coding, while accepting UTF-8 input. This means pTEX itself does not have 휀-TEX features, but that pTEX does code conversion in input and output. there is 휀-pTEX [7], which merges pTEX, 휀-TEX and Also, pT X (and its Unicode extension upT X) dis- E E additional primitives. Anything discussed about tinguishes 8-bit character tokens and Japanese char- pTEX in this paper (besides this paragraph) also acter tokens, while this distinction disappears when applies to 휀-pTEX, so I simply write “pTEX” instead tokens are processed with \string and \meaning, of “pTEX and 휀-pTEX”. Note that the pLATEX format or printed to a file or the terminal. in TEX Live is produced by 휀-pTEX, because recent These facts cause several unnatural behaviors versions of LATEX require 휀-TEX features. with (u)pTEX. For example, pTEX garbles “ſ ” (long s) to “顛” on some occasions. This paper explains these 2.1 Input code conversion by ptexenc unnatural behaviors, and discusses an experiment in improvement by the author.
    [Show full text]
  • [MS-UCODEREF]: Windows Protocols Unicode Reference
    [MS-UCODEREF]: Windows Protocols Unicode Reference Intellectual Property Rights Notice for Open Specifications Documentation . Technical Documentation. Microsoft publishes Open Specifications documentation for protocols, file formats, languages, standards as well as overviews of the interaction among each of these technologies. Copyrights. This documentation is covered by Microsoft copyrights. Regardless of any other terms that are contained in the terms of use for the Microsoft website that hosts this documentation, you may make copies of it in order to develop implementations of the technologies described in the Open Specifications and may distribute portions of it in your implementations using these technologies or your documentation as necessary to properly document the implementation. You may also distribute in your implementation, with or without modification, any schema, IDL's, or code samples that are included in the documentation. This permission also applies to any documents that are referenced in the Open Specifications. No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation. Patents. Microsoft has patents that may cover your implementations of the technologies described in the Open Specifications. Neither this notice nor Microsoft's delivery of the documentation grants any licenses under those or any other Microsoft patents. However, a given Open Specification may be covered by Microsoft Open Specification Promise or the Community Promise. If you would prefer a written license, or if the technologies described in the Open Specifications are not covered by the Open Specifications Promise or Community Promise, as applicable, patent licenses are available by contacting [email protected]. Trademarks. The names of companies and products contained in this documentation may be covered by trademarks or similar intellectual property rights.
    [Show full text]
  • Additional Information May Be Carried by Descriptors Which May Be Placed in the Descriptor Loop After the Basic Information
    ATSC Program IDd System Information Protocol for Temsuial BroadcIst IDd cable 12123197 Additional information may be carried by descriptors which may be placed in the descriptor loop after the basic information. The Virtual Channel Table may be segmented into as many as 256 sections. One section may contain information for several virtual channels, but the information for one virtual channel shall not be segmented and put into two or more sections. Thus for each section, the first field after protocoLversion shall be num_channelsjn_section. 6.3.1 Terretltrtal Virtual Channel Table The Terrestrial Virtual Channel Table is carried in private sections with table ID OxC8, and obeys the syntax and semantics of the Private Section as described in Section 2.4.4.10 and 2.4.4.11 of ISOIIEC 13818-1. The following constraints apply to the Transport Stream packets carrying the VCT sections: • PID for Terrestrial VCT shall have the value OxlFFB (base_PIO) • transporCscrambli"9-control bits shall have the value '00' • adaptation_field_control bits shall have the value '01' The bit stream syntax for the Terrestrial Virtual Channel Table is shown in Table 6.4. table_ld - An 8-bit unsigned integer number that indicates the type oftable section being defined here. For the terrestriaLvirtuaLchanneUable_sectionO, the table_id shall be OxC8. HCtIon_8ynwUndlcator- The section_syntax_indicator is a one-bit field which shall be set to '1' for the terrestriaLvirtuaLchanneLtable_sectionO. prlvWUndlcator - This I-bit field shall be set to '1'. MCtIon_1ength - This is a twelve bit field, the first two bits ofwhich shall be '00'. It specifies the number of bytes of the section, starting immediately following the sectionJength field, and including the CRC.
    [Show full text]