<<

Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016)

Student Guide

Introduction This dataset example introduces researchers to the concept of encodings for text data. Everything stored in a computer is stored in the form of binary numbers or a sequence of 0s and 1s; thus, technically, computers do not store letters, numbers, or other characters directly. To handle text data in computers, therefore, we need a mapping between the characters used by human and the binary numbers “understood” by computers. Such mappings are called encodings. For example, the letter “A” is normally encoded as 65 in decimal or 01000001 in binary.

Encodings are not text-analysis methods per se, and hence, they are often overlooked by researchers and teachers. However, they are such a foundationally important concept that sometimes encodings can cause big troubles in analyses conducted far downstream from the text–binary interface, especially when the analysis involves languages other than common English. For example, if you downloaded a text file in Chinese from the Internet and did not open it with the correct encoding, you would see a lot of question marks, weird symbols, or anything but the Chinese text you would like to analyze.

This example describes the concept of encodings and discusses several popular encodings in use. We illustrate encoding processes using a subset of data from the 2016 How ISIS Uses Twitter dataset (https://www.kaggle.com/fifthtribe/how-

Page 2 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 isis-uses-twitter/home). The data are collected by a digital agency, Fifth Tribe, and are released under the CC0: Public Domain license through the platform Kaggle. Specifically, we demonstrate how to load this text file with different encodings.

What Is Encoding?

Character Set Before discussing encodings, we need to introduce a related concept called a set, which is the set of objects to be encoded. Take plain English, for example. We probably want to encode the letters A–Z (in both upper and lower cases); we then need the numbers 0–9; and probably, we also need the symbols such as “,” and “.”. All these objects together constitute a character set. The standard character set for plain English is the so-called ASCII set. It contains all the objects we just mentioned as well as some other special symbols (dollar and pound signs, some mathematical operators, etc.). Each object in a character set has a unique ID called its code point.

However, English is not the only language in the world, and almost any language would need some additional or alternative symbols or objects: Arabic, Chinese, French, German, Hindi, Japanese, Russian, Swedish, etc. For a while, almost every language had its own character set, and the world was accumulating a wealth of character sets and corresponding encodings. In addition to their sheer number and variety, another problem of having so many distinct character sets and encodings is how very inconvenient such multiplicity made working with text files containing multiple languages. Eventually, was developed to redress this situation. Unicode has virtually all the characters and symbols in all the languages on earth; it is currently the largest character set in use.

Encoding

Page 3 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Now we turn to encodings. ASCII is also an encoding scheme itself. That is to say, each character in ASCII is mapped to a binary sequence directly by converting its code point (i.e., its numeric ID) to a . For example, the code point for the letter “A” is 65, and it is encoded as 65 in binary—01000001. For Unicode, the scheme had to become more complicated, creating some confusion for many people about encodings. First, Unicode, despite the word code in its root, is not actually an encoding scheme; it is a character set. Then, there are not one but three standard encoding schemes to map a code point in the Unicode set to a binary sequence in the computer. They are UTF-32, UTF-16, and UTF-8, with UTF-8 being the most popular encoding. For example, most webpages on the Internet are encoded by UTF-8. For most users of text analysis, it is not necessary to know the details of the mappings. Nonetheless, we discuss them briefly in the following paragraph for the curious.

It might seem straightforward to map a code point in Unicode to a binary sequence directly by converting the code point into binary, just as we do for ASCII. The huge size of Unicode complicates this however. Given how many objects it contains (consider all the Asiatic and Arabic and Cyrillic symbols in addition to the Roman ones in ASCII), the largest code point in Unicode would take 32 bits—32 zeros or ones—in binary. We can certainly encode every code point as a 32-bit binary sequence and that is exactly what UTF-32 does, but a lot of is wasted. Recall that the code point for “A” is 65, which is 01000001 in ASCII encoding. The code point for “A” is also 65 in Unicode, but if we encode it as a 32-bit binary sequence, then it will be 00000000000000000000000001000001. Although this encoding scheme is simple, it is not economical, as most of the spaces are wasted by padding 0s to get to 32 digits. The variable-length encodings such as UTF-8 and UTF-16 were developed to address this issue. If a code point can be encoded with a single (8 bits), then UTF-8 will encode it with a single byte (e.g., the letter “A”); if a single byte is not enough for some code point, UTF-8 will use two (16 bits); if two is not enough, UTF-8 will try three, and so on.

Page 4 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 UTF-16 is between UTF-8 and UTF-32: it uses two bytes at least and four bytes when necessary. For the variable-length encodings, several bits are used to signal whether the currently encoded object is single-byte or multi-byte, and if multi-byte, the number of bytes in use.

Detecting Encodings Since there are multiple encodings, a common question comes up when opening a text file: Which encoding is used by this file? (Technically, the process of opening a text file and displaying it on the screen is decoding, but if we know the encoding, then we know how to decode, so we will use the two interchangeably in this guide.) This is an important question because if you open the text file with a wrong encoding scheme, then you might see weird symbols or might fail to open the file altogether. But unfortunately, there is no systematic way to tell what encoding the file uses by merely looking at its filename or content (a pile of 0s and 1s, remember?). If there are no metadata about the encoding for the file, then the best one can do is trial and error. For example, start with ASCII, and if the text does not display correctly, try UTF-8, and so on.

Illustrative Example: Encodings for Arabic Tweets This example shows how to read a text file using different encodings, with data on 17,000 ISIS-related tweets from more than 100 Twitter users from all over the world. This dataset has been very helpful in developing effective counter- messaging measures against terrorism worldwide. And a necessary precondition of using these data to do any useful analysis is to load its text content correctly.

The Data This example uses a subset of data from the 2016 How ISIS Uses Twitter dataset (https://www.kaggle.com/fifthtribe/how-isis-uses-twitter/home). The data

Page 5 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 are collected by a digital agency, Fifth Tribe, and are released under the CC0: Public Domain license through the platform Kaggle. The variable we examine is

• tweet: the text content of each tweet.

There are 17,410 tweets (rows) in the dataset, posted between January 6, 2015, and May 13, 2016, before and after the November 2015 Paris Attacks. At least two languages—English and Arabic—are used in the tweets, making this data appropriate for demonstrating encodings.

Analyzing the Data The data file is encoded with UTF-8. Hence, if we open it with UTF-8 encoding, then it will display correctly on the screen. If we open it with ASCII, some of the text won’t be decoded correctly and will be shown as random symbols depending on which system you are using.

For example, using UTF-8, the following Arabic characters in one of the tweets are shown correctly as:

Using ASCII, the characters above are interpreted by the computer as “random” symbols:

Depending on what system you use, you might see different symbols, but they won’t be shown as the Arabic characters. Note that the encoding not only influences how the text will be displayed on computers; some software may even

Page 6 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 not be able to open the file if given a wrong encoding, making any subsequent analysis impossible of course. Therefore, when you collect texts, such as tweets, it is a good practice to store them always with the same encoding and to save the encoding information as metadata for the files.

As an illustration of encodings, there will be no actual text analysis in this guide. For further information, please see SAGE Research Methods Dataset on Basics in Text Analysis. Again, it cannot be stressed too much how important it is to understand encodings as a precondition for working with text data.

Review An encoding is a mapping between the characters used by human and the binary sequences (i.e., strings of 0s and 1s) that computers can “understand.” Encodings are important as they determine how text data are stored and processed on computers, especially when the text data involve languages other than plain English. There are many encodings available such as ASCII and UTF-8.

You should know:

• What are encodings. • What is a character set and its relationship with an encoding. • Some commonly used character sets and encodings. • How to load data with different encodings.

Your Turn You can download this sample dataset along with a guide showing how to load text data with different encodings using statistical software. See whether you can reproduce the results presented here. Then, try loading the data with any other encoding such as UTF-32.

Page 7 of 7 Learn About Encodings in Python With Data From How ISIS Uses Twitter Dataset (2016)