Modern Text Hiding, Text Steganalysis, and Applications: a Comparative Analysis

entropy

Review Modern Text Hiding, Text Steganalysis, and Applications: A Comparative Analysis

Milad Taleby Ahvanooey 1,* , Qianmu Li 1,2,*, Jun Hou 1, Ahmed Raza Rajput 1 and Yini Chen 1

1 School of Computer Science and Engineering, Nanjing University of Science and Technology, P.O. Box 210094, Nanjing, China; [email protected] (J.H.); [email protected] (A.R.R.); [email protected] (Y.C.) 2 Intelligent Manufacturing Department, Wuyi University, P.O. Box 529020, Jiangmen, China * Correspondence: [email protected] (M.T.A.); [email protected] (Q.L.); Tel.: +86-02584315982 (Q.L.)

Received: 9 March 2019; Accepted: 27 March 2019; Published: 1 April 2019

Abstract: Modern text hiding is an intelligent programming technique which embeds a secret message/watermark into a cover text message/file in a hidden way to protect confidential information. Recently, text hiding in the form of watermarking and steganography has found broad applications in, for instance, covert communication, copyright protection, content authentication, etc. In contrast to text hiding, text steganalysis is the process and science of identifying whether a given carrier text file/message has hidden information in it, and, if possible, extracting/detecting the embedded hidden information. This paper presents an overview of state of the art of the text hiding area, and provides a comparative analysis of recent techniques, especially those focused on marking structural characteristics of digital text message/file to hide secret bits. Also, we discuss different types of attacks and their effects to highlight the pros and cons of the recently introduced approaches. Finally, we recommend some directions and guidelines for future works.

Keywords: modern text hiding; text steganography; text steganalysis; covert communication

1. Introduction Reflecting the new trends and rapid progress in the field of information technology in the form of smart gadgets, communications, and digital content, an extensive environment with the capability to transfer, copy, duplicate, and share information over the Internet has been built, although this revolution in the digital world and the online distribution of digital media also implies that such information is vulnerable to malicious attacks, unauthorized access, forgery, plagiarism, etc. Moreover, digital texts in the form of text messages/files are used in many applications, such as password authentication, chatting, mobile banking, online news, commerce, and so on. However, when we send a text message via short message service (SMS), email, social media, and so on, the information included in the message is transmitted as plain text, exposing it to attacks. In some cases, this information may be sensitive/confidential, such as password authentication, banking credentials, and so on; also, sending such information via SMS or unsecured communication channels is a significant drawback, as neither provides security before transmission. On the other hand, hackers are regularly trying to break the safety of communication channels (e.g., network protocols, SMS, etc.) to access sensitive information during data transmission. Therefore, demand is growing for intelligence and multimedia security studies that involve not only encryption, but also covert communication whose essence lies in concealing data [1–19]. Recently, information hiding or data hiding in digital texts, known as text hiding, has drawn considerable attention due to its extensive usage, and potential applications in the cybersecurity and network communication industries [20–127]. Text hiding is the process of embedding secret data through a cover text or supportable technologies such as network protocols, SMS, etc. so

Entropy 2019, 21, 355; doi:10.3390/e21040355 www.mdpi.com/journal/entropy Entropy 2019, 21, 355 2 of 31 Entropy 2018, 20, x FOR PEER REVIEW 2 of 29 invisible/undetectable for adversaries or casual viewers [1,6,8]. It has been widely considered as an that the existence of the data is invisible/undetectable for adversaries or casual viewers [1,6,8]. It has attractive technology to improve the use of conventional cryptography algorithms in the area of been widely considered as an attractive technology to improve the use of conventional cryptography multimedia security by concealing a secret message/watermark into a cover text file/message to algorithms in the area of multimedia security by concealing a secret message/watermark into a cover protect confidential information. As depicted in Figure 1, the various information security systems text file/message to protect confidential information. As depicted in Figure1, the various information categories that are utilized to protect sensitive data from crackers, deceivers, hackers, and spies are security systems categories that are utilized to protect sensitive data from crackers, deceivers, hackers, divided into cryptography and information hiding [3]. Cryptography scrambles a plain-text (secret and spies are divided into cryptography and information hiding [3]. Cryptography scrambles a data) into cipher to prevent unauthorized access to its content. On the other hand, information hiding plain-text (secret data) into cipher to prevent unauthorized access to its content. On the other hand, conceals a secret message in a cover medium (e.g., text, image, audio, or video) so that the embedded information hiding conceals a secret message in a cover medium (e.g., text, image, audio, or video) so hidden data trace is unnoticeable/undetectable. Cryptography and information hiding are both that the embedded hidden data trace is unnoticeable/undetectable. Cryptography and information similar in the way which is employed to protect confidential/sensitive information. Nonetheless, the hiding are both similar in the way which is employed to protect confidential/sensitive information. invisibility is the difference between both systems, i.e., information hiding involves how to conceal Nonetheless, the invisibility is the difference between both systems, i.e., information hiding involves information so it is not noticeable. In practice, information hiding can be classified into watermarking how to conceal information so it is not noticeable. In practice, information hiding can be classified into and steganography. The goal of watermarking is providing proof of ownership for the cover media watermarking and steganography. The goal of watermarking is providing proof of ownership for the against malicious attacks such as tampering, forgery, and plagiarism (e.g., the embedded watermark cover media against malicious attacks such as tampering, forgery, and plagiarism (e.g., the embedded indicates the original owner). While, the aim of steganography is the invisible transmission of watermark indicates the original owner). While, the aim of steganography is the invisible transmission confidential information so that no one (except an intended recipient) can discover/encode it, i.e., of confidential information so that no one (except an intended recipient) can discover/encode it, steganography concerns concealing the fact that a medium contains secret data that is invisible/ i.e., steganography concerns concealing the fact that a medium contains secret data that is invisible/ indiscernible [1,3,41]. indiscernible [1,3,41].

Figure 1. Various categories of information security systems [3,19,20]. Figure 1. Various categories of information security systems [3,19,20]. During the last two decades, many text hiding algorithms have been introduced in terms of text steganographyDuring the andlast texttwo watermarkingdecades, many for text covert hiding communication algorithms have [1,6 ,been8–14, 20introd,31,36uced,39, 51in, 91terms], copyright of text steganographyprotection [3–5 ,7and,18, 20text–29 ,watermarking44,49–68,72–75 ,87for– 92covert,98–109 communication], copy control and[1,6,8,9–14,20,31,36,39,51,91], authentication [31,57,60,74 , copyright78,93–98]. protection [3–5,7,18,20–29,44,49–68,72–75,87–92,98–109], copy control and authentication [31,57,60,74,78,93–98].The main contributions of this paper are summarized as follows: The main contributions of this paper are summarized as follows: • We provide a brief review of existing literature on text hiding schema, attacks, text steganalysis, • Weapplications, provide a andbrief fundamental review of existing criteria. literature on text hiding schema, attacks, text steganalysis, • applications,We summarize and some fundamental of the recently criteria. proposed text hiding techniques which are focused on altering • Wethe structuresummarize of thesome cover of the text recently message/ﬁle proposed to conceal text hiding secret information.techniques which are focused on • alteringWe present the astructure comparative of the analysis cover text of the mess structuralage/file basedto conceal algorithms secret information. and evaluate their efﬁciency • Wewith present respect a to comparative common criteria. analysis of the structural based algorithms and evaluate their efficiency with respect to common criteria. The rest of the paper is organized as follows: Section2 presents some background literature The rest of the paper is organized as follows: Section 2 presents some background literature and and related studies on the information hiding area. Section3 explains various types of text hiding related studies on the information hiding area. Section 3 explains various types of text hiding approaches, along with their limitations. In Section4, we evaluate some of the recently proposed approaches, along with their limitations. In Section 4, we evaluate some of the recently proposed

Entropy 2019, 21, 355 3 of 31

structure-basedEntropy 2018, 20, x FOR algorithms PEER REVIEW and highlight their pros and cons. In Section5, we give some suggestions3 of 29 for future works. Finally, Section6 concludes the paper with a summary of contributions. structure-based algorithms and highlight their pros and cons. In Section 5, we give some suggestions 2.for Literature future works. Review Finally, Section 6 concludes the paper with a summary of contributions.

2. LiteratureIn what follows,review we present the existing literature on the text hiding area consisting of the schema, fundamental criteria, the Unicode standard, and text steganalysis. In what follows, we present the existing literature on the text hiding area consisting of the 2.1.schema, Text Hidingfundamental Schema criteria, the Unicode standard, and text steganalysis. The basic scenario of a cryptography covert channel is Simmons’ prisoner problem [108]. Alice 2.1. Text Hiding Schema and Bob are locked up in two separated cells but are permitted to communicate under the watch of Eve, theThe prisonbasic scenario warden. of If Evea cryptography discovers the covert existence channel of hidden is Simmons’ information prisoner in a transmittedproblem [108]. message, Alice sheand stopsBob are their locked communication up in two separated punishes cells them. but Eve are is permitted an active to warden communicate is she makes under noise the watch to make of AliceEve, the and prison Bob’s taskwarden. more difﬁcult.If Eve discovers She is a passivethe existence warden of if shehidden merely information detects and in investigates a transmitted the transmittedmessage, she data stops [12 their]. From communication the digital data punishes hiding them point. ofEve view, is an text active steganography/watermarking warden is she makes noise isto amake different Alice scenario and Bob’s which task works more based difficult. on the She practice is a passive of hiding warden a secret if she message merely (SM) detects through and ainvestigates cover message/ﬁle the transmitted (CM) by markingdata [12]. invisible From symbolsthe digital where data the tracehiding of embeddingpoint of view, the SM text is invisible/undetectablesteganography/watermarking by human is a visiondifferent systems. scenario In theory,which works the Modern based Texton the Hiding practice schema of hiding (MTH a) cansecret be message considered (SM) as through a form of a communication.cover message/file Figure (CM)2 demonstratesby marking invisible the modern symbols text where hiding the schema trace whichof embedding is represented the SM as is MTHSinvisible/undetectable[3,9,10,12,47,48,76 by,77 human,79]. vision systems. In theory, the Modern Text Hiding schema (MTH) can be considered as a form of communication. Figure 2 demonstrates the modern Where,text hidingMTHS schema= h{ whichCM, SM is represented, K, CMHM} ,as{Att MTHS(), CM [3,9,10,12,47,48,76,77,79].HM, CM0HM}, {Emb(), Ext()}i = Where, MTHS{ CM , SM , K , CMHM },{ Att (), CM HM , CM ' HM },{ Emb (), Ext ()}

Figure 2. Modern text hiding schema. Figure 2. Modern text hiding schema. As depicted in Figure2, the modern text hiding scenario consists of two main phases, and a third partyAs phase, depicted namely in Figure.2, Embedding the modern “Emb(),” text Extraction hiding scenario “Ext(),” consists and Attacks of two “Att().” main phases, and a third party phase, namely Embedding “Emb(),” Extraction “Ext(),” and Attacks “Att().”

AlgorithmAlgorithm 1: Pseudocode 1: Pseudocode of ofEmb() Emb() Input:Input: a cover a cover text text (CM (CM), a), secret a secret message message (SM (SM), a), secret a secret key key (K) ( K) Output:Output: a carrier a carrier text text message message or orstego-object stego-object (CM (CMHM)HM which) which consists consists of CM of CM andand HMHM 1. SM1. SM← ←SecretSecret Message Message (e.g., (e.g., confidential conﬁdential information information such such as aspassword, password, banking banking credentials, credentials, etc.); etc.); 2. CM2. CM← ←CoverCover Message Message (e.g., (e.g., an aninnocent innocent text text message message such such as asprank, prank, joke, joke, etc.); etc.); 3. K3.←K ←SecretSecret Key Key (e.g., (e.g., a symmetric a symmetric or orasymmetric asymmetric key key algorithm algorithm such such as asOne-Time-Pad, One-Time-Pad, AES, AES, DES, DES, etc.); etc.); 4. for each cSMccc∈={ , ,..., } do 4. for eachinci ∈ SM = 12{c1, c2,..., cn} do bits← ← bits [ ] 5. SM5. SMbits SMSM +bits Convert+ Convert each each SMSM[] ci c itoto a 8-bit a 8-bit string string based based on on the the ASCII ASCII Code; Code; 6. end6. end for for ← 7. Encrypted7. Encrypted _SM_SMbits←bits EncryptsEncrypts the theSMSMbits basedbits based on K on using K using a special a special encryption encryption function; function; ← 8. HM8. HM← ConvertConvert the the Encrypted_ Encrypted_SMSMbits tobits invisibleto invisible symbols symbols such such as space as space between between words, words, text text color, color, etc.; etc.; 9. CM9. CMHM←HM Embed← Embed the theHMHM intointo the theCMCM, where, where the theattacks attacks may may not notdetect/remove detect/remove it easily; it easily; 10.10. Return Return CM CMHMHM; ;

(1) Embedding (Emb()): Alice employs this function to hide an SM into the CM which consists of three stages. In the first stage, the embedding function converts the letters of the SM into a binary string (SMbits). In the second stage, it encodes the SMbits by using an encryption algorithm based on an

Entropy 2019, 21, 355 4 of 31

(1) Embedding (Emb()): Alice employs this function to hide an SM into the CM which consists of three stages. In the ﬁrst stage, the embedding function converts the letters of the SM into a binary string (SMbits). In the second stage, it encodes the SMbits by using an encryption algorithm based on an optional key(K) to secure its content, and produces encoded SMbits, i.e., One-Time-Pad, AES, DES, etc. Then, it converts the encrypted SMbits to a hidden message (HM) by marking/embedding invisible symbols through the CM. For example, to mark each bit ‘10, Emb() adds two spaces between words 0 and a single space is represented as a bit ‘0 . Finally, it generates a carrier message (CMHM). Algorithm 1 depicts the sequence of the Emb() with more details [1,10,12]. (2) Attack(Att()): During the communication process, attackers may attempt to break the security of the CM HM by decoding or manipulating the HM using steganalysis techniques. This process may cause alteration/removal of the HM form the CM’ HM. It is assumed that the attackers do not have any clue about the encoding function, secret key, and Emb(). In some cases, attackers employ conventional approaches to guess the invisible/hidden symbols which are statistically distinguishable, and extract/decode the original message, but in practice, this is an impossible task for attackers if the text hiding algorithm utilizes an encryption function during the embedding/extraction process. Algorithm 2 explains the sequence of the Att() with more details [1,9,10,12].

Algorithm 2: Pseudocode of Att()

Input: a carrier message (CMHM), an estimated secret key (EK) Output: a compromised carrier message (CM’HM), an estimated Secret Message (ESM) 1. HS← Estimates the hidden/invisible symbols from the CMHM; 2. for each ci ∈ HS = {c1, c2,..., cn} do 3. Estimated_SMbits ← Estimated_SMbits + Guess the binary string of each symbol based on the HS[ci]; 4. EKbits← EKbits + Guess the secret key according to the HS[ci] using the conventional approaches; 5. end for 6. SMbits← Tries to decrypt the Estimated_SMbits based on the ESK; 7. ESM← If it is possible, estimates/decodes the SMbits using conventional approaches; 8. CM’HM← Manipulate the CM’HM in order to remove the HM; 9. Return CM’HM, ESM;

3. Extraction (Ext()): Bob utilizes this function to extract/discover the original SM from the CM’HM. Since the CM’HM is transmitted via communication channels, the HM may be exposed to attacks, so it is necessary to verify the original SM using the same encryption function which already used during the embedding process, i.e., Alice already shared the key with Bob or he has knowledge about the special symbols of the key through the CM’HM. Two different terms are employed for this function, which are “detection” and “extraction”. However, researchers often deﬁne both as similar functions in the literature; we classify them in this way: extraction (Ext()) discovers/extracts the SM from the CM’HM and authenticates its integrity, while detection veriﬁes the existence of the SM from the CM’HM. Algorithm 3 outlines the sequence of the Ext() with more details [1,9,10,12].

Algorithm 3: Pseudocode of Ext()

Input: an affected carrier message (CM’HM), a secret key (K) Output: a secret message (SM’) 1. HS← Discovers the existing hidden marks/symbols from the CM’HM; 2. K← Secret Key (e.g., the symmetric or asymmetric key algorithm such as One-Time-Pad, AES, DES, etc.); 3. for each ci ∈ HS = {c1, c2,..., cn} do 4. Encrypted_SMbits← Encrypted_SMbits + Detects the binary string of each invisible symbol from HS[li]; 5. Kbits←Kbits+ Utilizes a shared key from Alice or Extracts the secret key from the CM’HM. 6. end for 7. SMbits← Decrypts the Encrypted _SMbits based on Kbits using corresponding decryption function; 8. SM’← Extracts the original SM characters from the SMbits based on their ASCII codes. 9. Return SM’; Entropy 2019, 21, 355 5 of 31

2.2. Information Theoretic and Modern Text Hiding

This subsection discusses an ideal text hiding system in which the CM and CMHM (cover message with and without the hidden information) are statistically indistinguishable or unnoticeable, i.e., it means that the CM & CMHM have the same probability distribution. We employ the stego-system models presented in [10,127] to clarify this requirement. As depicted in Figure2, Alice and Bob could exchange messages of a certain kind (called cover message/ﬁle) over a public/private channel which is accessible to Eve. Alice wishes to transmit an SM in cover of the CM to Bob so that Eve cannot observe whether there exists an HM through the CMHM. The entropy of information theory (H) is a popular metric for information measurement introduced by Shannon [128]. It computes the quantity of randomness existing in a message. The equation (1) is commonly utilized to compute Shannon’s entropy [129–131]. Let us assume that CM consists of unique symbols (or characters) appear into it, i.e., CM = {c1, c2, c3,..., cn}. Herein, ci is the th n occurrence of i symbol in all sequences with probability 0 < P(ci) < 1, ∑i=1 P(ci) = 1, i.e., P(ci) is th the probability of occurrence for ci element. Thus, the entropy of CM can be calculated as follows:

n HCM = −∑ P(ci) log2 P(ci) (1) i=1

Let us suppose that Eve does not try to disrupt communication between Alice and Bob, but only attempts to determine if hidden information is being transmitted. In [10], Cachin presented the first formal analysis on the stegosystem in which, depending on the fact that the probability distribution of CM and CMHM is identified, and both cover texts (CM and CMHM) are statistically close. Later in [127], Ryabko and Ryabko commented that the CM and CMHM are statistically indistinguishable. They assumed that Alice has access to an oracle which makes independent and identically distributed cover texts (CM and CMHM) based on some fixed but unknown distribution µ. The CM/CMHM consists of some symbols that belong to some (possibly infinite) alphabet A. Alice wishes to employ this source as cover to transmit hidden messages. An HM is a sequence of symbols or letters from B = {0,1} produced independently by equal probabilities of ‘00 and ‘10. Also, it is assumed that Alice encrypts SMs using a key shared only with Bob, i.e., similar to a common cryptosystem scenario. If Alice utilizes the Vernam cipher then, the encrypted SMs are certainly produced according to the Bernoulli (1/2) distribution, while if Alice employs “modern block” or “stream” ciphers, the encoded sequence thus “looks like” a sequence of random Bernoulli (1/2) trials. Herein, “look like” means that it is indistinguishable in polynomial time, or that the resemblance is proved experimentally by statistical data, known for all broadly utilized ciphers [132,133]. Eve or a third party is monitoring all messages transmitted from Alice to Bob and is attempting to detect whether SMs are being passed in the CM or not. In the best case scenario, if the text hiding technique does not change the CMHM by embedding the SM it means that the CM and CMHM have the same probability distribution (µ), hence, it is impossible to distinguish the presence of the HM from the CMHM. In [127], the authors confirmed that if the alphabet A is finite, then the average number of invisible/hidden symbols per character Ln goes to Shannon’s entropy H(µ) for the source µ, as n goes to infinity; as a result of this statement the definition can be expressed as follows: H(µ) = − ∑a∈A µ(a) log2 µ(a). Since, some existing text hiding techniques embed invisible symbols into the CM for marking the SMbits, the trace of embedding into CMHM is visually imperceptible, but, in practice, the CM and CMHM are statistically distinguishable, and their variation rate can be calculated by Equation (2), i.e., a Jaro similarity function [29,125,126].

2.3. The Unicode Standard Unicode is a universal standard which has been introduced for the processing, encoding, and handling of the digital texts expressed in most of the world’s writing systems from 1987 until now [100–104]. In other words, the Unicode standard is an encoding system which designed to support the worldwide display, processing, and interchange of the texts with different languages Entropy 2018, 20, x FOR PEER REVIEW 6 of 30 EntropyEntropy2019, 212018, 355,, 20,, xx FORFOR PEERPEER REVIEWREVIEW 6 of6 of 31 30 technical disciplines. Moreover, it also supports classical and historical characters of many languages. technical disciplines. Moreover, it also supports classical and historical characters of many languages. Necessarily, Unicode is required by the various Internet protocols (e.g., TCP/IP, SMTP, FTP, and and technicalNecessarily, disciplines. Unicode is Moreover, required by it alsothe various supports Internet classical protocols and historical (e.g., TCP/IP, characters SMTP, of FTP, many and HTTP, etc.) and implemented in all operating systems (e.g., Android, Windows, iOS, and BlackBerry) languages.HTTP, etc.) Necessarily, and implemented Unicode isin requiredall operating by thesystems various (e.g., Internet Android, protocols Windows, (e.g., iOS, TCP/IP, and BlackBerry) SMTP, and programming languages for processing and displaying digital texts. This standard consists of FTP,and programming HTTP, etc.) and languages implemented for processing in all operating and displaying systems digital (e.g., Android, texts. This Windows, standard iOS,consists and of three different encoding forms, UTF-8, UTF-16, and UTF-32, for which Unicode provides 17 planes, BlackBerry)three different and programming encoding forms, languages UTF-8,for UTF-16, processing and UTF-32, and displaying for which digital Unicode texts. provides This standard 17 planes, each with “65,536” possible letters (or ‘code points’). Therefore, it affords a total of 1,114,112 possible consistseach of with three “65,536” different possible encoding letters forms, (or ‘code UTF-8, points’) UTF-16,. Therefore, and UTF-32, it affords for whicha total Unicodeof 1,114,112 provides possible symbols/characters in various formats such as numbers, letters, emoticons, and a vast number of 17 planes,symbols/characters each with “65,536” in various possible formats letters such (oras nu ‘codembers, points’). letters, Therefore,emoticons, itand affords a vast a number total of of current characters in different languages, i.e., the UTF-8 presents one byte for any ASCII character, 1,114,112current possible characters symbols/characters in different languages, in various i.e., the formats UTF-8 such presents as numbers, one byte letters, for any emoticons, ASCII character, and which have the same code values in both ASCII and UTF-8, and up to four bytes for other symbols a vastwhich number have of the current same code characters values in both different ASCII languages, and UTF-8, i.e., and the up UTF-8 to four presents bytes for oneother byte symbols for [1–7]. In the Unicode, there are special zero-width characters (ZWC) which are employed to provide any ASCII[1–7]. In character, the Unicode, which there have are thespecial same zero-width code values characters in both (ZWC) ASCII which and UTF-8, are employed and up to to provide four specific entities such as Zero Width Joiner (ZWJ), e.g., ZWJ joins two supportable characters together bytesspecific for other entities symbols such [ 1as– 7Zero]. In Width the Unicode, Joiner (ZWJ), there aree.g., special ZWJ joins zero-width two supportable characters characters (ZWC) whichtogether in particular languages, POP directional, and Zero Width Non-Joiner (ZWNJ), etc. Practically, the are employedin particular to languages, provide specific POP directional, entities such and as Zero Zero Width Width Non-Joiner Joiner (ZWJ), (ZWNJ), e.g., etc. ZWJ Practically, joins two the ZWC characters do not have traces, widths or written symbol in digital texts [1–8,11,15,18,25– supportableZWC characters characters do together not have in particulartraces, widths languages, or written POP directional,symbol in anddigital Zero texts Width [1–8,11,15,18,25 Non-Joiner – 28,33,34,41–43,50–63,64–68,86–100]. Recently, many text hiding techniques that utilize social media, (ZWNJ),28,33,34,41 etc. Practically,–43,50–63,64 the– ZWC68,86– characters100]. Recently, do not many have text traces, hiding widths techniques or written that utilize symbol social in digital media, email, SMS, as communication channels have been introduced [1,6,8,11,20,36,37]. In a particular social textsemail, [1–8,11 SMS,,15,18 as,25 communication–28,33,34,41–43 channe,50–68ls,86 have–100 been]. Recently, introduced many [1,6,8,11,20,36,37]. text hiding techniques In a particular that utilize social media platform, if it employs the Unicode standard toto processprocess digitaldigital textstexts inin differentdifferent languages,languages, socialmedia media, platform, email, SMS, if it employs as communication the Unicode channels standard have to process been introduced digital texts [1 ,in6, 8different,11,20,36 ,languages,37]. In a thenthen thethe ZWCsZWCs representrepresent invisibleinvisible writtenwritten symbols.symbols. Otherwise,Otherwise, theythey mightmight justjust showshow somesome unusualunusual particularthen the social ZWCs media represent platform, invisible if it employs written the symbols. Unicode Otherwise, standard tothey process might digital just show texts some in different unusual symbols.symbols. AsAs listedlisted inin TableTable 1,1, WeWe havehave collectedcollected alalll ofof thethe utilizedutilized characterscharacters fromfrom thethe literatureliterature andand languages,symbols. then As thelisted ZWCs in Table represent 1, We have invisible collected written all of symbols. the utilized Otherwise, characters they from might the literature just show and testedtested themthem byby JavaJava programmingprogramming inin .txt,.txt, MSMS .docx,.docx, andand HTMLHTML files,files, i.e.,i.e., thethe ZWCsZWCs havehave nono tracetrace sometested unusual them symbols. by Java As programming listed in Table in 1.txt,, We MS have .docx, collected and HTML all of thefiles, utilized i.e., the characters ZWCs have from no the trace with respect to the written symbol. In practice, when ZWCs/special spaces are employed for with respect to the written symbol. In practice, when ZWCs/special spaces are employed for literatureembedding and tested a secret them bydata Java in programmingthe cover text, in the .txt, default MS .docx, encoding and HTML used files,must i.e., one the of ZWCs the Unicode have embedding a secret data in the cover text, the default encoding used must one of the Unicode no traceencodings with respect like UTF-8, to the UTF-16, written or symbol. UTF-32. In In practice, case of whenattack, ZWCs/special if a malicious user spaces copies are employeda target text encodings like UTF-8, UTF-16, or UTF-32. In case of attack, if a malicious user copies a target text for embeddingwhich contained a secret some data ZWCs in the in cover the new text, host the defaultfile, then encoding these characters used must will one be ofconsidered the Unicode as the which contained some ZWCs in the new host file, then these characters will be considered as the encodingsUnicode like encoding UTF-8, UTF-16, and show or UTF-32. an invisible In case text of attack, trace. ifOtherwise, a malicious they user display copies a some target unsupported text which Unicode encoding and show an invisible text trace. Otherwise, they display some unsupported containedcharacters some and ZWCs raise in suspicions the new host about file, the then existence these of characters secret information will be considered [1,3,6,7]. as the Unicode characters and raise suspicions about the existence of secret information [1,3,6,7]. encoding and show an invisible text trace. Otherwise, they display some unsupported characters and raise suspicionsTable about 1. the The existence most utilized of secret special information Unicode charac [1,ters3,6, 7in]. recent introduced techniques. Table 1. The most utilized special Unicode characters in recent introduced techniques. Algorithm Name Hex Code Decimal Code Written Symbol TableAlgorithm 1. The most utilized specialName Unicode charactersHex Code in recent Decimal introduced Code techniques.Written Symbol [1,27,28,33,42,55,58,91] Zero-Width-Non-Joiner U+200C 8204 No symbol and width [1,27,28,33,42,55,58,91] Zero-Width-Non-Joiner U+200C 8204 No symbol and width Algorithm[1,4] Name POP Directional Hex Code U+202C Decimal 8236 Code No Written symbolSymbol and width [1,4] POP Directional U+202C 8236 No symbol and width [1,4] Left-To-Right Override U+202D 8237 No symbol and width [1,27,28,33,42,55,58[1,4],91 ] Zero-Width-Non-Joiner Left-To-Right Override U+200C U+202D 8204 8237 No No symbol symbol and width width [1,28,33,42] Left-To-Right Mark U+200E 8206 No symbol and width [1,4][1,28,33,42] POP Left-To-Right Directional Mark U+202C U+200E 8236 8206 No No symbol symbol and width width [1,4][4] Left-To-Right Right -To- Override Left Override U+202D U+202E 8237 8238 No No symbol symbol andand width width [4] Right -To- Left Override U+202E 8238 No symbol and width [1,28,33,42] Left-To-Right Mark U+200E 8206 No symbol and width [5,6,53,54,91][5,6,53,54,91] Narrow Narrow No-BreakNo-Break SpaceSpace U+202F U+202F 8239 8239 No No symbolsymbol andand widthwidth [4][5,6,53,54,91] Right -To- Narrow Left No-Break Override Space U+202E U+202F 8238 8239 No No symbol symbol and width width [55,56][55,56] Left-to-right Left-to-right embeddingembedding U+202A U+202A 8234 8234 No No symbolsymbol andand widthwidth [5,6,53,54,91[55,56]] Narrow Left-to-right No-Break embedding Space U+202F U+202A 8239 8234 No No symbol symbol and width width [55,56] Right-to-left embedding U+202B 8235 No symbol and width [55,56][55,56] Left-to-right Right-to-left embedding embedding U+202A U+202B 8234 8235 No No symbol symbol and width width [7,55,56] Mongolian-vowel separator U+180E 6158 No symbol and width [55,56][7,55,56] Right-to-left Mongolian-vowel embedding separator U+202B U+180E 8235 6158 No No symbol symbol and width width [7,55,56][28,33] Mongolian-vowel Right -To- separator Left Mark U+180E U+200F 6158 8207 No No symbol symbol andand width width [28,33] Right -To- Left Mark U+200F 8207 No symbol and width [28,33[28,33,42,55,56]] Right -To- Zero-Width-Joiner Left Mark U+200F U+200D 8207 8205 No No symbol symbol andand width width [28,33,42,55,56] Zero-Width-Joiner U+200D 8205 No symbol and width [28,33,42[28,33,42,55,56],55,56] Zero-Width-Joiner Zero-Width-Joiner U+200D U+200D 8205 8205 No No symbol symbol and width width [42,55,56,58][42,55,56,58] Zero-Width-Space Zero-Width-Space U+200B U+200B 8203 8203 No No symbolsymbol andand widthwidth [42,55,56[42,55,56,58],58] Zero-Width-Space Zero-Width-Space U+200B U+200B 8203 8203 No No symbol symbol and width width [55,56] Zero-Width-Non-Break U+FEFF 65279 No symbol and width [55,56][55,56] Zero-Width-Non-Break Zero-Width-Non-Break U+FEFF U+FEFF 65279 65279 No No symbol symbol and width width [5[5––7,27,34,53,54,58]7,27,34,53,54,58] Hair Space U+200A 8202 “ ” [5–7,27,34[5,–537,27,34,53,54,58],54,58] Hair SpaceHair Space U+200A U+200A 8202 8202 “ ” [5[5––7,27,34,54]7,27,34,54] Six-Per-Em Space U+2006 8198 “ ” [5–7,27,34[5,–547,27,34,54]] Six-Per-EmSix-Per-Em Space Space U+2006 U+2006 8198 8198 “ ” [5[5––7,27,34,54]7,27,34,54] Figure Space U+2007 8199 “ ” [5–7,27,34[5,–547,27,34,54]] FigureFigure Space Space U+2007 U+2007 8199 8199 “ ” [5[5––7,27,34,54]7,27,34,54] Punctuation Space U+2008 8200 “ ” [5–7,27,34[5,–547,27,34,54]] PunctuationPunctuation Space Space U+2008 U+2008 8200 8200 “ ” [5[5––7,34,54,58]7,34,54,58] Thin Space U+2009 8201 “ ” [5–7,34,54[5,–587,34,54,58]] Thin SpaceThin Space U+2009 U+2009 8201 8201 “ ” [5–7,34,54[5[5]––7,34,54]7,34,54] En QuadEn Quad U+2000 U+2000 8192 8192 “ ” [5–7,34,54] En Quad U+2000 8192 “ ” [5–7,34,54[5[5]––7,34,54]7,34,54] Three-Per-EmThree-Per-Em Space Space U+2004 U+2004 8196 8196 “ ” [5–7,34,54] Three-Per-Em Space U+2004 8196 “ ” [5–7,34,54[5[5]––7,34,54]7,34,54] Four-Per-EmFour-Per-Em Space Space U+2005 U+2005 8197 8197 “ ” [5–7,34,54] Four-Per-Em Space U+2005 8197 “ ” [5–7,27,34[5[5,––1007,27,34,100]7,27,34,100]] NormalNormal Space Space U+0020 U+0020 32 32 “ ” [5–7,27,34,100] Normal Space U+0020 32 “ ”

Based on our experiments, Gmail blocked the “U+200B” character, and the Apple iOS does not BasedBased on our on our experiments, experiments, Gmail Gmail blocked blocked the the “U+200B” “U+200B” character,character, and the the Apple Apple iOS iOS does does not allow one to transmit the “U+200D” character. Moreover, we highlighted the special Unicode spaces not allowallow oneone toto transmit the the “U+200D” “U+200D” character. character. Mo Moreover,reover, we wehighlighted highlighted the special the special Unicode Unicode spaces

Entropy 2019, 21, 355 7 of 31 spaces between double quotation marks and changed the font color to show their width, but they are transparent in practice. These days, social media play a vital role in the new digital world; the end users are using it to keep in touch with their friends or make some new friends. Sometime to exhibit conﬁdence they post/share their latest accomplishments with friends. Everyone utilizes it differently. Some end users are employing social media as per their priorities and awareness to achieve their means. Further, these tools are all handy for online advertisements, payments, and business systems. At the early stages, social media was not that big yet, but now people can use it for almost anything in their daily life. Also, people’s cultures have been more impacted than anything else by social media in recent years. Large media companies are not expected to go away overnight, nor will the demand to communicate by smartphone or meet people in person, but social media provides one more means of engaging with users on this enormous planet, and if employed effectively could give all a more desirable option in how to live and communicate to each other in the digital world. Since the text message in the form of SMS, chat, email, and so on, has become a popular and easy form of communication, concerns about data leakage attacks, such as hacking, hijacking, and phishing, have emerged [1,6,8,11]. Table2 lists the text character limitation of social media and messenger apps which support the Unicode standard to process digital texts in different languages (except for ‘Twitter’ and ‘Telegram’).

Table 2. Text Character Limitation of Social Media and Messenger apps [1,6].

Social Media or Text Limits Number Text Limits Number Number Message/Post Messenger Name of ASCII Characters of UTF-8 Characters 1 SMS Message 2048 1024 2 Facebook Wall Post 63,206 31,603 3 LinkedIn Post 52,286 29,718 140 (Exclusive 4 Twitter Tweet 280 encoding) 5 Google+ Post 100,000 50,000 6 Instagram Pic Caption 2200 1100 7 Pinterest Pin Description 500 250 Video 8 YouTube 5000 2500 Description 9 WhatsApp Message 30,000 30,000 10 Gmail Mail Text 35,000,000 35,000,000 11 WeChat Message 16,207 16,207 12 Imo Message Virtually Unlimited Virtually Unlimited 13 Hangouts Message Virtually Unlimited Virtually Unlimited 4096 (Exclusive 4096 (Exclusive 14 Telegram Message encoding) encoding) 15 Line Message 10,000 10,000 16 Tango Message 520 520 17 QQ Message 16,207 16,207

2.4. Text Hiding Applications Text Steganography algorithms are applicable in many applications. The following points are the most signiﬁcant applications of text steganography.

2.4.1. Hidden Communication Text hiding could be utilized to communicate hidden information over public networks such as the Internet. One may embed secret bits into an unnoticeable text message/ﬁle which is routinely transmitted over such networks: a greeting, joke, story, etc. Since the text messages/ﬁles are sent using unsecured communication channels such as SMS, social media and so on, they are exposed to attacks. Users of such techniques may consist of intelligence or people who are subject to censorship such as detectives, journalists, judges, and so on [1,6,10–12]. Entropy 2018, 20, x FOR PEER REVIEW 8 of 29

Users of such techniques may consist of intelligence or people who are subject to censorship such as detectives, journalists, judges, and so on [1,6,10–12]. Entropy 2019, 21, 355 8 of 31 2.4.2. Network Covert Channels 2.4.2.Text Network hiding Covert can be Channels used to make covert channels that provide unexpected stealthy communication over the networks. Recently, covert channels were employed by cyber-attacks, i.e., Text hiding can be used to make covert channels that provide unexpected stealthy communication to permit a covert transmission of malware data. Nevertheless, they could also be applied for over the networks. Recently, covert channels were employed by cyber-attacks, i.e., to permit a covert legitimate goals, such as transmitting illicit information under Internet censorship [14,98,107]. transmission of malware data. Nevertheless, they could also be applied for legitimate goals, such as transmitting illicit information under Internet censorship [14,98,107]. 2.4.3. Unauthorized Access Detection 2.4.3.Text Unauthorized hiding could Access also be Detection employed to detect unauthorized access to sensitive documents over privateText networks. hiding For could example, also be sensitive/confiden employed to detecttial unauthorized documents in access a governmental to sensitive or documents commercial over organizationprivate networks. can be For marked example, with sensitive/confidential identifiers that are documents difficult to in detect. a governmental The aim or is commercial to trace unauthorizedorganization access/use can be marked of a sensit with identifiersive document that areto adifficult specific touser detect. who The may aim have is toobtained trace unauthorized a copy of theaccess/use marked document. of a sensitive The receiver document of such to a documents specific user should who not may be aware have obtainedof the existence a copy of of the the identifiersmarked document.[12,40,64]. The receiver of such documents should not be aware of the existence of the identifiers [12,40,64]. 2.5. Text Hiding Criteria 2.5.There Text Hidingare many Criteria things to be considered when programmers design a text hiding algorithm. However,There the are fundamental many things criteria to be can considered be easily found when in programmers recently introduced design algorithms: a text hiding invisibility, algorithm. embeddingHowever, capacity, the fundamental robustness, criteria and can security be easily [1]. The found communication in recently introduced channel over algorithms: which the invisibility, CMHM is embeddingtransmitted capacity, can be noisy robustness, or noiseless, and security for the [ 1ca].se The of communicationan active or a passive channel warden, over which respectively. the CMHM Also,is transmitted the steganographer can be noisy capability or noiseless, to select for the the case CM ofis anoften active restricted or a passive if not warden, altogether respectively. non-existent Also, [12].the In steganographer a network (private capability or public) to select application, the CM is th oftene CM restricted is produced if not by altogether a steganographer non-existent (in a [public12]. In a channel)network or (private a content or public)provider application, (in a private the CMchanne is producedl), i.e., for by the a steganographer private network (in application, a public channel) the authorityor a content responsible provider for (in document a private security. channel), More i.e.,over, for the for privatethe covert network channel application, application, the the authority CM is createdresponsible by the for computer, document not security. by the infringer. Moreover, Depe for thending covert on channelthese applications, application, a the trade-offCM is createdmust be by soughtthe computer, for satisfying not by the the criteria infringer. on Dependingany point inside on these the applications, magic triangle a trade-off as depicted must in be Figure sought 3 for [1,7,10,12].satisfying the criteria on any point inside the magic triangle as depicted in Figure3[1,7,10,12].

FigureFigure 3. 3.EvaluationEvaluation criteria criteria of oftext text hiding hiding algorithms. algorithms. 2.5.1. Invisibility 2.5.1. Invisibility Quantifying an attacker or Eve’s capability to discover/detect the existence of HM is called Quantifying an attacker or Eve’s capability to discover/detect the existence of HM is called invisibility (or imperceptibility/detectability/transparency), i.e., the embedding trace of an HM in the invisibility (or imperceptibility/detectability/transparency), i.e., the embedding trace of an HM in the CMHM must be invisible and avoid raising the suspicions of human vision systems. In other words, CMHM must be invisible and avoid raising the suspicions of human vision systems. In other words, invisibility refers to how many perceptual modiﬁcations are made in the CMHM after embedding an invisibility refers to how many perceptual modifications are made in the CMHM after embedding an HM. Practically, it cannot be measured numerically. The best way of analyzing the degree of invisibility HM. Practically, it cannot be measured numerically. The best way of analyzing the degree of is to compare the variation of CM and CMHM, i.e., with and without the HM [1,7,10,12]. In some invisibility is to compare the variation of CM and CMHM, i.e., with and without the HM [1,7,10,12]. In literature, researchers utilized the Jaro–Winkler Distance (or Jaro Similarity) for analyzing the similarity of the original CM and CMHM. It can be deﬁned as follows:

Entropy 2019, 21, 355 9 of 31

The Jaro distance dj of two given strings s1 = Lenngth(CM) and s2 = length(CMHM) is: ( 0 i f m = 0 d = (2) j 1 m + m + m−t else 3 |s1| |s2| m where, m is the number of matching characters, and t is half the number of transpositions. Two letters from CM and CMHM, respectively, are considered identical only if they are equal and not higher j k max(|s1|,|s2|) than 2 − 1. Each letter of CM is compared with all the matching characters in CMHM. The number of identical letters (but in different sequence order) divided by 2 speciﬁes the number of transpositions. If the dj is “0”, then the CM and CMHM are not similar, and “1” means both are exactly the same. A dj nearest to 1 represents that the CM and CMHM are closely similar [29,125,126]. However, it does not consider the similarity of the structural techniques due to the fact they do not modify the characters of the CM to hide the SMbits.

2.5.2. Embedding Capacity (EC) The number of secret bits which can be embedded in the CM is called embedding capacity (or payload). This feature could be measured numerically in units of bit-per-locations (BPL) or character-per-locations (CPL). Location means a changeable feature (or character/word) which can be considered as an embeddable location (EL) in the CM such as between words, after special characters, etc. Nevertheless, a text steganography algorithm provides a large EC; it is not efﬁcient if it modiﬁes the CM profoundly [1,7,10,12]:

ECCM = BPL × ELCM or ECCM = CPL × ELCM (3)

2.5.3. Distortion Robustness (DR)

Multiple attacks may occur on the CMHM while it is transmitted on the channels where it may be exposed to a hazard that could destroy the HM. Moreover, attackers may try to manipulate the HM rather than remove it. Therefore, any type of distortion might occur deliberately or even unintentionally on the CMHM. A robust text hiding algorithm makes the HM extremely difﬁcult to alter or destroy. It could also be measured numerically based on losing or removing probability P(L). In other words, P(L) is the probability of how much proportion of the hidden symbols has been lost from CMHM. Let us suppose that the number of ELs in the CM is NL, the length of the CM is stand as TC. Thus, the P(L) = NL/TC and the P(DR) can be computed as follows [1,3]:

P(DRHM) = [1 − P(L)]; 1 < NL < TC, NL ∈ N, TC ∈ N. (4)

2.5.4. Security There is a certain level of safety that prevents attackers from detecting the HM visually or from removing it from the CMHM (i.e., quantifying decoding reliability in the presence of channel noise when Eve is an active warden). This measure depends on three other criteria: invisibility, embedding capacity, and distortion robustness. An efﬁcient steganography algorithm must provide an optimum trade-off among these criteria. If a method affords a large EC, the embedding trace of HM is invisible, and robustness is high, then the security of the algorithm can be calculated using Equation (4). In modern text hiding techniques, a cryptosystem can be utilized to protect secret bits against decoding attacks. In practice, the encryption function is employed to secure the SMbits before embedding them into the CM, and alters the sequence of the secret bits such that they can only be extracted by the corresponding decryption function [1,12]. Decoding Probability (DP) is the probability of decoding an original SMbits by guessing attacks. Let us suppose that, an attacker speculates a message may contain an HM (e.g., he/she does not have any clue about the approach that was utilized to conceal the SM). Moreover, the attacker may try to decode the SM using conventional approaches or guessing the SMbits (using Entropy 2019, 21, 355 10 of 31 probability distribution analysis) from the invisible symbols or features. Since an encryption function is used to secure the SMbits based on a secret key (K), it is impossible to decode the original SM from the encrypted SMbits without having the secret key and the corresponding decryption function. If NS is the length of the SM binary, the P(DP) for guessing a correct encrypted binary string of the SM can be calculated as follows:

NS 1 NS ( ) = ( ) i ∃ ∈ [ ] ∈ P DP ∑ i , i : k∈N|i×k=NS, i 1, . . . , NS , i N (5) i=i|k 2

2.5.5. Computational Complexity The computational cost or complexity is the least significant measure for the next-generation smart devices such as computers, smartphones, tablets, etc. Nevertheless, there could be many pages in some text files; thus, it is preferable that steganography/watermarking techniques be computationally less complex. It is obvious that the long text files need more hardware or software resources, that is, they have higher computational complexity. Generally, the less complex approaches are employed for resource-limited systems such as embedded microprocessors and mobile devices. Let us assume that the NS is the length of the SM, and the LC is the length of CM; Then, the minimum computational cost for the Emb()/Ext() is O(NS×LC) due to need for searching LC times to finding the embeddable locations for marking each letter of the SM (or SMbits). However, the complexity of the additional costs such as encryption function, the dictionary of words, etc. must be considered in those techniques utilizing them during the embedding/extraction process [3,46,49].

2.6. Modern Text Hiding & Kerckhoffs’s Principle Since modern steganography/watermarking is a key-based algorithm similar to cryptography, the question for adhering to Kerckhoffs’s principle may emerge [1,17]. Kerckhoffs introduced for the first time the prudent tradition known as “Kerckhoffs’s principle” for cryptology in which an ideal crypto-system should be secure even if everything about the system is identified to the public except the secret key [104]. Therefore, an ideal text hiding algorithm should guarantee that it adheres to Kerckhoffs’s principle. Even if the attacker identifies how the stego-system works, it should not be possible to discover the system design. As depicted in Figure2, the CMHM is just like CM and the original CM is not sent to the recipient in the transmission process—thus any receiver cannot compare the CMHM with the original CM. Therefore, the original SM is only extractable by the key which is encrypted using a specific algorithm, so without knowing the original secret key, no one could break a modern text hiding algorithm [10,12,17,104].

2.7. Text Steganalysis and Attacks In contrast to text steganography (or watermarking), text steganalysis is the estimation process and science of identifying whether a given text message/file has hidden information in it, and, if possible, extracting/recovering the secret message. This term is similar to the way cryptanalysis is utilized in cryptography. In practice, the text steganalysis is a complicated task, because of the wide variety of digital text characteristics, the extensive variation of embedding approaches and usually, the low embedding distortion. In some cases, text steganalysis is possible due to the fact data embedding modifies the statistics of the cover message/file. In other words, the existence of embedded symbols (e.g., those techniques which modify the CM in order to hide the secret bits) still makes an original CM and its corresponding CMHM different in some aspects, though this is often imperceptible to the human vision system. Concerning the application, steganalysis methods could be typically classified into two categories: specific and universal. While the former attempt to break a unique watermarking/steganography algorithm, the latter aim to thwart all watermarking/steganographic algorithms. In practice, specific techniques achieve higher detection accuracy as compared to universal ones due to the fact they use prior knowledge of how the particular Entropy 2019, 21, 355 11 of 31 target algorithm works. However, the universal steganalysis is more attractive in practical application since they could operate independently of the embedding method and even be generalized to unknown steganography/watermarking approaches [16,17,105,106]. From a steganalysis point of view, we can classify the possible attacks into three types, including visual attacks, structural attacks and statistical/probabilistic attacks.

2.7.1. Visual Attacks The visual attacks or Manipulation by Readers (MBR) refers to a human factor, often a viewer who could perceptually (visually) observe the modifications through the CMHM or stego object. These modifications may consist of syntactic, semantic paraphrasing, lexical, rhetorical changes, and so on. Let us assume that an attacker has complete access to the CMHM, and if he suspects that there exist some unconventional modifications through the CMHM, then, he might manipulate it (i.e., it could be an intentional deletion, insertion, or re-ordering of words/characters). In practice, any types of manipulations through the CMHM may destroy the HM [1,3,17,23,111].

2.7.2. Structural Attacks

This attack involves modifying the layout of the CMHM. In some cases, attackers may change the formatting (e.g., font or copy from the CMHM to a new host ﬁle), encoding (e.g., ASCII, UTF-8, UTF-16, etc.) of the CMHM that may lead to destroying the HM [1,3,17].

2.7.3. Statistical Attacks This attack works based on the possibilities of guessing a correct SM in which the adversary can discover occult symbols from the CMHM by considering the number of words, spaces, and so on. Basically, this attack utilizes the knowledge of existing approaches to decode/guess the original SM using probability distribution functions [10]. When the CMHM does not show any visible alterations, the adversary processes the characters/letters of the CMHM to analyze the statistical variations, i.e., it may happen during the data transmission using MITM attacks [1,31,110]. Let us suppose that a CMHM contains NC characters, NH hidden symbols (spaces, zero-width characters, etc.). If the length of the SM is NS, then there are 2NS possible secret messages which can occur. Thus, the number of possible solutions (NP) for guessing the SM can be obtained as follows:

NS NP = k × 2 , SM = {c1, c2,..., cNS}. (6)

Moreover, the number of guessing the NH symbols from the CMHM can be computed using Equation (7): ! NC NC! P(NH, NC) = = , NH ≤ NC. (7) NH (NC − NH)! × NH!

Therefore, the probability of guessing a correct SM (i.e., cracking probability) from the CMHM can be calculated as follows: 1 1 1 P(SM) = × = . (8) NP P(NH, NC) NS × NC! 2 (NC−NH)!×NH!

If a text hiding algorithm utilizes an encryption function to secure the SMbits using a secret key, then the P(SM) is equal to zero (i.e., it is impossible to break) [10].

3. Various Types of Text Hiding Techniques Technically, there are various algorithms employed for information hiding in the form of the text steganography and text watermarking in the literature [3,19,46,49]. In practice, these two terms are different in the goal of embedding hidden data into a cover text message/ﬁle, where the concern EntropyEntropy2019 2018, 21, 20, 355, x FOR PEER REVIEW 1212 of of 31 29

is the protection of cover text content (called “text watermarking),” and the concern is the hidden istransmission the protection of the of coversecret textinformation content (called “text steganography”). watermarking),” We and can the cl concernassify the is existing the hidden text transmissionhiding techniques of the secretinto one information of the categories (called “textin Figure steganography”). 4, namely, structural, We can classify linguistic, the existingand random text hidingand statistics techniques [2,3,20,29,49]. into one of the categories in Figure4, namely, structural, linguistic, and random and statistics [2,3,20,29,49].

Figure 4. Various types of text hiding techniques. Figure 4. Various types of text hiding techniques. 3.1. Structural Techniques 3.1. Structural Techniques or format-based algorithms involve modifying the layout features or format of the CM toStructural mark/hide or theformat-basedSMbits, i.e., algorithms based on the involve Unicode modifying or the ASCII the layout encoding features without or format altering of thethe sentencesCM to mark/hide or words. the These SMbit featuress, i.e., based consist on of the word Unicode spacing, or the line ASCII spacing, encoding font style, without text color, altering and the so onsentences [1–8,11, 20or, 34words.,41,54 These,65,66, features100,112– 114consist]. Herein, of word we spac classifying, theline structural-based spacing, font style, techniques text color, into and four so categories,on [1–8,11,20,34,41,54,65,66,100,112 including, open space, line/word–114]. Herein, shift, zero-width,we classify feature/format,the structural-based and emoticons. techniques into four categories, including, open space, line/word shift, zero-width, feature/format, and emoticons. 3.1.1. Open Space 3.1.1.The Open open Space space (or white space)-based techniques utilize special Unicode spaces to mark/embed secretThe bits open into space the CM (or, i.e.,white for space)-based example: between techniques words, utilize end special of the Unicode sentences, spaces and to so mark/embed on. Many approachessecret bits haveinto the been CM introduced, i.e., for usingexample: the ideabetween of open words, space en duringd of the the lastsentences, two decades. and so In on. practice, Many theseapproaches techniques have provide been introduced high invisibility, using lowthe idea embedding of open capacity space during and modest the last robustness two decades. against In visualpractice, attacks. these Moreover, techniques they provide can be high applied invisibilit in multilingualy, low embedding digital texts capacity [6,7,15 and,27, 34modest,41,54 ,robustness65,66,100]. against visual attacks. Moreover, they can be applied in multilingual digital texts 3.1.2. Line/Word Shift [6,7,15,27,34,41,54,65,66,100]. Line/Word shift-based techniques involve shifting lines vertically or words horizontally to hide the3.1.2.SM Line/Wordbits through Shift the cover text ﬁle. In other words, these techniques evaluate the scanned images of the printed documents to extract or reveal the watermark. In practice, they are not applicable in digital Line/Word shift-based techniques involve shifting lines vertically or words horizontally to hide texts because if someone copies the carrier text to a new host ﬁle, the extraction algorithm cannot the SMbits through the cover text file. In other words, these techniques evaluate the scanned images of discover the hidden information. From the criteria point of view, these techniques typically provide the printed documents to extract or reveal the watermark. In practice, they are not applicable in low embedding capacity, high invisibility, and low robustness against structural attacks [112–114]. digital texts because if someone copies the carrier text to a new host file, the extraction algorithm 3.1.3.cannot Zero-Width discover the hidden information. From the criteria point of view, these techniques typically provide low embedding capacity, high invisibility, and low robustness against structural attacks [112–114].The zero-width-based techniques employ the ZWC Unicode characters to embed/mark the SMbits into the cover text. From the text processing point of view, the ZWCs have no text trace (written symbols)3.1.3. Zero-Width and can be embedded in different locations through the CM, but, they can be processed by programming analysis of the CMHM. These approaches can be utilized in multilingual texts and The zero-width-based techniques employ the ZWC Unicode characters to embed/mark the SMbits various text processing platforms such as social media, email, SMS, etc. For example, a zero-width into the cover text. From the text processing point of view, the ZWCs have no text trace (written symbols) and can be embedded in different locations through the CM, but, they can be processed by

EntropyEntropy 20182018,,, 2020,,, xxx FORFORFOR PEERPEERPEER REVIEWREVIEWREVIEW 1313 ofof 3030

Entropy 2019, 21, 355 13 of 31 programmingprogramming analysisanalysis ofof thethe CMCMHMHM... TheseTheseThese approachesapproachesapproaches cancancan bebebe utilizedutilizedutilized ininin multilingualmultilingualmultilingual textstextstexts andandand variousvarious texttext processingprocessing platformsplatforms suchsuch asas socialsocial media,media, email,email, SMS,SMS, etc.etc. ForFor example,example, aa zero-widthzero-width steganographysteganography technique technique called called AITSteg AITSteg was was proposed proposed in inin [1], [[1],[1],1], which whichwhich utilizes utilizesutilizes the thethe ZWCsZWCsZWCs tototo embedembedembed aaa longlong SMSMbitsbits inin frontfront ofof aa shortshort CMCM.. SinceSince thethe ZWCsZWCs havehave invisibleinvisible texttext tracestraces throughthrough thethe CCM,M, theythey cancan longlong SMSMbits in front front of a short short CM. Since Since the ZWCs ZWCs have invisible text traces through thethe CMCM,, they can bebe embeddedembedded usingusing thethe maxmax numbernumber ofof lettersletters inin thethe channelchannel (e.g.,(e.g., SMS,SMS, Facebook,Facebook, etc.). etc.). In In practice, practice, thethe zero-width-basedzero-width-based approaches approaches provide provide high high invisibility,ininvisibility,visibility, highhigh embeddingembedding capacitycapacity andand higherhigher robustnessrobustness againstagainst structuralstructural attacksattacksattacks [[1,4,25[1,4,251,4,25–––2828,33,55,56,91,115].28,33,55,56,91,115].,33,55,56,91,115].

3.1.4.3.1.4. Feature Feature oror FormatFormat TheThe feature/format-basedfeature/format-based methodsmethods methods involveinvolve involve modifyinmodifyin modifyinggg somesome some featuresfeatures ofof thethe cover cover text text such such asas fontfont size,size, style,style, color,color,color, etc.etetc.c. that that couldcould bebe alteredaltered tototo concealconcealconceal secretsesecretcret bitesbitesbites [[18,21,24].[18,21,24].18,21,24]. ForFor instance,instance, thethe dottingdotting featurefeature ofof thethe ArabicArabic textstexts cancan bebebe usedusedused forforfor markingmarkingmarking thethethe SM SMSMbitsbitsbits bybyby displacingdisplacing letterletter pointspoints andand diacriticsdiacritics [[116[116116––119].119119].]. SinceSince thethe structure structure of the the Arabic Arabic lang languagelanguageuage is is similar similar to to thethe PersianPersian andand UrduUrdu languages,languages, thesethese languageslanguages useuse thethe samesamesame pointpointpoint letters.lettersletters... SeveralSeveralSeveral techniquestechniquestechniques havehavehave utilizedutilizedutilized pointpointpoint letterslettersletters toto mark/embedmark/embed secretsecret secret bitsbits bits byby by displacingdisplacing displacing thethe the positionposition ofof of aa a point pointpoint a aa little littlelittle bit bitbit verticallyverticallyvertically highhighhigh concerningconcerningconcerning thethe standardstandard pointpoint positionposition throughthrough thethe CMCM [[15,88,90,92].[15,88,90,92].[15,88,90,92].15,88,90,92]. InInIn practice,practice,practice, thesethese techniquestechniques provideprovide highhigh invisibilityinvisibility (except(except for for color-basedcolor-based ones), ones), higher higher embedding embedding capacity, capacity, andand lowlow distortiondistortion robustnessrobustness againstagainst structuralstructural attacks.attacks. Color-basedColor-based algorithmsalgorithms areareare alsoalsoalsoalso vulnerablevulnerablevulnerablevulnerable to tototo visual visualvisualvisual attacks attacksattacksattacks [ [111].[111].[111].111].

3.1.5.3.1.5. Emoticons Emoticons or or Emoji Emoji

EmoticonEmoticon oror emoji-basedemoji-based approachesapproaches utilizeututilizeilize thethe emojiemoji symbolssymbols tototo embedembedembed thethethe SMSMSMbitsbitsbits throughthroughthrough thethethe CMCM.... TheseTheseTheseThese days,days,days,days, endendendend usersusersusersusers employemployemployemploy emoticonsemoticonsemoticonsemoticons orororor emojiemememojioji symbolssymbols inin dailydailydaily conversationsconversationsconversations insteadinstead ofof typingtyping theirtheir feelings. feelings. Recently,Recently, severalseveral algorithmsalgorithms havehave beenbeen introducedintroduced usingusing thethe covercover ofof emoticons emoticons toto mark mark secret secret bitsbits throughthrough through thethe the CMCMCM... . ForForFor For instance,instance,instance, instance, thethethe the technitechnitechni techniquesquesques presentedpresented presented inin in [8,120[8,120 [8,120––122]122]–122 generate]generate generate aa arandomrandom random texttext text consistingconsisting consisting somesome some wordswords words asas as aa a CMCMCM,,, ,andandand and also,also,also, also, theytheythey they convertconvertconvert convert thethethe the letterslettersletters letters ofofof the thethe SMSM intointointo emoticons basedbased onon aa predeﬁnedpredefinedpredefined patternpattern (e.g.,(e.g., AA === ““😣😣”,”, BB == “ “😢😢”,”, C C = = “ “😃😃”,”, and and so so on.). on.). Moreover, Moreover, theythey embedembedembed the thethe produced producedproduced emoticons emoticonsemoticons between betweenbetween words wordswords through throughthrough the theCMthe .CMCM Although... AlthoughAlthoughAlthough these thesethesethese approaches approachesapproachesapproaches have highhavehave embeddinghighhigh embeddingembedding capacity, capacity,capacity, they suffer theythey from suffersuffersuffer visible fromfromfrom transparency visiblevisiblevisible transparencytransparencytransparency (low invisibility), (low(low(low invisibility),invisibility),invisibility), and low distortion andandand lowlowlow robustnessdistortiondistortion robustnessrobustness against visual agagainstainst attacks. visualvisual attacks.attacks.

3.2.3.2. LinguisticTechniques LinguisticTechniques LlinguisticLlinguistic oror naturalnatural languagelanguage processing-basedprocessing-basprocessing-baseded algorithmsalgorithms alter alter thethe syntaxsyntax andand semanticssemantics characteristicscharacteristics ofof of thethe the texttext text content.content. content. TheThe The texttext text typicallytypically typically consistsconsists consists ofof severalseveral of several words,words, words, sentences,sentences, sentences, verbs,verbs, nouns,nouns, verbs, nouns,adverbs,adverbs, adverbs, adjectives,adjectives, adjectives, andand soso on. andon. SeveralSeveral so on. Several linguistic-balinguistic-ba linguistic-basedsedsed approachesapproaches approaches havehave usedused have characteristicscharacteristics used characteristics suchsuch asas suchsynonyms,synonyms, as synonyms, abbreviations,abbreviations, abbreviations, thethe similaritysimilarity the similarity ofof words,words, of words,andand soso and on,on, sototo on,embedembed to embed secretsecret secret bitsbits intointo bits intoaa CMCM a CM[17,62,70,71,80[17,62,70,71,80[17,62,70,71––85,106,109].,85,106,109].80–85,106,109 InIn ]. general,general, In general, wewe cancan we classifyclassify can classify thethe linguisticlinguistic the linguistic basedbased based approaapproa approacheschesches intointo twotwo into types:types: two types:syntacticsyntactic syntactic andand semantic.semantic. and semantic. 3.2.1. Semantic 3.2.1.3.2.1. SemanticSemantic Semantic methods work based on the specific language characteristics by modifying the SemanticSemantic methodsmethods workwork basedbased onon thethe specificspecific languagelanguage characteristicscharacteristics byby modifyingmodifying thethe semantic attributes of the CM to mark/embed the SMbits. These attributes include the spelling semanticsemantic attributesattributes ofof thethe CMCM toto mark/embedmark/embed thethe SMSMbitsbits... TheseTheseThese attributesattributesattributes includeincludeinclude thethethe spellingspellingspelling ofofof of words, abbreviations, synonyms, acronyms, and so on [62,70,71,75,82,84]. The advantage of the words,words, abbreviations,abbreviations, synonyms,synonyms, acronyms,acronyms, andand soso onon [62,70,71,75,82,84].[62,70,71,75,82,84]. TheThe advantageadvantage ofof thethe semantic-based methods is that they protect the HM against retyping attacks or the use of OCR semantic-basedsemantic-based methodsmethods isis thatthat theythey protectprotect thethe HMHM againstagainstagainst retypingretypingretyping attacksattacksattacks ororor thethethe useuseuse ofofof OCROCROCR software [111]. Moreover, these methods provide low embedding capacity, high invisibility and high softwaresoftware [111].[111]. Moreover,Moreover, thesethese methodsmethods provideprovide loloww embeddingembedding capacity,capacity, highhigh invisibilityinvisibilityinvisibility andandand highhighhigh robustness against structural attacks, but they modify the original meaning of the CM. robustnessrobustness againstagainst structuralstructural attacks,attacks, butbut theythey modifymodify thethe originaloriginal meaningmeaning ofof thethe CMCM... 3.2.2. Syntactic 3.2.2.3.2.2. SyntacticSyntactic Syntactic approaches involve modifying the CM without significantly changing the meaning or SyntacticSyntactic approachesapproaches involveinvolve modifyingmodifying thethe CMCM withoutwithout significantlysignificantly changingchanging thethe meaningmeaning oror tone ofSyntactic the text approaches content. In involve different modifying languages, the there CM arewithout some significantly syntactical compositions changing the inmeaning their text or tonetone ofof thethe texttext content.content. InIn differentdifferent languages,languages, ththereere areare somesome syntacticalsyntactical compositionscompositions inin theirtheir texttext structures,tone of the whichtext content. are specified In different by the languages, language and there its are specific some conventions syntactical compositions [3,20,81–83]. For in their instance, text structures,structures, whichwhich areare specifiedspecified byby thethe languagelanguage andand itsits specificspecific conventionsconventions [3,20,81[3,20,81––83].83]. ForFor instance,instance, astructures, method presented which are in specified [123], which by the utilizes language the similarity and its specific of La word conventions in the Arabic/Persian [3,20,81–83]. For text. instance, In this aa methodmethod presentedpresented inin [123],[123], whichwhich utilizesutilizes thethe similaritysimilarity ofof LaLa wordword inin thethe Arabic/PersianArabic/Persian text.text. InIn

Entropy 2019, 21, 355 14 of 31 approach, the primary form of “La” (“AË”) is employed for hiding a bit “0,” and speciﬁc form of the word “La” (“B”) is employed for concealing a bit “1” through the CM. In practice, the syntactic-based techniques have low embedding capacity, high invisibility and high robustness against structural attacks. They are also vulnerable to visual attacks.

3.3. Random and Statistics Techniques The random and statistics generation algorithms employ the statistical features of the SM to generate the CM automatically. In other words, these techniques do not require an existing CM, and utilize the structures and properties of a particular language i.e., what is the past format of a verb, how to generate the sentences, etc. [21,23,24,29,34,35,39,47,51,124]. In general, these methods have higher computational complexity which consumes more time and space to generate a CM.

3.3.1. Compression The compression-based methods utilize a lossless compression algorithm such as Huffman coding, Lempel–Ziv–Welch (LZW), arithmetic coding, etc. to hide the SMbits into the CM [21,24,34,35,39]. For example, a LZW compression-based steganography algorithm presented in [39] embeds the SMbits in e-mail addresses. This method considers the statistical distance for each letter of the SM such that a dependent ‘distance’ of the same letter in the cover text is computed. Therefore, a ‘distance vector’ is derived for the SM and a ‘distance matrix’ is produced for each CM. A text which gives the highest frequency of the distance values is ﬁnally selected from the text-based as a CM as well as the stego key. Moreover, the LZW code is computed for this distance matrix and the produced bits are divided into blocks of 12 bits including 9-bit, and 3-bit segregations. These segregations are employed to choose the domain name and the user-name from the available options to make a valid e-mail address. In practice, the compression-based algorithms require high computational complexity, and they are not efﬁcient for hiding the SM in short cover texts. However, they provide high invisibility, optimum capacity, and low robustness against structural attacks.

3.3.2. Random Cover The random cover-based techniques work by generating a cover according to the SM letters. Initially, the Emb() must generate a CM based on the SM letters, and then embed/mark the SMbits inside the CM [23,47,51,124]. For instance, a random cover generation technique called AH4S introduced in [51], which employs the structure of the omega network to conceal the SMbits in a generated CM. This method picks a character from the SM and utilizes the omega network to generate two related letters based on a picked character. Moreover, it searches in a predeﬁned dictionary for an appropriate English cover word to hide the two generated characters and reproduces the same process for all characters of the SM. This approach generates a long unknown text for a short SM and increases suspicions for readers/attackers. Practically, the random cover-based techniques provide perceptual transparency (low invisibility), low capacity, and high robustness. Moreover, they have high computational complexity for generating the CM during the embedding/extraction process.

3.4. An Empirical Comparison To demonstrate the variations between various types of text hiding techniques, we summarized an example of embedding method for each category as depicted in Figure5. Let us assume that the Emb() of each approach hides an SM (or SMbits) through the CM, and each one produced a CMHM, which are different from the other ones. Thus, we can observe that there are some pros & cons for each category as listed in Table3. We rated each type empirically based on the criteria, including, invisibility (Imperceptible, Perceptible), EC (Low, Modest, and High), and DR (Low, Medium, and High). Entropy 2018, 20, x FOR PEER REVIEW 14 of 29

is employed for hiding a bit “0,” and specific form of (”ﻟـﺎ“) ”this approach, the primary form of “La -is employed for concealing a bit “1” through the CM. In practice, the syntactic (”ﻻ“) ”the word “La based techniques have low embedding capacity, high invisibility and high robustness against structural attacks. They are also vulnerable to visual attacks.

3.3.1. Compression The compression-based methods utilize a lossless compression algorithm such as Huffman coding, Lempel–Ziv–Welch (LZW), arithmetic coding, etc. to hide the SMbits into the CM [21,24,34,35,39]. For example, a LZW compression-based steganography algorithm presented in [39] embeds the SMbits in e-mail addresses. This method considers the statistical distance for each letter of the SM such that a dependent ‘distance’ of the same letter in the cover text is computed. Therefore, a ‘distance vector’ is derived for the SM and a ‘distance matrix’ is produced for each CM. A text which gives the highest frequency of the distance values is finally selected from the text-based as a CM as well as the stego key. Moreover, the LZW code is computed for this distance matrix and the produced bits are divided into blocks of 12 bits including 9-bit, and 3-bit segregations. These segregations are employed to choose the domain name and the user-name from the available options to make a valid e-mail address. In practice, the compression-based algorithms require high computational complexity, and they are not efficient for hiding the SM in short cover texts. However, they provide high invisibility, optimum capacity, and low robustness against structural attacks.

3.3.2. Random Cover The random cover-based techniques work by generating a cover according to the SM letters. Initially, the Emb() must generate a CM based on the SM letters, and then embed/mark the SMbits inside the CM [23,47,51,124]. For instance, a random cover generation technique called AH4S introduced in [51], which employs the structure of the omega network to conceal the SMbits in a generated CM. This method picks a character from the SM and utilizes the omega network to generate two related letters based on a picked character. Moreover, it searches in a predefined dictionary for an appropriate English cover word to hide the two generated characters and reproduces the same process for all characters of the SM. This approach generates a long unknown text for a short SM and increases suspicions for readers/attackers. Practically, the random cover-based techniques provide perceptual transparency (low invisibility), low capacity, and high robustness. Moreover, they have Entropyhigh computational2019, 21, 355 complexity for generating the CM during the embedding/extraction process.15 of 31

Linguistic Structural Random & Statistics

CM: I Love an Apple. SM: A CM: I Love an Apple. CMHM: aard aard aard 1 or 0 V.S “01,…10, 11” V.S aard aard aard aard aard

CMHM: I Like an Apple. CMHM: I love an Apple. aard aard aard aard 2

Topkara et al., 2006 [71] Taleby Ahvanooey et al., 2018 [1] Hamdan and Hamarsheh, 2016 [51]

Figure 5. An empirical comparison between linguistic, structural, and random & statistics algorithms. Figure 5. An empirical comparison between linguistic, structural, and random & statistics algorithms. Table 3. Highlighted pros & cons of various types of text hiding techniques concerning criteria.

Language Type Name Invisibility EC DR Pros & Cons Coverage Having high complexity due to using an additional dictionary to replace the words/characters in the CM. Altering the meaning of original CM Linguistic after embedding an SM. [17,62,70,71,80– Imperceptible Low Medium Exclusive 85,106,109] Depending on an exclusive language (e.g., English, Persian/Arabic, etc.) Providing high invisibility, Low EC (e.g., 1 bit per synonym), and Medium robustness against visual attacks. Having no perceptible changes on the original CM after embedding an SM. Increasing the length of the CM by embedding additional Unicode Structural invisible symbols. [1–8,11,20,34, Imperceptible High High Multilingual Depending on the encoding features of 41,54,65,66,100, the CM (e.g., not the CM content, or 112–114] language). Providing high invisibility (except color based methods), higher EC (e.g., n-bit per location), and high robustness against structural and visual attacks. Having high complexity due to employing an extra compression Random & algorithm to encode the SMbits. Statistics High robustness against visual attacks Perceptible Modest High Exclusive [21,23,24,29,34, Depending on the language of the CM. 35,39,47,51,124] Providing perceptible transparency (low invisibility), modest EC, and high robustness against visual attacks

As listed in Table4, we summarized some highlights and limitations for each category separately by considering their characteristics and their applications. Entropy 2019, 21, 355 16 of 31

Table 4. Highlights & Limitations of various types of text hiding techniques.

Hidden Network Cover Unauthorized Type Highlights and Limitations Transmission Channels Access Detection The linguistic-based methods are not applicable to unauthorized access detection due to altering the original meaning of the CM during the embedding an SM. For employing in covert Linguistic × channels, they need a long CM, and can only be used in a CM with exclusive language. For utilizing in hidden transmission, they are not enforceable in limited communication channels. The structural-based approaches can provide all of three applications. For utilizing in hidden transmission, they are not applicable in limited Structural communication channels. Due to employing language-independent features of the CM to embed the SM, these methods could be used in multilingual texts. The random cover–based algorithms are not applicable to unauthorized access detection due to generating an unknown CM. Random & × For applying in hidden Statistics transmission, the generated CM raises suspicions for attackers. Due to generating a CM based on the SM, these approaches could only be applied to secure an SM with exclusive language.

4. Efficiency Analysis of Recent Structural Techniques During the last decade, many structural based text hiding algorithms have been introduced, and a few methods proposed in the linguistic-based and random and statistics-based categories. There are some reasons for that: some limitations such as low EC, altering the meaning of the CM, generating an unknown CM, etc. which make them inefficient for some applications might be the main reason. The second reason is that they both work based on the features of the language of the CM/SM to hide the SM that require some additional needs such as a predefined dictionary, dataset, etc. In what follows, we summarized the recent structural-based techniques that can be applied in multilingual texts and various applications. Por et al. [7] proposed a text-based data hiding technique called UniSpaCh, which generates a binary string of the SM and isolates it by 2-bit classification (i.e., “10, 01, 00, and 11”). Moreover, it substitutes each 2-bit with a special space (e.g., Thin, Hair, Six-Per-Em, and Punctuation). Finally, it embeds the additional spaces into predefined locations such as inter-words, inter-sentences, end-of-line, and inter-paragraphs into the MS Word file. However, this technique gives high invisibility, high Entropy 2019, 21, 355 17 of 31 robustness against structural and visual attacks, but it has low EC rate (two bits per spaces) and is not applicable to embed a long SMbits into a short CM. Odeh et al. [33] suggested a novel text steganography algorithm called ZW_4B using the ZWCs characters that hides SMbits inside an MS Word file. As depicted in Table5, this algorithm employs four ZWCs to mark four bits of the SMbits between letters in the CM file. For instance, the algorithm inserts all the four ZWCs after a letter through the CM, then it represents the hidden code is “0001”, if it embeds three ZWCs, then it marks “0001”, and so on. In practice, this technique provides high invisibility, higher embedding capacity, and can be applied in multilingual texts. However, it suffers from low robustness since only the embeddable location is between letters. Moreover, this method can preserve the embedded bits against structural attacks.

Table 5. Sample of Hidden Bits by using Word Symbols in [33].

Right to Left Mark Left to Right Mark ZWJ ZWNJ SMbits × × × × 0000 × × × - 0001 × × - × 0010 × × - - 0011 ......

Naqvi et al. [29] presented a multi-layer text steganography scheme called MHST using homomorphic encryption, which replaces the characters of the SM with the letters of the CM to hide it. In the experimental results, the authors claimed that this algorithm provides high embedding capacity, imperceptible transparency, and high robustness against structural attacks, but it suffers from visual or MBR attacks. i.e., if an attacker manipulates a portion of the CMHM, the extraction process of the SM might fail due to possibility of removing some characters of the SM through the CM. Odeh and Elleithy [90] introduced a text steganography method called ZWBSP that embeds the SMbits by adding a ZWC (U+200B) beside of the normal space (U+0020) between words through the MS Word ﬁle. This algorithm considers the embeddable location before/after the standard space between words based on a predeﬁned pattern as outlined in Table6. In practice, this method gives high invisibility, low EC, and medium robustness. Moreover, it is applicable in different languages, and protects the embedded SMbits against structural, and visual attacks.

Table 6. Predeﬁned pattern of embedding location in [90].

2-Bit Embeddable Location ‘000 No ‘ZWC’ + ”U+0020” ‘010 “U+0020” + No ‘ZWC’ ‘100 “U+200B” + ”U+0020” ‘110 “U+0020” + “U+200B”

Rizzo et al. [5] provided a text watermarking approach called TWSM which can embed a password based watermark in a Latin-based CM. This approach utilizes the homoglyph Unicode characters and special spaces for marking the watermark/SMbits in the CM. The researchers claimed that this approach could conceal a watermark (64 bit) into a short CM with only 46 letters and, also, it provides high invisibility and high capacity. However, it is vulnerable to structural attacks (e.g., modifying the font type of the CMHM causes the SMbits to be lost), and visual attacks. Due to its use of homoglyph characters, this method could only be applied in Latin-based cover texts. Later on, Rizzo et al. [6] used the same algorithm [5] to mark/embed a watermark in social media platforms. In [58], Alotaibi and Elrefaei proposed two watermarking techniques based on modifying the cover text using ZWCs and Unicode spaces. In the ﬁrst algorithm, the dotting attribute of the Arabic language applied in [15] is utilized to enhance the capacity of the previous work. Moreover, the Entropy 2019, 21, 355 18 of 31

ZWNJ is employed to mark/embed before and after the normal space depending on the letter which is pointed or unpointed. In the second algorithm, four Unicode characters are utilized to add next to normal space (e.g., ZWNJ, Thin, Hair, and ZW), herein is called 4-SpaCh. Every four bits from the SMbits are marked/embedded by corresponding the Unicode characters and order: the 1st bit is denoted by the ZWNJ, the 2nd bit by Thin space, the 3rd bit by Hair space, and the 4th bit by ZW space. Hence, if the algorithm embeds all four spaces, then it represents a ‘10, otherwise a ‘00. In practice, the second algorithm can be utilized for embedding in multilingual texts due to employing the Unicode characters to mark the SMbits into the CMHM. This technique has higher EC, high imperceptibility, and low DR against visual attacks, i.e., if an attacker manipulates a portion of the CMHM (consisting of some spaces), then it causes extraction by the corresponding Ext() to fail for the whole of the SM. Shu et al. [11] presented a text steganography algorithm by employing a combination of white- space and extended-line called WS_EL which provides secure communication on social media [23]. This approach generates a binary SM string, and embeds an additional white space between words, at the end of a line, and at the end of the paragraph to mark the SMbits. In the experimental results, they claimed that this approach gives optimum EC, high invisibility, but, it also has low DR against visual attacks. Taleby Ahvanooey et al. [1] proposed an innovative text steganography algorithm called AITSteg which can hide a long SM through a short CM for sending via social media. This method generates an SM binary string by the “Gödel” function and encodes the SMbits by a dynamic random key generation algorithm. Also, it converts the encoded SMbits to ZWCs based on a predeﬁned pattern as outlined in Table7, and embeds them in front of the CM. In this work, the authors evaluated the AITSteg on ﬁfteen social media (or messenger apps), and pointed out that only two social media including Twitter and Telegram do not support the employed ZWCs. From the experimental results, it can be concluded that the AITSteg provides high invisibility, high EC, and high DR against visual and structural attacks.

Table 7. Unicode ZWCs 2-bit classiﬁcation pattern in [1].

2-Bit Classiﬁcation Hex Code 00 U+200C 01 U+202C 10 U+202D 11 U+200E

Kumar et al. [34] suggested a text steganography scheme called 4&3SpaCh which extended the UniSpaCh [7] by efﬁciently employing the Unicode characters. This scheme conceals the SMbits into the MS Word ﬁle by considering the embeddable locations, including, inter-sentence, inter-word, end-of-line, and inter-paragraph spaces. As listed in Tables8 and9, the authors utilized two different patterns to mark the SMbits through the CM. However, this scheme provides high imperceptibility, and higher EC compared to the UniSpaCh, and high DR against structural attacks. However, it generates some unconventional gaps between words through the CMHM, which causes increased visual attacks. Entropy 2019, 21, 355 19 of 31

Table 8. Mapping Pattern of SMbits for marking the inter-word and inter-sentence locations in [34].

Spaces Pattern 4-bit Classification Normal Space 0000 EntropyEntropy 2018 2018, 20, ,20 x ,FOR x FOR PEER PEER REVIEW REVIEW 13 13of of30 30 Entropy 2018, 20Normal, x FOR PEER Space REVIEW + Three-Per-Em 0001 13 of 30 Three-Per-Em + Normal Space 0010 programmingprogrammingprogrammingNormal analysis analysisanalysis Space of ofof the + the Four-Per-Emthe CM CMCMHMHMHM. .These . TheseThese approaches approachesapproaches 0011 can cancan be bebe utilized utilizedutilized in inin multilingual multilingualmultilingual texts textstexts and andand variousvariousvarious text text text processing processingFour-Per-Em processing platforms platforms +platforms Normal such Spacesuch such as as as social social social media, media, media, 0100 email, email, email, SMS, SMS, SMS, etc. etc. etc. For For For example, example, example, a a zero-widtha zero-width zero-width steganographysteganographysteganographyNormal technique technique technique Space called called +called Six-Per-Em AITSteg AITSteg AITSteg was was was proposed proposed proposed 0101 in in in [1], [1], [1], which which which utilizes utilizes utilizes the the the ZWCs ZWCs ZWCs to to to embed embed embed a a a Six-Per-Em + Normal Space 0110 longlonglong SM SM SMbitsbitsbits in in in front front front of of of a a shorta short short CM CM CM. .Since .Since Since the the the ZWCs ZWCs ZWCs have have have invisible invisible invisible text text text traces traces traces through through through the the the C C M,CM,M, they they they can can can Normal Space + Figure 0111 bebebe embedded embedded embedded using using usingFigure the the the + max Normalmax max number number number Space of of of letters letters letters in in in the the the channel channel channel 1000 (e.g., (e.g., (e.g., SMS, SMS, SMS, Facebook, Facebook, Facebook, etc.). etc.). etc.). In In In practice, practice, practice, thethethe zero-width-based zero-width-basedzero-width-basedNormal approaches approaches approaches Space + Thin provide provideprovide high highhigh in in invisibility,visibility,visibility, 1001 high highhigh embedding embeddingembedding capacity capacitycapacity and and and higher higher higher robustnessrobustnessrobustness against against againstThin structural structural structural + Normal attacks attacks attacks Space [1,4,25 [1,4,25 [1,4,25––28,33,55,56,91,115].–28,33,55,56,91,115].28,33,55,56,91,115]. 1010 Normal Space + Hair 1011 Hair + Normal Space 1100 3.1.4.3.1.4.3.1.4. Feature Feature Feature or or or Format Format Format Normal Space + Punctuation 1101 TheTheThe feature/format-based feature/format-based feature/format-basedPunctuation + Normal methods methods methods Space involve involve involve modifyin modifyin modifyin 1110gg gsome some some features features features of of of the the the cover cover cover text text text such such such as as as Normal Space + Narrow No-Break 1111 fontfontfont size, size, size, style, style, style, color, color, color, et et c.etc. c.that that that could could could be be be altered altered altered to to to conceal conceal conceal se se secretcretcret bites bites bites [18,21,24]. [18,21,24]. [18,21,24]. For For For instance, instance, instance, the the the Narrow No-Break + Normal Space 1111 dottingdottingdotting feature feature feature of of of the the the Arabic Arabic Arabic texts texts texts can can can be be be used used used for for for marking marking marking the the the SM SM SMbitsbitsbits by by by displacing displacing displacing letter letter letter points points points and and and diacriticsdiacriticsdiacritics [116 [116 [116––119].–119].119]. Since Since Since the the the structure structure structure of of of the the the Arabic Arabic Arabic lang lang languageuageuage is is issimilar similar similar to to to the the the Persian Persian Persian and and and Urdu Urdu Urdu Table 9. Mapping Pattern of SMbits for marking the inter-paragraph and end of line locations in [34]. languages,languages,languages, these these these languages languages languages use use use the the the same same same point point point letters letters letters. .Several .Several Several techniques techniques techniques have have have utilized utilized utilized point point point letters letters letters tototo mark/embed mark/embed mark/embed secret secret secretSpaces bits bits bits by Patternby by displacing displacing displacing the the the position 3-bitposition position Classification of of of a a pointa point point a a littlea little little bit bit bit vertically vertically vertically high high high concerning concerning concerning thethethe standard standard standard point point pointThree-Per-Em position position position through through through Space the the the CM CM CM [15,88,90,92]. [15,88,90,92]. [15,88,90,92]. 000 In In In practice, practice, practice, these these these techniques techniques techniques provide provide provide high high high invisibilityinvisibilityinvisibility (except (except (exceptFour-Per-Em for for for color-based color-based color-based Space ones), ones), ones), higher higher higher embedding embedding embedding 001 capacity, capacity, capacity, and and and low low low distortion distortion distortion robustness robustness robustness againstagainstagainst structural structural structural attacks. attacks.Six-Per-Em attacks. Color-based Color-based Color-based Space algorithms algorithms algorithms are are 010are also also also vulnerable vulnerable vulnerable to to to visual visual visual attacks attacks attacks [111]. [111]. [111]. Figure Space 011 Punctuation Space 100 3.1.5.3.1.5.3.1.5. Emoticons Emoticons Emoticons or or or Emoji Emoji Emoji Thin Space 101 Hair Space 110 EmoticonEmoticonEmoticon or or or emoji-based emoji-based emoji-based approaches approaches approaches ut ut utilizeilizeilize the the the emoji emoji emoji symbols symbols symbols to to to embed embed embed the the the SM SM SMbitsbitsbits through through through the the the Narrow No-Break Space 111 CMCMCM. .These .These These days, days, days, end end end users users users employ employ employ emoticons emoticons emoticons or or or em em emojiojioji symbols symbols symbols in in in daily daily daily conversations conversations conversations instead instead instead of of of typingtypingtyping their their their feelings. feelings. feelings. Recently, Recently, Recently, several several several algorithms algorithms algorithms have have have been been been introduced introduced introduced using using using the the the cover cover cover of of of emoticons emoticons emoticons Patiburntototo etmark mark mark al. insecret secret secret [13] developedbits bits bits through through through an the the emoticons-basedthe CM CM CM. .For .For For instance, instance, instance, text the the steganography the techni techni techniquesquesques presented presented schemepresented called in in in [8,120 [8,120 [8,120 EM_ST––122]–122]122] generate generate generate a a a which generatesrandomrandomrandom a random text texttext consisting consistingconsisting text consisting some somesome of words wordswords some as aswordsas a a a CM CMCM as, , and a, andandCM also,. also,also, Moreover, they theythey convert convertconvert it converts the thethe letters all lettersletters the ofSM ofof the thethe SM SMSM into intointo characters intoemoticonsemoticonsemoticons emoticons based based based based on on on a ona predefineda predefined predefined a particular pattern pattern pattern pattern (e.g., (e.g., (e.g., (e.g., A A A= A=“= “= “😣 “😣😣”,”, ”,B B B B=“= = “= “😢 “😢😢”,”, ”,C C=“C C= = “= “😃 “😃😃”,”, ”,and and and so so so on.). on.) on.). on.). Moreover, Moreover, Moreover, and, thus embedstheytheythey embed embed theembed emoticons the the the produced produced produced between emoticons emoticons emoticons words throughbetween between between thewords words wordsCM .through through Practically,through the the the thisCM CM CM. scheme.Although .Although Although presents these these these approaches approaches approaches high EC, andhavehave visiblehave high highhigh transparency embedding embeddingembedding (low capacity, capacity,capacity, invisibility), they theythey suffer sufferandsuffer it from suffers fromfrom visible visiblevisible from lowtransparency transparencytransparency DR against (low (low visual(low invisibility), invisibility),invisibility), attacks. and andand low lowlow To demonstratedistortiondistortiondistortion the robustness robustness embeddingrobustness ag ag ag traceainstainstainst andvisual visual visual invisibility attacks. attacks. attacks. of the explained algorithms, we implemented them on some cover text examples. Herein, the implementation means the evaluation of selected algorithms based3.2.3.2.3.2. LinguisticTechniques LinguisticTechniques LinguisticTechniques on their corresponding Emb()/Ext() approaches. To ensure a fair comparison between existing structural algorithms, we considered those which LlinguisticLlinguisticLlinguistic or oror natural naturalnatural language languagelanguage processing-bas processing-basprocessing-basededed algorithms algorithmsalgorithms alter alteralter the thethe syntax syntaxsyntax and andand semantics semanticssemantics could be applied in multilingual cover texts. Let us suppose that we wish to hide as SM = Ab = characteristicscharacteristicscharacteristics of of of the the the text text text content. content. content. The The The text text text typically typically typically consists consists consists of of of several several several words, words, words, sentences, sentences, bitssentences, verbs, verbs, verbs, nouns, nouns, nouns, ”01000010 + 01100010”, then after implementing the aforementioned approaches on highlight cover adverbs,adverbs,adverbs, adjectives, adjectives, adjectives, and and and so so so on. on. on. Several Several Several linguistic-ba linguistic-ba linguistic-basedsedsed approaches approaches approaches have have have used used used characteristics characteristics characteristics such such such as as as text examples, the embedding trace of each method highlighted as depicted in Table 10. To show synonyms,synonyms,synonyms, abbreviations, abbreviations,abbreviations, the thethe similarity similaritysimilarity of ofof words, words,words, and andand so soso on, on,on, to toto embed embedembed secret secretsecret bits bitsbits into intointo a a a CM CMCM the trace of spaces (width or length) in CM , we have highlighted them, but they are transparent [17,62,70,71,80[17,62,70,71,80[17,62,70,71,80––85,106,109].–85,106,109].85,106,109]. In In In general, general, HMgeneral, we we we can can can classify classify classify the the the linguistic linguistic linguistic based based based approa approa approacheschesches into into into two two two types: types: types: in practice. syntacticsyntacticsyntactic and and and semantic. semantic. semantic. To evaluate the efficiency of the selected techniques, we implemented them on a simulated dataset. This dataset3.2.1. is3.2.1.3.2.1. generated Semantic Semantic Semantic by copying randomly some proverbs from referenced websites as outlined in Tables 11 and 12. SemanticSemanticSemantic methods methodsmethods work workwork based basedbased on onon the thethe specific specificspecific language languagelanguage characteristics characteristicscharacteristics by byby modifying modifyingmodifying the thethe semanticsemanticsemantic attributes attributes attributes of of of the the the CM CM CM to to to mark/embed mark/embed mark/embed the the the SM SM SMbitsbitsbits. .These .These These attributes attributes attributes include include include the the the spelling spelling spelling of of of words,words,words, abbreviations, abbreviations,abbreviations, synonyms, synonyms,synonyms, acronyms, acronyms,acronyms, and andand so soso on onon [62,70,71,75,82,84]. [62,70,71,75,82,84].[62,70,71,75,82,84]. The TheThe advantage advantageadvantage of ofof the thethe semantic-basedsemantic-basedsemantic-based methods methods methods is is isthat that that they they they protect protect protect the the the HM HM HM against against against retyping retyping retyping attacks attacks attacks or or or the the the use use use of of of OCR OCR OCR softwaresoftwaresoftware [111]. [111]. [111]. Moreover, Moreover, Moreover, these these these methods methods methods provide provide provide lo lo wlow wembedding embedding embedding capacity, capacity, capacity, high high high invisibility invisibility invisibility and and and high high high robustnessrobustnessrobustness against against against structural structural structural attacks, attacks, attacks, but but but they they they modify modify modify the the the original original original meaning meaning meaning of of of the the the CM CM CM. . .

3.2.2.3.2.2.3.2.2. Syntactic Syntactic Syntactic SyntacticSyntacticSyntactic approaches approaches approaches involve involve involve modifying modifying modifying the the the CM CM CM without without without significantly significantly significantly changing changing changing the the the meaning meaning meaning or or or tonetonetone of of of the the the text text text content. content. content. In In In different different different languages, languages, languages, th th thereereere are are are some some some syntactical syntactical syntactical compositions compositions compositions in in in their their their text text text structures,structures,structures, which which which are are are specified specified specified by by by the the the language language language and and and its its its specific specific specific conventions conventions conventions [3,20,81 [3,20,81 [3,20,81––83].–83].83]. For For For instance, instance, instance, aa methoda method method presented presented presented in in in [123], [123], [123], which which which utilizes utilizes utilizes the the the similarity similarity similarity of of of La La La word word word in in in the the the Arabic/Persian Arabic/Persian Arabic/Persian text. text. text. In In In

Entropy 2018, 20, x FOR PEER REVIEW 19 of 29

Table 9. Mapping Pattern of SMbits for marking the inter-paragraph and end of line locations in [34].

Spaces Pattern 3-bit Classification Entropy 2018, 20, x FOR PEER REVIEWThree -Per-Em Space 000 19 of 30 Four-Per-Em Space 001

Table 9. Mapping Pattern of SMSixbits- forPer marking-Em Space the inter-paragraph010 and end of line locations in [34]. Figure Space 011 Spaces Pattern 3-bit Classification Punctuation Space 100 Three-Per-Em Space 000 FourThin-Per Space-Em Space 101001 SixHair-Per Space-Em Space 110010 Narrow NoFigure-Break Space Space 111011 Punctuation Space 100 Patiburn et al. in [13] developed Thinan emoticons Space -based text101 steganography scheme called EM_ST Hair Space 110 which generates a random text consisting of some words as a CM. Moreover, it converts all the SM Narrow No-Break Space 111 characters into emoticons based on a particular pattern (e.g., A=“ ”, B=“ ”, C=“ ”, and so on.) and, thus embedsPatiburn theet al. emoticons in [13] developed between an wordsemoticons through-based thetext CMsteganography. Practically, scheme this schemecalled E M_presentsST high EC,which and generates visible transparency a random text (low consisting invisibility), of some and words it suffers as a CM from. Moreover, low DR it againstconverts visual all the attacks. SM Tocharacters demonstrate into emoticons the embedding based on a tracparticulare and pattern invisibility (e.g., A= of“ the”, B= explai“ ”, C=ned“ ”, algorithms, and so on.) we implementedand, thus them embeds on thesome emoticons cover text between examples words. Herein, through thethe implementationCM. Practically, this means scheme the presents evaluation high EC, and visible transparency (low invisibility), and it suffers from low DR against visual attacks. Entropy 2019, 21, 355 of selected algorithms based on their corresponding Emb()/Ext() approaches. 20 of 31 To demonstrate the embedding trace and invisibility of the explained algorithms, we implemented them on some cover text examples. Herein, the implementation means the evaluation Table 10. Implementation of selected structural approaches on the highlight examples. of selectedTable algorithms 10. Implementation based on their of selected corresponding structural Emb()/Ext() approaches approaches on the highlight. examples. Embedded AlgorithmAlgorithm Table CM10. ImplementationCM of selected structural approaches on the CM highlightHM examples. Embedded SM HM SMbits bits AITSteg [1]AITSteg [1]The onlyThe sourceonly source of knowledge of knowledge is experience. is experience. The only source of knowledge is experience. Embedded12 12 Algorithm CM CMHM ZW_4B [33ZW_4B] [33]The onlyThe only source source of knowledge of knowledge is experience. is experience. The only source of knowledge is experience. SMbits 16 16 MHST [29]MHST [29AITSteg] The only[1]The sourceonlyThe source only of knowledgesource of knowledge of knowledge is experience. is experience. is experience. TheThe only only source of of knowledge knowledge is experience.is experience. 12 0 0 ZWBSP [90ZWBSP] [ZW_4B90]The [33] onlyThe sourceonlyThe sourceonly of knowledgesource of knowledge of knowledge is experience. is experience.is experience. TThehe onlyonly source of of knowledge knowledge is experience.is experience. 16 12 12 TWSM [5,6TWSM] [5,6]MHSTThe [29 onlyThe] sourceonlyThe source only of knowledgesource of knowledge of knowledge is experience. is experience. is experience. TheThe only only source ofof knowledge knowledge is experience.is experience. 0 16 16 4-SpaCh [584-]SpaCh ZWBSP[58The] [ only90The] sourceonlyThe source only of knowledgesource of knowledge of knowledge is experience. is experience. is experience. TheThe onlyonly source of of knowledge knowledge is experience.is experience. 12 16 16 WS_EL [11WS_EL] [11]TWSMThe [5,6] onlyThe sourceonlyThe source only of knowledgesource of knowledge of knowledge is experience. is experience. is experience. TheThe only only source of of knowledge knowledge is isexperience. experience. 16 6 6 4-SpaCh [58] The only source of knowledge is experience. The only source of knowledge is experience. 16 4&3SpaCh4&3SpaCh [34] [34]The onlyThe sourceonly source of knowledge of knowledge is experience. is experience. The only source of knowledge is experience. 16 16 WS_EL [11] The only source of knowledge is experience. The only source of knowledge is experience. 6 UniSpaChUniSpaCh [7] [7]The onlyThe sourceonly source of knowledge of knowledge is experience. is experience. The only source of knowledge is experience. 16 16 4&3SpaCh [34] The only source of knowledge is experience. The only source of knowledge is experience. 16 EM_ST [13] The only source of knowledge is experience. 16 EM_ST UniSpaCh[13] The[7] onlyThe source only source of knowledge of knowledge is experience. is experience. The The only only sourcesource of ofknowledge knowledge is experience. is experience. 16 16 EM_ST [13] The only source of knowledge is experience. The only source of knowledge is experience. 16 Table 11. Dataset: cover message examples. TableTable 11. 11. Dataset: Dataset: cover cover message message examples. . Name Text Content Reference Name Name TextText ContentContent ReferenceReference Science without religion is lame, religion without science https://www.brainyquote.com CM.1Science without religion is lame, religion without science is CM.1 CM.1blind. is blind. https://www.brainyquote.comhttps://www.brainyquote.com

君子之行，静以修身，俭以养德，非澹泊无以明志，非宁 https://www.ﬂuentu.com/ 君子之行，静以修身，俭以养德，非澹泊无以明志，非宁静无 CM.2CM.2 静无以致远。《诫子书》 https://www.fluentu.com/ CM.2 以致远。《诫子书》 https://www.fluentu.com/ Die größte Gefahr für die meisten von uns ist nicht, dass https://www.germanpod101.com CM.3CM.3Die größtewir hohe Gefahr Ziele füranstreben die meisten und sie vonverfehlen, uns ist sondern nicht, dassdass wirhttps://www.germanpod101.com CM.3 hohe Zielewir uns anstreben zu niedrige und setzen sie undverfehlen, sie erreichen. sondern dass wir uns https://www.germanpod101.com zu niedrige setzen und sie erreichen. http://www.bartarinha.ir/ جهان سوم جایی است که هر کس بخواهد مملکتش را آباد کند، خانه اش CM.4CM.4 http://www.bartarinha.ir/ جهان سوم خراب جامییی شود است و که هر هر کس کس بخواهد بخواهد خانه اش رامملکتش راآباد کندآباد باید کند، در ویخانه رانیاش خراب /http://www.bartarinha.ir مملکتش بکوشد. CM.4 می شود و هر کس بخواهد خانه اش را آباد کند باید در ویرانی مملکتش بکوشد. Chi vuol andar salvo per lo mondo, bisogna aver occhio di http://oaks.nvg.org/ CM.5Chi vuolfalcone, andar orecchio salvo d ’asin per o, lo viso mondo, di scimia, bisogna bocca di aver porcello, occhio di CM.5 http://oaks.nvg.org/ CM.5 falcone,spalle orecchio di camello, d’asin è o,gambe viso di cervo.scimia, bocca di porcello, spalle http://oaks.nvg.org/ di camello, è gambe di cervo.

Entropy 2019, 21, 355 21 of 31

Table 12. The detailed structures of sample cover texts.

Cover Name Characters Spaces Words Sentences Lines Language CM.1 68 9 10 1 2 English CM.2 36 0 36 1 2 Chinese CM.3 160 27 28 1 4 German CM.4 137 30 31 1 3 Persian CM.5 156 26 27 1 4 Italian

Let us assume that we wish to hide a SM = “original” or (64-bit) through the sample cover messages as depicted in Table 11. To evaluate the invisibility rate of selected algorithms, we analyzed them using equation (2) considering the differences between CM and CMHM for each method that the obtained results listed in Table 13.

Table 13. Invisibility (%) Analysis of evaluated methods using Jaro Distance based on the examples.

Algorithm CM.1 CM.2 CM.3 CM.4 CM.5 Average Invisibility (%) AITSteg [1] 89.3 84.3 94.4 89.3 95.1 =∼90 UniSpaCh [7] 83.8 0 80.8 79.9 80.4 =∼81 ZW_4B [33] 62.5 47.2 94.0 0 93.4 =∼74 MHST [29] 100 0 100 100 100 =∼100 ZWBSP [90] 96.1 0 95.1 80.1 95 =∼92 TWSM [5,6] 85.7 0 81.8 79.3 80.7 =∼82 4-SpaCh [58] 82.9 0 84 84.1 96.5 =∼87 WS_EL [11] 83.4 0 81.1 80.3 80.6 =∼81 4&3SpaCh [34] 84.9 0 87 87.5 84.6 =∼86 EM_ST [13] 83.2 0 81.1 80.1 80.1 =∼81

Since the majority of selected approaches embed the SMbits into the CM based on the bit-level marking (except MHST [29] & EM_ST [13]), we normalize the EC of each approach by considering 8-bit binary for each character of the SM. Moreover, we evaluate the embedding capacity of the selected algorithms based on the number of embeddable locations required to embed the SM in the CM. Table 14 summarizes the EC rates offered by the evaluated approaches after analyzing them on the highlight samples (e.g., SM and CM). Assuming that a malicious user tampers with a word or a letter of the CMHM, then can the SMbits be extracted from the CM’HM by the extraction algorithm? To answer this question, we evaluated the approximate DR rate of each approach based on the embedding locations and the cover messages in Table 12 using equation (4) separately. The DR results listed in Table 15, and Figure6 illustrates the average invisibility, EC and DR of evaluated techniques.

Table 14. EC (Bit & %) results of structural approaches on the highlight samples.

Type of Average Algorithm CM.1 CM.2 CM.3 CM.4 CM.5 Embedding EC/64 (%) AITSteg [1] Bit-level 64 64 64 64 64 =∼ 64 => 100 UniSpaCh [7] Bit-level 22 4 62 64 60 =∼ 42 => 65 ZW_4B [33] Bit-level 64 64 64 0 64 =∼ 51 => 80 MHST [29] Character-Level 8*8 = 64 0 8*8 = 64 0 8*8 = 64 =∼ 64 => 100 ZWBSP [90] Bit-level 18 0 56 60 52 =∼ 46 => 72 TWSM [5,6] Bit-level 47 0 64 64 64 =∼ 60 => 93 4-SpaCh [58] Bit-level 36 0 64 64 64 =∼ 57 => 89 WS_EL [11] Bit-level 11 2 31 33 31 =∼ 22 => 33 4&3SpaCh Bit-level 45 9 64 64 64 =∼ 59 => 92 [34] EM_ST [13] Character-Level 8*8 = 64 0 8*8 = 64 8*8 = 64 8*8 = 64 =∼ 64 => 100 Entropy 2018, 20, x FOR PEER REVIEW 21 of 29 letter of the CMHM, then can the SMbits be extracted from the CM’HM by the extraction algorithm? To answer this question, we evaluated the approximate DR rate of each approach based on the embedding locations and the cover messages in Table 12 using equation (4) separately. The DR results listedEntropy in2019 Table, 21, 35515, and Figure 6 illustrates the average invisibility, EC and DR of evaluated techniques.22 of 31

TableTable 15. ApproximateApproximate DR (%) results of evaluated approaches on the highlight samples. Algorithm CM.1 CM.2 CM.3 CM.4 CM.5 Average DR (%) Algorithm CM.1 CM.2 CM.3 CM.4 CM.5 Average DR (%) AITSteg [1] 98.5 97.2 99.3 99.2 99.3 ≅99 AITSteg [1] 98.5 97.2 99.3 99.2 99.3 =∼99 UniSpaCh [7] 83.8 88.8 80.6 75.9 80.7 ≅82 UniSpaCh [7] 83.8 88.8 80.6 75.9 80.7 =∼82 ≅ ZW_4BZW_4B [33] [33] 76.476.4 55.555.5 90 90 88.3 88.389.7 89.7 80 =∼80 MHST [MHST29] [29] 88.288.2 00 95 95 0 094.8 94.8 ≅93 =∼93 ZWBSPZWBSP [90] [90] 86.786.7 00 83.1 83.1 78.1 78.183.3 83.3 ≅83 =∼83 TWSM [TWSM5,6] [5,6] 57.357.3 00 66.8 66.8 78.1 78.151.9 51.9 ≅64 =∼64 ∼ 4-SpaCh4-SpaCh [58] [58] 86.786.7 00 83.1 83.1 78.1 78.183.3 83.3 ≅83 =83 WS_EL [11] 83.8 95 80.6 75.9 80.1 =∼83 WS_EL [11] 83.8 95 80.6 75.9 80.1 ≅83 4&3SpaCh [34] 82.3 91.6 80 75.1 80.1 =∼82 ≅ EM_ST4&3SpaCh [13] [34] 86.7 82.3 091.6 80 83.1 75.1 78.1 80.1 83.3 82 =∼83 EM_ST [13] 86.7 0 83.1 78.1 83.3 ≅83

FigureFigure 6. 6. TheThe overlap overlap between between the the average average Invi Invisibility,sibility, EC and DR results (%). Table 16 depicts a comparative analysis of selected structural approaches in terms of criteria and Table 16 depicts a comparative analysis of selected structural approaches in terms of criteria and language coverage along with their limitations. To demonstrate the efﬁciency of evaluated algorithms, language coverage along with their limitations. To demonstrate the efficiency of evaluated we rated them according to the results concerning to invisibility, EC, and DR: for example, invisible, algorithms, we rated them according to the results concerning to invisibility, EC, and DR: for and visible for the invisibility; low, medium, and high scale for the EC; low, modest, and high for example, invisible, and visible for the invisibility; low, medium, and high scale for the EC; low, the DR. modest, and high for the DR.

In practice, all the approaches that work based on modifying the spaces between words, cannot be applied in Chinese texts because in this language there are no spaces between words. To demonstrate the pros and cons, we considered four types of effective attacks for assessing their limitations such as visual (tampering), structural (formatting), statistical (decoding), and retyping attacks. Let us suppose that a malicious user copies a portion (or all) of the CMHM which included the SMbits into a new host text message/file and randomly modifies it in terms of mentioned attacks. In this case, if even one bit or character of the SM is altered, then it leads to the extraction of the SM by the corresponding Ext() to fail. Table 17 depicts the evaluated results conducted on the CMHM examples.

Entropy 2019, 21, 355 23 of 31

Table 16. Comparative analysis of structural approaches in terms of criteria and language coverage.

Language Algorithm EC DR Invisibility Limitations Coverage Embeds additional ZWCs in AITSteg [1] High High Imperceptible Multilingual front of the CM Depends on the spaces UniSpaCh [7] Low Medium Imperceptible Multilingual between words Embeds four ZWCs after each ZW_4B [33] Modest Medium Imperceptible Exclusive (Latin) letter Depends on using an MHST [29] High High Imperceptible Exclusive (Latin) exclusive language in the SM Depends on the spaces ZWBSP [90] Low Medium Imperceptible Multilingual between words Depends on the spaces and TWSM [5,6] High Low Imperceptible Exclusive (Latin) font style of the CM Depends on the spaces 4-SpaCh [58] Modest Medium Imperceptible Multilingual between words Embeds two spaces between WS_EL [11] Low Medium Imperceptible Multilingual words Depends on the spaces 4&3SpaCh [34] High Medium Imperceptible Multilingual between words Embeds additional emoticons EM_ST [13] High Medium Visible Multilingual between words

In practice, all the approaches that work based on modifying the spaces between words, cannot be applied in Chinese texts because in this language there are no spaces between words. To demonstrate the pros and cons, we considered four types of effective attacks for assessing their limitations such as visual (tampering), structural (formatting), statistical (decoding), and retyping attacks. Let us suppose that a malicious user copies a portion (or all) of the CMHM which included the SMbits into a new host text message/ﬁle and randomly modiﬁes it in terms of mentioned attacks. In this case, if even one bit or character of the SM is altered, then it leads to the extraction of the SM by the corresponding Ext() to fail. Table 17 depicts the evaluated results conducted on the CMHM examples.

Table 17. A comparison analysis of evaluated techniques against the stated attacks.

Having Robustness Against Attack: Yes () and No (×) Algorithm Security Limitations Visual Structural Statistical Retyping AITSteg [1] × Optimum safety (3) UniSpaCh [7] × Optimum safety (3) ZW_4B [33] × × Medium safety (2) MHST [29] × × Medium safety (2) ZWBSP [90] × Optimum safety (3) TWSM [5,6] × × × Easy to lose (1) 4-SpaCh [58] × Optimum safety (3) WS_EL [11] × Optimum safety (3) 4&3SpaCh [34] × Optimum safety (3) EM_ST [13] × × Medium safety (2)

As shown in Table 17, almost all the evaluated algorithms have some limitations; however, some of them provide better safety than others. In practice, the programmers must take into account the priority of criteria in case of fragile or robust and, so, they choose a proper approach based on the security limitations which could give more safety in the particular application.

5. Suggestions for Future Works Text hiding is a flexible and potent technique that could be employed in different ways to keep safe sensitive information in various areas such as covert communication, copyright protection, Entropy 2019, 21, 355 24 of 31 authentication, etc. Although the efficiency of text hiding algorithms has drawn much attention from cybersecurity researchers, it still lacks a precise analysis modeling which could take the fundamental criteria into account during the efficiency analysis. As we already explained, there are four evaluation criteria for efficiency analysis, which rely on the way of embedding. In other words, the embedding methods generally specify how to evaluate the efficiency of the particular algorithm. Therefore, to assess the effectiveness of a specific algorithm, it is necessary to compare it with previous works within the same category (e.g., linguistic, structural, and random and statistics). We have also summarized the various limitations of three major types of text hiding techniques in Table3, which provides a better understanding of the state-of-the-art and hopefully can guide in developing future works. Since many types of research concerning the structural-based techniques (only a few algorithms proposed in other categories) and affording better efficacy have been carried out, we have tried to highlight the recently proposed algorithms in this paper. As we have pointed out in Section3, the linguistic and random and statistics-based approaches have more limitations compared to structural-based methods. Due to the use of extra dictionaries and high computational complexity, a few researchers focused on linguistic and random and statistics-based approaches in recent years as well. Over the last decade, many structural-based algorithms have been introduced to improve the efficiency of text hiding by considering the optimum trade-off between criteria, as depicted in Tables 16 and 17. However, the embedding capacity and robustness of them require to be more improved against various attacks regarding security requirements. In what follows, we recommend some guidelines aimed at instructing cybersecurity researchers on the best options to apply the structural based algorithms relying on the characteristics of the applications. Nevertheless, we have to declare that these recommendations are general and empirically derived rules of thumb; these directions should not be considered rigidly or dogmatically. • Since most of the authentication systems utilize SMS to verify the authenticity of users, the structural-based technique can be employed as the best option to provide covert communication against unpredictable network attacks such as MITM, brute-force, and guessing attacks. • Where the primary concern is the invisible transmission of secret information over public networks, the structural-based steganography algorithms could be utilized for providing that requirement. • In the case of unauthorized access tracking, a combination of machine learning algorithms and the ZWC-based methods can be employed to mark sensitive documents over private networks. For instance, confidential documents in a governmental organization could be marked with identifiers such as an invisible signature which is difficult to detect. • Due to the fact social media have become a significant part of the end users’ daily communications, a combination of unsupervised learning algorithms and structural-based text hiding can be used to intelligent information analysis during the resharing/reproduction of data to protect valuable information against malicious attacks. • The lossless compression algorithms such as Huffman coding, LZW, arithmetic, and so on, could be utilized during the encoding section of structural-based methods to improve the embedding capacity criteria. An efficient text hiding algorithm should provide optimum trade-off among the three fundamental criteria to gain a certain level of security. • To sum up, which type of text hiding algorithms provides better efficiency? We cannot give an accurate and unique answer to this question. Cybersecurity researchers must take into account many things like various pros and cons of text hiding algorithms, together with the recommendations that we have outlined. Also, they should ponder whether the text hiding techniques would be relevant or not for the particular application. When the researcher comprehends that some of the merits of a specific algorithm could provide a proper benefit to the exact needs of the application at issue; hence it should probably be given a try. Entropy 2019, 21, 355 25 of 31

6. Conclusions This case study presents a comparative analysis of existing text hiding techniques, especially on those focused on modifying the structural characteristics of digital text message/file. We overviewed a range of fundamental criteria, applications, and attacks covering the text hiding area to explain the current security challenges in the cybersecurity industry. Also, we summarized three major categories of text hiding techniques based on how to process cover text messages/files to embed the secret bits, namely, structural, linguistic, and random and statistics. We then outlined the limitations and characteristics of each category to show their efficiency in various applications. Moreover, we evaluated the recently proposed approaches concerning the fundamental criteria to highlight their pros and cons. Finally, we have recommended some of guidelines and directions that merit further attention in future works.

Author Contributions: Conceptualization, writing—original draft, software, validation, and methodology, M.T.A.; Ph.D. dissertation supervision, project administration, funding acquisition, Q.L.; formal analysis, J.H.; review, A.R.; investigation, C.Y. Funding: This research was funded in part by the Nanjing Municipal Government Scholarship (NMG), Jiangsu province of China, [grant number 2016050328], in part by the Project of ZTE Cooperation Research [2016ZTE04-11], Jiangsu province key research and development program: Social development project [BE2017739], Jiangsu province key research and development program: Industry outlook and common key technology projects [BE2017100], 2018 Jiangsu Province Major Technical Research Project “Information Security Simulation System”. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. Ahvanooey, M.T.; Li, Q.; Hou, J.; Mazraeh, H.D.; Zhang, J. AITSteg: An Innovative Text Steganography Technique for Hidden Transmission of Text Message via Social Media. IEEE Access 2018, 6, 65981–65995. [CrossRef] 2. Kamaruddin, N.S.; Kamsin, A.; Por, L.Y.; Rahman, H. A Review of Text Watermarking: Theory, Methods, and Applications. IEEE Access 2018, 6, 8011–8028. [CrossRef] 3. TAhvanooey, M.T.; Li, Q.; Shim, H.J.; Huang, Y. A Comparative Analysis of Information Hiding Techniques for Copyright Protection of Text Documents. Secur. Commun. Netw. 2018, 2018, 5325040. 4. Ahvanooey, M.T.; Mazraeh, H.D.; Tabasi, S.H. An innovative technique for web text watermarking (AITW). Inf. Secur. J. Glob. Perspect. 2016, 25, 191–196. [CrossRef] 5. Rizzo, S.G.; Bertini, F.; Montesi, D. Content-preserving Text Watermarking through Unicode Homoglyph Substitution. In Proceedings of the 20th International Database Engineering & Applications Symposium (IDEAS ’16), Montreal, QC, Canada, 11–13 July 2016; pp. 97–104. 6. Rizzo, S.G.; Bertini, F.; Montesi, D.; Stomeo, C. Text Watermarking in Social Media. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Sydney, Australia, 31 July–3 August 2017. 7. Por, L.Y.; Wong, K.; Chee, K.O. UniSpaCh: A text-based data hiding method using Unicode space characters. J. Syst. Softw. 2012, 85, 1075–1082. [CrossRef] 8. Patiburn, S.A.; Iranmanesh, V.; Teh, P.L. Text Steganography using Daily Emotions Monitoring. Int. J. Educ. Manag. Eng. 2017, 7, 1–14. [CrossRef] 9. Zhou, X.; Wang, Z.; Zhao, W.; Yu, J. Attack Model of Text Watermarking Based on Communications. In Proceedings of the 2009 International Conference on Information Management, Innovation Management and Industrial Engineering, Xi’an, China, 26–27 December 2009. 10. Cachin, C. An information-theoretic model for steganography. Inf. Comput. 2004, 192, 41–56. [CrossRef] 11. Shiu, H.J.; Lin, B.S.; Lin, B.S.; Huang, P.Y.; Huang, C.H.; Lei, C.L. Data Hiding on Social Media Communications Using Text Steganography. In Proceedings of the International Conference on Risks and Security of Internet and Systems, Dinard, France, 19–21 September 2017; pp. 217–224. 12. Wang, Y.; Moulin, P. Perfectly Secure Steganography: Capacity, Error Exponents, and Code Constructions. IEEE Trans. Inf. 2008, 54, 2706–2722. [CrossRef] Entropy 2019, 21, 355 26 of 31

13. Wendzel, S.; Caviglione, L.; Mazurczyk, W.; Lalande, J.-F. Network Information Hiding and Science 2.0: Can it be a Match? Int. J. Electron. Telecommun. 2017, 63, 217–222. [CrossRef] 14. Zseby, T.; Vazquez, F.I.; Bernhardt, V.; Frkat, D.; Annessi, R. A Network Steganography Lab on Detecting TCP/IP Covert Channels. IEEE Trans. Educ. 2016, 59, 224–232. [CrossRef] 15. Alotaibi, R.A.; Elrefaei, L.A. Utilizing Word Space with Pointed and Un-pointed Letters for Arabic Text Watermarking. In Proceedings of the 2016 UKSim-AMSS 18th International Conference on Computer Modelling and Simulation (UKSim), Cambridge, UK, 6–8 April 2016; pp. 111–116. 16. Yu, Y.; Min, L.; JianFeng, W.; Bohuai, L.; Yang, Y.; Lei, M.; Wang, J.; Liu, B. A SVM based text steganalysis algorithm for spacing coding. China Commun. 2014, 11, 108–113. 17. Banik, B.G.; Bandyopadhyay, S.K. Novel Text Steganography Using Natural Language Processing and Part-of-Speech Tagging. IETE J. Res. 2018, 1–12. [CrossRef] 18. Ramakrishnan, B.K.; Thandra, P.K.; Srinivasula, A.V.S.M. Text steganography: A novel character-level embedding algorithm using font attribute. Secur. Commun. Netw. 2016, 9, 6066–6079. [CrossRef] 19. Petitcolas, F.; Anderson, R.; Kuhn, M. Information hiding-a survey. Proc. IEEE 1999, 87, 1062–1078. [CrossRef] 20. Fateh, M.; Rezvani, M. An email-based high capacity text steganography using repeating characters. Int. J. Comput. Appl. 2018, 1–7. [CrossRef] 21. Malik, A.; Sikka, G.; Verma, H.K. A high capacity text steganography scheme based on LZW compression and color coding. Eng. Sci. Technol. Int. J. 2017, 20, 72–79. [CrossRef] 22. Mahato, S.; Khan, D.A.; Yadav, D.K. A modified approach to data hiding in Microsoft Word documents by change-tracking technique. J. King Saud Univ. Comput. Inf. Sci. 2017.[CrossRef] 23. Jalil, Z.; Mirza, A.M. A robust zero-watermarking algorithm for copyright protection of text documents. J. Chin. Inst. Eng. 2013, 36, 180–189. [CrossRef] 24. Malik, A.; Sikka, G.; Verma, H.K. A high capacity text steganography scheme based on huffman compression and color coding. J. Inf. Optim. Sci. 2017, 38, 647–664. [CrossRef] 25. Rahman, M.S.; Khalil, I.; Yi, X.; Dong, H. Highly imperceptible and reversible text steganography using invisible character based codeword. In Proceedings of the PACIS 2017: Twenty First Pacific Asia Conference on Information Systems, Langkawi, Malaysia, 19 July 2017; pp. 1–13. 26. Rahma, A.M.S.; Bhaya, W.S.; Al-Nasrawi, D.A. Text steganography based on Unicode of characters in multilingual. Int. J. Eng. Res. Appl. (IJERA) 2013, 3, 1153–1165. 27. Aman, M.; Khan, A.; Ahmad, B.; Kouser, S. A hybrid text steganography approach utilizing Unicode space characters and zero-width character. Int. J. Inf. Technol. Secur. 2017, 9, 85–100. 28. Odeh, A.; Elleithy, K.; Faezipour, M.; Abdelfattah, E. Highly efficient novel text steganography algorithms. In Proceedings of the 2015 Long Island Systems, Applications and Technology, Farmingdale, NY, USA, 1 May 2015; pp. 1–7. 29. Naqvi, N.; Abbasi, A.T.; Hussain, R.; Khan, M.A.; Ahmad, B. Multilayer Partially Homomorphic Encryption Text Steganography (MLPHE-TS): A Zero Steganography Approach. Wirel. Pers. Commun. 2018, 103, 1563–1585. [CrossRef] 30. Maram, B.; Gnanasekar, J.M.; Manogaran, G.; BalaAnand, M. Intelligent security algorithm for UNICODE data privacy and security in IOT. Serv. Comput. Appl. 2018, 13, 1–13. [CrossRef] 31. Rahman, M.S.; Khalil, I.; Yi, X. A lossless DNA data hiding approach for data authenticity in mobile cloud based healthcare systems. Int. J. Inf. Manag. 2019, 45, 276–288. [CrossRef] 32. Liu, Y.; Zhu, Y.; Xin, G. A zero-watermarking algorithm based on merging features of sentences for Chinese text. J. Chin. Inst. Eng. 2014, 38, 391–398. [CrossRef] 33. Odeh, A.; Elleithy, K.; Faezipour, M. Steganography in text by using MS word symbols. In Proceedings of the Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education, Bridgeport, CT, USA, 3–5 April 2014; pp. 1–5. 34. Kumar, R.; Chand, S.; Singh, S. An efficient text steganography sheme using Unicode Space Characters. Int. J. Comput. Sci. 2015, 10, 8–14. [CrossRef] 35. Satir, E.; I¸sık,H. A Huffman compression based text steganography method. Multimed. Tools Appl. 2012, 70, 2085–2110. [CrossRef] 36. Kumar, R.; Malik, A.; Singh, S.; Chand, S. A high capacity email based text steganography scheme using Huffman compression. In Proceedings of the 2016 3rd International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 11–12 February 2016; pp. 53–56. Entropy 2019, 21, 355 27 of 31

37. Tutuncu, K.; Hassan, A.A. New Approach in E-mail Based Text Steganography. Int. J. Intell. Syst. Appl. Eng. 2015, 3, 54. [CrossRef] 38. Abdullah, A.H. Data Security Algorithm Using Two-Way Encryption and Hiding in Multimedia Files. Int. J. Sci. Eng. Res. 2014, 5, 471–475. 39. Satir, E.; Isik, H.; I¸sık,H. A compression-based text steganography method. J. Syst. Softw. 2012, 85, 2385–2394. [CrossRef] 40. Stojanov, I.; Mileva, A.; Stojanovic, I. A new property coding in text steganography of Microsoft Word documents. In Proceedings of the Eighth International Conference on Emerging Security Information, Systems and Technologies, Lisbon, Portugal, 16–20 November 2014; pp. 25–30. 41. Rafat, K.F.; Sher, M. Secure Digital Steganography for ASCII Text Documents. Arab. J. Sci. Eng. 2013, 38, 2079–2094. [CrossRef] 42. Baawi, S.S.; Mokhtar, M.R.; Sulaiman, R. Enhancement of Text Steganography Technique Using Lempel-Ziv-Welch Algorithm and Two-Letter Word Technique. In Proceedings of the International Conference of Reliable Information and Communication Technology, Kuala Lumpur, Malaysia, 23–24 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 525–537. 43. Balajee, K.; Gnanasekar, J. Unicode Text Security Using Dynamic and Key-Dependent 16x16 S-Box (January 4, 2016). Aust. J. Basic Appl. Sci. 2016, 10, 26–36. 44. Qadir, M.; Ahmad, I. Digital Text Watermarking: Secure Content Delivery and Data Hiding in Digital Documents. IEEE Aerosp. Electron. Syst. Mag. 2006, 21, 18–21. [CrossRef] 45. Al-Maweri, N.A.A.S.; Ali, R.; Adnan, W.A.W.; Ramli, A.R.; Rahman, S.M.S.A.A. State-of-the-Art in Techniques of Text Digital Watermarking: Challenges and Limitations. J. Comput. Sci. 2016, 12, 62–80. [CrossRef] 46. Singh, P.; Chadha, R.S. A Survey of Digital Watermarking Techniques, Applications and Attacks. Int. J. Eng. Innov. Technol. 2013, 2, 165–175. 47. Agarwal, M. Text Steganographic Approaches: a comparison. Int. J. Netw. Secur. Its Appl. 2013, 5, 9–25. [CrossRef] 48. Guru, J.; Damecha, H. Digital Watermarking Classiﬁcation: A Survey. Int. J. Comput. Sci. Trends Technol. 2014, 2, 122–124. 49. Alkawaz, M.H.; Sulong, G.; Saba, T.; Almazyad, A.S.; Rehman, A. Concise analysis of current text automation and watermarking approaches. Secur. Commun. Netw. 2016, 9, 6365–6378. [CrossRef] 50. Alhusban, A.M.; Alnihoud, J.Q.O. A Meliorated Kashida Based Approach for Arabic Text Steganography. Int. J. Comput. Sci. Inf. Technol. 2017, 9, 99–109. 51. Hamdan, A.M.; Hamarsheh, A. AH4S: An algorithm of text in text steganography using the structure of omega network. Secur. Commun. Netw. 2016, 9, 6004–6016. [CrossRef] 52. Sumathi, C.P.; Santanam, T.; Umamaheswari, G. A Study of Various Steganographic Techniques Used for Information Hiding. Int. J. Comput. Sci. Eng. Surv. 2013, 4, 9–25. 53. Mir, N. Copyright for web content using invisible text watermarking. Comput. Hum. Behav. 2014, 30, 648–653. [CrossRef] 54. Sruthi, E.; Scaria, A.; Ambikadevi, A.T. Lossless Data Hiding Method Using Multiplication Property for HTML File. Int. J. Innov. Res. Sci. Technol. 2015, 1, 420–425. 55. Ahvanooey, M.T.; Tabasi, S.H. A new method for copyright protection in digital text documents by adding hidden Unicode characters in Persian/English texts. Int. J. Curr. Life Sci. 2014, 8, 4895–4900. 56. Ahvanooey, M.T.; Tabasi, S.H.; Rahmany, S. A Novel Approach for text watermarking in digital documents by Zero-Width Inter-Word Distance Changes. DAV Int. J. Sci. 2015, 4, 550–558. 57. Bashardoost, M.; Rahim, M.S.M.; Hadipour, N. A novel zero-watermarking scheme for text document authentication. J. Teknol. 2015, 75, 49–56. [CrossRef] 58. Alotaibi, R.A.; Elrefaei, L.A. Improved capacity Arabic text watermarking methods based on open word space. J. King Saud Univ. Comput. Inf. Sci. 2018, 30, 236–248. [CrossRef] 59. Alginahi, Y.M.; Kabir, M.; Tayan, O. An enhanced Kashida-based watermarking approach for Arabic text-documents. In Proceedings of the 2013 International Conference on Electronics, Computer and Computation (ICECCO), Ankara, Turkey, 7–9 November 2013; pp. 301–304. Entropy 2019, 21, 355 28 of 31

60. Alginahi, Y.M.; Kabir, M.N.; Tayan, O. An Enhanced Kashida-Based Watermarking Approach for Increased Protection in Arabic Text-Documents Based on Frequency Recurrence of Characters. Int. J. Comput. Electr. Eng. 2014, 6, 381–392. [CrossRef] 61. Preda, M.D.; Pasqua, M. Software Watermarking: A Semantics-based Approach. Electron. Notes Theor. Comput. Sci. 2017, 331, 71–85. [CrossRef] 62. Gu, J.; Cheng, Y. A watermarking scheme for natural language documents. In Proceedings of the 2010 2nd IEEE International Conference on Information Management and Engineering (ICICES 2010), Chengdu, China, 16–18 April 2010. 63. Jaiswal, R.; Patil, N.N. Implementation of a new technique for web document protection using unicode. In Proceedings of the 2013 International Conference on Information Communication and Embedded Systems (ICICES 2013), Chennai, India, 21–22 February 2013; pp. 69–72. 64. Liu, T.-Y.; Tsai, W.-H. A New Steganographic Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique. IEEE Trans. Inf. Forensics Secur. 2007, 2, 24–30. [CrossRef] 65. Mohamed, A. An improved algorithm for information hiding based on features of Arabic text: A Unicode approach. Egypt. Inform. J. 2014, 15, 79–87. [CrossRef] 66. Al-maweri, N.S.; Adnan, W.W.; Ramli, A.R.; Samsudin, K.; Rahman, S.M.S.A.A. Robust Digital Text Watermarking Algorithm based on Unicode Extended Characters. Indian J. Sci. Technol. 2016, 9, 1–14. [CrossRef] 67. Zhang, Y.; Qin, H.; Kong, T. A novel robust text watermarking for word document. In Proceedings of the 3rd International Congress on Image and Signal Processing (CISP2010), Yantai, China, 16–18 October 2010. 68. Kaur, M.; Mahajan, K. An Existential Review on Text Watermarking Techniques. Int. J. Comput. Appl. 2015, 120, 29–32. [CrossRef] 69. Kim, M.Y. Text watermarking by syntactic analysis. In Proceedings of the 12th WSEAS International Conference on Computers (ICC’ 08), World Scientiﬁc and Engineering Academy and Society, Heraklion, Greece, 24–26 August 2008; pp. 904–909. 70. Topkara, M.; Topkara, U.; Atallah, M.J. Words are not enough: Sentence level natural language watermarking. In Proceedings of the 4th ACM International Workshop on Contents Protection and Security, Xi’an, China, 30 May 2006. 71. Topkara, U.; Topkara, M.; Atallah, M.J. The Hiding Virtues of Ambiguity: Quantiﬁably Resilient Watermarking of Natural Language Text through Synonym Substitutions. In Proceedings of the 8th Workshop on Multimedia and Security (MM&Sec ’06), Geneva, Switzerland, 26–27 September 2006; pp. 167–174. 72. Bender, W.; Gruhl, D.; Morimoto, N.; Lu, A. Techniques for data hiding. IBM Syst. J. 1996, 35, 313–336. [CrossRef] 73. Brassil, J.; Low, S.; Maxemchuk, N. Copyright protection for the electronic distribution of text documents. Proc. IEEE 1999, 87, 1181–1196. [CrossRef] 74. Petrovic, R.; Tehranchi, B.; Winograd, J.M. Security of Copy-Control Watermarks. In Proceedings of the 8th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services—TELSIKS 2007, Nis, Serbia, 26–28 September 2007; pp. 117–126. 75. Vybornova, O.; Macq, B. Natural Language Watermarking and Robust Hashing Based on Presuppositional Analysis. In Proceedings of the IEEE International Conference on Information Reuse and Integration, Las Vegas, NV, USA, 13–15 August 2007; pp. 177–182. 76. Jalil, Z.; Mirza, A.M.; Iqbal, T. A zero-watermarking algorithm for text documents based on structural components. In Proceedings of the IEEE International Conference on Information and Emerging Technologies, Karachi, Pakistan, 14–16 June 2010; pp. 1–5. 77. Bashardoost, M.; Rahim, M.S.M.; Saba, T.; Rehman, A. Replacement Attack: A New Zero Text Watermarking Attack. 3D Res. 2017, 8, 2–9. [CrossRef] 78. Ba-Alwi, F.M.; Ghilan, M.M.; Al-Wesabi, F.N. Content Authentication of English Text via Internet using Zero Watermarking Technique and Markov Model. Int. J. Appl. Inf. Syst. 2014, 7, 25–36. 79. Tanha, M.; Torshizi, S.D.S.; Abdullah, M.T.; Hashim, F. An overview of attacks against digital watermarking and their respective countermeasures. In Proceedings of the IEEE International Conference on Cyber Security, Cyber Warfare and Digital Forensic (CyberSec), Kuala Lumpur, Malaysia, 26–28 June 2012; pp. 265–270. Entropy 2019, 21, 355 29 of 31

80. Meral, H.M.; Sevinç, E.; Unkar, E.; Sankur, B.; Özsoy, A.S.; Güngör, T. Natural language watermarking via morphosyntactic alterations. In Proceedings of the SPIE 6505, Security, Steganography, and Watermarking of Multimedia Contents, San Jose, CA, USA, 28 January 2007. [CrossRef] 81. Meral, H.M.; Sankur, B.; Özsoy, A.S.; Güngör, T.; Sevinç, E. Natural language watermarking via morphosyntactic alterations. Comput. Lang. 2009, 23, 107–125. [CrossRef] 82. Kim, M.-Y.; Zaiane, O.R.; Goebel, R. Natural Language Watermarking Based on Syntactic Displacement and Morphological Division. In Proceedings of the Computer Software and Applications Conference Workshops (IEEE COMPSACW), Seoul, South Korea, 19–23 July 2010. 83. Halvani, O.; Steinebach, M.; Wolf, P.; Zimmermann, R. Natural language watermarking for german texts. In Proceedings of the 1st ACM Workshop on Information Hiding and Multimedia Security, Montpellier, France, 17–19 June 2013; pp. 193–202. 84. Mali, M.L.; Patil, N.N.; Patil, J.B. Implementation of Text Watermarking Technique Using Natural Language Watermarks. In Proceedings of the IEEE International Conference on Communication Systems and Network Technologies, Gwalior, India, 6–8 April 2013; pp. 482–486. 85. Lu, H.; Guangping, M.; Dingyi, F.; Xiaolin, G. Resilient natural language watermarking based on pragmatics. In Proceedings of the IEEE Youth Conference on Information, Computing and Telecommunication (YC-ICT ’09), Beijing, China, 20–21 September 2009. 86. Lee, I.S.; Tsai, W.H. Secret communication through web pages using special space codes in HTML ﬁles. Int. J. Appl. Sci. Eng. 2008, 6, 141–149. 87. Cheng, W.; Feng, H.; Yang, C. A robust text digital watermarking algorithm based on fragments regrouping strategy. In Proceedings of the IEEE International Conference on Information Theory and Information Security (ICITIS), Beijing, China, 17–19 December 2010; pp. 600–603. 88. Gutub, A.A.A.; Ghouti, L.; Amin, A.A.; Alkharobi, T.M.; Ibrahim, M. Utilizing extension character ‘Kashida’ with pointed letters 469 for Arabic text digital watermarking. In Proceedings of the SECRYPT 2007, Barcelona, Spain, 28–31 July 2007; pp. 329–332. 89. Chou, Y.-C.; Huang, C.-Y.; Liao, H.-C. A Reversible Data Hiding Scheme Using Cartesian Product for HTML File. In Proceedings of the Sixth International Conference on Genetic and Evolutionary Computing (ICGEC), Kitakushu, Japan, 25–28 August 2012; pp. 153–156. 90. Odeh, A.; Elleithy, K. Steganography in Text by Merge ZWC and Space Character. In Proceedings of the 28th International Conference on Computers and Their Applications (CATA-2013), Honolulu, HI, USA, 4–6 March 2013; pp. 1–7. 91. Shirali-Shahreza, M. Pseudo-space Persian/Arabic text steganography. In Proceedings of the IEEE Symposium on Computers and Communications ISCC, Marrakech, Morocco, 6–9 July 2008; pp. 864–868. 92. Gutub, A.A.A.; Fattani, M.M. A Novel Arabic Text Steganography Method Using Letter Points and Extensions. Int. J. Comput. Electr. Autom. Control Inf. Eng. 2007, 1, 502–505. 93. Gutub, A.A.A.; Al-Nazer, A.A. High Capacity Steganography Tool for Arabic Text Using ‘Kashida’. ISC Int. J. Inf. Secur. 2010, 2, 107–118. 94. Gutub, A.A.A.; Al-Alwani, W.; Mahfoodh, A.B. Improved Method of Arabic Text Steganography Using the Extension ‘Kashida’ Character. Bahria Univ. J. Inf. Commun. Technol. 2010, 3, 68–72. 95. Al-Nazer, A.; Gutub, A. Exploit Kashida Adding to Arabic e-Text for High Capacity Steganography. In Proceedings of the 2009 Third International Conference on Network and System Security, Gold Coast, QLD, Australia, 19–21 October 2009; pp. 447–451. 96. Al-Nofaie, S.M.; Fattani, M.M.; Gutub, A.A.A. Capacity Improved Arabic Text Steganography Technique Utilizing ‘Kashida’ with Whitespaces. In Proceedings of the 3rd International Conference on Mathematical Sciences and Computer Engineering (ICMSCE 2016), Lankawi, Malaysia, 4–5 February 2016; pp. 38–44. 97. Al-Nofaie, S.M.; Fattani, M.M.; Gutub, A.A.-A. Merging Two Steganography Techniques Adjusted to Improve Arabic Text Data Security. J. Comput. Sci. Comput. Math. 2016, 6, 59–65. [CrossRef] 98. Keidel, R.; Wendzel, S.; Zillien, S.; Conner, E.S.; Haas, G. WoDiCoF-A Testbed for the Evaluation of (Parallel) Covert Channel Detection Algorithms. J. Univers. Comput. Sci. 2018, 24, 556–576. 99. Gu, Y.X.; Wyseur, B.; Preneel, B. Software-Based Protection Is Moving to the Mainstream. IEEE Comput. Soc. 2011, 28, 56–59. 100. Por, L.Y.; Ang, T.F.; Delina, B. Whitesteg: A new scheme in information hiding using text steganography. Wseas Trans. Comput. 2008, 7, 735–745. Entropy 2019, 21, 355 30 of 31

101. The Unicode Standard. December 2018. Available online: http://www.unicode.org/standard/standard. html (accessed on 30 March 2019). 102. Unicode. Wikipedia (the Free Encyclopedia), December 2018. Available online: https://en.wikipedia.org/ wiki/Unicode (accessed on 30 March 2019). 103. Unicode Control Characters. March 2019. Available online: http://www.fileformat.info/info/unicode/ char/search.htm (accessed on 30 March 2019). 104. Kerckhoffs, A. La cryptographie militaire. J. Sci. Mil. 1883, IX, 161–191. 105. Din, R.; Tuan Muda, T.Z.; Lertkrai, P.; Omar, M.N.; Amphawan, A.; Aziz, F.A. Text steganalysis using evolution algorithm approach. In Proceedings of the 11th WSEAS International Conference on Information Security and Privacy (ISP’12), Prague, Czech Republic, 24–26 September 2012. 106. Din, R.; Samsudin, A.; Lertkrai, P. A Framework Components for Natural Language Steganalysis. Int. J. Comput. Eng. 2012, 641–645. [CrossRef] 107. Mazurczyk, W.; Wendzel, S.; Cabaj, K. Towards Deriving Insights into Data Hiding Methods Using Pattern-based Approach. In Proceedings of the 13th International Conference on Availability, Reliability and Security, Hamburg, Germany, 27–30 August 2018; p. 10. 108. Simmons, G.J. The prisoner’s problem and the subliminal channel. In Advances in Cryptology; Plenum Press: New York, NY, USA, 1984; pp. 51–67. 109. Khosravi, B.; Khosravi, B.; Khosravi, B.; Nazarkardeh, K. A new method for pdf steganography in justified texts. JISA. 2019, 145, 61–70. 110. Ahvanooey, M.T.; Li, Q.; Rabbani, M.; Rajput, A.R. A Survey on Smartphones Security: Software Vulnerabilities, Malware, and Attacks. Int. J. Adv. Comput. Sci. Appl. 2017, 8, 30–45. 111. Khairullah, M. A novel steganography method using transliteration of Bengali text. J. King Saud Univ. Comput. Inf. Sci. 2018.[CrossRef] 112. Kim, Y.-W.; Moon, K.-A.; Oh, I.-S. A text watermarking algorithm based on word classification and inter-word space statistics. In Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR ’03), Washington, DC, USA, 27 June–2 July 2003; Volume 2, p. 775. 113. Alattar, A.M.; Alattar, O.M. Watermarking electronic text documents containing justified paragraphs and irregular line spacing. Electron. Imaging 2004, 5306, 685–695. 114. Low, S.; Maxemchuk, N.; Brassil, J.; O’Gorman, L. Document marking and identification using both line and word shifting. In Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies, Bringing Information to People (INFOCOM ’95), Boston, MA, USA, 2–6 April 1995; Volume 2, pp. 853–860. 115. Memon, M.Q.; Yu, H.; Rana, K.G.; Azeem, M.; Yongquan, C.; Ditta, A. Information hiding: Arabic text steganography by using Unicode characters to hide secret data. Int. J. Electron. Secur. Digit. Forensics 2018, 10, 61–78. [CrossRef] 116. Shirali-Shahreza, M. A New Approach to Persian/Arabic Text Steganography. In Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, Software Architecture and Reuse (ICIS-COMSAR’06), Honolulu, HI, USA, 10–12 July 2006; pp. 310–315. 117. Aabed, M.A.; Awaideh, S.M.; Elshafei, A.-R.M.; Gutub, A.A. Arabic Diacritics based Steganography. In Proceedings of the 2007 IEEE International Conference on Signal Processing and Communications, United Arab Emirates, 24–27 November 2007; pp. 756–759. 118. Gutub, A.; Elarian, Y.; Awaideh, S.; Alvi, A. Arabic text steganography using multiple diacritics. In Proceedings of the 5th IEEE International Workshop on Signal Processing and its Applications (WoSPA08), University of Sharjah, Sharjah, UAE, 18–20 March 2008. 119. Memon, J.A.; Khowaja, K.; Kazi, H. Evaluation of steganography for urdu/arabic text. J. Theor. Appl. Inf. Technol. 2005, 4, 232–237. 120. Nagarhalli, T.P. A new approach to SMS text steganography using emoticons. In Proceedings of the International Journal of Computer Applications (0975–8887) National Conference on Role of Engineers in Nation Building (NCRENB-14), VIVA Institute of Technology, Maharashtra, India, 6–7 March 2014; pp. 1–3. 121. Ahmad, T.; Sukanto, G.; Studiawan, H.; Wibisono, W.; Ijtihadie, R.M. Emoticon-based steganography for securing sensitive data. In Proceedings of the 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, 7–8 October2014; pp. 1–6. Entropy 2019, 21, 355 31 of 31

122. Iranmanesh, V.; Wei, H.J.; Dao-Ming, S.L.; Arigbabu, O.A. On using emoticons and lingoes for hiding data in SMS. In Proceedings of the 2015 International Symposium on Technology Management and Emerging Technologies (ISTMET), Melaka, Malaysia, 25–27 August 2015; pp. 103–107. 123. Shirali-Shahreza, M. A New Persian/Arabic Text Steganography Using “La” Word. In Advances in Computer and Information Sciences and Engineering; Springer: Berlin/Heidelberg, Germany, 2008; pp. 339–342. 124. Bhattacharyya, S.; Indu, P.; Sanyal, G. Hiding Data in Text using ASCII Mapping Technology (AMT). Int. J. Comput. Appl. 2013, 70, 29–37. [CrossRef] 125. Kingslin, S.; Kavitha, N. Evaluative Approach towards Text Steganographic Techniques. J. Sci. Technol. 2015, 8.[CrossRef] 126. Thamaraiselvan, R.; Saradha, A. A Novel approach of Hybrid Method of Hiding the Text Information Using Stegnography. Int. J. Comput. Eng. Res. 2012, 1405–1409. 127. Ryabko, B.; Ryabko, D. Information-theoretic approach to steganographic systems. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 2461–2464. 128. Chen, R.X. A Brief Introduction on Shannon’s Information Theory. arXiv, 2016; arXiv:1612.09316. 129. Verdü, S. Fifty years of Shannon theory. IEEE Trans. Inf. Theory 1998, 44, 2057–2078. [CrossRef] 130. Yamano, T. A possible extension of Shannon’s information theory. Entropy 2001, 3, 280–292. [CrossRef] 131. Rico-Larmer, S.M. Cover Text Steganography: N-gram and Entropybased Approach. In Proceedings of the 2016 KSU Conference on Cybersecurity Education, Research and Practice, Kennesaw State University, Kennesaw, GA, USA, 4 October 2016. Available online: https://digitalcommons.kennesaw.edu/ccerp/2016/ Student/16 (accessed on 30 March 2019). 132. Menzes, A.; van Oorschot, P.; Vanstone, S. Handbook of Applied Cryptography; CRC Press: Boca Raton, FL, USA, 1996. 133. Ryabko, B.; Fionov, A. Basics of Contemporary Cryptography for IT Practitioners; World Scientiﬁc Pub. Co. Pte Lt.: Hackensack, NJ, USA, 2005.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).