Remove Unicode Characters from Textfiles - Sed , Other Bash/Shell Methods

Remove Unicode Characters from Textfiles - Sed , Other Bash/Shell Methods

sign up log in tour help stack overflow careers Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no Take the 2-minute tour × registration required. Remove unicode characters from textfiles - sed , other bash/shell methods How do I remove unicode characters from a bunch of text files on the terminal? I've tried this but it didn't work: sed 'g/\u'U+200E'//' -i *.txt I need to remove these unicodes from the textfiles U+0091 - sort of weird "control" space U+0092 - same sort of weird "control" space A0 - non-space break U+200E - left to right mark bash unicode sed text-files spaces asked Dec 19 '11 at 13:55 alvas 7,826 12 62 155 What encoding is your text files in? – unwind Dec 19 '11 at 14:08 5 Answers If you want to remove ONLY particular characters and you have python, you can: CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")') sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt answered Dec 19 '11 at 14:19 Michał Šrajer 11.3k 2 21 47 Maybe not the prettiest. But it worked very well for me. By constructing the CHARS variable, it made the sed easier to read, and CHARS variable can be easily maintained. Choroba's answer also works, so I guess it's a matter of taste (and if you have Python handy). – Paulb Feb 17 '14 at 13:03 It is an alternative code of python part. python -c 'print "".join(map(unichr, range(0x80, 0xa0) + range(0x2000, 0x200f))).encode("utf-8")' – ENDOH takanao Mar 17 at 4:15 clear all non-ascii chars of file.txt $ iconv -c -f utf-8 -t ascii file.txt $ strings file.txt answered Dec 19 '11 at 14:12 kev 52.6k 10 100 144 i want to keep the unicode encoding. sorry, so iconv is not the solution. – alvas Dec 19 '11 at 14:40 2 Why can't you just run it in reverse? tempf=$(mktemp) iconv -c -f utf-8 -t ascii file.txt > $tempf iconv -f ascii -t utf-8 $tempf > file.txt – David Gladfelter Feb 21 '14 at 16:32 UTF-8 is a valid subset of ASCII. The reverse transformation keeps the file unchanged. – Eric Bréchemier Sep 8 '14 at 9:13 You have just changed my life, kev! You're The Man. Thanks! – Krzysztof Jabłoński Oct 3 '14 at 15:17 Use iconv: iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt This will translate characters like "Š" into "S" (most similar looking ones). answered Dec 19 '11 at 14:05 Michał Šrajer 11.3k 2 21 47 1 they are not ascii, i want to keep them in utf8 but i want to replace these weird spaces into normal null string "" – alvas Dec 19 '11 at 14:09 See my another answer – Michał Šrajer Dec 19 '11 at 14:20 Not what the OP wanted, but I had a need to convert a unicode line-seperator (u2028) into a newline. I would have preferred to use iconv, but I couldn't figure out how to do it. Is there a way? – Chris Quenelle Oct 1 '13 at 18:05 the -c flag is useful to discard characters that cannot be transliterated, avoiding a fatal error. – Eric Bréchemier Sep 8 '14 at 9:10 1 As an alternative to -c, --unicode-subst allows to specify a pattern for the substitution of the character, instead of removing it completely. For example, --unicode-subst='?' allows to replace non-identifiable characters with a question mark. – Eric Bréchemier Sep 8 '14 at 10:31 For utf-8 encoding of unicode, you can use this regular expression for sed: sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//' answered Dec 19 '11 at 14:26 choroba 71.2k 6 41 95 how do i find the mapping from U+... to \xc2\... ? – alvas Dec 19 '11 at 14:37 8 echo -ne '\u0091' | xxd – kev Dec 19 '11 at 14:52 This could be a good start - utf8-chartable.de – jaypal singh Dec 20 '11 at 2:21 Convert Swift files from utf-8 to ascii: for file in *.swift; do iconv -f utf-8 -t ascii "$file" > "$file".tmp mv -f "$file".tmp "$file" done swift auto completion not working in Xcode6-Beta answered Jul 12 '14 at 13:56 MattDiPasquale 30.4k 53 237 374.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    1 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us