sign up log in tour stack overflow careers

Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no Take the 2-minute tour × registration required.

Remove unicode characters from textfiles - , other bash/shell methods

How do I remove unicode characters from a bunch of text files on the terminal? I've tried this but it didn't work:

sed 'g/\u'U+200E'//' -i *.txt

I need to remove these unicodes from the textfiles

U+0091 - of weird "control" space U+0092 - same sort of weird "control" space A0 - non-space break U+200E - left to right mark

bash unicode sed text-files spaces

asked Dec 19 '11 13:55 alvas 7,826 12 62 155

What encoding is your text files in? – unwind Dec 19 '11 at 14:08

5 Answers

If you want to remove ONLY particular characters and you have python, you can:

CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")') sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt

answered Dec 19 '11 at 14:19 Michał Šrajer 11.3k 2 21 47

Maybe not the prettiest. But it worked very well for me. By constructing the CHARS variable, it made the sed easier to read, and CHARS variable can be easily maintained. Choroba's answer also works, so I guess it's a matter of taste (and if you have Python handy). – Paulb Feb 17 '14 at 13:03

It is an alternative code of python part. python -c 'print "".(map(unichr, range(0x80, 0xa0) + range(0x2000, 0x200f))).encode("utf-8")' – ENDOH takanao Mar 17 at 4:15

all non-ascii chars of .txt

$ iconv -c -f utf-8 -t ascii file.txt $ file.txt

answered Dec 19 '11 at 14:12 kev 52.6k 10 100 144

i want to keep the unicode encoding. sorry, so iconv is not the solution. – alvas Dec 19 '11 at 14:40

2 Why can't you just run it in reverse? tempf=$(mktemp) iconv -c -f utf-8 -t ascii file.txt > $tempf iconv -f ascii -t utf-8 $tempf > file.txt – David Gladfelter Feb 21 '14 at 16:32

UTF-8 is a valid subset of ASCII. The reverse transformation keeps the file unchanged. – Eric Bréchemier Sep 8 '14 at 9:13

You have just changed my life, kev! You're The Man. Thanks! – Krzysztof Jabłoński Oct 3 '14 at 15:17

Use iconv:

iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt

This will translate characters like "Š" into "S" ( similar looking ones).

answered Dec 19 '11 at 14:05 Michał Šrajer 11.3k 2 21 47

1 they are not ascii, i want to keep them in utf8 but i want to replace these weird spaces into normal null string "" – alvas Dec 19 '11 at 14:09

See my another answer – Michał Šrajer Dec 19 '11 at 14:20

Not what the OP wanted, but I had a need to convert a unicode line-seperator (u2028) into a newline. I would have preferred to use iconv, but I couldn't figure out how to do it. Is there a way? – Chris Quenelle Oct 1 '13 at 18:05

the -c flag is useful to discard characters that cannot be transliterated, avoiding a fatal error. – Eric Bréchemier Sep 8 '14 at 9:10

1 As an alternative to -c, --unicode-subst allows to specify a pattern for the substitution of the character, instead of removing it completely. For example, --unicode-subst='?' allows to replace non-identifiable characters with a question mark. – Eric Bréchemier Sep 8 '14 at 10:31

For utf-8 encoding of unicode, you can use this regular expression for sed:

sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//'

answered Dec 19 '11 at 14:26 choroba 71.2k 6 41 95

how do i the mapping from U+... to \xc2\... ? – alvas Dec 19 '11 at 14:37

8 -ne '\u0091' | xxd – kev Dec 19 '11 at 14:52

This could be a good start - utf8-chartable.de – jaypal singh Dec 20 '11 at 2:21

Convert Swift files from utf-8 to ascii:

for file in *.swift; do iconv -f utf-8 -t ascii "$file" > "$file".tmp -f "$file".tmp "$file" done swift auto completion not working in Xcode6-Beta

answered Jul 12 '14 at 13:56 MattDiPasquale 30.4k 53 237 374