Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog Linux: Convert recursively files content from WINDOWS- CP1251 to Unicode UTF-8 with recode and iconv

Author : admin

Some ago I've written a tiny article, explaining how converting of HTML or TEXT content inside file can be converted with iconv.

Just recently, I've made mirror of a whole website with its directory structure with wget cmd. The website to be mirrored was encoded with charset Windows-1251 ( is now a bit obsolete and not very recommended to use), where my Apache Webserver to which I mirrored is configured by default to deliver file content (.html, txt, js, css ...) in newer and standard (universal cyrillic) compliant UTF-8 encoding. Thus opening in browser from my website, the website was delivered in UTF-8, whether the file content itself was with encoding Windows -1251; Thus I ended up seeing a lot of monkey unreadable characters instead of Slavonic letters. To deal with the inconvenience, I've used one liner script that converts all Windows-1251 charset files to UTF-8. This triggered me writting this little post, hoping the might be useful to others in a similar situation to mine:

1. Mass file charset / encoding convertion with recode

On Linux hosts, recode is probably not installed. If you're on Debian / Ubuntu Linux install it with apt;

apt-get install -- recode

It is also installable from default repositories on Fedora, RHEL, CentOS with:

yum -y install recode

Here is recode description taken from :

1 / 2 Walking in Light with Christ - Faith, Computing, Diary Articles & tips and tricks on GNU/Linux, FreeBSD, Windows, mobile phone articles, religious related texts http://www.pc-freak.net/blog

NAME recode - converts files between character sets

. -name "*.html" -exec recode WINDOWS-1251..UTF-8 {} \;

If you have few file extensions whose chracter encoding needs to be converted lets say .html, .htm and .php use cmd:

find . -name "*.html" -o -name '*.htm' -o -name '*.php' -exec recode WINDOWS-1251..UTF-8 {} \;

Btw I just recently learned how one can look for few, file extensions with find under one liner the argument to pass is -o -name '*.file-extension', as you can see from example, you can look for as many different file extensions as you like with one find search command.

After completing the convertion, I've remembered that earlier I've also used iconv on a couple of occasions to convert from Cyrillic CP-1251 to Cyrillic UTF-8, thus for those prefer to complete convertion with iconv here is an alternative a bit longer method using for cycle + and iconv.

2. Mass file convertion with iconv

for i in $(find . -name "*.html" -print); do iconv -f WINDOWS-1251 -t UTF-8 $i > $i.utf-8; mv $i $i.bak; mv $i.utf-8 $i; done

As you see in above line of code, there are two occurances of move command as one is backupping all .html files and second mv overwrites with files with converted encoding. For any other files different from .html, just change in cmd find . -iname '*.html' to whatever file extension.

2 / 2

Powered by TCPDF (www.tcpdf.org)