The Unicode specifications are continually revised and updated to add new languages and symbols. Based on the official python documentation, Unicode (Universal Coded Character Set)is a specification that aims to list every character used by human languages and give each character its own unique code. Let’s start with a simple explanation of Unicode. You might be asking why we need to encode and decode characters. Most of the time, such errors are not informative enough unless you are a veteran in this field. Imagine the frustration when you encounter errors in encoding or decoding such as: UnicodeEncodeError: 'mbcs' codec can't decode characters in position UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position Handling Unicode files as a natural language processing practitioner is a nightmare, especially if you are using Windows operating system. This article is a must-read for those that often handle Unicode files (applicable to other encodings as well ) in their daily work. The official logo of the Unicode Consortium
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |