Главная » Error » UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte

В этой статье

What is UnicodeDecodeError ‘utf8’ codec can’t decode byte?

The UnicodeDecodeError normally happens when decoding a string from a certain coding. Since codings map only a limited number of str an illegal sequence of str characters characters, Unicodestrings to  (non-ASCII) will cause the coding-specific decode() to fail.

When importing and reading a CSV file, Python tries to convert a byte-array (bytes which it assumes to be a utf-8-encoded string) to a Unicode string (str). It is a decoding process according to UTF-8 rules. When it tries this, it encounters a byte sequence that is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Example

import pandas as pd
a = pd.read_csv(“filename.csv”)

Output

Traceback (most recent call last):
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position 2: invalid start byte

There are multiple solutions to resolve this issue, and it depends on the different use cases. Let’s look at the most common occurrences, and the solution to each of these use cases.

Solution for Importing and Reading CSV files using Pandas

If you are using pandas to import and read the CSV files, then you need to use the proper encoding type or set it to unicode_escape to resolve the UnicodeDecodeError as shown below.

import pandas as pd
data=pd.read_csv(“C:\Employess.csv”,encoding=”unicode_escape’)
print(data.head())

SIA & DIA — Создание и передача .zip .tar.gz .tar.bz2 архива в Linux и распаковка

Additional points :

  • UTF-8 properties –
    • Can handle any Unicode code point.
    • A string of ASCII text is also valid UTF-8 text.
    • UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes. This avoids the byte-ordering issues that can occur with integer and word oriented encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending on the hardware on which the string was encoded.

Hope this helps.

Other Interesting Reads –

How to log an error in Python ?

  • How to Code Custom Exception Handling in Python ?

  • How to Handle Errors and Exceptions in Python ?

  • How to Handle Bad or Corrupt records in Apache Spark ?

    Tutorials

    PySpark Tutorial
    Google Cloud (GCP) Tutorial

    Генерация паролей на Linux через pwgen

    Возможно ли запустить ChromeDriver в headless режиме с расширениями?

    • 2 подписчика
    • 17 часов назад
    • 37 просмотров

    ответов
    0

  • python
    • Python

    • +2 ещё

    Простой

  • Как сделать повторение кода Python в exe?

    • 1 подписчик
    • 14 часов назад
    • 63 просмотра

    ответов
    0

  • python
    • Python

    • +1 ещё

    Простой

  • Solution for Loading and Parsing JSON files

    If you are getting UnicodeDecodeError while reading and parsing JSON file content, it means you are trying to parse the JSON file, which is not in UTF-8 format. Most likely, it might be encoded in ISO-8859-1. Hence try the following encoding while loading the JSON file, which should resolve the issue.

    json.loads(unicode(opener.open(…), “ISO-8859-1”))

    Рассылка в телеграм боте?

    • 1 подписчик
    • 3 часа назад
    • 72 просмотра

    ответа
    3

  • python
    • Python

    Простой

  • Как получить response?

    • 1 подписчик
    • 15 часов назад
    • 92 просмотра

    ответа
    4

  • python
    • Python

    • +1 ещё

    Простой

  • Solution for Loading and Parsing any other file formats

    In case of any other file formats such as logs, you could open the file in binary mode and then continue the file read operation. If you just specify only read mode, it opens the file and reads the file content as a string, and it doesn’t decode properly.

    You could do the same even for the CSV, log, txt, or excel files also.

    with open(path, ‘rb’) as f:
    text = f.read()

    Alternatively, you can use decode() method on the file content and specify errors=’replace’ to resolve UnicodeDecodeError 

    with open(path, ‘rb’) as f:
    text = f.read().decode(errors=’replace’)

    When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn’t use errors=’replace’, so if there are any characters in the Unicode string that aren’t in the default encoding (probably ASCII) you’ll get a UnicodeEncodeError.

    (Python 3 no longer does this as it is terribly confusing.)

    Check the type of message and assuming it is indeed Unicode, works back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

    Обработка ошибок

    Существуют различные типы errors , некоторые из которых указаны ниже:

    Тип ошибкиПоведение
    strictПоведение по умолчанию, которое вызывает UnicodeDecodeError при сбое.
    ignoreИгнорирует некодируемый Unicode из результата.
    replaceЗаменяет все некодируемые символы Юникода вопросительным знаком (?)
    backslashreplaceВставляет escape-последовательность обратной косой черты ( uNNNN) вместо некодируемых символов Юникода.

    Давайте посмотрим на приведенные выше концепции на простом примере. Мы рассмотрим входную строку, в которой не все символы кодируются (например, ö ),

    a = ‘This is a bit möre cömplex sentence.’

    print(‘Original string:’, a)

    print(‘Encoding with errors=ignore:’, a.encode(encoding=’ascii’, errors=’ignore’))
    print(‘Encoding with errors=replace:’, a.encode(encoding=’ascii’, errors=’replace’))

    Вывод

    Original string: This is a möre cömplex sentence.
    Encoding with errors=ignore: b’This is a bit mre cmplex sentence.’
    Encoding with errors=replace: b’This is a bit m?re c?mplex sentence.’

    Python and character encodings

    Python 2 doesn’t give a damn what your strings are encoded as. Latin 1, Latin 2, Shift-JIS, everything is fine. Doesn’t keep track of them, either, that’s up to you!

    Python 2 also has a special Unicode string, where ‘Cat’ would be the normal string and u’Cat’ would be the Unicode version.

    For Python 3, by default every string is UTF-8. This doesn’t seem like that big of a change, but it makes a lot of things Just Work that used to be problematic.

    Compare and Contrast

    I put this together as two IPython notebooks, too: Python 2, Python 3.

    Let’s compare what happens if you run the following code in an IPython notebook with Python 2 and Python 3.

    CommandPython 3Python 2
    print ‘hello world’hello worldhello world
    ‘hello world’‘hello world’‘hello world’
    print ‘你好世界’你好世界你好世界
    ‘你好世界’‘你好世界’‘xe4xbdxa0xe5
    xa5xbdxe4xb8
    x96xe7x95x8c’
    requests
    .get(“http://djchina.org”)
    .text

    资源

    u8d44u6e90

    import pandas as pd
    utf8_df.to_csv(“../output.csv”)
    Works fineUnicodeEncodeError:
    ‘ascii’ codec can’t encode characters
    Opening a UTF-8 file with accented charactersWorks fineHorrible errors

    So more or less, Python 3 does everything right. It opens, saves, and looks at Unicode/UTF-8 perfectly, while Python 2 keeps forgetting it doesn’t care about what your strings are and tries to treat them as ASCII (and throws an error in the process).

    Как завершить выполнение кода принудительно?

    • 1 подписчик
    • 13 часов назад
    • 100 просмотров

    ответ
    1

  • python
    • Python

    Простой

  • Conclusion

    In this tutorial, we learned about unicode and unicodedatabase module which defines the unicode characteristics. Hope you all enjoyed. Stay Tuned

    Solution for decoding the string contents efficiently

    If you encounter UnicodeDecodeError while reading a string variable, then you could simply use the encode method and encode into a utf-8 format which inturns resolve the error.

    str.encode(‘utf-8’).strip()Ezoic
    report this adTotal0Shares0Share0Tweet0Share0Share0Share

    Sign Up for Our Newsletters

    Subscribe to get notified of the latest articles. We will never spam you. Be a part of our ever-growing community.

    SubscribeBy checking this box, you confirm that you have read and are agreeing to our terms of use regarding the storage of the data submitted through this form.

    What does character encoding mean in Python?

    A string is a sequence of Unicode codepoints. These codepoints are converted into a sequence of bytes for efficient storage. This process is called character encoding.

    There are many encodings such as UTF-8,UTF-16,ASCII etc.

    By default, Python uses UTF-8 encoding.

    NoneType: ‘NoneType’ object is not subscriptable. Как быть, где допустил ошибку?

    • 1 подписчик
    • 2 часа назад
    • 44 просмотра

    ответ
    1

  • python
    • Python

    • +2 ещё

    Простой

  • Установка ElasticSearch на Debian 8.10

    How to enable CORS on Django REST Framework?

    • March 31, 2022

     If we are building an API layer using the Django REST framework and accessing these APIs in the front-end application we need to enable the CORS on Django Rest Framework… View Post

    • Python
    • Errors and Exception
    • 2 minute read

    <хедер class="entry-хедер">

    The uncidedata module to work with Unicode in Python

    The unicodedatamodule provides us the Unicode Character Database (UCD) which defines all character properties of all Unicode characters.

    Let’s look at all the functions defined within the module with a simple example to explain their functionality. We can efficiently use Unicode in Python with the use of the following functions.

    1. unicodedata.lookup(name)

    This function looks up the character by the given name. If the character is found, the corresponding character is returned. If not found, then Keyerror is raised.

    import unicodedata

    print (unicodedata.lookup(‘LEFT CURLY BRACKET’))
    print (unicodedata.lookup(‘RIGHT SQUARE BRACKET’))
    print (unicodedata.lookup(‘ASTERISK’))
    print (unicodedata.lookup(‘EXCLAMATION MARK’))

    Output:

    {
    ] *
    !

    6. unicodedata.category(chr)

    This function returns the general category assigned to the character chr as a string. It returns ‘L’ for letter and ‘u’ for uppercase and ‘l’ for lowercase.

    import unicodedata

    print (unicodedata.category(u’P’))
    print (unicodedata.category(u’p’))

    Output:

    Lu
    Ll

    7. unicodedata.bidirectional(chr)

    This function returns the bidirectional class assigned to the character chr as a string. An empty string is returned by this function if no such value is defined.

    AL denotes Arabic letter, AN denotes Arabic number and L denotes left to right etc.

    import unicodedata

    print (unicodedata.bidirectional(u’u0760′))

    print (unicodedata.bidirectional(u’u0560′))

    print (unicodedata.bidirectional(u’u0660′))

    Output:

    AL
    L
    AN

    8. unicodedata.combining(chr)

    This function returns canonical combining class assigned to the given character chr as string. It returns 0 if there is no combining class defined.

    import unicodedata

    print (unicodedata.combining(u”u0317″))

    Output:

    220

    9. unicodedata.mirrored(chr)

    This function returns a mirrored property assigned to the given character chr as an integer. It returns 1 if the character is identified as ‘mirrored‘ in bidirectional text or else it returns 0.

    import unicodedata

    print (unicodedata.mirrored(u”u0028″))
    print (unicodedata.mirrored(u”u0578″))

    Output:

    1
    0

    10. unicodedata.normalize(form, unistr)

    Using this function returns the conventional form for the Unicode string unistr. The valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

    from unicodedata import normalize

    print (‘%r’ % normalize(‘NFD’, u’u00C6′))
    print (‘%r’ % normalize(‘NFC’, u’Cu0367′))
    print (‘%r’ % normalize(‘NFKD’, u’u2760′))

    Output:

    ‘Æ’
    ‘Cͧ’
    ‘❠’

    TypeError: unhashable type: ‘list’

    • August 20, 2022

    Table of Contents Hide TypeError: unhashable type: ‘list’Example – unhashable type: ‘list’Solution to TypeError: unhashable type: ‘list’.Solution 1 – By Converting list into a tupleSolution 2 – By Adding list… View Post

    • Python
    • Errors and Exception
    • 4 minute read

    <хедер class="entry-хедер">

    Solution 3 – using requests.response content property

    So far, the code in this article used r.text that contains Request’s content response in a string. We can skip the encoding part all together by simply using the r.content instead as this property already contains the server content response in bytes. We then just simply use decode() method on r.content:

    decoded_d=r.content.decode(‘utf-8-sig’)
    data = json.loads(decoded_data)

    Компиляция Telegram TDLib — Could NOT find OpenSSL

    ModuleNotFoundError: No module named ‘PIL’

    • August 20, 2022

    Table of Contents Hide What is ModuleNotFoundError: No module named ‘PIL’?How to fix ModuleNotFoundError: No module named ‘PIL’?Solution 1 – Installing and using the Pillow module in a proper waySolution… View Post

    • Python
    • Errors and Exception
    • 4 minute read

    <хедер class="entry-хедер">

    Working with Python 2

    If you can’t use Python 3, you can try to make things work with Python 2. You have two main saviors in Python 2:

    1. The codecs library

    codecs allows you to specify and encoding when opening files for reading and writing. You can open a UTF-8 file for reading like so:

    import codecs
    opened = codecs.open(“filename.txt”, “r”, “utf-8”)

    2. Hacking sys

    The following code forces Python 2 to use UTF-8. It’s very discouraged, but it totally works better than anything else.

    import sys
    reload(sys)
    sys.setdefaultencoding(“utf-8”)

    The big issue that comes up is that you can’t use print from IPython Notebook any more (it prints to the command line, not to your notebook). There are other issues, but

    4. .encode and .decode

    Come on, don’t do this to yourself. Just move to Python 3! You could also try to read this summary if you are especially masochistic.

    More links

    A video

    Декодирование потока байтов

    Подобно кодированию строки, мы можем декодировать поток байтов в строковый объект, используя функцию decode() .

    Формат:

    encoded = input_string.encode()
    # Using decode()
    decoded = encoded.decode(decoding, errors)

    Поскольку encode() преобразует строку в байты, decode() просто делает обратное.

    byte_seq = b’Hello’
    decoded_string = byte_seq.decode()
    print(type(decoded_string))
    print(decoded_string)

    Вывод


    Hello

    Это показывает, что decode() преобразует байты в строку Python.

    Подобно параметрам encode() , параметр decoding определяет тип кодирования, из которого декодируется последовательность байтов. Параметр errors обозначает поведение в случае сбоя декодирования, который имеет те же значения, что и у encode() .

    What is UTF-8 Encoding?

    UTF-8 is the most popular and commonly used for encoding characters. UTF stands for Unicode Transformation Format and ‘8’ means that 8-bit values are used in the encoding.

    It replaced ASCII (American Standard Code For Information Exchange) as it provides more characters and can be used for different languages around the world, unlike ASCII which is only limited to Latin languages.

    The first 128 codepoints in the UTF-8 character set are also valid ASCII characters. A character in UTF-8 can be from 1 to 4 bytes long.

    1. Encode a string to UTF-8 encoding

    string = ‘örange’
    print(‘The string is:’,string)
    string_utf=string.encode()
    print(‘The encoded string is:’,string_utf)

    Output:

    The string is: örange
    The encoded string is: b’xc3xb6range’

    2. Encoding with error parameter

    Let us encode the german word weiß which means white.

    string = ‘weiß’

    x = string.encode(encoding=’ascii’,errors=’backslashreplace’)
    print(x)

    x = string.encode(encoding=’ascii’,errors=’ignore’)
    print(x)

    x = string.encode(encoding=’ascii’,errors=’namereplace’)
    print(x)

    x = string.encode(encoding=’ascii’,errors=’replace’)
    print(x)

    x = string.encode(encoding=’ascii’,errors=’xmlcharrefreplace’)
    print(x)

    x = string.encode(encoding=’UTF-8′,errors=’strict’)
    print(x)

    Output:

    b’wei\xdf’
    b’wei’
    b’wei\N{LATIN SMALL LETTER SHARP S}’
    b’wei?’
    b’weiß’
    b’weixc3x9f’

    What is utf-8-sig?

    The utf-8-sig is a Python variant of UTF-8, in which, when used in encoding, the BOM value will be written before anything else, while when used during decoding, it will skip the UTF-8 BOM character if it exists and this is exactly what I needed.

    So the solution is simple. We just need to decode the data using utf-8-sig encoding, which will get rid of the BOM value. There are several ways to accomplish that.

    How to interpret ASCII and Unicode in Python?

    Python provides us a string module that contains various functions and tools to manipulate strings. It falls under the ASCII character set.

    import string

    print(string.ascii_lowercase)
    print(string.ascii_uppercase)
    print(string.ascii_letters)
    print(string.digits)
    print(string.hexdigits)
    print(string.octdigits)
    print(string.whitespace)
    print(string.punctuation)

    Output:

    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
    0123456789
    0123456789abcdefABCDEF
    01234567

    !”#$%&'()*+,-./:;<=>[email protected][]^_`{|}~

    We can create one-character Unicode strings by using chr() built-in function. It takes only one integer as argument and returns the unicode of the given character.

    Similarly, odr() is an inbuilt function that takes a one-character Unicode string as input and returns the code point value.

    chr(57344)
    ord(‘ue000’)

    Output:

    ‘ue000’
    57344

    Как перейти по ярлыку?

    • 1 подписчик
    • 9 часов назад
    • 52 просмотра

    ответа
    2

  • python
    • Python

    Простой

  • ModuleNotFoundError: No module named ‘Cython’

    • August 20, 2022

    Table of Contents Hide What is ModuleNotFoundError: No module named ‘Cython’?How to fix ModuleNotFoundError: No module named ‘Cython’?Solution 1 – Installing and using the Cython module in a proper waySolution… View Post

    • Python
    • Basics
    • 2 minute read

    <хедер class="entry-хедер">

    Установка Composer на Debian 8.2

    Как сделать /unmute aiogram?

    • 1 подписчик
    • 17 часов назад
    • 50 просмотров

    ответов
    0

  • Показать ещёЗагружается…
  • Вакансии с Хабр Карьеры

    Python разработчик

    Сбер
    •Иннополис
    от 80 000 ₽

    Python-разработчик (Платформа Linux)

    Сбер
    •Москва
    от 250 000 до 350 000 ₽

    Бекенд-разработчик (Python)

    Налоги Онлайн

    от 320 000 до 640 000 ₽
    Ещё вакансии

    Заказы с Хабр Фриланса

    Разработка задания (node.js)

    18 окт. 2022, в 15:46
    10 руб./за проект

    Сделать Webview Android из веб приложения

    18 окт. 2022, в 15:23
    5000 руб./за проект

    Решить задачу на голом Питоне

    18 окт. 2022, в 15:19
    2500 руб./за проект
    Ещё заказы

    Минуточку внимания

    Присоединяйтесь к сообществу, чтобы узнавать новое и делиться знаниями

    Зарегистрироваться

    Самое интересное за 24 часа

    • Можно ли заменить в ноутбуке экран на другой, с более высоким разрешением?

      • 2 подписчика
      • 1 ответ
    • Как вывести общее количество товаров магазина Woocommerce в записи/на странице с помощью шорткода?

      • 2 подписчика
      • 0 ответов
    • Можно ли удалить рекламу Yandex с сайта?

      • 4 подписчика
      • 3 ответа
    • Как исправить эту ошибку docker-compose?

      • 1 подписчик
      • 1 ответ
    • Как скачивать с защищенных каналов Telegram?

      • 8 подписчиков
      • 1 ответ
    • Какие есть сервисы кэширования для сайта?

      • 7 подписчиков
      • 1 ответ
    • Как Отправить сгенерированный пдф файл на электронную почту?

      • 2 подписчика
      • 1 ответ
    • Как дать роли права на чтение только со slave?

      • 2 подписчика
      • 1 ответ
    • Как изменить код таким образом, чтобы промисы выполнялись поочередно?

      • 2 подписчика
      • 1 ответ
    • Как сгрупировать значения multiselect инпута в подмассивы если в форме несколько multiselect инпутов с одним названием tags[]?

      • 2 подписчика
      • 1 ответ
    • © Habr
    • О сервисе
    • Обратная связь
    • Блог

    Python Max int | Maximum value of int in Python

    • August 20, 2022

    In this tutorial, we will look at what’s Python Max int in different versions of Python. Python 3 has unlimited precision that means there is no explicitly defined max limit.… View Post

    • Python
    • String Methods
    • 2 minute read

    <хедер class="entry-хедер">

    Как заставить дискорд бота запустить песню?

    • 2 подписчика
    • 17 часов назад
    • 47 просмотров

    ответа
    2

  • python
    • Python

    • +1 ещё

    Простой

  • Как создать разные сессии/ прокси для каждого запроса?

    • 1 подписчик
    • 15 минут назад
    • 10 просмотров

    ответов
    0

  • python
    • Python

    Средний

  • Источники

    • https://itsmycode.com/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte/
    • https://gankrin.org/how-to-enable-utf-8-in-python/
    • https://qna.habr.com/q/341813
    • https://pythononline.ru/osnovy/encode-decode
    • https://www.jonathansoma.com/tutorials/international-data/python-and-utf-8/
    • https://www.askpython.com/python-modules/unicode-in-python-unicodedata
    • https://www.howtosolutions.net/2019/04/python-fixing-unexpected-utf-8-bom-error-when-loading-json-data/
    [свернуть]
    Решите Вашу проблему!


    ×
    Adblock
    detector