Главная » Error » UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte

В этой статье

What is UnicodeDecodeError ‘utf8’ codec can’t decode byte?

The UnicodeDecodeError normally happens when decoding a string from a certain coding. Since codings map only a limited number of str an illegal sequence of str characters characters, Unicodestrings to  (non-ASCII) will cause the coding-specific decode() to fail.

When importing and reading a CSV file, Python tries to convert a byte-array (bytes which it assumes to be a utf-8-encoded string) to a Unicode string (str). It is a decoding process according to UTF-8 rules. When it tries this, it encounters a byte sequence that is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Example

import pandas as pd
a = pd.read_csv(“filename.csv”)

Output

Traceback (most recent call last):
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position 2: invalid start byte

There are multiple solutions to resolve this issue, and it depends on the different use cases. Let’s look at the most common occurrences, and the solution to each of these use cases.

Solution for Importing and Reading CSV files using Pandas

If you are using pandas to import and read the CSV files, then you need to use the proper encoding type or set it to unicode_escape to resolve the UnicodeDecodeError as shown below.

import pandas as pd
data=pd.read_csv(“C:\Employess.csv”,encoding=”unicode_escape’)
print(data.head())

Заметки Python #17: Формат хранения данных

Не пропустите так же:

Decoding a Stream of Bytes

Similar to encoding a string, we can decode a stream of bytes to a string object, using the decode() function.

Format:

encoded = input_string.encode()
# Using decode()
decoded = encoded.decode(decoding, errors)

Since encode() converts a string to bytes, decode() simply does the reverse.

byte_seq = b’Hello’
decoded_string = byte_seq.decode()
print(type(decoded_string))
print(decoded_string)

Output


Hello

This shows that decode() converts bytes to a Python string.

Similar to those of encode(), the decoding parameter decides the type of encoding from which the byte sequence is decoded. The errors parameter denotes the behavior if the decoding fails, which has the same values as that of encode().

class=”wp-block-separator has-text-color has-background has-vivid-green-cyan-background-color has-vivid-green-cyan-color”>

Solution for Loading and Parsing JSON files

If you are getting UnicodeDecodeError while reading and parsing JSON file content, it means you are trying to parse the JSON file, which is not in UTF-8 format. Most likely, it might be encoded in ISO-8859-1. Hence try the following encoding while loading the JSON file, which should resolve the issue.

json.loads(unicode(opener.open(…), “ISO-8859-1”))

Solution for Loading and Parsing any other file formats

In case of any other file formats such as logs, you could open the file in binary mode and then continue the file read operation. If you just specify only read mode, it opens the file and reads the file content as a string, and it doesn’t decode properly.

You could do the same even for the CSV, log, txt, or excel files also.

with open(path, ‘rb’) as f:
text = f.read()

Alternatively, you can use decode() method on the file content and specify errors=’replace’ to resolve UnicodeDecodeError 

with open(path, ‘rb’) as f:
text = f.read().decode(errors=’replace’)

When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn’t use errors=’replace’, so if there are any characters in the Unicode string that aren’t in the default encoding (probably ASCII) you’ll get a UnicodeEncodeError.

(Python 3 no longer does this as it is terribly confusing.)

Check the type of message and assuming it is indeed Unicode, works back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

Solution for decoding the string contents efficiently

If you encounter UnicodeDecodeError while reading a string variable, then you could simply use the encode method and encode into a utf-8 format which inturns resolve the error.

str.encode(‘utf-8’).strip()Ezoic
report this adTotal0Shares0Share0Tweet0Share0Share0Share

Sign Up for Our Newsletters

Subscribe to get notified of the latest articles. We will never spam you. Be a part of our ever-growing community.

SubscribeBy checking this box, you confirm that you have read and are agreeing to our terms of use regarding the storage of the data submitted through this form.

How to enable CORS on Django REST Framework?

  • March 31, 2022

 If we are building an API layer using the Django REST framework and accessing these APIs in the front-end application we need to enable the CORS on Django Rest Framework… View Post

  • Python
  • Errors and Exception
  • 2 minute read

<хедер class="entry-хедер">

Файловая система и кодировка

При работе с операциями над файловой системой необходимо обязательно указывать кодировку, так как в различных ОС — она разная.

[code]

import locale

def_coding = locale.getpreferredencoding()
print(def_coding)

# Создаем файл и записываем в него текст
file= open(‘buhlo.txt’, ‘w’)
file.write(‘party time’)
file.close()
print(type(file))

# явное указание кодировки при работе с файлом
with open(‘buhlo.txt’, encoding=’utf-8′) as file:
for text in file:
print(text, end=»)

[/code]

Результат:


Конвертация (decode, encode)

Практически все программы так или иначе работают с сетью или файловой системой, поэтому работа с байтами так или иначе необходима. Чтобы преобразовывать байты в строки (bytes -> str) и наоборот (str -> bytes) используются методы кодирования или декодирования.

[code]

# Из строки в байты (encode)
alco = ‘Виски’
alco_in_bytes = alco.encode(‘utf-8′)
print(alco_in_bytes)

# простое декодирование — decode
alco_bytes = b’x57x68x69x73x6bx79’
alco_in_str = alco_bytes.decode(‘utf-8’)
print(alco_in_str)

# метод encode для класса str (передаем какую строку будем декодировать и указываем кодировку)
alcostr= ‘Виски’
alco_encode = str.encode(alcostr, encoding=’utf-8′)
print(alco_encode)

# метод decode для класса bytes (кодировка указана как ключевой аргумент)
bytesstart = b’x57x68x69x73x6bx79′
bytesend = bytes.decode(bytesstart, encoding=’utf-8′)
print(bytesend)

[/code]

Результат:


Методы encode и decode можно применять к типам данных str и bytes. В качестве аргумента передается имя переменной и ключ шифрования (кодировка, если по-простому).

Декодирование потока байтов

Подобно кодированию строки, мы можем декодировать поток байтов в строковый объект, используя функцию decode() .

Формат:

encoded = input_string.encode()
# Using decode()
decoded = encoded.decode(decoding, errors)

Поскольку encode() преобразует строку в байты, decode() просто делает обратное.

byte_seq = b’Hello’
decoded_string = byte_seq.decode()
print(type(decoded_string))
print(decoded_string)

Вывод


Hello

Это показывает, что decode() преобразует байты в строку Python.

Подобно параметрам encode() , параметр decoding определяет тип кодирования, из которого декодируется последовательность байтов. Параметр errors обозначает поведение в случае сбоя декодирования, который имеет те же значения, что и у encode() .

TypeError: unhashable type: ‘list’

  • August 20, 2022

Table of Contents Hide TypeError: unhashable type: ‘list’Example – unhashable type: ‘list’Solution to TypeError: unhashable type: ‘list’.Solution 1 – By Converting list into a tupleSolution 2 – By Adding list… View Post

  • Python
  • Errors and Exception
  • 4 minute read

<хедер class="entry-хедер">

Обработка ошибок

Существуют различные типы errors , некоторые из которых указаны ниже:

Тип ошибкиПоведение
strictПоведение по умолчанию, которое вызывает UnicodeDecodeError при сбое.
ignoreИгнорирует некодируемый Unicode из результата.
replaceЗаменяет все некодируемые символы Юникода вопросительным знаком (?)
backslashreplaceВставляет escape-последовательность обратной косой черты ( uNNNN) вместо некодируемых символов Юникода.

Давайте посмотрим на приведенные выше концепции на простом примере. Мы рассмотрим входную строку, в которой не все символы кодируются (например, ö ),

a = ‘This is a bit möre cömplex sentence.’

print(‘Original string:’, a)

print(‘Encoding with errors=ignore:’, a.encode(encoding=’ascii’, errors=’ignore’))
print(‘Encoding with errors=replace:’, a.encode(encoding=’ascii’, errors=’replace’))

Вывод

Original string: This is a möre cömplex sentence.
Encoding with errors=ignore: b’This is a bit mre cmplex sentence.’
Encoding with errors=replace: b’This is a bit m?re c?mplex sentence.’

Examples to remove Unicode characters

Here, we will be discussing all the different ways through which we can remove all the Unicode characters from the string:

1. Using encode() and decode() method

In this example, we will be using the encode() function and the decode() function from removing the Unicode characters from the String. Encode() function will encode the string into ‘ASCII’ and error as ‘ignore’ to remove Unicode characters. Decode() function will then decode the string back in its form. Let us look at the example for understanding the concept in detail.

#input string
str = “This is Python u500cPool”

#encode() method
strencode = str.encode(“ascii”, “ignore”)

#decode() method
strdecode = strencode.decode()

#output
print(“Output after removing Unicode characters : “,strdecode)

Output:

Using encode() and decode() method
Using encode() and decode() method

Explanation:

  • Firstly, we will take an input string in the variable named str.
  • Then, we will apply the encode() method, which will encode the string into ‘ASCII’ and error as ‘ignore’ to remove Unicode characters.
  • After that, we will apply the decode() method, which will convert the byte string into the normal string format.
  • At last, we will print the output.
  • Hence, you can see the output string with all the removed Unicode characters.

2. Using replace() method to remove Unicode characters

In this example, we will be using replace() method for removing the Unicode characters from the string. Suppose you need to remove the particular Unicode character from the string, so you use the string.replace() method, which will remove the particular character from the string. Let us look at the example for understanding the concept in detail.

#input string
str = “This is Python u300cPool”

#replace() method
strreplaced = str.replace(‘u300c’, ”)

#output
print(“Output after removing Unicode characters : “,strreplaced)

Output:

Using replace() method to remove Unicode characters
Using replace() method to remove Unicode characters

Explanation:

  • Firstly, we will take an input string in the variable named str.
  • Then, we will apply the replace() method in which we will replace the particular Unicode character with the empty space.
  • At last, we will print the output.
  • Hence, you can see the output string with all the removed Unicode characters.

Importance of encoding

Since encoding and decoding an input string depends on the format, we must be careful when encoding/decoding. If we use the wrong format, it will result in the wrong output and can give rise to errors.

The below snippet shows the importance of encoding and decoding.

The first decoding is incorrect, as it tries to decode an input string which is encoded in the UTF-8 format. The second one is correct since the encoding and decoding formats are the same.

a = ‘This is a bit möre cömplex sentence.’

print(‘Original string:’, a)

# Encoding in UTF-8
encoded_bytes = a.encode(‘utf-8’, ‘replace’)

# Trying to decode via ASCII, which is incorrect
decoded_incorrect = encoded_bytes.decode(‘ascii’, ‘replace’)
decoded_correct = encoded_bytes.decode(‘utf-8’, ‘replace’)

print(‘Incorrectly Decoded string:’, decoded_incorrect)
print(‘Correctly Decoded string:’, decoded_correct)

Output

Original string: This is a bit möre cömplex sentence.
Incorrectly Decoded string: This is a bit m��re c��mplex sentence.
Correctly Decoded string: This is a bit möre cömplex sentence.
class=”wp-block-separator has-text-color has-background has-vivid-green-cyan-background-color has-vivid-green-cyan-color”>

ModuleNotFoundError: No module named ‘PIL’

  • August 20, 2022

Table of Contents Hide What is ModuleNotFoundError: No module named ‘PIL’?How to fix ModuleNotFoundError: No module named ‘PIL’?Solution 1 – Installing and using the Pillow module in a proper waySolution… View Post

  • Python
  • Errors and Exception
  • 4 minute read

<хедер class="entry-хедер">

Модуль subprocess

Модуль subprocess отвечает за выполнение следующих действий: порождение новых процессов, соединение c потоками стандартного ввода, стандартного вывода, стандартного вывода сообщений об ошибках и получение кодов возврата от этих процессов. В качестве примера мы используем стандартную команду из cmd  — ping.

В этом примере результат работы модуля subprocess — это конвертация каждой из строк в формат кодировки cp866, после чего результат перекодируется в UTF-8. Он представляет собой набор кодов Unicode (байтовый формат). Для дальнейшей работы с результатом как со строкой необходимо преобразовать его в этот тип, то есть выполнить операцию decode.

Вся последовательность действий:

  1. Байтовый формат cp866 -> строка в формате cp866.
  2. Строка в формате cp866 -> байтовый формат UTF-8.
  3. Байтовый формат UTF-8 -> строка в формате UTF-8.

[code]

import subprocess

args = [‘ping’, ‘vk.com’] ping = subprocess.Popen(args, stdout=subprocess.PIPE)
for result in ping.stdout:
result = result.decode(‘cp866’).encode(‘utf-8’)
print(result.decode(‘utf-8’))

[/code]

Результат


In this article: Python, Программирование Следующий:

Conclusion

In this article, we learned how to use the encode() and decode() methods to encode an input string and decode an encoded byte sequence.

We also learned about how it handles errors in encoding/decoding via the errors parameter. This can be useful for encryption and decryption purposes, such as locally caching an encrypted password and decoding them for later use.

What are Unicode characters?

Unicode is an international encoding standard that is widely spread and has its acceptance all over the world. It is used with different languages and scripts by which each letter, digit, or symbol is assigned with a unique numeric value that applies across different platforms and programs.

Кодировка UTF-8

Чтобы передавать данные по сети нам нужно сконвертировать текст  в байты. Для этого и служит одна из версий Unicode — кодировка UTF-8. Она имеет переменную длину кода — это значит, что UTF-8 не использует один байт все время, это от 1 до 4 байтов.

[adace-ad id=»3474″]

Fix – UnicodeEncodeError: ‘ascii’ codec can’t encode character u’xa0′:

Quite common error while dealing with unicode characters if you fetch or crawl data from different web pages (on different sites).

Let’s understand why this problem is happening –

if( aicp_can_see_ads() ) {

}

  • When you try to use the Python string function, it uses the default character encoding .
    • If you check sys.stdout.encoding value , sometimes it is “None”.
    • The default can be located in – /etc/default/locale in case of Linux
    • And the default is defined by the variables LANG, LC_ALL, LC_CTYPE
    • See what values are set against these variables.
    • For example – If the default is UTF-8 , these would be LANG=”UTF-8″ , LC_ALL=”UTF-8″ , LC_CTYPE=”UTF-8″
  • Now assume default encoding is “XYZ” . Hence Python tries to encode the bytes (input datatext) using this encoding.
  • Assume some of “these” textdata representations belong to unicode characters.
  • Now if the default character encoding used is not equipped to handle that, the error pops out.
  • So to handle this issue , you have to specify the “RIGHT” encode option to Python so it knows how to handle it.
  • A Standard option is to use “UTF-8” as a encode option. It more or less works fine.
  • There are other ways also to workoutignore the error. We will see that.

The Python string function handles the below set of ASCII characters comfortably –

    • whitespace=’ tnrvf’ascii_lowercase=’abcdefghijklmnopqrstuvwxyz’ascii_uppercase=’ABCDEFGHIJKLMNOPQRSTUVWXYZ’ascii_letters=ascii_lowercase+ascii_uppercasedigits=’0123456789’hexdigits=digits+’abcdef’+’ABCDEF’octdigits=’01234567’punctuation=r”””!”#$%&'()*+,-./:;<=>[email protected][]^_`{|}~”””printable=digits+ascii_letters+punctuation+whitespace

Fix –

  • Set the Python encoding to UTF-8. This will ensure the fix for the current session .

$ export PYTHONIOENCODING=utf8

if( aicp_can_see_ads() ) {

}

  • Set the environment variables correctly in /etc/default/locale .  This sets the system`s default locale encoding to the UTF-8 format.

LANG=”UTF-8″ or “en_US.UTF-8″
LC_ALL=”UTF-8” or “en_US.UTF-8″
LC_CTYPE=”UTF-8” or “en_US.UTF-8″Or use command line
export LC_ALL=”UTF-8″
export LC_ALL=”UTF-8″
export LC_CTYPE=”UTF-8”

  • Set the encoding at code level.

str1 =
str2 = str1.encode(‘utf-8’)
print (str1.encode(‘utf-8’))
print (str2)str1 =
str2 = str1.encode(‘utf-8’, ‘ignore’).decode(‘utf-8’)
print (str2)

  • Set the encoding using sys

# encoding=utf8
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding(‘utf8’)

  • Set the encoding using locale

import os
import locale
os.environ[“PYTHONIOENCODING”] = “utf-8″
scriptLocale=locale.setlocale(category=locale.LC_ALL, locale=”en_GB.UTF-8”)

  • Set the encoding using Emacs

#!/usr/bin/env python
# -*- coding: utf-8 -*-
u=’abcdé’print(ord(u[-1]))#!/usr/bin/env python
# -*- coding: utf-8 -*-#!/usr/bin/env python
# coding: utf8

  • If you can safely ignore or bypass or throw out the unicode characters or you do not need those , you can also use below option . In this example , str2 will no longer have any unicode characters (those are ignored or dropped).

str2 = str1.encode(‘ascii’, ‘ignore’).decode(‘ascii’)
print (str2)

if( aicp_can_see_ads() ) {

}

  • Use codecs for file operation – codecs.open(encoding=”utf-8″) – File handling (Read and write files to and from Unicode) . The encoding can be anything utf-8, utf-16, utf-32 etc.

import codecs
opened = codecs.open(“inputfile.txt”, “r”, “utf-8”)

Additional points :

  • In Python 3 as UTF-8 is the default source encoding
  • encode() function converts the Unicode to bytes (returns a bytes representation of the Unicode string). Various encode() options –
    • encode(‘ascii’, ‘ignore’)
    • encode(‘ascii’, ‘replace’)
    • encode(‘ascii’, ‘xmlcharrefreplace’)
    • encode(‘ascii’, ‘backslashreplace’)
    • encode(‘ascii’, ‘namereplace’)
  • decode() function converts the bytes to a String . This method takes an encoding argument, such as UTF-8, and optionally an errors argument. The errors argument (e.g. “ignore”) specifies the response when the string can’t be converted with the encoding.Various decode() options –
    • decode(“utf-8”, “strict”)
    • decode(“utf-8”, “replace”)
    • decode(“utf-8”, “backslashreplace”)
    • decode(“utf-8”, “ignore”)
  • UTF-8 properties –
    • Can handle any Unicode code point.
    • A string of ASCII text is also valid UTF-8 text.
    • UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes. This avoids the byte-ordering issues that can occur with integer and word oriented encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending on the hardware on which the string was encoded.

Hope this helps to solve the issue.

Other Interesting Reads –

How to log an error in Python ?

  • How to Code Custom Exception Handling in Python ?

  • How to Handle Errors and Exceptions in Python ?

  • How to Handle Bad or Corrupt records in Apache Spark ?

    Tutorials

    PySpark Tutorial
    Google Cloud (GCP) Tutorial

    ModuleNotFoundError: No module named ‘Cython’

    • August 20, 2022

    Table of Contents Hide What is ModuleNotFoundError: No module named ‘Cython’?How to fix ModuleNotFoundError: No module named ‘Cython’?Solution 1 – Installing and using the Cython module in a proper waySolution… View Post

    • Python
    • Basics
    • 2 minute read

    <хедер class="entry-хедер">

    Python Max int | Maximum value of int in Python

    • August 20, 2022

    In this tutorial, we will look at what’s Python Max int in different versions of Python. Python 3 has unlimited precision that means there is no explicitly defined max limit.… View Post

    • Python
    • String Methods
    • 2 minute read

    <хедер class="entry-хедер">

    Источники

    • https://itsmycode.com/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte/
    • https://digital2.ru/zametki-python-16-kodirovki/
    • https://www.askpython.com/python/string/python-encode-and-decode-functions
    • https://pythononline.ru/osnovy/encode-decode
    • https://www.pythonpool.com/remove-unicode-characters-python/
    • https://gankrin.org/fix-unicodeencodeerror-ascii-codec-cant-encode-character/
    [свернуть]
    Решите Вашу проблему!


    ×
    Adblock
    detector