encode : str 에서 bytes 로
decode : bytes 에서 str 로

이때 encoding 방식이 필요하며,
해당 encoding 방식에 따라 동일한 str 객체라도 다른 bytes 객체로 변환된다.

참고로, str 객체가 같은 경우엔 Unicode의 codepoint (=코드값)는 같음.
하지만, encoding방식에 따라 대응되는 bytes 객체는 다름.

참고 : Unicode에서의 encoding이란

Unicode에서 문자열은 일종의 sequence of code points 임. (코드값들의 sequence)
이를 저장 및 전송, 또는 메모리에 올리기 위해선 일종의 code unit으로 변경해야 하며, 이 code unit은 일종의 binary data임.

즉, Unicode encoding 이란
code points들의 sequence를,
이에 대응하는 bytes 객체로 변환하는 것을 의미함.

참고로, Unicode는 여러 개의 encoding 방법을 가짐 (Python에서는 이중 기본으로 utf-8 사용.)

encode

다음의 눈사람 모양의 글자인 str 객체에서 encode 메서드를 통해 bytes 객체를 얻어냄.

이때 encoding 방식을 argument로 지정함 (없을시 기본값인 utf-8 로 동작.)

snowman_char = "\u2603"
print(snowman_char)       # ☃
print(type(snowman_char)) # <class 'str'>
len(snowman_char)         # 1

code_bytes = snowman_char.encode('utf8')
print(type(code_bytes)) # <class 'bytes'>
code_bytes              # b'\xe2\x98\x83'

cp949나 ascii, cp1252, latin-1 등의 다양항 encoding을 적용할 수 있음.
단, 대상 str 객체가 해당 encoding에서 지원하지 않는 글자를 포함한 경우, UnicodeEncodeError가 발생함.

위의 눈사람 글자는 ASCII 에는 정의되지 않은 문자이기 때문에 ASCII로 인코딩이 불가함.

code_bytes = snowman_char.encode('ascii')
print(type(code_bytes))
code_bytes

위의 경우 다음의 UnicodeEncodeError가 발생.

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-54-2c76e369edaf> in <cell line: 1>()
----> 1 code_bytes = snowman_char.encode('ascii')
      2 print(type(code_bytes))
      3 code_bytes

UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128)

encode 에 errors parameter를 이용하여 해당 에러 발생을 방지할 수 있음.

strict : 기본값. 인코딩이 지원하지 않는 글자가 있는 경우 ,UnicodeEncodeError 발생시킴.
ignore : 인코딩이 지원하지 않는 글자를 무시함. (null문자와 출력은 비슷)
replace : 인코딩이 지원하지 않는 글자를 ? 로 치환시킴.
backslashreplace : 인코딩이 지원하지 않는 글자를 \xNN 의 escape sequence로 치환시킴 (N은 16진수 숫자)
xmlcharrefreplace : 인코딩이 지원하지 않는 글자를 해당하는 HTML Character Entity로 치환시킴 (&글자이름;). HTML문서에서 사용하는 방식으로 웹브라우저에서 해당 폰트 제공시 깨지지 않고 출력가능.

다음 예제 코드들을 참고.

예제: ignore

code_bytes = snowman_char.encode('ascii',errors='ignore')
print(type(code_bytes))
code_bytes

결과는 다음과 같음.

<class 'bytes'>
b''

예제: replace

code_bytes = snowman_char.encode('ascii', errors='replace')
print(type(code_bytes))
code_bytes

결과는 다음과 같음.

<class 'bytes'>
b'?'

예제 : backslashreplace

code_bytes = snowman_char.encode('ascii', errors='backslashreplace')
print(type(code_bytes))
code_bytes

결과는 다음과 같음.

<class 'bytes'>
b'\\u2603'

예제: xmlcharrefreplace

# HTML character entity를 사용하여 HTML에서 문제없는 문자열 생성.
code_bytes = snowman_char.encode('ascii', errors='xmlcharrefreplace') 
print(type(code_bytes))
code_bytes

결과는 다음과 같음.

<class 'bytes'>
b'&#9731;'

decode

encode의 반대로 bytes 객체에서 메서드로 지원됨.

bytes 객체로부터 str객체를 얻음.
마찬가지로 encoding방식을 argument로 받을 수 있음.

snowman_char = "\u2603"
code_bytes = snowman_char.encode('utf-8')
print(type(code_bytes)) #bytes
code_bytes

decoded_str = code_bytes.decode('utf8')
print(type(decoded_str)) #str
decoded_str

code_bytes가 utf8로 인코딩 된 경우이므로 문제없이 해당하는 str 객체를 반환함.

결과는 다음과 같음

<class 'str'>
☃

단, 인코딩을 다른 것으로 할 경우,

해당 인코딩에서 아예 지원하지 않는 bytes 값인 경우,

UnicodeDecodeError가 발생함.

아래 코드의 경우, U+2603 글자를 ASCII는 지원하지 않으므로 에러 발생함.

code_bytes.decode('ascii')

앞서 encode 경우처럼,
errors parameter에 ignore arugment를 넘겨주는 방법으로 해당 에러를 막을 수 있음.

UnicodeDecodeError가 발생하지 않더라도,

다른 인코딩으로 decode할 경우 이상한 문자로 나올 수 있음

(인코딩에서 다른 방식으로 처리한 경우임.)

다음 예를 참고할 것.

code_bytes.decode('cp1252') # 'windows-1252'

이 경우 결과가 다음과 같음

â˜ƒ

원래 글자가 아닌 다른 글자가 나왔음.

즉, 항상 encode 와 decode 에서 인코딩방식을 같은 것으로 해야한다.

더 읽어보면 좋은 자료들

https://docs.python.org/3/howto/unicode.html

Unicode HOWTO

Release, 1.12,. This HOWTO discusses Python’s support for the Unicode specification for representing textual data, and explains various problems that people commonly encounter when trying to work w...

docs.python.org

https://dsaint31.me/mkdocs_site/CE/ch01/code_for_character/

BME228

Codes for Characters Code 란 특정 형태의 information을 다른 방법으로 표현하는 규칙 또는 해당 규칙으로 표현된 결과물 을 가르킴. 문자를 나타내기 위한 code는 인간이 사용하는 문자 를 일종의 기호 또

dsaint31.me

'Python' 카테고리의 다른 글

[Python] Module, Package and Library (+ Framework) (0)	2024.02.03
[Python] Arithmetics, Variables, Types and Assignment (1)	2024.01.24
[Python] Unicode and Python : Unicode Literal (0)	2024.01.16
[Python] Binary Operations (0)	2024.01.15
[Python] $struct$ 사용하기: bytes 로 C언어 구조체 다루기. (1)	2024.01.15

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

ds31x

[Python] Unicode and Python: encode and decode

참고 : Unicode에서의 encoding이란

encode

decode

더 읽어보면 좋은 자료들

'Python' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

[Python] Unicode and Python: encode and decode

참고 : Unicode에서의 encoding이란

encode

decode

더 읽어보면 좋은 자료들

'Python' 카테고리의 다른 글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역