urllib.parse.quote, urllib.parse.urlencode

728x90

https://x.com/python_tip/status/1459082270455783425

urllib.parse.quote는 URL에서 한글 및 특수 문자를 퍼센트 인코딩(percent encoding)으로 변환하는 함수.

URL에 대한 이해가 없다면 다음을 참고:

https://dsaint31.tistory.com/entry/CE-URL-URI-and-UNC

[CE] URL, URI and UNC

URI or URLURI는 Uniform Resource Identifier의 abbreviation 이고, URL은 Uniform Resource Locator의 abbreviation임.인터넷 또는 WAN 상에서 특정 resource(HTML, 이미지, 동영상 등을 resource라고 지칭함)에 접근할 목적으로 해

dsaint31.tistory.com

기본 문법

urllib.parse.quote(string, safe='/', encoding='utf-8', errors='strict')

string: 인코딩할 문자열
safe: 인코딩하지 않을 문자들 (기본값: '/')
encoding: 문자 인코딩 방식 (기본값: 'utf-8')
errors: 인코딩할 수 없는 문자가 있는 경우의 에러 처리 방식 (기본값: 'strict')
- 'strcit' : 예외 발생.
- 'ignore`: 인코딩할 수 없는 문자 삭제.
- 'replace': 문제 문자를 ?로 치환.
- 'xmlcharrefreplace': $#숫자 형태로 치환
- 'backslashreplace': \uXXXX 형태로 치환 (X: hexadecimal)

encode에서의 에러처리와 유사함:
2024.01.16 - [Python] - [Python] Unicode and Python: encode and decode

주요 특징과 사용 팁

1. quote vs quote_plus 차이점

quote(): 공백을 %20으로 인코딩
quote_plus(): 공백을 +로 인코딩 (HTML 폼 데이터에 적합)

2. 일반적인 사용 사례

URL 경로에서 파일명이나 폴더명 인코딩
검색 쿼리 파라미터 인코딩 (? 뒤에 위치)
API 요청 시 한글이나 특수문자 처리
웹 스크래핑에서 URL 구성

3. 주의사항

이미 인코딩된 문자열을 다시 인코딩하면 이중 인코딩됨
safe 매개변수를 적절히 설정해야 URL 구조가 깨지지 않음
가급적 전체 URL이 아닌 필요한 부분만 인코딩해야 함.

예제 코드

from urllib.parse import quote

# ------------
# 기본 사용법
text = "안녕하세요 world!"
encoded = quote(text)
print(f"원본: {text}")
print(f"인코딩: {encoded}")
# 출력: %EC%95%88%EB%85%95%ED%95%98%EC%84%B8%EC%9A%94%20world%21

# ------------
# 공백과 특수문자 인코딩
text_with_spaces = "hello world & test"
encoded_spaces = quote(text_with_spaces)
print(f"\n공백 포함: {text_with_spaces}")
print(f"인코딩: {encoded_spaces}")
# 출력: hello%20world%20%26%20test

# ------------
# safe 매개변수 사용 (특정 문자는 인코딩하지 않음)
url_path = "/path/to/file name.txt"
encoded_safe = quote(url_path, safe='/')
print(f"\nsafe='/' 사용: {url_path}")
print(f"인코딩: {encoded_safe}")
# 출력: /path/to/file%20name.txt

# ------------
# safe='' (모든 특수문자 인코딩)
encoded_all = quote(url_path, safe='')
print(f"\nsafe='' 사용: {url_path}")
print(f"인코딩: {encoded_all}")

# 출력: %2Fpath%2Fto%2Ffile%20name.txt

# ------------
# URL 쿼리 파라미터에 사용
search_query = "파이썬 프로그래밍"
url = f"https://example.com/search?q={quote(search_query)}"
print(f"\nURL 생성: {url}")

# 출력: https://example.com/search?q=%ED%8C%8C%EC%9D%B4%EC%8D%AC%20%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D

# ------------
# 이메일 주소 인코딩
email = "user+test@example.com"
encoded_email = quote(email, safe='@')
print(f"\n이메일: {email}")
print(f"인코딩: {encoded_email}")

# 출력: user%2Btest@example.com

# ------------
# 실제 URL 구성 예시
base_url = "https://api.example.com"
endpoint = "/users/search"
params = {
    "name": "홍길동",
    "city": "서울시",
    "age": "25+"
}

# 쿼리 파라미터 인코딩
query_string = "&".join([f"{key}={quote(str(value))}" for key, value in params.items()])
full_url = f"{base_url}{endpoint}?{query_string}"
print(f"\n완성된 URL: {full_url}")

# 결과 : https://api.example.com/users/search?name=%ED%99%8D%EA%B8%B8%EB%8F%99&city=%EC%84%9C%EC%9A%B8%EC%8B%9C&age=25%2B

# ------------
# quote_plus와의 차이점 비교
from urllib.parse import quote_plus

text_with_plus = "hello world+test"
print(f"\n원본: {text_with_plus}")
print(f"quote(): {quote(text_with_plus)}")
print(f"quote_plus(): {quote_plus(text_with_plus)}")

# 출력
# 원본: hello world+test
# quote(): hello%20world%2Btest
# quote_plus(): hello+world%2Btest

`urllib.parse.urlencode` 와 `requests.get(..., params=dict객체)`

실제로 requests 모듈을 사용하면, get함수에서 params인자로 넘겨주면 알아서 처리가 됨.
또는 urllib.parse 모듈의 urlencode를 써도 된다.

위의 둘이 좀더 간단하게 이용가능함.

from urllib.parse import urlencode
import requests

# 방법 1: urlencode() 사용
data = {'q': '파이썬 프로그래밍', 'category': 'IT'}
query_string = urlencode(data)
url = f"https://example.com/search?{query_string}"
# 결과 : https://example.com/search?q=%ED%8C%8C%EC%9D%B4%EC%8D%AC+%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D&category=IT

# 방법 2: requests.get() 자동 처리 (더 간단!)
response = requests.get('https://example.com/search', params=data)
# 자동으로 쿼리 파라미터 인코딩해서 get으로 전송까지 함.

같이보면 좋은 자료들

2023.12.06 - [Python] - [Etc] Token and Tokenizer

[Etc] Token and Tokenizer

Token의 의미문장을 구성하는 (최소)의미 단위. 일반적으로 하나의 word가 token에 해당하며,영어에서는 whitespace character(공백문자) 및 punctuation mark 등을 구분자(delimiter)로 하여 나눈 결과들을보통 tok

ds31x.tistory.com

https://dsaint31.me/mkdocs_site/CE/ch01/Encoding_for_external_rep/?h=url#url-encoding-percent-encoding

BME

Encodings Quoted-Printable Encoding QT Encoding은 과거 7bit만을 지원하는 통신 경로로 데이터 통신을 하던 시절에 개발된 encoding 방식임 (email 첨부파일 전송 등에 아직도 사용됨). Quoted-printable encoding 방식은

dsaint31.me