2

I'm trying to use the Google Translate API to translate text that is in Kannada (and hence encoded utf-16) to English. Manually entering my URL, after pluggin in my google api key, https://www.googleapis.com/language/translate/v2?key=key#&q=ಚಿಂಚೋಳಿ&source=kn&target=en, I'm able to get the translation I want.

The problem is, however, that this url is utf16 encoded. When I try to open url using urllib, I get the error message from below. Any advice about how to proceed or an alternative way to proceed would be appreciated.

EDIT: I believe the problem can be solved by calling urllib.parse.quote_plus(text) where text is the utf16 text, and replacing the utf16 text with the return value from that function.

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    urllib.request.urlopen(url)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 469, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 487, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 447, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1283, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 73-79: ordinal not in range(128)
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Can you include the output of `print(repr(url))` in the original question? – Aya May 18 '13 at 13:14
  • quote_plus or quote seem to be an option to me. – User May 18 '13 at 18:23
  • look at [how this answer constructs an url from Unicode input](http://stackoverflow.com/a/4546813/4279) i.e., use utf-16 only once when you decode input bytes into Unicode, after that it doesn't matter what the input encoding is. – jfs Dec 21 '13 at 19:15

1 Answers1

2

The problem is, however, that this url is utf16 encoded

UTF-16 doesn't mean what you think it means. It is an encoding of Unicode characters to bytes used internally by the string types of some systems such as the Win32 API. UTF-16 is almost never used on the web because it is not ASCII-compatible.

https://www.googleapis.com/language/translate/v2?key=key#&q=ಚಿಂಚೋಳಿ&source=kn&target=en

This is not a URI - URIs may contain only ASCII characters. It is an IRI, which can contain other Unicode characters.

However urllib does not support IRIs. There are some Python libraries that do directly support IRI; alternatively you can convert any IRI into a corresponding URI which urllib will be happy with. This is done by encoding any non-ASCII characters in the hostname using the IDNA algorithm, and encoding any non-ASCII characters in other parts of the address (including the query parameters) using URL-encoding on the UTF-8 representation of the characters. That gives you this:

https://www.googleapis.com/language/translate/v2?key=key#&q=%E0%B2%9A%E0%B2%BF%E0%B2%82%E0%B2%9A%E0%B3%8B%E0%B2%B3%E0%B2%BF&source=kn&target=en

However, the use of # here doesn't look right- that's a client-side mechanism for passing in data from the browser, which won't work for server requests.

Usually you'd do something like:

baseurl= 'https://www.googleapis.com/language/translate/v2'
text= u'ಚಿಂಚೋಳಿ'
url= baseurl+'?'+urllib.urlencode(dict(
    source= 'kn', target= 'en',
    q= text.encode('utf-8'),
    key= key
))
bobince
  • 528,062
  • 107
  • 651
  • 834
  • there is no need to encode Unicode string in Python 3.3 `urlencode({'q':u'ಚಿಂಚೋಳಿ'})` works just fine. – jfs Dec 21 '13 at 19:17