Discussion:
[Tutor] Retain UTF-8 Character in Python List
Boy Sandy Gladies Arriezona
2015-06-01 02:39:03 UTC
Permalink
Hi, it's my first time in here. I hope you don't mind if I straight to the
question.
I do some work in python 2 and my job is to collect some query and then
send it to java program via json. We're doing batch update in Apache
Phoenix, that's why I collect those query beforehand.

My question is:
*Can we retain utf-8 character in list without changing its form into \xXX
or \u00XX?* The reason is because that java program insert it directly "as
is" without iterating the list. So, my query will be the same as we print
the list directly.

Example:
c = 'sffs © fafd'
l = list()

l.append(c)

print l
['sffs \xc2\xa9 fafd'] # this will be inserted, not ['sffs © fafd']


---
Best regards,
Boy Sandy Gladies Arriezona
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
http
Peter Otten
2015-06-01 09:55:56 UTC
Permalink
Post by Boy Sandy Gladies Arriezona
Hi, it's my first time in here.
Welcome!
Post by Boy Sandy Gladies Arriezona
I hope you don't mind if I straight to the
question.
I do some work in python 2 and my job is to collect some query and then
send it to java program via json. We're doing batch update in Apache
Phoenix, that's why I collect those query beforehand.
*Can we retain utf-8 character in list without changing its form into \xXX
or \u00XX?* The reason is because that java program insert it directly "as
is" without iterating the list. So, my query will be the same as we print
the list directly.
c = 'sffs © fafd'
l = list()
l.append(c)
print l
['sffs \xc2\xa9 fafd'] # this will be inserted, not ['sffs © fafd']
import json
items = [u"sffs © fafd"] # unicode preferrable over str for non-ascii
data
Post by Boy Sandy Gladies Arriezona
print json.dumps(items, ensure_ascii=False)
["sffs © fafd"]


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/m
Steven D'Aprano
2015-06-01 11:22:22 UTC
Permalink
Post by Boy Sandy Gladies Arriezona
Hi, it's my first time in here. I hope you don't mind if I straight to the
question.
I do some work in python 2 and my job is to collect some query and then
send it to java program via json. We're doing batch update in Apache
Phoenix, that's why I collect those query beforehand.
In Python 2, regular strings "" are actually ASCII byte strings, and
cannot include Unicode characters. If you try, you'll get something
platform dependent, which may be UTF-8, but could be something else.

So for example:

py> s = "a©b" # Not a Unicode string
py> len(s) # Expecting 3.
4
py> for c in s: print c, repr(c)
...
a 'a'
� '\xc2'
� '\xa9'
b 'b'


Not what you want! Instead, you have to use Unicode strings, u"".

py> s = u"a©b" # Unicode string
py> len(s)
3
py> for c in s: print c, repr(c)
...
a u'a'
© u'\xa9'
b u'b'
py> print s
a©b


Remember, the u is not part of the string, it is part of the delimiter:

ASCII byte string uses delimiters " " or ' '

Unicode string uses delimiters u" " or u' '
Post by Boy Sandy Gladies Arriezona
*Can we retain utf-8 character in list without changing its form into \xXX
or \u00XX?* The reason is because that java program insert it directly "as
is" without iterating the list. So, my query will be the same as we print
the list directly.
What do you mean, the Java program inserts it directly? Inserts it into
what?
Post by Boy Sandy Gladies Arriezona
c = 'sffs © fafd'
l = list()
l.append(c)
print l
['sffs \xc2\xa9 fafd'] # this will be inserted, not ['sffs © fafd']
Change the string 'sffs...' to a Unicode string u'sffs...' and your
example will work.

*However*, don't be fooled by Python's list display:

py> mylist = [u'a©b']
py> print mylist
[u'a\xa9b']

"Oh no!", you might think, "Python has messed up my string and converted
the © into \xa9 which is exactly what I don't want!"

But don't be fooled, that's just the list's default display. The string
is actually still exactly what you want, it just displays anything which
is not ASCII as an escape sequence. But if you print the string
directly, you will see it as you intended:

py> print mylist[0] # just the string inside the list
a©b



By the way, if you can use Python 3 instead of Python 2, you may find
the Unicode handling is a bit simpler and less confusing. For example,
in Python 3, the way lists are printed is a bit more sensible:

py> mylist = ['a©b'] # No need for the u' delimiter in Python 3.
py> print(mylist)
['a©b']

Python 2.7 will do the job, if you must, but it will be a bit harder and
more confusing. Python 3.3 or higher is a better choice for Unicode.
--
Steve
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mail
Boy Sandy Gladies Arriezona
2015-06-01 16:02:11 UTC
Permalink
On Mon, Jun 1, 2015 at 6:22 PM, Steven D'Aprano <***@pearwood.info> wrote:
What do you mean, the Java program inserts it directly? Inserts it into
what?

I mean that java program use the list and directly send it into apache
phoenix to do "batch upsert".
I thought that because the list is not show me the way I set it, then it
will send the wrong data into
apache phoenix. I still haven't found the problem yet, I don't know which
program is at fault. Anyway,
your explanation is really good, thank you.


On Mon, Jun 1, 2015 at 4:55 PM, Peter Otten <***@web.de> wrote:
import json
items = [u"sffs © fafd"] # unicode preferrable over str for non-ascii
data
print json.dumps(items, ensure_ascii=False)
["sffs © fafd"]

I haven't try that, using unicode strings before append it into the list.
Awesome, I'll try it out later when
I got in the office. Thank you very much.
---
Best regards,
Boy Sandy Gladies Arriezona
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/m

Loading...