unicodeobject
Python在3之后,str对象内部改用Unicode表示,因而在源码中被称为Unicode对象。Python中将字符串对象分为四类,对应分为三个对象:PyASCIIObject,PyCompactUnicodeObject,PyUnicodeObject,其对象继承关系为Unicode对象的原始基类除了PyObject外,继承自PyASCIIObject,PyCompactUnicodeObject类继承自PyASCIIObject,PyUnicodeObject继承自PyCompactUnicodeObject。
PyASCIIObject
PyASCIIObject为最基础的对象。
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr;
} PyASCIIObject;
- PyObject_HEAD为宏定义,
#define PyObject_HEAD PyObject ob_base;
,定义对象头为PyObject。 - length为字符串长度
state为状态信息
interned:interned机制,有三种:`SSTATE_NOT_INTERNED (0),
SSTATE_INTERNED_MORTAL (1),SSTATE_INTERNED_IMMORTAL (2)`。指的是在创建一个新的字符串对象时,如果已经有了和它的值相同的字符串对象,那么就直接返回那个对象的引用,而不返回新创建的字符串对象。Python维护着一个键值对类型的结构interned,键就是字符串的值。但这个intern机制并非对于所有的字符串对象都适用,对于那些符合python标识符命名原则的字符串,也就是只包括字母数字下划线的字符串,python会对它们使用intern机制。Python的interned机制有两种:mortal和immortal,前者会被回收,后者则不会被回收,与Python虚拟机共存亡。>>> a = 'abc' >>> b = 'abc' >>> id(a) == id(b) True >>> c = '!abc' >>> d = '!abc' >>> id(c) == id(d) False >>>
kind:字符串类型,由一个枚举类型存放,表示字符串以几字节的形式保存。
enum PyUnicode_Kind { /* String contains only wstr byte characters. This is only possible when the string was created with a legacy API and _PyUnicode_Ready() has not been called yet. */ PyUnicode_WCHAR_KIND = 0, /* Return values of the PyUnicode_KIND() macro: */ PyUnicode_1BYTE_KIND = 1, PyUnicode_2BYTE_KIND = 2, PyUnicode_4BYTE_KIND = 4 };
- PyUnicode_1BYTE_KIND:存储字符编码0x0000-0x00ff的字符。
- PyUnicode_2BYTE_KIND:存储字符编码0x0100-0xffff的字符。
PyUnicode_4BYTE_KIND:存储字符编码0x10000-0xffffffff的字符。
>>> import sys >>> s = 'unicode字符串unicode' >>> sys.getsizeof(s[:2]) - sys.getsizeof(s[:1]) 1 >>> sys.getsizeof(s[:9]) - sys.getsizeof(s[:8]) 2 >>> sys.getsizeof(s[:12]) - sys.getsizeof(s[:11]) 2 >>> sys.getsizeof(s[:14]) - sys.getsizeof(s[:13]) 2 >>> s = '字符串unicode' >>> sys.getsizeof(s[:9]) - sys.getsizeof(s[:8]) 2 >>> s = 'unicode' >>> sys.getsizeof(s) 56 >>> s = '' >>> sys.getsizeof(s) 49 >>>
- 注意到字符串如果前部为ascii字符,则使用1字节存储,后部变为中文后使用2字节存储。
- 若开头为中文,则后部都用2字节存储。
- PyASCIIObject结构体在64位机器下,大小为48字节(PyASCIIObject结构体加上末尾的\0字节)。长度为n的纯ASCII字符串对象,需要消耗
n+48+1
。
- compact:是否紧凑,如果是紧凑。Python将只使用一个内存块来存储内容,也就是说,在内存中字符是紧紧跟在结构体后面的。
- ascii:是否纯ascii文本。
- ready:用来说明对象的布局是否被初始化。如果是1,就说明要么这个对象是紧凑的(compact),要么它的数据指针已经被填满了。
unsigned int :24;
:用于内存对齐。
- wstr:字符串对象真正的值所在。
PyCompactUnicodeObject
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the
* terminating \0. */
char *utf8; /* UTF-8 representation (null-terminated) */
Py_ssize_t wstr_length; /* Number of code points in wstr, possible
* surrogates count as two code points. */
} PyCompactUnicodeObject;
- utf8_length:utf8中的字节数,不包括结尾的\0。
- *utf8:文本UTF8编码形式,缓存以避免重复编码运算。
- wstr_length:码位数。
PyUnicodeObject
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data; /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;
- data:字符串缓冲区
PyUnicode_Type
PyTypeObject PyUnicode_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str", /* tp_name */
sizeof(PyUnicodeObject), /* tp_basicsize */
···
unicode_new, /* tp_new */
PyObject_Del, /* tp_free */
};
- 注意到unicode_new指针,指向字符串初始化函数
unicode_new
在unicode_new函数中最终实际调用了unicode_new_impl函数。
static PyObject *
unicode_new(PyTypeObject *type, PyObject *args, PyObject *kwargs)
{
PyObject *return_value = NULL;
... /* 一些校验 */
PyObject *x = NULL;
const char *encoding = NULL;
const char *errors = NULL;
skip_optional_pos:
return_value = unicode_new_impl(type, x, encoding, errors);
exit:
return return_value;
}
- unicode_new函数通过一系列的判断调用,大部分判断最终到达PyUnicode_New函数。(
unicode_new_impl->PyUnicode_FromEncodedObject->PyUnicode_DecodeUTF8Stateful->unicode_decode_utf8->PyUnicode_New
),
PyUnicode_New
PyObject *
PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
/* Optimization for empty strings */
if (size == 0) {
return unicode_new_empty();
}
PyObject *obj;
PyCompactUnicodeObject *unicode;
void *data;
enum PyUnicode_Kind kind;
int is_sharing, is_ascii;
Py_ssize_t char_size;
Py_ssize_t struct_size;
is_ascii = 0;
is_sharing = 0;
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
is_ascii = 1;
struct_size = sizeof(PyASCIIObject);
}
else if (maxchar < 256) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
}
else if (maxchar < 65536) {
kind = PyUnicode_2BYTE_KIND;
char_size = 2;
if (sizeof(wchar_t) == 2)
is_sharing = 1;
}
else {
if (maxchar > MAX_UNICODE) {
PyErr_SetString(PyExc_SystemError,
"invalid maximum character passed to PyUnicode_New");
return NULL;
}
kind = PyUnicode_4BYTE_KIND;
char_size = 4;
if (sizeof(wchar_t) == 4)
is_sharing = 1;
}
/* Ensure we won't overflow the size. */
if (size < 0) {
PyErr_SetString(PyExc_SystemError,
"Negative size passed to PyUnicode_New");
return NULL;
}
if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
return PyErr_NoMemory();
/* Duplicated allocation code from _PyObject_New() instead of a call to
* PyObject_New() so we are able to allocate space for the object and
* it's data buffer.
*/
...
...
...
#ifdef Py_DEBUG
unicode_fill_invalid((PyObject*)unicode, 0);
#endif
assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
return obj;
}
- 如果字符串的最大字符<256,则使用1字节存储字符。
- 如果56<字符串的最大字符<65535,则使用2字节存储字符。
- 其余使用4字节存储字符。
Comments | NOTHING