[PHP-dev 1383] Fw: [PHP-DEV] mbstring: missing support for hex numeric entities &xHHHH;

Seiji Masugata s.masugata @ digicom.dnp.co.jp
2007年 8月 9日 (木) 17:47:48 JST


こんにちわ、桝形です。

ついでに。

結構前の話ですが、本家のML宛に以下のようなメールが届いていました。
メールを出さないと。。。と思いつつ、放置してました。スミマセン。

ちなみに、このメールに対するMLでの返信は特にありませんでした。

Forwarded by Seiji Masugata
--------------------- Original Message Start -----------------------
From:    Umberto Salsi <salsi @ icosaedro.it>
To:      <internals @ lists.php.net>
Date:    Wed, 23 May 2007 17:40:57 CEST
Subject: [PHP-DEV] mbstring: missing support for hex numeric entities &xHHHH;
----

mbstring does not support numeric entities in HTML code. For example:

echo urlencode( mb_convert_encoding("&#x0415;", "UTF-8", "HTML-ENTITIES") );

displays %F2%AF%B8%9F rather than the expected %D0%95.
This bug was detected by Nick Wedd <nick @ maproom.co.uk> and reported in the
newsgroup comp.lang.php, Message-ID: <EU9zOoNGJAVGFAaa @ maproom.demon.co.uk>.

I'd found the bug in the file ext/mbstring/libmbfl/filters/mbfilter_htmlent.c
and added these features:

- decode hex entities &xHHHH;
- detect invalid digits
- detect digits missing at all
- detect values out of the range 0-0xffff

Invalid values are returned verbatim.

Apparently the right place for this patch should be
http://cvs.sourceforge.jp/cgi-bin/viewcvs.cgi/php-i18n/
but currently the project isn't no more hosted there.

The patch for ext/mbstring/libmbfl/filters/mbfilter_htmlent.c follows:

173a174,217
> static int mbfl_decode_numeric_entity(char *s, int s_len)
> /*
> 	s = numeric entity "ddd" or "xhhhh"
> 	return: numeric value or -1 if not inside [0,0xffff] or invalid digits
> */
> {
> 	int ent, pos, c, d;
> 
> 	ent = 0;
> 
> 	if (*s == 'x' || *s == 'X') {
> 		/* hexadecimal base */
> 		if ( s_len < 2 )
> 			return -1;  /* no digits found */
> 		for (pos=1; pos<s_len; pos++) {
> 			c = s[pos];
> 			if (isdigit(c))
> 				d = c - '0';
> 			else if (isxdigit(c))
> 				d = tolower(c) - 'a' + 10;
> 			else
> 				return -1;  /* invalid hex digit */
> 			ent = (ent << 4) + d;
> 			if (ent > 0xffff)
> 				return -1;  /* too big */
> 		}
> 
> 	} else {
> 		/* decimal base */
> 		if ( s_len < 1 )
> 			return -1;  /* no digits found */
> 		for (pos=0; pos<s_len; pos++) {
> 			c = s[pos];
> 			if (! isdigit(c) )
> 				return -1;  /* invalid dec char */
> 			ent = ent*10 + (c - '0');
> 			if (ent > 0xffff)
> 				return -1;  /* too big */
> 		}
> 	}
> 
> 	return ent;
> }
> 
192,193c236,246
< 				for (pos=2; pos<filter->status; pos++) {
< 					ent = ent*10 + (buffer[pos] - '0');
---
> 				ent = mbfl_decode_numeric_entity(&buffer[2], filter->status - 2);
> 				if( ent >= 0 ){
> 					CK((*filter->output_function)(ent, filter->data));
> 					filter->status = 0;
> 					/*php_error_docref("ref.mbstring" TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/
> 				} else {
> 					/* failure */
> 					buffer[filter->status++] = ';';
> 					buffer[filter->status] = 0;
> 					/* php_error_docref("ref.mbstring" TSRMLS_CC, E_WARNING, "mbstring cannot decode '%s'", buffer); */
> 					mbfl_filt_conv_html_dec_flush(filter);
195,197d247
< 				CK((*filter->output_function)(ent, filter->data));
< 				filter->status = 0;
< 				/*php_error_docref("ref.mbstring" TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/


Best regards,
 ___ 
/_|_\  Umberto Salsi
\/_\/  www.icosaedro.it

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


--------------------- Original Message End -------------------------

-- 
Seiji Masugata <s.masugata @ digicom.dnp.co.jp>



PHP-dev メーリングリストの案内