This article shows you how to multilingualize your gem with Ruby 1.9.
6 months ago, Ruby 1.9.1 was released. Some gems support Ruby 1.9, others do not. At RubyKaigi2009 , I said that you should port your library to Ruby 1.9 now.
Well, what is "Ruby 1.9-ready"? It does not mean only the library can be built with Ruby 1.9. The most important point in supporting Ruby 1.9 is M17N -- multilingualization.
M17N
In Ruby 1.9, strings, symbols, regular expressions and IOs have encoding(s). IO has two encoding because it is on the border of inside Ruby and outside Ruby.
You must assign a correct encoding to all objects you create. For example, when you are writing a HTTP client library, you must
- Read
Content-Type header
, itscharset
sub field, - Assign the encoding to response body.
JEG2's Understanding M17N helps you about this topic.
Extension libraries
Actually, it is not hard to multilingualize pure-ruby library because Ruby helps you to assign encodings. There are not so much things you should do by hand. But extension libraries are harder to multilingualize.
There are two problems.
rb_str_new
is not sufficient.- You must understand some new concepts.
how to create a string
In Ruby 1.8, rb_str_new
is the most common function to create a new String object. In Ruby 1.9, some new functions are added.
You can create a string associated with a character encoding with them.
VALUE rb_external_str_new(const char *str, long len);
VALUE rb_external_str_new_cstr(const char *str);
)VALUE rb_locale_str_new(const char *str, long len);
VALUE rb_locale_str_new_cstr(const char *str);
VALUE rb_usascii_str_new(const char *str, long len);
VALUE rb_usascii_str_new_cstr(const char *str);
VALUE rb_enc_str_new(const char *str, long len, rb_encoding *encoding);
VALUE rb_enc_vsprintf(rb_encoding *encoding, const char *format, va_list args)
rb_external_XXX
functions create a String with the default external encoding ( Encoding.default_external
).
rb_locale_XXX
is same but for the locale encoding. rb_usascii_XXX
is same but for US-ASCII.
More generally, rb_enc_XXX
functions take a pointer to rb_encoding
as an argument.
You must use the new functions instead of rb_str_new
because rb_str_new
creates a String with ASCII-8BIT encoding. It is probably not an encoding you want.
Or you can use rb_enc_associate
to share the source between 1.8 and 1.9.
VALUE str = rb_str_new_cstr("foo"); #ifdef HAVE_ENCODING_H rb_enc_associate(str, rb_usascii_encoding()); #endif
rb_enc_associate
also takes a pointer to rb_encoding
as an argument.
rb_encoding
rb_encoding
is one of the new concept you should understand. It is the internal representation for Encoding object.
You do not need to know the internal of rb_encoding
. Members of rb_encoding
might change in a future version of Ruby.
There are some functions to get rb_encoding
.
rb_encoding *rb_ascii8bit_encoding(void); rb_encoding *rb_utf8_encoding(void); rb_encoding *rb_usascii_encoding(void); rb_encoding *rb_locale_encoding(void); rb_encoding *rb_filesystem_encoding(void); rb_encoding *rb_default_external_encoding(void); rb_encoding *rb_default_internal_encoding(void);
More generally, you can get a rb_encoding
by name with
rb_encoding * rb_enc_find(const char *name);
by "index" with
rb_encoding* rb_enc_from_index(int idx);
Hmm, what is the "index"?
Index
"index" is another new concept to understand. It is a unique little integer for an Encoding.
It is not a pointer so it is easy to copy and store.
It is a little integer so it can be stored in RBasic::flags
.
This is the same way as String handles its encoding efficiently.
#define ENCODING_SET_INLINED(obj,i) do {\ RBASIC(obj)->flags &= ~ENCODING_MASK;\ RBASIC(obj)->flags |= (VALUE)(i) << ENCODING_SHIFT;\ } while (0)
Case Study
The pg gem, a postgresql database adapter, did not support Encodings.
So I wrote a patch .
Dispatching
Your ruby has encoding.h when it supports M17N. You can use HAVE_RUBY_ENCODING_H
in an extension library.
#if defined(HAVE_RUBY_ENCODING_H) && HAVE_RUBY_ENCODING_H # define M17N_SUPPORTED #endif
Associating index
I did not want to use rb_enc_str_new
to keep the modification least.
static VALUE pgresult_res_status(VALUE self, VALUE status) { - return rb_tainted_str_new2(PQresStatus(NUM2INT(status))); + VALUE ret = rb_tainted_str_new2(PQresStatus(NUM2INT(status))); + ASSOCIATE_INDEX(ret, self); + return ret; }
The patched version of pgresult_res_status
overwrites the encoding of ret
after creating ret
as an ASCII-8BIT string.
ASSOCIATE_INDEX
macro is a wrapper for rb_enc_associate_index
.
#ifdef M17N_SUPPORTED # define ASSOCIATE_INDEX(obj, index_holder) rb_enc_associate_index((obj), enc_get_index((index_holder))) static rb_encoding * pgconn_get_client_encoding_as_rb_encoding(PGconn* conn); static int enc_get_index(VALUE val); #else # define ASSOCIATE_INDEX(obj, index_holder) /* nothing */ #endif
For 1.8, ASSOCIATE_INDEX does nothing. For 1.9, it extracts the associated encoding from index_holder
and associates obj
with the encoding.
enc_get_index
is a specialized version of rb_enc_get_index
for pg.
It extracts an encoding from a PGConn object.
static int enc_get_index(VALUE val) { int i = ENCODING_GET_INLINED(val); if (i == ENCODING_INLINE_MAX) { VALUE iv = rb_ivar_get(val, s_id_index); i = NUM2INT(iv); } return i; }
You don't have to implement a function like enc_get_index
in your library. You can use rb_enc_get_index
API with Ruby 1.9.2.
But Ruby 1.9.1's rb_enc_get_index
has a bug. ((I have had no plan to fix the bug in 1.9.1. But now I feel the decision might be wrong. Do you want to use rb_enc_get_index
in you library with Ruby 1.9.1? To backport or not to backport.))
So I reimplemented enc_get_index
.
Mapping encodings
The mapping between external information and Ruby's encodings is somtimes difficult. It might be non-trivial.
PostgreSQL supports many character encodings . Which encoding in Ruby correspond to whicn encoding in PostgreSQL? So I had to decide a mapping. Here is the mapping I wrote for pg.
#ifdef M17N_SUPPORTED /** * The mapping from canonical encoding names in PostgreSQL to ones in Ruby. */ static const char * const (enc_pg2ruby_mapping[][2]) = { {"BIG5", "Big5" }, {"EUC_CN", "GB2312" }, {"EUC_JP", "EUC-JP" }, {"EUC_JIS_2004", "EUC-JP" }, {"EUC_KR", "EUC-KR" }, {"EUC_TW", "EUC-TW" }, {"GB18030", "GB18030" }, {"GBK", "GBK" }, {"ISO_8859_5", "ISO-8859-5" }, {"ISO_8859_6", "ISO-8859-6" }, {"ISO_8859_7", "ISO-8859-7" }, {"ISO_8859_8", "ISO-8859-8" }, /* {"JOHAB", "JOHAB" }, dummy */ {"KOI8", "KOI8-U" }, {"LATIN1", "ISO-8859-1" }, {"LATIN2", "ISO-8859-2" }, {"LATIN3", "ISO-8859-3" }, {"LATIN4", "ISO-8859-4" }, {"LATIN5", "ISO-8859-5" }, {"LATIN6", "ISO-8859-6" }, {"LATIN7", "ISO-8859-7" }, {"LATIN8", "ISO-8859-8" }, {"LATIN9", "ISO-8859-9" }, {"LATIN10", "ISO-8859-10" }, {"MULE_INTERNAL", "Emacs-Mule" }, {"SJIS", "Windows-31J" }, {"SHIFT_JIS_2004","Windows-31J" }, /*{"SQL_ASCII", NULL }, special case*/ {"UHC", "CP949" }, {"UTF8", "UTF-8" }, {"WIN866", "IBM866" }, {"WIN874", "Windows-874" }, {"WIN1250", "Windows-1250"}, {"WIN1251", "Windows-1251"}, {"WIN1252", "Windows-1252"}, {"WIN1253", "Windows-1253"}, {"WIN1254", "Windows-1254"}, {"WIN1255", "Windows-1255"}, {"WIN1256", "Windows-1256"}, {"WIN1257", "Windows-1257"}, {"WIN1258", "Windows-1258"} };
"SJIS" in PostgreSQL is Encoding::SJIS
in Ruby? NO ! According to the documentation, "SJIS" in PostgreSQL is "Mskanji". So it is Encoding::CP932
in Ruby.
I decided the mapping with help by naruse , a M17N specialist in the ruby core team.
See CJKV Information Processing for more information about East Asian encodings.
What I want to say here is,
- Mapping encodings is difficult,
- but the above mapping table might help you.
Dummy encoding
Sometimes you have to define a dummy encoding in your library.
Ruby 1.9.1 does not support JOHAB encoding but PostgreSQL does. What should I do?
You can implement a new encoding in your library. But it is very hard. *1
So I defined JOHAB as a dummy encoding. A dummy encoding is a encoding which Ruby does not support but just knows its name. Defining a dummy encoding is easy. It is done by just call rb_define_dummy_encoding
.
Here is a function I wrote for JOHAB.
static rb_encoding * find_or_create_johab(void) { static const char * const aliases[] = { "JOHAB", "Windows-1361", "CP1361" }; int enc_index; int i; for (i = 0; i < sizeof(aliases)/sizeof(aliases[0]); ++i) { enc_index = rb_enc_find_index(aliases[i]); if (enc_index > 0) return rb_enc_from_index(enc_index); } enc_index = rb_define_dummy_encoding(aliases[0]); for (i = 1; i < sizeof(aliases)/sizeof(aliases[0]); ++i) { rb_enc_alias(aliases[i], aliases[0]); } return rb_enc_from_index(enc_index); }
At first, the function looks for JOHAB so that it can find the builtin JOHAB out with a future version of Ruby. If Ruby does not have JOHAB, it defines JOHAB as a dummy encoding and defines its aliases.
Conclusion
Supporting Ruby 1.9 in your gem must be multilingualizing it.
After you read the JEG2's article , it is not so diffifult to multilingualizing pure-ruby libraries. But multilingualizing extension libraries is much more difficult.
Understand what rb_encoding
and is, what encoding index is.
Understand complexity of character encodings. The ruby core team might help you to solve the complexity as naruse helped me.