Context: Transcoding a string from an external source for saving in the database
From a gem, I get a string s that has latin-1-encoded content and that I want to store in a Rails model.
r = MyRecord.new(mystring: s)
# ...
r.save
Because my PostgreSQL database uses UTF-8 encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:
ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65
...
I can solve this easily by transcoding the string:
r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
# ...
r.save
(Because r.encoding returns #<Encoding:ASCII-8BIT> instead of #<Encoding:ISO-8859-1>, I'm passing the source encoding as the second argument. The gem that produced s probably isn't aware that the file it read the string from is latin1 encoded.)
Challenge: Avoid hard-coding the destination encoding
It occurred to me, that knowledge about the database's string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.
I can ask the model's class for the database's encoding:
MyRecord.connection.encoding
This doesn't return a Ruby Encoding object though, it returns a string containing the encoding's name. Fortunately, the Encoding class can be queried with names (and some aliases) to look up encodings:
Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8
Unfortunately, different naming conventions are used: MyRecord.connection.encoding returns 'UTF8' (no minus sign) while Encoding.find(...) needs to be passed 'UTF-8' (with minus sign) or 'CP65001' if we want it to return #<Encoding:UTF-8>.)
Sooooo close.
Question: Is there a clean and/or recommended way
to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database's encoding for that?
Discarded ideas
I don't feel doing string manipulation or pattern matching on the result of MyRecord.connection.encoding or on the contents of Encoding.aliases() would be any better than just leaving the hard-coded values in the code.
Modifying Encoding.aliases()'s return value doesn't have any effect:
Encoding.aliases['UTF8'] = 'UTF-8'
Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8
(and doesn't feel right either, anyway), nor does modifying the return value of #names:
Encoding::UTF_8.names.push('UTF8')
Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8
I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.