A mind that is stretched by a new experience can never go back to its old dimensions.

Ruby: incompatible character encodings: US-ASCII

February 29th, 2012 Posted in from the road, geek out

As OPower goes international later this year, I’m working on some validation scripts that will validate translated output from our Translation contractor. The scripts are in ruby.

I’m seeing this exception thrown while parsing a file returned by the contractors:

incompatible character encodings: US-ASCII and UTF-8

From infile.gets().

There are two problems here: The translators put a UTF8 code point that tells the text editor the “endian”-ness of the file. My script can’t handle this. The first line of the file is a comment for a java application.properties file:

  # =============== (The # indicates comment, the rest is for aesthetics)
The Octal Dump of the file indicates there is something else there:
 $ od -c form-messages_en_AU.properties | head -n 2
 0000000 357 273 277 # = = = = = = = = = = =
 0000020 = = = = = = = = = = = = = = = =

357 273 277 – that isn’t a printing character!

That’s where my scripts were choking. The solution is twofold:
1. Read the file in the right encoding
2. Learn to ignore the character in question.

Part 1: Change the encoding type to UTF8 on File.open:
 infile = File.new(filename, "r", encoding: Encoding::UTF_8)
Part 2: Learn to ignore the character in question:
 if line.start_with?("#") or line.start_with?("\uFEFF#")
   logger.debug "skipping line #{line}"
   @messages << Message.new(line)

Et voila, the exception goes away as a result of Part 1, and I can safely ignore the comment on the first line. If you are not lucky enough to have a comment on the first line, like most cases, you can use string manipulation to throw away \uFEFF once you’ve identified it.

Post a Comment

CommentLuv badge