Sunday, June 9, 2013

Unicode

One of the requirements in my current project is the use of a few scientific symbols. They're Unicode characters, such as the degrees sign (°), or Maōri macrons for place names. Simple, right?

Not simple. JSP pages have two places that they need to specify that the contents are UTF-8. First, you need to add this JSP declaration to every heading, which sets the Content-Type in the HTTP header:

<@page contentType="text/html" pageEncoding="UTF-8">

Then you need to use a Spring filter in web.xml which I assumes sets the character set setting in the HTML itself:

 <filter>
        <filter-name>encodingFilter</filter-name>
        <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
        <init-param>
            <param-name>encoding</param-name>
            <param-value>UTF-8</param-value>
        </init-param>
        <init-param>
            <param-name>forceEncoding</param-name>
            <param-value>true</param-value>
        </init-param>
 </filter>

So far so good; POST results seem to arrive encoded properly at the servlet. Then I sent the results to the Oracle database, and lo! The database does not like them. It accepts them, but returns random ASCII characters when the data is fetched again. Oracle is retarded; we all know that, but it is so incredibly retarded that to change the character set, we need to completely re-install Oracle (!!). That's not going to happen. Possible things to try include:

  • Use NVARCHAR (not recommended), and do pstmt.setFormOfUse(2, oracle.jdbc.OraclePreparedStatement.FORM_NCHAR) every time you use a field. The method setFormOfUse(...) is an Oracle extension to JDBC. Alternatively you can use JDBC 4.0's getNString() methods.
  • There's a JDBC environment setting (-Doracle.jdbc....) somewhere for the Oracle JDBC thin drivers. The OCI drivers might work better. Newer versions might work better. I thought this was too retarded to investigate.
  • Don't use Strings, but rather use BLOBs or raw data. Seriously; I was tempted.

Or... just store HTML in the database. You can encode Unicode characters using org.apache.commons.lang.StringEscapeUtils.escapeHtml(), which makes them ASCII, and then use unescapeHtml() to restore them, or just send it straight to the browser and let the browser do it.

Except for newline characters. Oh how I hate them. Sometimes they're one character, sometimes they're two, browsers ignore them, except when they're in a <textarea> or <pre>. So currently I'm converting newline character pairs into  <br />, and back again.

This lends itself to yet another problem: you can add a "maxlength" attribute to a textarea, but that counts characters before they're converted into 7-bit clean HTML. Once they hit the database, chances are they will be longer than the field size, meaning that you need to return a recoverable error nicely to the user, and also provision about twice as much space as you think you'll need for these fields.

Why can't it all "just work"? Why can't everything be UTF-8, by default, and just work? Why couldn't HTML choose a newline character that would always appear in the output rather than using

and <br /> everywhere? People can use word-wrapping text editors, surely. Why couldn't they make text editors smarter and use HTML like the rest of the browser does?