Ethiopic Text in (X)HTML Documents
ASCII and Unicode are the two most widely used text encoding standards. The first is much older and occupies less byte space, and is commonly used in system files and source code. The latter is newer and encompasses a wider range of characters and is used widely on the web and desktop applications with international support.
If you decide to use Ethiopic characters in the markup of your (X)HTML, you have to make sure that the document remains UTF-8 encoded. This should not be a problem with modern text editors (MS Notepad, TextEdit, Notepad++…) and IDEs (Eclipse, NetBeans > 6.9) because they provide native support for unicode source files.
However, some text editors such as Vim and web based file editors do not support UTF-8 documents without special configuration. If you open your UTF-8 source file with an editor that lacks unicode support, you are bound to lose the encoded data upon saving it. This can be disastrous, especially in the case of documents containing Ethiopic content.
Here are some steps you can take to work around this problem:
- Store content in a database: The separation of content and presentation is a core component of the MVC design architecture. Even if you do not subscribe to this pattern in your project, storing your raw content in a database provides additional data protection and saves you the headache of mixing content and markup. There are many CMSs (Content Management System) such as WordPress, Joomla and Drupal that dynamically generate UTF-8 compatible (X)HTML output with the proper HTTP headers. The data is usually stored on a database server with full UTF-8 support so any changes to the encoding of the source files will not affect your data.
- Use a flat file CMS: If it is impractical to use a full blown CMS, you may opt for lite XML or flat-file based content management systems. These do not require database privileges so they work on almost any web server. The pages are created dynamically so you will need a web server that supports web applications (PHP, ASP.NET, JSP, Python, CGI…), which can be found on most web hosts. GetSimpleCMS is a great customizable CMS for this purpose.
- Convert Unicode to HTML Entities: HTML Entities are decimal representations of the position of characters in the Unicode set. If you cannot use a database and cannot run dynamic web pages on your server, you can still prevent data loss by converting your unicode characters into their ASCII HTML entity equivalent. In comparison with Unicode, ASCII covers a smaller subset of the character range, therefore any attempts to directly convert from Unicode to ASCII will lead to data loss. However, you can work around this problem by using the decimal representations of the unicode characters. This has some disadvantages such as a larger file size (due to the additional bytes of HTML entities) and poor code readability (you see decimals in place of the characters).
Tips
Don’t forget the declarations XHTML:
<?xml version=”1.0” encoding=”UTF-8”?> <meta http-equiv=”content-type” content=”application/xhtml+xml; charset=UTF-8“/>/strong> HTML 4.1 <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”> HTML5 Use your platform’s built-in Unicode to HTML Entities converter