Automated Content Migration to WordPress

In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:

  1. Unsanitary HTML, extracting content from shell
  2. Adding template tags and breadcrumbs
  3. Manual copy/paste labor

The Solution

  1. Use JavaScript to sanitize the DOM
  2. Write RegExp rules
  3. Scrape the via Node.JS and import directly into WP using the JSON API

The Stack

  • Node 6+ for cleaner JS
  • JSDOM for Headless Scraping -npm module
  • WordPress JSON API -wp plugin
  • WordPress Client -npm module


  1. Content Inventory

    List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.

    Export the list to a CSV

    You should have a source.csv with something like:

    url, title, description, template, category, meta1..., Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania
  2. Get WP Ready

    1.  On your WP installation, install and activate this plugin [WP REST API](
    1. Upload and unzip the Basic Auth plugin, it is not in the Plugin repo at the time of this writing.
    2. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
    3. Test if the API works, navigate to {baseurl}/wp-json . You should see a JSON response with your site’s info
    4. Add the following to .htaccess to enable Basic Auth:

      RewriteRule ^index\.php$ - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
      1. Verify

        curl -u username:password -i -H 'Accept:application/json' {baseurl}/wp-json/wp/v2/users/me
      2. It should display your user information

Set up Node

  1. Get NPM/Node and the basics running in your environment
  2. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
  3. Test if it works, navigate to

    . You should see a JSON response with your site’s info

The App Logic

  1. Initialize

    1. Connect with WP
  2. Batch Process

    1. Read data

      1.  Process CSV
      1. Convert to an array of objects with useful data

            1.  'url', 'title', 'description'
        3.  Populate master list
      2. (optional) Separate into sublists for categories, templates…

      3. For each URL in master

        1. Scrape webpage

          1.  Run JSDOM + jQuery
        2. Get Raw HTML

          1. Process

            1. Extract title, description, custom values into JSON
        3. Extract body
        4. Sanitize HTML, remove unnecessary DOM elements

          1. Import

            1. Insert a new post/page to WordPress
        5. Store output
  3. Verify in WP

The Code Structure

├── data
│   └── source.csv
├── env.js            // Password
├── index.js          // Entry
├── lib
│   ├── api.js        // All WP interactions
│   ├── read.js       // File read / CSV parse
│   └── scrape.js     // JSDOM Scraper
├── node_modules
│   ├── async
│   ├── fast-csv
│   ├── jsdom
│   └── wpapi
└── package.json

The Code

env.js - Environment Config

For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this in with working WP credentials.

// These data shouldn't be checked in.

module.exports = {
    'WP_URL': '',
    'WP_USERNAME': 'test',
    'WP_PASSWORD': 'test'


data/source.csv - List of URLs

I will use a single column CSV, you can pass metadata by adding more columns. For example purposes, I will scrape WikiSource.


lib/Read.js - Reads CSV Files

For most use cases, this crude CSV parser will suffice:

const fs = require('fs');
try {
let raw = fs.readFileSync('./data/source.csv', 'utf8')
let parsed = raw.split("\n") // Rows
                .map(r => r.split(",") // Fields
                            .map(f => f.trim('\r\n') )) // Trailing chars
} catch (e){

But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:

const csv = require('fast-csv'),
    fs = require('fs');
class List {
    constructor(filePath, limit = 500) {
        this.filePath = filePath || null;
        this.limit = limit; = []; = null;
    read() {
        return new Promise((resolve, reject) => {
            if (!(this.filePath && fs.existsSync(this.filePath))) {
                return reject('File does not exist');
            // TODO: impement scalable streaming.
   = fs.createReadStream(this.filePath);
  "data", (raw) => {
                if ( > this.limit) {
                    console.log("Read", "Limit exceeded");
            }).on("end", () => {
module.exports = {
testRead.js - Test that CSV is read
const { List } = require ('./lib/read');
let file = new List('./data/source.csv');, console.error);

Run testRead.js, you should see a 2D array of your CSV.

lib/API.js - WP API Wrapper

This file wraps around the wpapi npm module to provide support authentication and provide only the functions we need: new post and new page

 *  Wrapper around WP-API
const env = require('../env');
const WPAPI = require('wpapi');

class API {
    constructor () {
        this.wp = null;
        this.user = null;

    addPost(title, content, category, meta, type='posts', status='draft') {  
        return new Promise((resolve, reject) => {
            }).then(function( response ) {
            }, reject);

    addPage(title, content, category, meta, type='posts', status='draft') {  
        return new Promise((resolve, reject) => {
            }).then(function( response ) {
            }, reject);

    initialize() {
        return new Promise((resolve, reject) => {
            if (!this.wp)
                let config = {
                    endpoint: `${env.WP_URL}/wp-json`,
                    username: env.WP_USERNAME,
                    password: env.WP_PASSWORD,
                    auth: true

                this.wp = new WPAPI(config)

                // Verify that it authenticated
                this.wp.users().me().then((user) => {
                    this.user = user;
                    console.log('API', 'authenticated as',;
                }, (error) => reject(error))
                reject ("API already initialized");

module.exports = { API };
testAPI.js - Test that WP works
const { API } = require ('./lib/api');
let api = new API();
api.initialize().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

lib/Scrape.js - Headless Webpage Scraper

This wraps around JSDOM for convenience. The fnProcess argument to the constructor accepts a function that accepts a window object and returns a parsed JSON. Since it is executed within the fake DOM, jQuery functions are available in the context.

const jsdom = require('jsdom');
class Scrape {
    constructor(url, fnProcess = null, libs = []) {
        this.url = url || null;
        this.libs = [...[""], libs];
        this.fnProcess = (typeof fnProcess === 'function') ? fnProcess : function(window) {
            return window.document.body.innerHTML;
        this.output = null;
    scrape() {
        return new Promise((resolve, reject) => {
            jsdom.env(this.url, [""], (err, window) => {
                if (err) {
                    return reject(err);
                this.output = this.fnProcess(window);
module.exports = {

testScrape.js - should return a JSON

const { Scrape } = require ('./lib/scrape');
let page = new Scrape('', function (window) {
    return {title: window.title, body: window.jQuery('p').text()}
page.scrape().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

index.js - The Glue

Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.

The Scrape Function

Scrape the fields we want from WikiSource:

// Scrape function to be executed in DOM
const fnScrape = function(window) {

    // From
    //   The Bhagavad Gita (Arnold translation)/Chapter 1 
    // To
    //   Chapter 1

    let $ = window.jQuery;

    let title = $('#header_section_text').text().replace(/["()]/g, "");
body = $('.poem').text()
    return {

I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations.

The entire file:

const async = require('async');
const { List } = require('./lib/read'), 
      { Scrape } = require('./lib/scrape'), 
      { API } = require('./lib/api');

const csvFilePath = './data/source.csv',

// Step 1 - Init WP
let api = new API();

// Step 2 - Read CSV
const readTheFile = function() {
    let file = new List(csvFilePath);
    console.log('Reading file...');

// Step 3 - Process multiple URLs
const processPages = function(data) {
    data.shift(); // CSV header
    console.log('Processing', data.length, 'pages');
    async.forEachLimit(data, LIMIT_PARALLEL, processSingle, (err)=>{
        if (err)
            return console.error(err);

// Step 4 - Get a JSON version of a URL
const scrapePage = function(url) {
    return new Promise((resolve, reject) => {
        if (url.indexOf('http') !== 0) {
            reject('Invalid URL');
        let page = new Scrape(url, fnScrape);
        page.scrape().then((data) => {
            console.log(">> >> Scraped data", data.body.length);
        }, (err) => reject);

// Scrape function to be executed in DOM
const fnScrape = function(window) {

    // From
    //   The Bhagavad Gita (Arnold translation)/Chapter 1 
    // To
    //   Chapter 1

    let $ = window.jQuery;
    let title = $('#header_section_text').text().replace(/["()]/g, ""),
        body = $('.poem').text()
    return {

// Step 3 - Get a JSON version of a URL
const processSingle = function(data, cb) {
    let [url] = data;
    console.log(">> Processing ", url);
    scrapePage(url).then((data) => {
        // Step 5 - Add page to WordPress
        api.addPage(data.title, data.body).then((wpId) => {
            console.log(">> Processed ", wpId);
        }, cb)
    }, cb);

// Kick start the process
    .then(readTheFile, console.error)
    .then(processPages, console.error);
console.log('WP Auth...');


>> Processed  140
>> Processing
>> >> Scraped data 12634
>> Processed  141
>> Processing
>> Processed  142
>> Processing
>> Processed  143
>> Processing
>> >> Scraped data 3005
>> >> Scraped data 3706
>> >> Scraped data 5297
>> >> Scraped data 4039
>> Processed  144
>> Processing
>> >> Scraped data 3835
>> Processed  145
>> Processing
>> >> Scraped data 3781
>> Processed  146
>> Processing
>> >> Scraped data 11816
>> Processed  147
>> Processed  148
>> Processed  149
>> Processed  150
>> Processed  151

Check your WP for the new content.


Scrape to WP

Asynchronous Recursive Functions in JS

I dislike having to deal with JSON structures with many levels of nesting because of the difficulty that comes with traversing all values. In cases where one starts with an unknown level of nesting, recursive modification makes sense. Modifying nested JS hashes is fairly straightforward with synchronous operations but it can get messy if the modifier operation is asynchronous.

This snippet will enable you to apply asynchronous operations for a nested JSON object using the Async library.
Works on both client/server with ES6 support.

SPA - Server Side Injection into Static HTML

If you decide to generate all your client-side assets and ship them using a CDN, it can be tricky to inject server-side generated dynamic content.

Why would you do that?

There are some parameters that are available to the server and cannot be reliably retrieved through AJAX calls. HTTP headers such as referrer, user agent, client IP… and other fields that are transmitted by the browser upon requesting index.html, can be used to alter the behavior of your SPA before any of your libraries are loaded.

Examples include: environment variables, tracking codes, api codes, geo-location based redirecting, device based shipping of assets…

If this information is needed before page load, and needs to be synchronously retrieved (not through AJAX), you can use the following JS generation hack to provide it.

This example assumes you are using Express.JS in Node.JS. The same concept can be applied to any dynamic language that lets you manipulate HTTP response headers.

Your index.html

<script src="env.js"></script>
<script src="jquery.js"></script>
<script src="bootstrap.js"></script>

The generated env.js file

if (!(window.env && typeof window.env == "Object"))
    window.env = {};
window.env.API_KEY = 'XXXXXXX'; // API Key
window.env.GA_ID = 'YZYZYZYZ'; // Google Analytics ID
window.env.API_SERVER = ''; // In case of zero-down time toggling

Normally this file would be hardcoded and served using the express static file server. This is fine until you need to make a change, requiring you to regenerate the code in the file and re-deploy it. A quick alternative is to generate the JS file on the fly and trick the browser into thinking its a static JS asset.


// JSON response
app.get('/data',  (req, res) => {
// JS file
app.get('/data.js',  (req, res) => {
    res.send(`window.GLOBAL = ${JSON.stringify(getData())}`);

The additional delay caused by this should be very minimal. If desired, one can trick the HTTP cache header using ETags to never fetch the file if it has not changed.


Static HTML

AJAX Fetch

Static JS

Server: Difference between server time and data ready
Took: Total time to render

Simple DB Failover in WordPress

If you have a back-up replicated DB instance running and don’t want to use something like HyperDB, you can override wp-config.php with this snippet:

define('DB_NAME', 'wordpress');
define('DB_USER', 'user');
define('DB_PASSWORD', 'password');

$db_hosts = array("hostdb1", "hostdb-backup");
foreach ($db_hosts as $host)
    $l = @mysql_connect($host, DB_USER, DB_PASSWORD, DB_NAME);
    if ($l !== false)
        define('DB_HOST', $host);

WordPress As An Application Framework

I want to use WordPress as an application framework. I like the user-friendly admin, easy install, low maintenance, and rich set of plugins. I can use it for content heavy applications that need some sort of human data entry on the back-end. I want to expose the data in WordPress via an API to a consumer application (mobile app, client side web app)

But it has some limitations:

  • Custom meta fields are serialized in one table (post_meta)
  • Slow performance because of PHP load time and WordPress bootstrap on every request
  • Inefficient SQL queries
  • Flat data structure
    My solution:

  • Add MongoDB interface to WordPress store custom meta fields, in key/value pair format

  • Create Node.JS/MongoDB middleware between client and WordPress for caching
  • Skip wp bootstrap calls on every request
  • Create UI for creating nested JSON structures. Map custom fields back to JSON upon data retrieval.
    The first step would be to create a developer UI that provides an easy interface for generating models as custom posts. It will look like most custom post type plugins, where form controls can be drag/dropped into metaboxes. The developer will specify the JSON data type for each of these fields, set up validation rules and create a mapping to a JSON file structure. Then the developer will configure which RESTful interfaces/endpoints will be available for the model and any authentication rules.

After creating the custom post type, the developer can set up notification services to update the cache. The cache will selectively update and version changes to the models.

A queue of add/remove/update operations will be maintained by the immediate database layer. When transactions are committed, they will be relayed to WordPress via the JSON API.

This is not ideal for applications that use real time data. It is great for content heavy applications like place directories, news feeds, information apps, sports apps…


There is no easy way to manage application content right now. I have looked at Cloud CMS, Contentful…, but they do not offer the convenience and intuitive interface that WordPress provides.

If I make any progress on this, I will post updates.


Ethiopic Web Fonts at

I just re-launched EthiopicType ( It is intended to serve as an archive of Web Typographic Resources for the Ethiopic Script. It currently has a simple text-to-PNG image generator, a dynamic image generation API , a collection of free web fonts with their respective CDN embed codes and links to tutorials, input methods and other resources.

In this article, I will discuss the problems of Web Typography and the details of the new site.


If you see rectangular boxes on a webpage instead of Ethiopic Text, it is because the browser you are using does not have the necessary fonts installed.


**You don’t see this problem often now a days because Microsoft has bundled the font Nyala with newer versions of Windows (Vista, 7, 8). Regardless, you should make sure the text on your webpage is displayed correctly on the number of platforms that do not have have native support Ethiopic Text. These include Windows XP, Mac OS X (< 10.7.3) , iOS, Android (<= 4.04), Windows Mobile, Blackberry OS, Samsung Bada, and some flavors of Linux.


Most operating systems come pre-installed with their own unique bundle of fonts, therefore it is hard to maintain consistent typography on different browsers. The CSS1 specification attempted to tackle this problem by providing the ‘font-style’ and ‘font-family’ properties, but this only informed the browser to display close matches found on the system. For example, when font-family is set to Sans, it may render Tahoma on Windows, Helvetica on Macintosh or FreeSans on Linux. Their similarity was adequate for plain textual content but it did not suffice for decorative headings and other design elements that are crucial for the aesthetic aspects of web design.


This has historically created significant difficulties for web designers and developers because it made it impossible to predict the exact rendering of their typography on every browser. You can find detailed information on the history of Web Fonts here: Wikipedia: Web Typography but I will provide a brief overview of several attempts that have been made to solve this problem:


Image Replacement


This is an old technique which involves rendering the text into the desired font and embedding it in the web page as an image.


Static Images

In this scenario, the designer renders the text as a rasterized bitmap image with the desired font and places the image as an <img> tag into the webpage. This was popular in the early days of the internet (late 90s - early 2000s) when servers and client-side processing resources were limited. It is still in use today for logos and headings that require fancy styles. Designers also place the text in the alt attribute so it can be read by screen readers and clients that do not support images. The image generator at currently uses this method, I plan to make it fully dynamic in the feature.


Dynamic Images

Dynamic images are usually dynamically generated on the server-side and inserted into the webpage usually as an AJAX request or CSS media query based on client-side validation such as CSS3 web fonts support or native font support (more details below).


Here is a PHP tool you can try out: P+C DTR

There are also some implementations that use client side vector generation using Flash. This is more efficient than using static images but only works on browsers that support Adobe Flash (which automatically excludes iOS, Android and other mobile clients). However, it is a great tool for desktop clients, given the 99% Flash penetration rate. You can read more about this method here Scalable Inman Flash Replacement.


Here are some great plugins for Wordpress and Drupal that automatically generates images based on browser support.

Wordpress > Facelift Image Replacement (FLIR)

Drupal >Facelift Image Replacement Integration


I have tested the Wordpress Plugin with Amharic fonts with no success. Please let me know your experienceif you decide to try it out.


CSS Image Replacement

This is a hybrid of the techniques mentioned above, it uses the CSS background-image property which works similar to the first method but does not display the text image if the browser does not support it. Instead, it displays the raw text which can still be formatted using other CSS properties, giving it an advantage over traditional HTML img tag.


You can read more about it here: Fahrner Image Replacement


If you want to learn more about specific implementations and compare the pros and cons of each method you may find this useful: A List Apart Article


Vector Text

This is closely related to Web Fonts but it is slightly different because the fonts are converted to vector glyphs (using Bézier curves) and do not support text-selection and instant typing.

It is not as popular as Image replacement due to browser compatibility issues with the vector formats (Internet Explorer supports only the deprecated VML, support for later versions of SVG is still limited).


Web Fonts


Prior to the emergence of web fonts, browsers relied only on the fonts supplied by the operating system to render textual content in the HTML. WebFonts have made it possible to render almost any font on multiple browsers. The CSS3 specification includes the @font-face element which provides support for Web Fonts on newer versions of most browsers.


It works by having the browser download the desired font (or a subset of the glyph set) and applying it dynamically to the specified HTML elements. This method can be very inefficient on slower connections and older cleints because it requires heavy client-side processing and consumes large bandwidth resources.


You can detailed information about Web Fonts here: Wikipedia: Web Typography


The fonts are available for download or can be embedded directly from the CDN. Most of the fonts come from SenaMiirmir, which I highly recommend if you’re interested in learning more about the Ethiopic Script. The fonts are currently hosted on Amazon’s EC2 but I plan to migrate the fonts to Amazon S3 for increased performance and availability. If you have a server with closer proximity to Ethiopia and would like to act as a mirror, I will give you the web service for generating the embed codes automatically. In the future, I plan to include a WYSIWYG (What you see is what you get) editor for the Image Generator and ability to generate a wider range of output (JPG, SVG, PDF…).


The fonts vary in size depending on their format and breadth of glyphs covered. Currently, the smallest is GeezNewB (7KB WOFF) and the largest is FreeSerif (3,167KB SVG )


Bandwidth is always an issue when downloading resources and it makes it impractical to use web fonts on slow connections. Samuel Teklu (creator of jGeez) has solved this problem by checking for native support before downloading the fonts. I recommend his solution, especially if you don’t need a specific font for styling.


Here is a link to his blog post: Detecting if there is a Geez unicode installed on client browsers


Meanwhile, I will try to remove the non-Ethiopic Glyphs from some of the fonts and add them as special cases so a smaller file can be used in cases that require only Ethiopic text.

Please let me know your experience with these fonts (I have not tested all of them) and send me an email at: [email protected] if you would like to contribute resources.


Thank you for your time

Ethiopic Text in (X)HTML Documents

ASCII and Unicode are the two most widely used text encoding standards. The first is much older and occupies less byte space, and is commonly used in system files and source code. The latter is newer and encompasses a wider range of characters and is used widely on the web and desktop applications with international support.

If you decide to use Ethiopic characters in the markup of your (X)HTML, you have to make sure that the document remains UTF-8 encoded. This should not be a problem with modern text editors (MS Notepad, TextEdit, Notepad++…) and IDEs (Eclipse, NetBeans > 6.9) because they provide native support for unicode source files.

However, some text editors such as Vim and web based file editors do not support UTF-8 documents without special configuration. If you open your UTF-8 source file with an editor that lacks unicode support, you are bound to lose the encoded data upon saving it. This can be disastrous, especially in the case of documents containing Ethiopic content.

Here are some steps you can take to work around this problem:

  • Store content in a database: The separation of content and presentation is a core component of the MVC design architecture. Even if you do not subscribe to this pattern in your project, storing your raw content in a database provides additional data protection and saves you the headache of mixing content and markup. There are many CMSs (Content Management System) such as WordPress, Joomla and Drupal that dynamically generate UTF-8 compatible (X)HTML output with the proper HTTP headers. The data is usually stored on a database server with full UTF-8 support so any changes to the encoding of the source files will not affect your data.
  • Use a flat file CMS: If it is impractical to use a full blown CMS, you may opt for lite XML or flat-file based content management systems. These do not require database privileges so they work on almost any web server. The pages are created dynamically so you will need a web server that supports web applications (PHP, ASP.NET, JSP, Python, CGI…), which can be found on most web hosts. GetSimpleCMS is a great customizable CMS for this purpose.
  • Convert Unicode to HTML Entities: HTML Entities are decimal representations of the position of characters in the Unicode set. If you cannot use a database and cannot run dynamic web pages on your server, you can still prevent data loss by converting your unicode characters into their ASCII HTML entity equivalent. In comparison with Unicode, ASCII covers a smaller subset of the character range, therefore any attempts to directly convert from Unicode to ASCII will lead to data loss. However, you can work around this problem by using the decimal representations of the unicode characters. This has some disadvantages such as a larger file size (due to the additional bytes of HTML entities) and poor code readability (you see decimals in place of the characters).


Don’t forget the declarations XHTML:

<?xml version=”1.0” encoding=”UTF-8”?> <meta http-equiv=”content-type” content=”application/xhtml+xml; charset=UTF-8/>/strong> HTML 4.1 <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”> HTML5 Use your platform’s built-in Unicode to HTML Entities converter