Automated Content Migration to WordPress

In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:

  1. Unsanitary HTML, extracting content from shell
  2. Adding template tags and breadcrumbs
  3. Manual copy/paste labor

The Solution

  1. Use JavaScript to sanitize the DOM
  2. Write RegExp rules
  3. Scrape the via Node.JS and import directly into WP using the JSON API

The Stack

  • Node 6+ for cleaner JS
  • JSDOM for Headless Scraping -npm module
  • WordPress JSON API -wp plugin
  • WordPress Client -npm module

Steps

  1. Content Inventory

    List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.

    Export the list to a CSV

    You should have a source.csv with something like:

    url, title, description, template, category, meta1...
    https://en.wikipedia.org/wiki/Baldwin_Street, Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania
    ...
    
  2. Get WP Ready

    1.  On your WP installation, install and activate this plugin [WP REST API](https://wordpress.org/plugins/rest-api/)
    
    1. Upload and unzip the Basic Auth plugin, it is not in the Plugin repo at the time of this writing. https://github.com/eventespresso/Basic-Auth
    2. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
    3. Test if the API works, navigate to {baseurl}/wp-json . You should see a JSON response with your site’s info
    4. Add the following to .htaccess to enable Basic Auth:

      RewriteRule ^index\.php$ - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
      1. Verify

        curl -u username:password -i -H 'Accept:application/json' {baseurl}/wp-json/wp/v2/users/me
      2. It should display your user information

Set up Node

  1. Get NPM/Node and the basics running in your environment
  2. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
  3. Test if it works, navigate to
    {baseurl}/wp-json?

    . You should see a JSON response with your site’s info

The App Logic

  1. Initialize

    1. Connect with WP
  2. Batch Process

    1. Read data

      1.  Process CSV
      
      1. Convert to an array of objects with useful data

            1.  'url', 'title', 'description'
        
        3.  Populate master list
        
      2. (optional) Separate into sublists for categories, templates…

      3. For each URL in master

        1. Scrape webpage

          1.  Run JSDOM + jQuery
          
        2. Get Raw HTML

          1. Process

            1. Extract title, description, custom values into JSON
        3. Extract body
        4. Sanitize HTML, remove unnecessary DOM elements

          1. Import

            1. Insert a new post/page to WordPress
        5. Store output
  3. Verify in WP

The Code Structure

├── data
│   └── source.csv
├── env.js            // Password
├── index.js          // Entry
├── lib
│   ├── api.js        // All WP interactions
│   ├── read.js       // File read / CSV parse
│   └── scrape.js     // JSDOM Scraper
├── node_modules
│   ├── async
│   ├── fast-csv
│   ├── jsdom
│   └── wpapi
└── package.json

The Code

env.js - Environment Config

For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this in with working WP credentials.

// These data shouldn't be checked in.

module.exports = {
    'WP_URL': 'https://domain.com',
    'WP_USERNAME': 'test',
    'WP_PASSWORD': 'test'
}

 

data/source.csv - List of URLs

I will use a single column CSV, you can pass metadata by adding more columns. For example purposes, I will scrape WikiSource.

url
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_1
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_2
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_3
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_4
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_5
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_6
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_7
https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_8

lib/Read.js - Reads CSV Files

For most use cases, this crude CSV parser will suffice:

const fs = require('fs');
try {
let raw = fs.readFileSync('./data/source.csv', 'utf8')
let parsed = raw.split("\n") // Rows
                .map(r => r.split(",") // Fields
                            .map(f => f.trim('\r\n') )) // Trailing chars
} catch (e){
    console.error(e);
}

But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:

const csv = require('fast-csv'),
    fs = require('fs');
class List {
    constructor(filePath, limit = 500) {
        this.filePath = filePath || null;
        this.limit = limit;
        this.data = [];
        this.stream = null;
    }
    read() {
        return new Promise((resolve, reject) => {
            if (!(this.filePath && fs.existsSync(this.filePath))) {
                return reject('File does not exist');
            }
            // TODO: impement scalable streaming.
            this.stream = fs.createReadStream(this.filePath);
            this.stream.pipe(csv()).on("data", (raw) => {
                if (this.data.length > this.limit) {
                    console.log("Read", "Limit exceeded");
                    return this.stream.destroy();
                }
                this.data.push(raw);
            }).on("end", () => {
                resolve(this.data)
            });
        })
    }
}
module.exports = {
    List
};
testRead.js - Test that CSV is read
const { List } = require ('./lib/read');
let file = new List('./data/source.csv');
file.read().then(console.log, console.error);

Run testRead.js, you should see a 2D array of your CSV.

lib/API.js - WP API Wrapper

This file wraps around the wpapi npm module to provide support authentication and provide only the functions we need: new post and new page

/* 
 *  Wrapper around WP-API
 */
const env = require('../env');
const WPAPI = require('wpapi');

class API {
    constructor () {
        this.wp = null;
        this.user = null;
    }

    addPost(title, content, category, meta, type='posts', status='draft') {  
        return new Promise((resolve, reject) => {
            this.wp.posts().create({
                title,
                content,
                status
            }).then(function( response ) {
                resolve(response.id);
            }, reject);
        });
    }

    addPage(title, content, category, meta, type='posts', status='draft') {  
        return new Promise((resolve, reject) => {
            this.wp.pages().create({
                title,
                content,
                status
            }).then(function( response ) {
                resolve(response.id);
            }, reject);
        });
    }

    initialize() {
        return new Promise((resolve, reject) => {
            if (!this.wp)
            {
                let config = {
                    endpoint: `${env.WP_URL}/wp-json`,
                    username: env.WP_USERNAME,
                    password: env.WP_PASSWORD,
                    auth: true
                }

                this.wp = new WPAPI(config)

                // Verify that it authenticated
                this.wp.users().me().then((user) => {
                    this.user = user;
                    console.log('API', 'authenticated as', user.name);
                    resolve(user);
                }, (error) => reject(error))
            }
            else
            {
                reject ("API already initialized");
            }
        });
    }
}

module.exports = { API };
testAPI.js - Test that WP works
const { API } = require ('./lib/api');
let api = new API();
api.initialize().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

lib/Scrape.js - Headless Webpage Scraper

This wraps around JSDOM for convenience. The fnProcess argument to the constructor accepts a function that accepts a window object and returns a parsed JSON. Since it is executed within the fake DOM, jQuery functions are available in the context.

const jsdom = require('jsdom');
class Scrape {
    constructor(url, fnProcess = null, libs = []) {
        this.url = url || null;
        this.libs = [...["http://code.jquery.com/jquery.js"], libs];
        this.fnProcess = (typeof fnProcess === 'function') ? fnProcess : function(window) {
            return window.document.body.innerHTML;
        }
        this.output = null;
    }
    scrape() {
        return new Promise((resolve, reject) => {
            jsdom.env(this.url, ["http://code.jquery.com/jquery.js"], (err, window) => {
                if (err) {
                    return reject(err);
                }
                this.output = this.fnProcess(window);
                resolve(this.output);
            });
        });
    }
}
module.exports = {
    Scrape
}

testScrape.js - Example.com should return a JSON

const { Scrape } = require ('./lib/scrape');
let page = new Scrape('http://example.org/', function (window) {
    return {title: window.title, body: window.jQuery('p').text()}
})
page.scrape().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

index.js - The Glue

Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.

The Scrape Function

Scrape the fields we want from WikiSource:

// Scrape function to be executed in DOM
const fnScrape = function(window) {

    // From
    //   The Bhagavad Gita (Arnold translation)/Chapter 1 
    // To
    //   Chapter 1

    let $ = window.jQuery;

    let title = $('#header_section_text').text().replace(/["()]/g, "");
body = $('.poem').text()
    return {
        title, 
        body
    };
}

I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations.

The entire file:

const async = require('async');
const { List } = require('./lib/read'), 
      { Scrape } = require('./lib/scrape'), 
      { API } = require('./lib/api');

const csvFilePath = './data/source.csv',
    LIMIT_PARALLEL = 5;

// Step 1 - Init WP
let api = new API();

// Step 2 - Read CSV
const readTheFile = function() {
    let file = new List(csvFilePath);
    console.log('Reading file...');
    return file.read();
};

// Step 3 - Process multiple URLs
const processPages = function(data) {
    data.shift(); // CSV header
    console.log('Processing', data.length, 'pages');
    async.forEachLimit(data, LIMIT_PARALLEL, processSingle, (err)=>{
        if (err)
        {
            return console.error(err);
        }
        console.log("Done!");
    });
};

// Step 4 - Get a JSON version of a URL
const scrapePage = function(url) {
    return new Promise((resolve, reject) => {
        if (url.indexOf('http') !== 0) {
            reject('Invalid URL');
        }
        let page = new Scrape(url, fnScrape);
        page.scrape().then((data) => {
            console.log(">> >> Scraped data", data.body.length);
            resolve(data);
        }, (err) => reject);
    });
};

// Scrape function to be executed in DOM
const fnScrape = function(window) {

    // From
    //   The Bhagavad Gita (Arnold translation)/Chapter 1 
    // To
    //   Chapter 1

    let $ = window.jQuery;
    let title = $('#header_section_text').text().replace(/["()]/g, ""),
        body = $('.poem').text()
    return {
        title, 
        body
    };
}

// Step 3 - Get a JSON version of a URL
const processSingle = function(data, cb) {
    let [url] = data;
    console.log(">> Processing ", url);
    scrapePage(url).then((data) => {
        // Step 5 - Add page to WordPress
        api.addPage(data.title, data.body).then((wpId) => {
            console.log(">> Processed ", wpId);
            cb();
        }, cb)
    }, cb);
}

// Kick start the process
api.initialize()
    .then(readTheFile, console.error)
    .then(processPages, console.error);
console.log('WP Auth...');

Test

...
>> Processed  140
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_12
>> >> Scraped data 12634
>> Processed  141
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_13
>> Processed  142
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_14
>> Processed  143
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_15
>> >> Scraped data 3005
>> >> Scraped data 3706
>> >> Scraped data 5297
>> >> Scraped data 4039
>> Processed  144
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_16
>> >> Scraped data 3835
>> Processed  145
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_17
>> >> Scraped data 3781
>> Processed  146
>> Processing  https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_18
>> >> Scraped data 11816
>> Processed  147
>> Processed  148
>> Processed  149
>> Processed  150
>> Processed  151
Done!

Check your WP for the new content.

GitHub:

Scrape to WP