Automated Content Migration to WordPress
In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:
- Unsanitary HTML, extracting content from shell
- Adding template tags and breadcrumbs
- Manual copy/paste labor
The Solution
- Use JavaScript to sanitize the DOM
- Write RegExp rules
- Scrape the via Node.JS and import directly into WP using the JSON API
The Stack
- Node 6+ for cleaner JS
- JSDOM for Headless Scraping -npm module
- WordPress JSON API -wp plugin
- WordPress Client -npm module
Steps
Content Inventory
List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.
Export the list to a CSV
You should have a source.csv with something like:
url, title, description, template, category, meta1... https://en.wikipedia.org/wiki/Baldwin_Street, Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania ...
Get WP Ready
1. On your WP installation, install and activate this plugin [WP REST API](https://wordpress.org/plugins/rest-api/)
- Upload and unzip the Basic Auth plugin, it is not in the Plugin repo at the time of this writing. https://github.com/eventespresso/Basic-Auth
- Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
- Test if the API works, navigate to
{baseurl}/wp-json
. You should see a JSON response with your site’s info Add the following to .htaccess to enable Basic Auth:
RewriteRule ^index\.php$ - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
Verify
curl -u username:password -i -H 'Accept:application/json' {baseurl}/wp-json/wp/v2/users/me
It should display your user information
Set up Node
- Get NPM/Node and the basics running in your environment
- Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
- Test if it works, navigate to
{baseurl}/wp-json?
. You should see a JSON response with your site’s info
The App Logic
Initialize
- Connect with WP
Batch Process
Read data
1. Process CSV
Convert to an array of objects with useful data
1. 'url', 'title', 'description' 3. Populate master list
(optional) Separate into sublists for categories, templates…
For each URL in master
Scrape webpage
1. Run JSDOM + jQuery
Get Raw HTML
Process
- Extract title, description, custom values into JSON
- Extract body
Sanitize HTML, remove unnecessary DOM elements
Import
- Insert a new post/page to WordPress
- Store output
Verify in WP
The Code Structure
├── data │ └── source.csv ├── env.js // Password ├── index.js // Entry ├── lib │ ├── api.js // All WP interactions │ ├── read.js // File read / CSV parse │ └── scrape.js // JSDOM Scraper ├── node_modules │ ├── async │ ├── fast-csv │ ├── jsdom │ └── wpapi └── package.json
The Code
env.js - Environment Config
For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this in with working WP credentials.
// These data shouldn't be checked in. module.exports = { 'WP_URL': 'https://domain.com', 'WP_USERNAME': 'test', 'WP_PASSWORD': 'test' }
data/source.csv - List of URLs
I will use a single column CSV, you can pass metadata by adding more columns. For example purposes, I will scrape WikiSource.
url https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_1 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_2 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_3 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_4 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_5 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_6 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_7 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_8
lib/Read.js - Reads CSV Files
For most use cases, this crude CSV parser will suffice:
const fs = require('fs'); try { let raw = fs.readFileSync('./data/source.csv', 'utf8') let parsed = raw.split("\n") // Rows .map(r => r.split(",") // Fields .map(f => f.trim('\r\n') )) // Trailing chars } catch (e){ console.error(e); }
But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:
const csv = require('fast-csv'), fs = require('fs'); class List { constructor(filePath, limit = 500) { this.filePath = filePath || null; this.limit = limit; this.data = []; this.stream = null; } read() { return new Promise((resolve, reject) => { if (!(this.filePath && fs.existsSync(this.filePath))) { return reject('File does not exist'); } // TODO: impement scalable streaming. this.stream = fs.createReadStream(this.filePath); this.stream.pipe(csv()).on("data", (raw) => { if (this.data.length > this.limit) { console.log("Read", "Limit exceeded"); return this.stream.destroy(); } this.data.push(raw); }).on("end", () => { resolve(this.data) }); }) } } module.exports = { List };
testRead.js - Test that CSV is read
const { List } = require ('./lib/read'); let file = new List('./data/source.csv'); file.read().then(console.log, console.error);
Run testRead.js, you should see a 2D array of your CSV.
lib/API.js - WP API Wrapper
This file wraps around the wpapi npm module to provide support authentication and provide only the functions we need: new post and new page
/* * Wrapper around WP-API */ const env = require('../env'); const WPAPI = require('wpapi'); class API { constructor () { this.wp = null; this.user = null; } addPost(title, content, category, meta, type='posts', status='draft') { return new Promise((resolve, reject) => { this.wp.posts().create({ title, content, status }).then(function( response ) { resolve(response.id); }, reject); }); } addPage(title, content, category, meta, type='posts', status='draft') { return new Promise((resolve, reject) => { this.wp.pages().create({ title, content, status }).then(function( response ) { resolve(response.id); }, reject); }); } initialize() { return new Promise((resolve, reject) => { if (!this.wp) { let config = { endpoint: `${env.WP_URL}/wp-json`, username: env.WP_USERNAME, password: env.WP_PASSWORD, auth: true } this.wp = new WPAPI(config) // Verify that it authenticated this.wp.users().me().then((user) => { this.user = user; console.log('API', 'authenticated as', user.name); resolve(user); }, (error) => reject(error)) } else { reject ("API already initialized"); } }); } } module.exports = { API };
testAPI.js - Test that WP works
const { API } = require ('./lib/api'); let api = new API(); api.initialize().then(console.log, console.error);
Run testAPI.js, you should see a JSON with your user details.
lib/Scrape.js - Headless Webpage Scraper
This wraps around JSDOM for convenience. The fnProcess argument to the constructor accepts a function that accepts a window object and returns a parsed JSON. Since it is executed within the fake DOM, jQuery functions are available in the context.
const jsdom = require('jsdom'); class Scrape { constructor(url, fnProcess = null, libs = []) { this.url = url || null; this.libs = [...["http://code.jquery.com/jquery.js"], libs]; this.fnProcess = (typeof fnProcess === 'function') ? fnProcess : function(window) { return window.document.body.innerHTML; } this.output = null; } scrape() { return new Promise((resolve, reject) => { jsdom.env(this.url, ["http://code.jquery.com/jquery.js"], (err, window) => { if (err) { return reject(err); } this.output = this.fnProcess(window); resolve(this.output); }); }); } } module.exports = { Scrape }
testScrape.js - Example.com should return a JSON
const { Scrape } = require ('./lib/scrape'); let page = new Scrape('http://example.org/', function (window) { return {title: window.title, body: window.jQuery('p').text()} }) page.scrape().then(console.log, console.error);
Run testAPI.js, you should see a JSON with your user details.
index.js - The Glue
Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.
The Scrape Function
Scrape the fields we want from WikiSource:
// Scrape function to be executed in DOM const fnScrape = function(window) { // From // The Bhagavad Gita (Arnold translation)/Chapter 1 // To // Chapter 1 let $ = window.jQuery; let title = $('#header_section_text').text().replace(/["()]/g, ""); body = $('.poem').text() return { title, body }; }
I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations.
The entire file:
const async = require('async'); const { List } = require('./lib/read'), { Scrape } = require('./lib/scrape'), { API } = require('./lib/api'); const csvFilePath = './data/source.csv', LIMIT_PARALLEL = 5; // Step 1 - Init WP let api = new API(); // Step 2 - Read CSV const readTheFile = function() { let file = new List(csvFilePath); console.log('Reading file...'); return file.read(); }; // Step 3 - Process multiple URLs const processPages = function(data) { data.shift(); // CSV header console.log('Processing', data.length, 'pages'); async.forEachLimit(data, LIMIT_PARALLEL, processSingle, (err)=>{ if (err) { return console.error(err); } console.log("Done!"); }); }; // Step 4 - Get a JSON version of a URL const scrapePage = function(url) { return new Promise((resolve, reject) => { if (url.indexOf('http') !== 0) { reject('Invalid URL'); } let page = new Scrape(url, fnScrape); page.scrape().then((data) => { console.log(">> >> Scraped data", data.body.length); resolve(data); }, (err) => reject); }); }; // Scrape function to be executed in DOM const fnScrape = function(window) { // From // The Bhagavad Gita (Arnold translation)/Chapter 1 // To // Chapter 1 let $ = window.jQuery; let title = $('#header_section_text').text().replace(/["()]/g, ""), body = $('.poem').text() return { title, body }; } // Step 3 - Get a JSON version of a URL const processSingle = function(data, cb) { let [url] = data; console.log(">> Processing ", url); scrapePage(url).then((data) => { // Step 5 - Add page to WordPress api.addPage(data.title, data.body).then((wpId) => { console.log(">> Processed ", wpId); cb(); }, cb) }, cb); } // Kick start the process api.initialize() .then(readTheFile, console.error) .then(processPages, console.error); console.log('WP Auth...');
Test
... >> Processed 140 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_12 >> >> Scraped data 12634 >> Processed 141 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_13 >> Processed 142 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_14 >> Processed 143 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_15 >> >> Scraped data 3005 >> >> Scraped data 3706 >> >> Scraped data 5297 >> >> Scraped data 4039 >> Processed 144 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_16 >> >> Scraped data 3835 >> Processed 145 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_17 >> >> Scraped data 3781 >> Processed 146 >> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_18 >> >> Scraped data 11816 >> Processed 147 >> Processed 148 >> Processed 149 >> Processed 150 >> Processed 151 Done!
Check your WP for the new content.