Four Kitchens
Insights

Migrating old HTML files into Drupal

7 Min. ReadDevelopment

The Internet in the 90’s – a much simpler place.

We’ve done several migrations for clients who need their old, legacy content imported into Drupal from a collection of static HTML files. In this post I’ll outline the procedure we use to migrate, and provide some solutions to common problems related to encoding, line endings and parsing HTML with QueryPath. Code snippets are provided inline, and complete source code is provided as a Github gist.

1. Set up a migration source

Let’s work with a hypothetical example site, which has the following directory listing:

 $ ls /mnt/html
 about.html
 news.html
 h1001.html
 h1002.html
 ...
 h1133.html
 h1133.html
 contact.html

The main workhorse is the extremely well-architected Migrate module. If you haven’t yet discovered the wonders of it’s elegant abstraction and flexibility, my suggestion is to watch the presentations on the project page to gain a firm understanding of how it works. I won’t go into the basics here, as I’m going to cover only tips related to importing static HTML.

To use the Migrate module, you need to start with defining your migration. This is done by extending the Migration base class, and providing a constructor that injects the required dependancies into the object. The dependencies are a migration source, a destination, field mappings, etc. The migration source is what we’re interested in right now.

To assemble the source object we’ll need to create two things, the first is a MigrateListFiles, which is an object that provides a list of filenames. These filenames are used as ids in the Migrate module. You just need to create the MigrateListFiles with some parameters that direct it to the files you want:

$regex = '/h[0-9]+.html/';
$list_files = new MigrateListFiles(array('/mnt/html'), '/mnt/html', $regex);

The first parameter is an array of directories. In our case, we only have one, but there could be multiple different source directories. Next is the base dir, which is the part of the directory structure that you’d like explicitly excluded when ids are created. Finally, the regular expression that is used to filter the files in the directory down to just the ones we care about. In our case, we filter looking for any .html that starts with the letter ‘h’ and some digits. This means we won’t be migrating the individual about.html, contact.html, etc. pages.

We then create an object that provides a method of turning an id (filename) into a migrate-able chunk of data, which in this case means doing a file_get_contents() and returning the file contents to you:

$item_file = new MigrateItemFile('/mnt/html');

We then create an array which explicitly states the fields we’re going to be providing, and finally create the actual migration source, passing it the two objects from above:

$fields = array('title' => t('Title'), 'body' => t('Body'));
$this->source = new MigrateSourceList($list_files, $item_file, $fields);

To reiterate, the MigrateSource is going to be a MigrateSourceList, which is a basic source that provides the ability to iterate over a MigrateListFiles and get data from a MigrateItemFile.

If having all these objects seems unnecessarily complex, remember that the Migrate module is extremely flexible – there’s a lot of code reuse going on here, so the tradeoff is worth it.

The rest of the source setup is beyond the scope of this article, and the Github gist contains a full example.

2. Fixing HTML file encodings

Old static sites typically get modified over the years by multiple people using many different platforms and editors, which results in a line ending and character encoding disaster. Any broken characters normally don’t show in the browser, even when the web server headers and HTML tags suggest an encoding that conflicts with the actual encoding. This is because browsers are really, really good at sorting out the mess, which is something that they probably should not be doing – developers should be correcting problems in the source. But anyway.

The first step then, would be to use a good text editor to browse through a handful of files one by one, checking them for inconsistency. You may find several are detected as ISO-8859-1 (Latin-1) but are actually Windows 1252 and thus your files contain broken characters where there should be angled quotes and other Windows niceties. Assuming this is the case, you can use a snippet like this to convert the file contents to UTF-8 before doing anything else with it:

$enc = mb_detect_encoding($html, 'UTF-8', TRUE);
if (!$enc) {
  $html = mb_convert_encoding($html, 'UTF-8', 'WINDOWS-1252');
}

It’s pretty simple – if the encoding is not UTF-8 (which is very reliably detected by PHP, unlike WINDOWS-1252), then assume it’s WINDOWS-1252 and convert. If some of your content is in ISO-8859-1 this shouldn’t cause a problem since WINDOWS-1252 is a superset of ISO-8859-1.

Modify the above snippet according to the encodings you find in your source data.

3. Fixing line endings

There’s Unix, Windows, Classic Mac… and I’ve even discovered a crazy hybrid CRCRLF which I assume was created by a buggy editor. Best to convert everything to Unix (LF). Depending on your source, one way to do that is to simply blast the CR characters away:

$html = str_replace(chr(13), ", $html);

It’s worth noting that getting rid of the CR characters is a necessary step if you want to use QueryPath, as described below.

4. Fixing code point HTML entities

The correct entities to use in HTML for symbols are the “named” entities – for example the ™ sign should be . However, browsers also accept for ™, which is a reference to a code point in the extended ASCII table. This would be fine, except in the case that the document is UTF-8, because is not a valid unicode code point. The browser just cheats and assumes you mean the ASCII code point for ™, even though the correct code point in UTF-8 for ™ is . PHP isn’t so lenient, and will reject invalid values. Try it yourself:

print html_entity_decode('™', ENT_COMPAT | ENT_HTML401, 'UTF-8');
™
print html_entity_decode('™', ENT_COMPAT | ENT_HTML401, 'UTF-8');
// Nothing!
print html_entity_decode('™', ENT_COMPAT | ENT_HTML401, 'UTF-8');
™

So, if your source content has HTML entity code points in the extended ASCII range 127-159, you’re going to have to manually substitute them for named entities before trying to parse them in QueryPath, as QueryPath uses PHP’s default encoding functions, and thus will fail to generate the characters you want.

Luckily, this isn’t too hard:

function convertEntities($html) {
$entities = array(
'–' => '–',
'—' => '—',
'˜' => '˜',
'™' => '™',
// … get them all!
);
$html = str_replace(array_keys($entities), array_values($entities), $html);
return $html;
}

Here is the full list of ASCII characters.

5. Parsing HTML with QueryPath

You’re probably going to want to process and transform the source in some way, and for that, QueryPath is an excellent choice. It uses PHP’s native DOM handling abilities to load HTML and execute jQuery-like chainable commands. In fact, it provides most of the jQuery functions for your usage in PHP.

Getting the HTML into a QueryPath object can be tricky. Depending on how the source HTML is setup, each individual file may have a complete page with headers and footers, or it could include the sidebar and other components using Apache’s Server-Side Includes (SSI) and just contain the main content area (a saner approach).

In the case where you just have the body content in each file, you’ll need to pad it with the complete structure before loading it into QueryPath. To do that, use a function like this:

function wrapHTML($body) {
  // We add surrounding and tags.
  $html = ";
  $html .= '';
  $html .= $body;
  $html .= ";
  return $html;
}

I found that I always have to add this surrounding HTML in exactly this way in order for QueryPath to correctly use UTF-8. We can now create our QueryPath object, combining all the above tips in the correct order:

$html = str_replace(chr(13), ", $html);
$enc = mb_detect_encoding($html, 'UTF-8', TRUE);
if (!$enc) {
  $html = mb_convert_encoding($html, 'UTF-8', 'WINDOWS-1252');
}
$html = wrapHTML($html);
$html = convertEntities($html);
$qp_options = array(
  'convert_to_encoding' => 'utf-8',
  'convert_from_encoding' => 'utf-8',
  'strip_low_ascii' => FALSE,
);
$qp = htmlqp($html, NULL, $qp_options);

You can now perform your transformations on the $qp object. For example, you may want to process all anchors, so that you can change the ‘href’ on internal links to point at a new path:

// For all anchor links.
$anchors = $qp->top('a');
foreach($anchors as $a) {
  $href = trim($a->attr('href'));
  $href = getNewHref($href);
  // Set the new href.
  $a->attr('href', $href);
}

What about stripping out all those pesky Dreamweaver HTML comments? Easy:

foreach ($this->qp->top()->xpath('//comment()')->get() as $comment) {
  $comment->parentNode->removeChild($comment);
}

To get your body content out again, use the innerHTML function:

$body = $qp->top('body')->innerHTML();

Conclusion

These are some fairly tough problems to diagnose the first time you run into them, especially when dealing with a large body of content, so I hope this post helps you to get your clients safely into a Unicode-based Drupal world.