Migrate and Populate the Content
An oft-overlooked step in a website project is actually getting the content into the CMS. This might be an automated or manual project, depending on where the content is now, and how it might need to change.
Does This Apply?
Unless you’re creating a brand new site with all new content or throwing away all your old content and starting over, then you will be migrating content. Your specific project will dictate how much is automated or manual, but know that this workload will exist.
It’s time for a new house.
You spend a bunch of time looking at home plans and talking to home builders. You pick out finishes, buy an empty lot, and engage with an architect. You spend months visiting the construction site, keeping careful track of how it’s progressing.
You realize that your new house is going to drive other changes in your life. You register your kids in a new school. You decide to buy a more fuel-efficient car since your new house is fifteen miles further from the old one. You buy a new file cabinet to store all the paperwork, from builder invoices to instructions for the new oven.
The house nears completion. You call your utilities and arrange for transfer of service. You file change of address cards with all your different subscriptions. You tell Trevor, your landlord, that you’re moving out at the end of the month. You even plan a housewarming party for fifty of your friends.
The end of the month comes, and the house is done. The builder gives you the keys. You sign the mortgage papers. It’s all yours, congratulations.
You visit your new house. You walk inside … and sit on the floor. There’s no furniture. You go to make dinner, but there are no dishes in the sink. Your phone rings. It’s Trevor, your old landlord.
“Hey, when are you planning to come get all your stuff?”
[record scratch sound effect]
This is the story of content migrations. In our excitement and rush to build something new, we sometimes forget that we have a ton of stuff that we’ll need to move when this is all done.
In this chapter, we’re going to explain how to avoid this.
“Migration” can be a very dangerous word.
Some people will say “site migration” to mean the entire process of moving from one website to another. To them, the entire thing is a migration, from when you start talking about it, to when the new site goes live1. Planning, designing, development, and moving content are all “migration.”
To other people, “migration” is moving content from one platform (your old CMS) to a new platform (your new CMS). The migration is one part of a much larger project.
We’ll revisit this point a couple times below, but let’s just get something out now: sometimes, the content migration is The Thing. Many projects are very small development efforts with monumental migration efforts behind them. We’ve seen projects where almost half the budget was dedicated to migration.
When someone says “migration,” do they mean the entire project, or just moving the content? Be relentless in baselining the usage of this term, because mistakes can be disastrous.
The Three Questions of a Content Migration
When planning a content migration, there are three different questions you need to answer. Sadly, these aren’t yes or no answers – these are more like three complicated problems you need to have solutions for.
- The Editorial Question: What content is moving, and how does it need to change on the way over?
- The Functional Question: How will functional or logical aspects of the content work in the new CMS?
- The Procedural Question: How will the actual bytes move from one disk to another, and what does the timing of that look like?
We’ll discuss each question below, but understand that some of this retreads over subjects we’ve covered in the past. You will likely have done some of the work already, and that’s a great thing. Now you’re just going to reconsider that work in the context of your content migration.
The Editorial Question
We need to know what we’re going to move. We need a list of everything that’s going over, so we can move it and check it off our list.
Thankfully, this is the easiest question for us to answer by this point in the process, because a lot of this work will have been done already.
If you’re moving from your apartment to a new house, the scope of what you need to move is pretty simple – you need to get everything out of the apartment. Where that stuff goes is up to you, of course. Most of it will make it to the new house, but some of it may be donated or even thrown away. Still, it’s easy to know when you have everything accounted for, because in the end your apartment needs to be empty. A simple visual sweep can tell you if you got everything or not.
The same isn’t quite true of websites. While your apartment has a standard set of doors that you pass every day, your website has thousands of rooms – some of which you haven’t seen in years, and some that might not even have doors. It is very easy to forget something, or not plan adequately for it. Then you turn off the old website and that content simply goes away (insofar as your website visitors are concerned).
First, let’s look back to Chapter 7: Know Your Content and pull out that content inventory. This encompasses the scope of content that’s going to be on the new website. You also have a site map that defines content organization for the new site, including any new content to be written.
Essentially, you need to get from the current content inventory to the future site map. We can ignore new content at this point, because creating new content is an operational/editorial concern, and not really part of a migration.
But for content that already exists, we need to define within the inventory whether each page is a one-off or part of a highly structured group:
- Structured Groups of Content Objects: All news releases. All technical documentation. All forum posts. These are large groups of similar content that can be grouped together in a “bucket” that all have to go over to the new website. You do not need to list every news article, so long as everyone can agree on what a news article is, and that they’re all migrating.
Make It Easier on Yourself
It’s worth saying at this moment that you often don’t need to move all your content. Now might be a handy time to get rid of stuff.
Remember the apartment analogy above where some of the stuff in your apartment went to the trash dump? Same deal with digital content. Here’s a statement that you should repeat to yourself five times every single morning while you’re doing migration planning.
The easiest content to migrate is content that you throw away.
Seriously. We’re not even being flippant. What to make your migration easier? Throw stuff away.
If you’ve performed a detailed content audit, you’ll have already reviewed this content from a few angles. Here are some factors you could take into account when deciding whether to keep or discard something:
- Traﬃc: Check your analytics. How often has this content been consumed?
- Centrality: How often is this content linked from other content? If it has 300 inbound links, that can be a problem. But if it’s not linked from anywhere – if it’s one of those rooms with no doors – maybe there’s a reason?
- Age: We’re not saying that old content is always bad, but consider how current or relevant the content is to your current content consumers. Are they likely to get any value out of it?
- Keywords: Some organizations are much more concerned about SEO than others, but consider if any of your target keywords appear in the content. If the answer is no, then evaluate whether or not it’s serving a purpose for you.
By using a set of rules, you might be able to identify a huge subgroup of irrelevant content in a larger group. Yes, all news articles are going over, but before you migrate them, can you perhaps cut down the “all” there? Delete what you can, then migrate the rest.
The goal of The Editorial Question is to identify a list of all the content or groups of content that are migrating. Once this is done, then we move onto the next question, where we figure out how all your content is going to work in the new CMS.
The Functional Question
There are certain features and common page models that are relatively standard, regardless of the CMS.
For instance, news releases display a title, publish date, and a body. You can likely count on this working reasonably well in most any CMS. The simple display of structured content is a well-handled pattern.
However, some content-based functionality can be more in-depth and unique from CMS to CMS. For example, what about the page that lists all of those news releases? Let’s consider all the functionality that might exist here.
- News releases are listed in reverse chronological order by date published.
- Certain news releases are hidden unless the user is logged in with the correct user permissions.
- News releases can be filtered or categorized.
- The user can perform a full-text search on all news releases, and this search includes functionality like type-ahead suggestions, stemming, fuzzy matching, and scoring.
- The user can subscribe to a particular category to get email updates when a news release is added.
And so on.
This is essentially a software application, embedded as a subset of the larger website. Looking over that list, are you sure your new CMS will be able to do all of that?
If not, you’re going to need to make some decisions.
- What of those requirements are you willing to abandon?
- Do you have an integration partner that can find creative solutions to work around limitations in your new CMS?
- Is your new CMS the right choice? I’m not saying you should bail out at the smallest problem, but if you keep evaluating functionality and your new CMS keeps up coming up short, then you might need to take a beat and think through this some more.
And understand that this is just one page on your website. A typical website has a dozen or more of these content-based applications scattered around. You need to account for all of them and make sure there are plans to make this all work.
Remember: functionality has to migrate, too. And even if that functionality is migrated, there’s a very real chance your new CMS handles it completely differently, and it cannot be migrated so much as it has to be reconstructed.
Navigation is a good example. Many systems (especially web-focused systems) use a tree of content to organize their pages, as we discussed in Chapter 20: Implement the Back-End Functionality. If you’re moving from a content tree-based system to another content tree-based system – for instance, from Sitecore to Optimizely – then you’re in pretty good shape. But what if you’re moving to a system that doesn’t use a tree? Say, Drupal? Or a headless system, which have mostly eschewed content trees altogether? If your entire navigation logic depends on parent-child relationships, then you’re going to have to find a new way to program this.
Finding all this programming is not easy. Some of it will be reflected in source code, and the developers who implemented the system can point this out. But other functionality will simply be inherent to the CMS, and might be so ingrained that you can inadvertently extrapolate that functionality onto all CMSs, assuming your new one is going to handle it in the same way.
Sometimes, the best way to catalog this functionality is to simply walk through your current website with your implementation team. Look at every unique page, and make sure they have an answer for every piece of logic and functionality that goes beyond the simple display of a single content object.
Coming out of this process, you should have a list of all the content that’s moving (from The Editorial Question section above), and a list of all functionality that’s going to need to be replicated from one system to another.
The Procedural Question
By this point we know what content is migrating, and we’re aware of all the content functionality that’s going to have to be reconstructed (or reconsidered). A lot of that functionality was identified as development tasks and is being incorporated into our new website.
Now we have to figure out exactly how we’re going to migrate this. Our migration is basically a separate, related project to main website development. Our plan for this subproject is a roll-up of a number of different problems that need solutions:
- Extraction: How are we going to get content out of our old system?
- Transformation: How does the content need to be cleaned up?
- Import: How are we going to get the content into the new system?
- Timing: What does the order and spacing of this process look like?
We’ll look at each one individually. But first, do you need robots or humans?
Automated or Manual?
Here’s your first crucial decision: are you going to try to automate this process at all?
Make no mistake: you can do a migration manually. There’s nothing stopping you from buying pizza for all the interns and stuffing them in a conference room for a week to copy and paste content from your old CMS to your new one.
In fact, for some projects, this is exactly the right decision. Here’s when this makes sense:
- Low Volume: If you’re migrating less than 100 pages, then there’s little point in automating the process. You’ll spend more budget to configure the automation than you would save over doing it manually.
- High Transformation: Some content needs to be transformed considerably on the way over. For example, if you’re taking a bunch of content out of PDF files and putting this in HTML, that’s probably something you’re going to have to do manually2.
- Low-Cost Labor: I wasn’t kidding about buying pizza for the interns. This happens all the time. Universities are legendary for using work-study students for
Copy-and-paste is low-tech and tedious, but reliable and low-overhead. A manual migration takes no prep to get started, and sometimes you can brute force a migration faster than getting clever with it.
If you decide to do a manual migration, then very little of what’s below will apply to you. You will extract, transform, and import all in one step.
Also, know that any automated migration project will in reality include some manual migration. There’s always content that is not automatically migratable3. These pages are usually made up of a mixture of content from a single object, design components that display other content, and aggregations that bring in more content. The complexity of these pages in relation to their relative scarcity means these will need to be reconstructed, rather than migrated.
There’s a number of different options for getting content out of a CMS. Some are well-supported and helpful. Others are less so.
- Export: Some systems actually have an export system. However, it’s getting less common, and even if it exists, it’s important to know in what format the content will be exported, and if all the detail you need will be included.
- API: Most all systems have an API that you can manipulate from code scripts. It’s not hard to use this to export content into some neutral format that you design. This generally gives you the most granular way to control exactly what gets exported and in what format it ends up.
- Database: If a system doesn’t offer either of the above options, you might be able to go around the CMS entirely and manually search its database. In many situations, a developer can hook directly into the database and retrieve content. The danger here is that the content in the database might not be exactly representative of what makes it to a browser.
- HTTP: The final option would be to simply make an HTTP call to the URL where the content is located. If you can get a list of URLs where the content is the operative content object (see Implement the Back-End Functionality), then you can very easily request it just like a browser would, then parse the HTML to break it up into the attributes you need. There are some drawbacks here – non-public content can be difficult, as can non-displayed attributes – but this is occasionally your last resort.
The goal of extraction is to get the content out into some neutral form where it can be easily accessed and manipulated, usually as a text markup format like JSON or XML, though occasionally you might deposit it directly into a simple database or other storage mechanism.
Once you have the content out of the old CMS, it often needs to be transformed or adjusted before being imported into your new CMS. Occasionally the content was modeled poorly in the old CMS, and it would be inefficient to move it as-is.
For example, consider some scenarios of content in your old CMS.
- The author’s name was represented in a single field. You want to be able to order authors by last name, so this is going to need to be split into first and last names.
- The model for committee meetings had a set number of fields (say, six) for attendees. This becomes a problem when more than six people attend, so the plan is to break this out to a relational attribute in the new system. Each meeting attendee will need to form a separate content object, and be linked into the meeting object where they used to appear.
- Comments were stored in a separate database table. This content needs to be extracted alongside the CMS content with a reference to the correct article content object from the old CMS.
Remember, when you import to your new CMS, it has no idea where the content came from, which means the rules you used on your old site to dictate design mean nothing.
Many times, you need to “scrub” rich text. Your old CMS may have generated HTML from rich text editors – the body of an article, for instance – and this HTML is of varying quality. Sometimes it’s clean and can be imported without changes, but other times, you’ll need to write scripts to comb through this HTML and fix problems, like embedded scripts, character encodings not compatible with the new CMS, or deprecated HTML tags like
FONT and (gulp)
These operations can be tedious. If you’re lucky, the HTML is consistently bad, meaning you can fix these things at the global script level. Usually, the HTML is inconsistently bad, meaning you have to pick through it and fix them manually.
Finally, what goes out must come back in again, and eventually you’re going to have to take that content you’ve gotten out of your old CMS and move it into your new CMS.
There are fewer options for this scenario. Most CMSs don’t have an “import content” function5.
Usually, an import needs to be performed from code, by a developer, working against the new CMSs API. It would seem fairly simple, but there are some nuances.
- You will need to keep a “manifest” to tie content from your old site to its corresponding new content.
- You will need to make your script “update aware,” so you can run it and re-run it over and over without creating new objects every time.
- Working with referential attributes – where an attribute on one content object points to another – can be tricky. If you’re importing 10,000 articles and each one is linked to one of 300 authors, then you need to make sure the authors exist first. You can’t link an article to something that isn’t there.
- You will need to do a link resolution process, where you search all HTML for embedded links to other content and then point them at the correct content.
Of course, these are all going to depend on the specifics of your migration. In reality, the best way to illustrate the complexity of a hypothetical migration project is … to provide an example of a hypothetical migration project.
Timing and Iterations: A Hypothetical Migration Project
We’ve been talking about a “migration” like it’s a singular thing that happens at a point in time, but that’s not how it works. A migration is a process that occurs alongside your development project.
Many parts of a migration are iterative, meaning you’ll try them, realize you did something wrong, delete the results, fix your problem, and do it again. This happens on extraction, transformation, and import. These things go in cycles, continually being refined, closer and closer to an ideal.
Remember too that migration can start early. You should start inventorying as soon as you decide to move your website. You can start extraction shortly thereafter. And you can do a lot of transformation without knowing the final destination for the content. Don’t sit around and wait for the new website to be done. Start working on your migration as early as you can.
The best way to illustrate this is to give a hypothetical timeline of events. Here’s how the entire migration process might work for a large project with around 50,000 content objects (pages, components, relationships, etc.):
- Your organization decides to move to a new CMS. Even before development has started – indeed, even before a new CMS is selected – the content team includes The Editorial Question in their content strategy. They start developing a list of what content is migrating and what isn’t as a part of the site map.
- The development team starts looking at ways to get content out of the old CMS. They determine they can do a rough export. Remember that the new CMS still hasn’t been selected, but the developers know they’re going to extract the content no matter what, and getting it out isn’t affected by what it’s going into.
- They don’t know exactly what content they’ll need, so they just export every last bit of content they can find. This is an iterative process – they export, review, evaluate what they missed and what’s not ideal, then delete the result, tweak their script, and export again. This goes on for a couple weeks. The developers eventually get the content out into thousands of JSON files.
- The developers get busy cleaning up the old content. It’s really old and messy. They still don’t know the CMS it’s going into, but they know it’s going to need to be fixed, regardless of destination.
- Finally, a new CMS is selected.
- In an epic working session on The Functional Question, the content strategy, design, and development teams figure out where content is going to go in the new system, and how it’s going to have to change to work under the new system. Much of this work begins in content strategy and information architecture, but this is when the concept becomes action. These teams also create a plan for content that needs to be created or manually migrated.
- The developers spent the next few weeks preparing the exported content in the JSON files. They write scripts to transform them, review the result, tweak their scripts, then transform again. This goes on until the JSON files are as perfectly clean and importable as possible.
- The migration QA team comes up with a plan to import content in stages, then QA the result before moving on. The content is divided into groups and a schedule is put in place.
- Import begins. As an example, the developers run a script to bring in 16,000 news articles. They take a quick look and realize something went wrong, so they delete them, fix their script, and re-import. Testing starts, and this time the testers realize that none of the images were imported. So, testing stops and scripts are tweaked.
- Up to this point, testers have just been recording issues, but not fixing them, in case everything needs to be re-imported. However, as QA continues, it appears that the last import is solid and they can commit it. A decision is made that an automated mass import will not be done again for the news articles, so manual editors and testers are free to start correcting issues.
- The developers will never run the news article import script again. They move on to the next group of content.
- Testing continues. Issues are fixed where possible, but occasionally there are issues that need further review. For example, one news article links to an older one that was discarded. Testers open tickets for these issues and assign them to the relevant staff to ensure the issues aren’t missed. The content strategy and editorial teams are busy analyzing problems and adapting the content to fit.
- This cycle of import, re-import, commit, and testing continues until all content is in. This process might take weeks. During this time, the content team has been creating new content to fill in the gaps. The new website slowly starts to come together.
- Once a content group is imported and the teams commit to that import, the content team acknowledges that this content is “frozen.” This means that it will not be re-imported, so if they want to change the content in the old system (which is still running the public website), they will need to wait on that change until after launch, or duplicate that change in the new system.
- All testing completes, all new content is completed, and the new site is fully populated and ready to launch. From this moment, the new site begins to get “stale.” Remember, the old CMS with all the original content is running the public website. And the new CMS with all the imported content is just sitting there. So, either no content can change in either system, or any change to the public website (which would be on the old CMS) must be duplicated on the new website-in-waiting.
- Thankfully, this period is short. The new website is cleared for launch. Once it launches, the other website is hidden from public access. It’s left online for a period of weeks in case the team realizes they missed something, or need to refer to old content or configuration.
- The old CMS and website is eventually taken offline and archived.
Now, look back through the above narrative, and acknowledge one thing: we didn’t talk at all about building the new website. That was all migration. Throughout that narrative, it’s assumed there is another project and another group of developers actually building the new site.
Can you imagine if that team got all done building the new website, and then said:
Okay, now let’s figure out how to migrate all this. I don’t know, maybe we should start with a content inventory or something?
Do you see that dot disappearing over the horizon? That’s your launch date.
Users and URLs
There are two specific situations that come up in enough migrations to be worth discussing separately.
Clearly, you’ll have to retrain your editors and get them new accounts on your new CMS. But what if you have users who log into your website? You might have user accounts for your customers or the public that they use to access content.
Many times, these user accounts are stored in your CMS. If this is the case, these accounts are going to have to move to the new CMS. The problem is you likely don’t have access to the users’ passwords. This is by design.
Passwords are not normally stored in clear text. Rather, they’re one-way encrypted so that you can’t view them in their original form. This is a good security practice, but it means you can’t seamlessly create a new account for your users on your new CMS – you don’t know what their password was.
Often, this means that moving all these user accounts will require your users to create new passwords, or even entirely new accounts6, which isn’t as much of a technical lift as it is a communication lift. You’ll need to develop a plan to address this change – not only to clearly explain what’s necessary, but also to convince them that this isn’t a phishing attempt, and that you genuinely do need them to reset their passwords7.
If your website has been online for any period of time, you have published URLs that have crept “outside the walls.” Search engines have indexed them. Users have shared them on social media. They have been bookmarked.
Which means people will click on them … and they will get … nothing?
In a perfect world, your URLs won’t change. But this isn’t common. Different CMSs have different ways of forming URLs. You can override this on some of them to mock up your old URL structure, but sometimes this creates more problems than it solves, and it can be an overbearing solution to a problem solved through other means.
The most straightforward way to manage redirection is to store the old URL with the new content. So every one of the 16,000 news articles you imported will have an attribute for “Old URL.” When a request comes into the new website and generates a 404 Not Found (since the old URL doesn’t exist anymore), you can do a lookup to figure out what they were looking for, then redirect them.
Many times, new sites have been launched without a URL redirection plan. They promptly fell out of every search engine, and 404 Not Found requests went through the roof.
Migrations are … unpleasant. No one wants to think about this stuff until it’s too late. Consequently, migrations are chronically overlooked, in terms of both budget and schedule.
A universal rule: plan more time or budget than you think you will need to migrate your content. You will absolutely use it.
Know too that migrations tend to be a little less planned and smooth than the development of the new website. Even if you run the most perfect development project in the world, your migration project might get a little rougher the closer you get to launch. Problems will be found, quick fixes will be hacked into place, and the entire thing will be looked at as disposable – you need to do “just enough” to get it done.
Most migrations come skidding across the finish line backwards, on their roof, and on fire. But the checkered flag tends to make all those problems seem insignificant.
Inputs and Outputs
The inputs and outputs are hard here, because this is not something that happens within the larger project, but rather alongside it. So there are many inputs and outputs, happening all throughout the main project.
At minimum, some of the tasks performed during content planning – a content inventory, audit, and site map – will inform the next steps.
The Big Picture
You need to start your migration early. Like, at the very, very beginning of all this – back when you started talking about goals and plans. You should have been inventorying content back then. And, as our extended narrative above explained, your migration runs alongside the main strategy and development project. At your very first meeting about this project, you should start talking about migration. No time is too early.
This is something that everyone will be involved with, except maybe designers (they might do a bit of QA, but that’s about it). You’ll need the full cooperation of the content, development, and management teams to pull a migration off.