migrate_spreadsheet - Migrating from Excel sheets to Drupal

Upgrading websites off of legacy platforms and onto Drupal is a pretty common request. One of the most helpful contributor developed modules, Migrate, is now in Drupal 8 core, which really solidifies Drupal as a great choice when having to make the decision of which CMS to go with when upgrading.

A major problem that can come up early in planning migrations for older sites is the source data format. There are some fairly well-known contrib modules that help with this already; Migrate JSON , XML, and CSV. XML and CSV source plugins are particularly useful with older sites. For instance, legacy Oracle content management systems used XML for menu and page structure.

CSV source data is a common occurrence for older sites. It can be useful to have a CSV export when it's not easy to gain quick access to the source data, or in some cases, when the source data comes from data types that aren't able to be read by Drupal's migrate module. One of those unreadable data sources, which comes up more often than expected, is Microsoft Access. Access provides two tools: databases and reporting.

Until recently, the easiest solution to migrate Access data into Drupal using the migrate module was to export data and reports into CSV files and use migrate_source_csv to read and import the data into Drupal. While this is possible to get working it often requires more time to finish than it probably should.

A number of issues can come up when dealing with data in CSV files. If the CSV file is created on a Windows-based machine, line endings and file encoding affect how each row and value is being read by the importer. If values are not encapsulated correctly, it can cause the importer to read single values as separate columns and cause all kinds of issues for the imported data. Some of which can not always be caught right away. These problems are particularly difficult to figure out when you're dealing with large sets of data across multiple CSV files.

One of the nice features of Microsoft Access is that it provides functionality to export databases and reports to Excel files. Excel (xls, xlsx, ods, etc.) files are usually larger than CSVs and inherently more complicated in structure, but what they provide is data that can be read the same way across multiple spreadsheet applications in different operating systems. If only there was a way we could read data from spreadsheet files rather than having to use multiple CSVs.

In comes the Migrate Spreadsheet module. As of writing this post it hasn't even been on drupal.org for a full month, but it has already proved its worth. I had the unfortunate problem of being stuck with an xlsx file as source data for a migration. I tried converting it to CSV but encountered the issues that can potentially come with csv data listed above.

This module leverages the PhpSpreadsheet plugin by PHPOffice, which makes it possible to read data from ods, xlsx, xls, xml, html, sylk, and csv files. We are only concerned with xlsx in this case. Both migrate_spreadsheet and PhpSpreadsheet can be installed via Composer by following the installation instructions, so this is ready to go for modern Drupal 8 projects.

Setting up the migrate_plus configuration file is pretty straightforward and similar to how CSV migrations are set up. The major differences are that you need to specify the worksheet you're migrating from in the xlsx file. This is actually very useful so that you can use one source data file for multiple migrations. You can then set the origin, which is the starting point for data in the worksheet, by cell ID (ex: A2). Following that, you can set the row number which contains your header information (if you have one).

The list of columns is slightly different than in a csv migration since you only need to list the columns as they appear in the excel sheet. Unlike the csv migration configuration, you don't need to set key or index for each column name. The only other difference being that when setting the key identifier for the source data, you need to tell the migration what type of data it is (ex: integer). Below is an example of an excel sheet source migration configuration file.

Yaml code:

id: example_nodes
label: Example Nodes
source:
  plugin: example_node
  # The source file. The path can be either relative to Drupal root or it can
  # be a also an absolute reference such as a stream wrapper.
  file: modules/custom/example_xlsx_migrate/example_nodes.xlsx
  # The worksheet to be read.
  worksheet: worksheet1
  # It should use a spreadsheet representation (B2, A2, ...). The data area does
  # not include the header. If it's missed, the assumption is that the first row
  # contains the table header and the data origin is the first cell of the
  # second row. That is A2. In this example the data area starts from the second
  # column of the third row.
  origin: A2
  # The row where the header is placed, if any. If missed, there's no table
  # header for the data being inserted and the spreadsheet columns (A, B, ...)
  # the table header. If the table row is the first row, this should be 1. The
  # will be used as column names. The value of 2 means that the table header is
  # on the second row.
  header_row: 1
  # Columns to be returned, basically a list of table header cell values.
  columns:
- Title
- 'ExternalLink'
- Body
- Tag
- ID
- 'Bylineauthors'
  # The name of the column with the row index. If this setting is specified, the
  # source will return also a column with this name containing the row index.
  # In this example 'Row no.' can be used later in keys: list to make this
  # column a primary key column.
  # row_index_column: ID


  # The primary key as list of keys. Is a list of source columns that are
  # composing the primary key. The list is keyed by column name and has the
  # field schema as value. If the table have a header (i.e. 'header_row:' is set)
  # the keys will be set ad the name of header cells acting as primary index.
  # Otherwise the column letters (A, B, C, ...) can be used. If no keys are
  # defined, the current row position will be returned as primary key.
  keys:
ID:
type: integer
destination:
  plugin: entity:node
process:
  type:
plugin: default_value
default_value: example_node
  # Most fields can be mapped directly - we just specify the destination (D8)
  # field and the corresponding field name from above, and the values will be
  # copied in.
  title: Title
  field_link: 'ExternalLink'
  field_byline_author: 'Bylineauthor'
  body: Body
  tags: Tag
  field_legacy_id: ID


migration_dependencies: {}

The migration class itself doesn't differ much from other migration classes. Simply include a use statement for Drupal\migrate_spreadsheet\Plugin\migrate\source\Spreadsheet; make sure your class extends the Spreadsheet class, and that's it!

Additional Resources
Registering Migrations in Drupal 8 | Blog
Top Drupal 8 Modules | Blog
Intro to Drupal 8 Vocabulary | Blog