3 min read

Warning - this post contains almost everything you shouldn't do. Parsing XML with regular expressions, interfacing with binary files and expecting that they won't change, using code outside of Excel instead of using any of the in-built functions, making assumptions about data formats, and using Perl without comments.

I was recently given an Excel file which contained thousands of one-column rows and they wanted the text splitting across two columns. The second part of the text was in bold and this was the part to be entered into a different column.

Ideally the data could have been exported into CSV with <b> tags, but the CSV output has all formatting stripped, so instead I took the text from the .xslx file itself. Note that I am not familiar with the Excel XML standard, so it's possible I've made some bad assumptions.

An xslx file is basically a zip archive. If you rename it to .zip you can extract it, and inside there are several files. One of them, xl/sharedStrings.xml, contains XML of every row (or at least, did in this case). It's all on one line, but each line was contained within a <si>...</si> stanza, so for example:

<si><r><t xml:space="preserve">10) </t></r><r><rPr><sz val="7"/><color theme="1"/><rFont val="Times New Roman"/><family val="1"/></rPr><t>Column part 1 </t></r><r><rPr><b/><sz val="7"/><color theme="1"/><rFont val="Times New Roman"/><family val="1"/></rPr><t>bolded part here!</t></r></si><si><r><t xml:space="preserve">10) </t></r><r><rPr><sz val="7"/><color theme="1"/><rFont val="Times New Roman"/><family val="1"/></rPr><t>Column part 1 </t></r><r><rPr><b/><sz val="7"/><color theme="1"/><rFont val="Times New Roman"/><family val="1"/></rPr><t>bolded part here!</t></r></si>

Inside there, if you look carefully, before the bolded section of the row there is a </b>. I can't see an 'unbold' in there, so it appears that this will 'toggle' bold.

So my crude steps were to:

  • Split the text into lines by inserting a new line after every </si>
  • Split the columns on bold by replacing </b> with a tab (I chose a tab because the data contained commas, and occasionally a row had more than one </b> so this was easier than trying to determine the number of fields and wrapping them with quotes
  • Remove any other XML - I just removed the text inside tags
  • Print the new row as a .tsv file ready to be reimported to Excel

It's horrible, but it worked as a quick fix.

The code to achieve this was written in Perl and is presented below without comments, apologies, or shame. It takes the name of the xlsx file as an argument, and creates output.tsv. Example usage:

perl split_on_bold.pl file.tsv

#!/usr/bin/perl

use warnings;
use strict;

use Archive::Zip qw(:ERROR_CODES);
use Getopt::Std;
use Cwd;

my $zip = Archive::Zip->new();
my $zipName = shift(@ARGV);
my $status = $zip->read($zipName);
my $target_full = "xl/sharedStrings.xml";
my $target = "sharedStrings.xml";
my $output_file = "output.tsv";

die "Read of $zipName failed\n" if $status != AZ_OK;
$status = $zip->extractMemberWithoutPaths($target_full);
die "Extracting $target_full from $zipName failed\n" if $status != AZ_OK;
	
open my $file, "<", $target;
open my $output, ">", $output_file;
my $data;
{
    local $/;
    $data = <$file>;
}
$data =~ s/<\/si>/<\/si>\n/g;
my @lines = split /^/, $data;

foreach my $line (@lines) {
    next unless $line =~ /<si>/;
    $line =~ s/<b\/>/\t/g;
    $line =~ s/<.*?>//g;
    print $output $line;
}
my $path = getcwd;
print "Check ${path}/${output_file} for a tsv\n";
unlink($target);
James Lawrie

James Lawrie

James has over a decade of experience working for companies such as Percona, UKFast, and Bytemark. In his spare time he rides his motorbike, lifts weights, and learns Polish.