A question about querying office spreadsheets, on the Mulberry XSL list.
1 2 3 4 5 6 7 | ... All I need to do is retrieve the string value of every cell of column "A" whose string length is greater than 20 characters in length. Anyone know how to use XPath to retrieve this from an Excel 2007 file? ... |
- XQuery your office documents
- XQuery your Excel spreadsheets
- Yes! You can finally query your office documents!
- Using XQuery to join two Excel spreadsheets? Why not?
Back to the question, looks like a fairly simple question, isn’t it? Let’s try… Let’s start from a concrete OOXML example, a worksheet might look as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> ... <sheetData> <row r="1" spans="1:27"> <c r="A1" t="s"> <v>0</v> </c> <c r="AA1" t="s"> <v>0</v> </c> </row> <row r="2" spans="1:27"> <c r="A2"> <v>1.23</v> </c> </row> <row r="3" spans="1:27"> <c r="A3" t="s"> <v>1</v> </c> </row> <row r="4" spans="1:27"> <c r="A4" t="str"> <f>CONCATENATE("Hello ", "world", "!")</f> <v>Hello world!</v> </c> <c r="AA4" t="s"> <v>2</v> </c> </row> ... </sheetData> ... </worksheet> |
- For cell A1, the t=”s” attribute, indicates it is a string. As such the 0 value in the
element is an index in the shared string table. - AA1 has the same value as A1
- A2 has a numeric value 1.23
- A4 has a string-based formula, as such the value is stored in the cell itself
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | <sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="5" uniqueCount="3"> <si> <t>Hello world!</t> </si> <si> <r> <t xml:space="preserve">Hello </t> </r> <r> <rPr> <b/> <sz val="11"/> <color rgb="FF00B050"/> <rFont val="Calibri"/> <family val="2"/> <scheme val="minor"/> </rPr> <t>wonderful</t> </r> <r> <rPr> <sz val="11"/> <color theme="1"/> <rFont val="Calibri"/> <family val="2"/> <scheme val="minor"/> </rPr> <t xml:space="preserve"> </t> </r> <r> <rPr> <i/> <sz val="11"/> <color rgb="FFFF0000"/> <rFont val="Calibri"/> <family val="2"/> <scheme val="minor"/> </rPr> <t>world</t> </r> <r> <rPr> <sz val="11"/> <color theme="1"/> <rFont val="Calibri"/> <family val="2"/> <scheme val="minor"/> </rPr> <t>!</t> </r> </si> <si> <t>Hello wonderful world!</t> </si> </sst> |
1 2 3 4 5 6 7 8 9 10 11 12 | declare namespace ssml = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"; declare variable $sheet := doc("jar:file:///C:/example.xlsx!/xl/worksheets/sheet1.xml"); declare variable $sharedStrings := doc("jar:file:///C:/example.xlsx!/xl/sharedStrings.xml"); for $c in $sheet/ssml:worksheet/ssml:sheetData/ssml:row/ssml:c[matches(@r,"^A[0-9]+")] let $str := if ($c/@t = "s") then string-join(($sharedStrings/ssml:sst/ssml:si)[xs:integer($c/ssml:v)+1]//ssml:t,"") else string($c/ssml:v) where string-length($str)>20 return <cell id="{$c/@r}">{$str}</cell> |
1. Declares the SpreadsheetML namespace
2. Specifies the XML document representing the first sheet in the spreadsheet
3. Specifies the XML document containing the shared strings table
4. Get all cells in the A column, we use a regular expression to filter all cells starting with an A, followed by a number of alpha numeric characters
5.
6. Test if the cell contains a string value or not
7. Lookup the string value in the shared strings table. As the index in spreadsheetML is zero-based we need to increment the index by one before accessing the n-th
8.
9. Get the value contained in the cell itself
10. As specified in the original question, only get the values of more than 20 characters
11.
12. Return the cell row/column number and the actual value
After all this query is longer than one might have thought initially, given the rather simple question. However, all complexity is there due to the Open Office XML standard. Can you imagine to write the equivalent in for example Java?
Out of scope for this post, but the solution can easily be enhanced, for example to query all sheets in the spreadsheet.
Someone wants to try and write the same for spreadsheets saved in OpenDocument Format?

