0

I have an xml file.

<?xml version="1.0" encoding="UTF-8"?> <doc>
  <!-- A comment -->
  <a xmlns="http://www.tei-c.org/ns/1.0">
    <w>word
    </w>
    <w>wording
    </w>
</a>
</doc>

I would like to return nodes containing "word" but not "wording".

library(XML) # I have nothing against using library(xml2) or library(xml2r) instead
test2 <- xmlParse("file.xml", encoding="UTF-8")
x <- c(x="http://www.tei-c.org/ns/1.0")

# starts-with seems to find the words just fine
test1 <- getNodeSet(doc, "//x:w[starts-with(., 'word')]", x)
# but R doesn't seem to allow "matches" to be included
# in the xpath query, hence none of the following work:
test1 <- getNodeSet(doc, "//x:w[[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[@*[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[matches(., '^word$')]", x)
test1 <- getNodeSet(doc, "//x:w[@*[matches(., '^word$')]]", x)

Update: If I use the term matches with any combination I get the following error and an empty list as result.

xmlXPathCompOpEval: function matches not found
XPath error : Unregistered function
XPath error : Invalid expression
XPath error : Stack usage error
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //x:w[matches(., '^word$')]

If I look for "//x:w[@*[contains(., '^word$')]]" based on advice below, I get the following warning and empty list as result:

Warning message:
In xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  :
  the XPath query has no namespace, but the target document has a default namespace. 
 This is often an error and may explain why you obtained no results

I imagine I am just using the wrong commands. What should I change to make it work? Thanks!

puslet88
  • 1,288
  • 15
  • 25
  • `getNodeSet(doc, "//x:w[starts-with(., 'word') and not(starts-with(.,'wording'))]", x)` (if you need to use `starts-with`) or `getNodeSet(doc, "//x:w[contains(., 'word') and not(contains(.,'wording'))]", x)` (if you really wants "contains") – hrbrmstr Nov 02 '15 at 10:58
  • Thanks. I'm looking for a more general solution, mainly I'm hoping to be able to use the word boundary markers as here `^word$`. I tried it with "contains", but it doesn't seem to parse them either. "Matches" does the trick outside of R environment (in BaseX). – puslet88 Nov 02 '15 at 12:40
  • When you say something doesn't "work," please be more specific. What was the actual result, and how did it differ from what you expected? E.g. did `test1` end up with the wrong value? What value? Was there an error message? What did it say? – LarsH Nov 03 '15 at 14:36
  • See http://stackoverflow.com/a/25309592/423105 regarding R, XPath 2.0, and regular expressions. – LarsH Nov 04 '15 at 02:45
  • Thanks for your help @LarsH , this post you refer to may just help me figure it out. I haven't yet had the time to read it in depth, but will do so soon. I'm also just now learning that there are different versions of XPath to pay attention to. – puslet88 Nov 05 '15 at 12:15

1 Answers1

0

Thanks for updating your question to include the error message. It's like going to a doctor and asking for treatment to solve your problem -- you definitely want to let him know what specific symptoms you've noticed!

And this error message confirms that the match() function is missing. That indicates that R (at least, the version you're using) uses XPath 1.0, which does not have match() or other regular expression features. BaseX, on the other hand, supports XPath 2.0 (in fact it supports XPath 3.0, IIRC), so it can handle match().

Regarding how to do what you want in XPath 1.0, it's not entirely clear what you'd like to do. You mentioned using word boundary markers, so you could try something like

getNodeSet(doc, "//x:w[contains(normalize-space(concat(' ', ., ' ')),
                                ' word ')]", x)

This will select <w> elements whose content includes word at the beginning and/or end of the text, or preceded/followed by whitespace. If you want to treat certain non-whitespace characters as word boundaries, you could translate them to whitespace using translate().

LarsH
  • 27,481
  • 8
  • 94
  • 152