2

I am writing a Porter stemmer in xQuery and as the first step I need to match consonant and vowel patterns. The consonant matching sequence from the Perl example I'm using as a basis for this is (?:[^aiueoy]|(?:(?<=[aiueo])y)|\by), and the vowel sequence is (?:[aiueo]|(?:(?<![aiueo])y)). I need to expand that to also include the letter aesc (æ), and so this is what I have for my xquery regex:

let $v := element {"vowels"} {matches($f,"(?:([^aiueoy])|(?:(?:[aiueo]\1)y))")}
let $c := element {"consonants"} {matches($f,"(?:([aiueo])|(?:(?<![aiueo]\1)y))")}

A sample of the type of XML I am looking for is as follows:

<entry ref="173">
        <headword>abǒve</headword>
        <headword>abǒven</headword>
        <variant>abufe</variant>
        <variant>abufen</variant>
        <variant>abuue</variant>
        <variant>abuuen</variant>
        <variant>abowve</variant>
        <variant>obove</variant>
        <variant>oboven</variant>
        <variant>obufe</variant>
        <variant>obufen</variant>
        <variant>abof</variant>
        <variant>obof</variant>
        <variant>aboyf</variant>
        <variant>aboun</variant>
        <variant>aboune</variant>
        <variant>abown</variant>
        <variant>abowne</variant>
        <variant>aboon</variant>
        <variant>oboun</variant>
        <variant>oboune</variant>
        <variant>abow</variant>
        <variant>aboʒe</variant>
        <part_of_speech> adv. </part_of_speech>
    </entry>

Running this in Saxon, however, I get the following error: Query failed with dynamic error: Syntax error at char 17 in regular expression: No expression before quantifier I'm pretty sure my issue is that I'm not building the positive lookbehind properly, having changed it from <= to \1, but I'm not sure how I would build that aspect in a way that works with xQuery. Any suggestions would be much appreciated.

medievalmatt
  • 427
  • 2
  • 12
  • I don't think XQuery supports neither non-capturing groups nor lookbehinds. I'm confused by what you're trying to refer to with the `\1` backreference, could you maybe add the expected XML output? Also I think you inverted vowels and consonnants in your XQuery code, otherwise I also don't understand why you'd want to match `[aiueo]` as consonnants and `[^aiueo]` as vowels – Aaron Sep 28 '18 at 15:27
  • If possible I would suggest using another language than XQuery to do that work, its regex support is limited and it looks like the bulk of your work is text processing and the XML processing is secondary. – Aaron Sep 28 '18 at 15:33

1 Answers1

2

The XQuery 3.1 spec's regular expression support is described at https://www.w3.org/TR/xpath-functions-31/#regex-syntax, noting that XPath and XQuery supports several additions to what the XML Schema Datatypes specification on regular expressions at https://www.w3.org/TR/xmlschema-2/#regexs. Unfortunately, lookbehind support is not part of the specification.

However, since you note that you're using Saxon, Saxon has an extension that allows you to enable native Java regex if you supply the j flag, as documented at https://www.saxonica.com/html/documentation/functions/fn/matches.html. This should give you access to Java's support for positive lookbehind expressions.

(This j flag is becoming a sort of extension convention among other XQuery implementations. BaseX follows Saxon, as noted at http://docs.basex.org/wiki/XQuery_Extensions#Regular_Expressions. eXist will likely adopt this convention too: https://github.com/eXist-db/exist/issues/846.)

Joe Wicentowski
  • 5,159
  • 16
  • 26
  • 1
    Thanks for pointing this out. I had actually gone to the Saxon page you mentioned (which is why I was pretty sure non-capturing groups were ok now), but I missed that flag. I'm going to try that out. – medievalmatt Sep 28 '18 at 23:36