0

In Python script using lxml, I use the following Xpath to find elements with a certain text content that do not have a certain value for a particular attribute. Like this:

xpath('//el[text()="something" or text()="something else" or text()="this other thing" and @attrib!="A"]')

I also tried:

xpath('//el[text()="something" or text()="something else" or text()="this other thing" and not(@attrib="A")]')

This is part of a loop like this:

for element in root.xpath('//el[text()="something" or text()="something else" or text()="this other thing" and not(@attrib="A")]'):

    element.get('attrib')

In the results I get lots of 'A' values. I don't understand what I'm doing wrong. This is not supposed to happen. I explicitly included 'not(@attrib="A")' as one of the conditions.

========= addition ============

for el in root_element.xpath('//tok[text()="altra" or text()="altres" or text()="altr" and not(@lemma="altre")]'): 
        wrong_lemma = el.get('lemma')

This is an example of a part of a document that contains the element that should not get matched but that IS matched. I get 'altre' as the value for variable 'wrong_lemma' in the output.

<tok id="w-1264" ord="5" lemma="altre" xpos="DI0CP0">altres</tok> <tok id="w-1265" ord="6" lemma="insigne" xpos="AQ0CP00">insignes</tok> <tok id="w-1266" ord="7" lemma="cavaller" xpos="NCMP000">cavallers</tok> 

The following do not work either:

for el in root_element.xpath('//tok[text()="altra" or text()="altres" or text()="altr" and @lemma!="altre"]'): 
        wrong_lemma = el.get('lemma')
for el in root_element.xpath('//tok[text()="altra" or text()="altres" or text()="altr" and not(contains(@lemma!="altre"))]'): 
        wrong_lemma = el.get('lemma')
jfontana
  • 141
  • 1
  • 6
  • I believe this may help you. https://stackoverflow.com/questions/1550981/how-to-use-not-in-xpath – user6846 Dec 25 '22 at 01:15
  • pls add HTML or URL so this can be reproduced. – simpleApp Dec 25 '22 at 01:36
  • Thanks for the pointer but it doesn't help. I had actually seen that question and its answers when I was trying to solve the problem. As suggested by @simpleApp I edited my question to add the specific code I'm using and the specific part of the XML document that contains the element that should not be matched but that unfortunately it IS matched. – jfontana Dec 25 '22 at 01:52
  • the xml you have shared marked as `root`, then `for el in root.xpath('//tok[@lemma!="altre"]'): print(el.get('lemma'))` gives me `insigne cavaller` , i think i am unable to reproduce it ? – simpleApp Dec 25 '22 at 02:22
  • Thanks for looking into this @simpleApp. Yes, I had tried with not(@attr="something") by itself before and it did work. The problem is when I add more conditions to the filter ```//tok[text()="something" or text()="somethingdiff" and @lemma!="altre"]. I guess the phrasing of the question focuses on the negation but the problem is perhaps the 'and'? I know that searches for //tok[text()="something"] also work properly when not combined with something else via 'and'. It was difficult to think of a question that could summarize the problem well. – jfontana Dec 25 '22 at 10:43
  • Glad you figured out that was missing the grouping of conditions. Awesome!! – simpleApp Dec 25 '22 at 12:11

1 Answers1

0

The problem was with the syntax of the Xpath.

What I needed to do was to add a parenthesis around the boolean OR choices. The lack of parenthesis messed up the whole thing. This is what works:

xpath('//el[(text()="something" or text()="something else" or text()="this other thing") and @attrib!="A"]')

This will not match the pieces of XML that were matched before even though they had to be excluded.

Azhar Khan
  • 3,829
  • 11
  • 26
  • 32
jfontana
  • 141
  • 1
  • 6
  • You might like to edit your headline question so it's actually relevant to your real problem. – Michael Kay Dec 27 '22 at 11:13
  • @Michael Kay you are totally right. My apologies for this but I'm having a hard time figuring out what the right question would be. When I asked it, I thought the problem was with the 'not' function and so I formulated the whole question focusing on that. It was only later, after reading the responses, that I saw the problem was not with the not() function but rather with the way I had grouped the conditions. I am really at a loss as to what to do to fix this problem. – jfontana Dec 27 '22 at 18:00