1

I am looking to do the subtraction of a list from another list but by respecting repetitions:

>>> a = ['a', 'b', 'c','c', 'c', 'c', 'd', 'e', 'e']
>>> b = ['a', 'c', 'e', 'f','c']
>>> a - b
['b', 'c','c', 'd', 'e']

Order of elements does not matter.

There is a question with answers here but it ignores the repetitions. Solutions there would give:

>>> a - b
['b', 'd']

One solution considers duplicates but it alters one of the original list:

[i for i in a if not i in b or b.remove(i)]

I wrote this solution:

a_sub_b = list(a)
b_sub_a = list(b)
for e in a:
    if e in b_sub_a:
        a_sub_b.remove(e)
        b_sub_a.remove(e)

print a_sub_b # a - b 
print b_sub_a # b - a

That works for me , but is there a better solution , simpler or more efficient ?

Nic
  • 12,220
  • 20
  • 77
  • 105
Assem
  • 11,574
  • 5
  • 59
  • 97
  • Does this also have to respect repetitions in `b`? In other words, if `b = ['a', 'c', 'e', 'f', 'c']`, is the result `['b', 'c', 'd', 'e']`? – abarnert May 24 '15 at 04:55
  • 1
    Also, in your example, the elements in `b` are in the same order as in `a`. Is that guaranteed/required, or is the order actually meaningless? (And, if the order of `b` _is_ meaningless, is the order in `a` meaningful, or can we just remove any of the `'c'` elements instead of just the first?) – abarnert May 24 '15 at 04:56
  • yes repitions from both 3 x 'c' - 2 x 'c' should gives 1 x 'c' – Assem May 24 '15 at 04:58
  • not considering order – Assem May 24 '15 at 04:58
  • The order of _both_ is meaningless, or just of `b`? – abarnert May 24 '15 at 05:09
  • order of all doesn't matter `a`, `b`, `a - b` – Assem May 24 '15 at 05:10
  • Then why are you using lists in the first place? – abarnert May 24 '15 at 05:11
  • They were strings, any suggestion? – Assem May 24 '15 at 05:14
  • 1
    As I said in my answer, if you just want a multiset, where order doesn't matter, use a type designed to act like a multiset, like `collections.Counter`, not a `list`. If you did that, the answer to this problem would be obvious (or, rather, the problem wouldn't come up in the first place)—and the same for all of the other multiset operations you're going to ask for later (like "how do I intersect these two lists"). – abarnert May 24 '15 at 05:16
  • They were strings so the problem will be the same: subtraction of strings. I didn't know Counters had the `substraction` operation. Thank you for clarification. – Assem May 24 '15 at 05:23
  • If they're strings, `Counter('abcab')` will give you exactly what you'd expect, without having to go through a `list` first. – abarnert May 24 '15 at 05:30

3 Answers3

4

If order doesn't matter, use collections.Counter:

c = list((Counter(a) - Counter(b)).elements())

Counter(a) - Counter(b) builds a Counter with the count of an element x equal to the number of times x appears in a minus the number of times x appears in b. elements() creates an iterator that yields each element a number of times equal to its count, and list turns that into a list. The whole thing takes O(len(a)+len(b)) time.

Note that depending on what you're doing, it might be best to not work in terms of lists and just keep a, b, and c represented as Counters.

user2357112
  • 260,549
  • 28
  • 431
  • 505
2

This is going to search every element of b for each element of a. It's also going to do a linear remove on each list for each element that matches. So, your algorithm takes quadratic time—O(max(N, M)^2) where N is the length of a and M is the length of b.

If you just copy b into a set instead of a list, that solves the problem. Now you're just doing a constant-time set lookup for each element in a, and a constant-time set remove instead of a list remove. But you've still got the problem with the linear-time and incorrect removing from the a copy. And you can't just copy a into a set, because that loses duplicates.


On top of that, a_sub_b.remove(e) removes an element matching e. That isn't necessarily the same element as the element you just looked up. It's going to be an equal element, and if identity doesn't matter at all, that's fine… but if it does, then remove may do the wrong thing.

At any rate, performance is already a good enough reason not to use remove. Once you've solved the problems above, this is the only thing making your algorithm quadratic instead of linear.

The easiest way to solve this problem is to build up a new list, rather than copying the list and removing from it.

Solving both problems, you have O(2N+M) time, which is linear.


So, putting the two together:

b_set = set(b)
new_a = []
for element in a:
    if a in b_set:
        b_set.remove(element)
    else:
        new_a.append(element)

However, this still may have a problem. You haven't stated things very clearly, so it's hard to be sure, but can b contain duplicates, and, if so, does that mean the duplicated elements should be removed from a multiple times? If so, you need a multi-set, not a set. The easiest way to do that in Python is with a Counter:

from collections import Counter

b_counts = Counter(b)
new_a = []
for element in a:
    if b_counts[element]:
        b_counts[element] -= 1
    else:
        new_a.append(element)

On the other hand, if the order of neither a nor b matters, this just reduces to multiset difference, which makes it even easier:

new_a = list((Counter(a) - Counter(b)).elements())

But really, if the order of both is meaningless, you probably should have been using a Counter or other multiset representation in the first place, not a list…

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I am iterating throug `a`, but removing from `a_sub_b` – Assem May 24 '15 at 05:05
  • 1
    @bigOTHER: Right, sorry, so the correctness problem is more subtle; let me edit. Anyway, it's still quadratic time rather than linear, and possibly wrong. – abarnert May 24 '15 at 05:06
1

The following uses standard library only:

a = ['a', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e']
b = ['a', 'c', 'e', 'f','c']
a_set = set(a)
b_set = set(b)

only_in_a = list(a_set - b_set)
diff_list = list()

for _o in only_in_a:
    tmp = a.count(_o) * _o
    diff_list.extend(tmp)

for _b in b_set:
    tmp = (a.count(_b) - b.count(_b)) * _b
    diff_list.extend(tmp)

print diff_list

And gives:

['b', 'b', 'd', 'd', 'd', 'c', 'c', 'e']

as expected.

boardrider
  • 5,882
  • 7
  • 49
  • 86