My whole application is a little sitemap-Scraper
, I feed it the root link, from here it will scan the site for more links, and then scrape those sites also for more links, kinda like a sitemap-gen just more verbose. The bigger picture is, is that there are sites containing youtube,facebook,google etc. These sites can lead to a further eternity and put my app into a limbo-chain, thus I decided I'll feed it a blocker so we can remove those bigger websites
I have a file called blocked_sites.txt
in which i have:
facebook
youtube
And I have a set
in which I have:
'facebook.com', 'youtube.com', 'gold'
So, what I want to do, is :
- Compare Both lists items to one another
- Check if urls-item CONTAINS blocked_sites item
- Remove That item if it contains BLOCKED item
Point 1&2 I got done But the third one is a gotcha, this is what I preemptively tried:
# For every url in urls
for url in urls:
# For every blocker inside blocked
for blocker in blocked:
# If URL contains BLOCKER
if blocker in url:
# Remove THAT URL
urls.remove(url)
print('removed: ' + url)
print(urls)
The problem is that I can't really modify a set while iterating through it at the same time. So what are my options?
Heres what I thought:
- Take the
URL
that DOESNT contain blocker and copy it to another set --This seems a bit bulky, I mean, we would then have to deal with the urls,blocker, new_urls and doesn't seem as much of a good idea, especially if I am constantly feeding more and more links to the old list, doesn't seem very memory effecient - Let's try and convert them into a list!
--Hey! It worked! for like only 3 items?
--On further look, a set already is a list?
yet, I got an error when I was using
{ 'item' }
as my set as opposed to[ 'item' ]
?
Okay so take these first sets:
urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook'}
>> Set changed during iteration
alrighty, let's do it this-way:
urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook']
>>> Removed: facebook
Yay it worked!
What if we add more blockers like so:
urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook', 'youtube']
>>>Removed: facebook
['youtube.com', 'gold']
That's strange! For some reason, it can only take off one blocker?
How do I get to the gold?