8

Is it possible to get only specific URLs?

Like:

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

Output should be only URLs from http://www.iwashere.com/

like, output URLs:

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

I did it by string logic. Is there any direct method using BeautifulSoup?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Zero
  • 74,117
  • 18
  • 147
  • 154

3 Answers3

17

You can match multiple aspects, including using a regular expression for the attribute value:

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

which matches (for your example):

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

so any <a> tag with a href attribute that has a value that starts with the string http://www.iwashere.com/.

You can loop over the results and pick out just the href attribute:

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http: or mailto:), or a double slash (//hostname/path); any such value must be a relative path instead:

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    It worked perfectly. For people who aren't aware of libraries. You need to `from bs4 import BeautifulSoup import re` – Zero Mar 09 '13 at 17:20
  • I have one more question. We can perfectly extract links if they are in `http://www.iwashere.com/xyz...abc.html` format. But, if the links are local. Say, like `[next, next]`. How can I extract the underlying link? When HTML code is seen, link is hyperlinked with proper location. Any way to extract such links? – Zero Mar 09 '13 at 20:39
  • @searcoding: You'd need to match anything that doesn't start with a scheme or double slash; any `href` value that does *not* start with those is a relative URL instead. Use `href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))')` (that's a negative look-ahead to test for a scheme or double slash, anything that has those does *not* match). – Martijn Pieters Mar 09 '13 at 23:05
7

If you're using BeautifulSoup 4.0.0 or greater:

soup.select('a[href^="http://www.iwashere.com/"]')
yurisich
  • 6,991
  • 7
  • 42
  • 63
0

You could solve this with partial matching in gazpacho:

Input:

html = """\
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
"""

Code:

from gazpacho import Soup

soup = Soup(html)
links = soup.find('a', {'href': "http://www.iwashere.com/"}, partial=True)
[link.attrs['href'] for link in links]

Which will output:

# ['http://www.iwashere.com/washere.html', 'http://www.iwashere.com/wasnot.html']
emehex
  • 9,874
  • 10
  • 54
  • 100