0

This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"

</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>

After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far

from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests

def getGenders(browser : mc.Browser, url: str, name: str) -> None:
    res =  browser.open(url)
    aux = res.read()
    html2 = bs4(aux, 'html.parser')
    with open(name, "w", encoding='utf-8') as file2:
        file2.write( str( html2 ) )

getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")

with open("gendersBooks.html", "r", encoding='utf8') as file:
    contents = file.read()

    
    bsObj = bs4(contents, "lxml")

    aux = open("books.text", "w", encoding='utf8')

    officials  = bsObj.find_all('a', {'class' : 'booktitle'})

    for text in officials:
        print(text.get_text())
        aux.write(text.get_text().format())


    aux.close()
    file.close()
  • `find_all` is case sensitive - try `bsObj.find_all('a', {'class' : 'bookTitle'})`. – metatoaster Jun 24 '20 at 04:14
  • Does this answer your question? [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – Humayun Ahmad Rajib Jun 24 '20 at 05:14

2 Answers2

1

Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)

for text in officials:
        print(text['href'])
       
IkerG
  • 51
  • 6
1

BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".

Here is a simple example easy to understand with your html code snipet.

from bs4 import BeautifulSoup 

html_code = '''

</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>

'''

soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})

# - Book Title -
title = tag.text 
print(title)

# - Href Link -
href = tag.get('href')
print(href) 

I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.

Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.

Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.

So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.

from bs4 import BeautifulSoup
import requests

url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'

html_source = requests.get(url).content 

soup = BeautifulSoup(html, 'html.parser')

# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})

# - Extract Book Title -
href = tag.text

# - Extract href from Tag -
title = tag.get('href')

Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.

get all the "a" tags first:

a_tags = soup.findAll('a', {'class' : 'booktitle'})

and then scrape all the book tags info and append each book info to a books list.

books = []
for a in a_tags:
    try:
        title = a.text
        href = a.get('href')
        books.append({'title':title, 'href':href})  #<-- add each book dict to books list
        print(title)
        print(href)
    except:
        pass

To understand your code better I advise you to read this related links:

BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

requests: https://requests.readthedocs.io/en/master/

Python Context Manager: https://book.pythontips.com/en/latest/context_managers.html

https://effbot.org/zone/python-with-statement.htm

Diego Suarez
  • 901
  • 13
  • 16