4

I am trying to export my data as a .txt file

from bs4 import BeautifulSoup
import requests
import os

import os

os.getcwd()
'/home/folder'
os.mkdir("Probeersel6") 
os.chdir("Probeersel6")
os.getcwd()
'/home/Desktop/folder'
os.mkdir("img")  #now `folder` 

url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
r  = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("article", {"class": "article"})

with open(""%s".txt", "wb" %(url)) as file:
    for item in data:
        print item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].text 
        print item.contents[0].find_all("a", {"class": "link-grey"})[0].text
        print "\n"
        print item.contents[0].find_all("img", {"class": "media-full"})[0]
        print "\n"
        print item.contents[1].find_all("div", {"class": "article_textwrap"})[0].text
        file.write()

what should be put in the:

file.write()

to work?

I am also trying to get the name of the .txt file the same as the url should I do that with a string?

with open(""%s".txt", "wb" %(url)) as file:


url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
Danisk
  • 113
  • 1
  • 1
  • 9

2 Answers2

4

You should put Inside file.write your content. I'll probably do something like:

#!/usr/bin/python3
#

from bs4 import BeautifulSoup
import requests

url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
file_name=url.rsplit('/',1)[1].rsplit('.')[0]

r  = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = soup.find_all('article', {'class': 'article'})


content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
                                               item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
                                               item.contents[0].find_all('img', {'class': 'media-full'})[0],
                                               item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
                                             ) for item in data)

with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
    file.write(content)
  • 1
    It is showing up a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 130: ordinal not in range(128)" – Danisk Mar 16 '16 at 15:46
  • 1
    I actually dont have access to python, and it seems that you are using python2 (I guess). You should read a little bit about the encoding, your problem is very common! –  Mar 16 '16 at 15:49
  • 1
    Not at all, but the encoding is kinda different from python3 and it makes long time that I haven't used python2. I'll take a look to your problem, but maybe you were right by using the binary mode `wb` –  Mar 16 '16 at 16:01
  • I guess its not the wb, there is an error at the last sentence: item.contents[1].find_all("div", {"class": "article_textwrap"})[0].text, in the content+= ' ' ' – Danisk Mar 16 '16 at 16:11
  • 1
    Over python2 you can let the text mode as I said, and just use `...text.decode('utf-8','replace')` to decode the content. Note that the character encoding `utf-8` may not always be the same, and you can probably parse it from the html content. I think that your script has some other small issues but I don't really have the time to debug it. –  Mar 16 '16 at 17:48
  • http://stackoverflow.com/questions/36044653/beautifulsoup-error-in-file-saving-txt I couldn't post any codes here, so I've solved the Unicode problem (i think) but now it ain't saving anything – Danisk Mar 16 '16 at 19:08
  • @Danisk if you have access to python3, I just made an edit to my post, and it seems to work fine since your website is in `utf-8` –  Mar 16 '16 at 19:55
0

I was working on a webscraping project, and this issue gave me tons of problems. I tried almost every solution out there that dealt with Python encoding (convert to UTF using string.encode(), convert to ASCII, convert using the 'unicodedata' module, use .decode() and then .encode(), blood sacrifice to Tim Peters, etc etc).

None of the solutions worked all the time, which struck me as very un-Pythonic.

So what I ended up using was the following:

html = bs.prettify()  #bs is your BeautifulSoup object
with open("out.txt","w") as out:
    for i in range(0, len(html)):
        try:
            out.write(html[i])
        except Exception:
            1+1

It's not perfect, but it gave me the best results. When I opened it in a browser, it was able to parse the page properly almost every time.

Abhishek Divekar
  • 1,131
  • 2
  • 15
  • 31
  • Your solutions didn't worked all the time because you are not properly encoding and decoding your inputs and exits. Normally you should read the encoding of the html document. For your luck (or the luck of everyone) almost all the content is on `utf-8` or the latin encoding :) –  Mar 16 '16 at 17:55
  • @rsm Yes, but the problem is not in parsing the HTML page, it is always when writing to file. OSes usually store files in simple ASCII formats, and converting from UTF-8 to ASCII is a pain. It's possible to set the file encoding to UTF-8 by opening the file via `open("out.txt","w","utf-8")`, but practically that has not worked for me often. I'm simply speaking from practical experience, and I've found the above solution to be the one which "just works". – Abhishek Divekar Mar 16 '16 at 18:48
  • The Utf problem is "solved" I think http://stackoverflow.com/questions/36044653/beautifulsoup-error-in-file-saving-txt?noredirect=1#comment59738275_36044653 Now the .txt file thing needs to get solved.. Or.. am I wrong? – Danisk Mar 16 '16 at 19:19
  • 2
    Well, I'm sorry guys if I don't want to discuss about it, but there is too many documentation out there. The encoding is a really important part of a program and sometimes a pain, but for me python3 does the job. If you want to know the charset of an html file, just take a look to the meta charset ``.. –  Mar 16 '16 at 19:37
  • @abhidivekar by the way, I just took a good look to your code and I don't recommend to anyone to use that code. There's no reason of having the `for`, either the `range`, either the `Exception` and the `1+1`. You could simply use something like `out.write(html.encode('utf-8','replace'))` –  Mar 16 '16 at 20:13