0

I am using django beautifulsoup to get the all data on html tables. I have the code that strips the tables and saves the table data as a list of lists:

soup = bs.BeautifulSoup(html_source, 'lxml')
table = soup.find('table', {'id': 'detail'})
rows = table.findAll('tr')

data = [[td.findChildren(text=True) for td in tr.findAll(['th', 'td'])] for tr in rows]
data = [[u"".join(d).strip() for d in l] for l in data]

This code worked well so far, but somehow it does not capture the entire data of this html table. It gets only the thead rows. I cannot figure out why?

<table class="table_type1" data-tdborder="" id="detail">
   <colgroup>
      <col width="38">
      <col>
      <col>
      <col width="140">
   </colgroup>
   <thead>
      <tr>
         <th>No.</th>
         <th>Status</th>
         <th>Location</th>
         <th>Event Date</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td style="text-align:center;">1</td>
         <td class="multi_row" style="line-height:15px;">Empty Container Release to Shipper</td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA49')" title="GREATING FORTUNE (SHANGHAI) CONTAIN">GREATING FORTUNE (SHANGHAI) CONTAIN</a></td>
         <td class="ico_a">2017-10-09 10:51</td>
      </tr>
      <tr>
         <td style="text-align:center;">2</td>
         <td class="multi_row" style="line-height:15px;">Gate In to Outbound Terminal</td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-10 04:43</td>
      </tr>
      <tr>
         <td style="text-align:center;">3</td>
         <td class="multi_row" style="line-height:15px;">Loaded on 'NYK LYNX 2610E' at Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-11 22:58</td>
      </tr>
      <tr>
         <td style="text-align:center;">4</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Departure from Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-12 05:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">5</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Arrival at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 21:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">6</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' POD Berthing Destination<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 22:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">7</td>
         <td class="multi_row" style="line-height:15px;">Unloaded from 'NYK LYNX 2610E' at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 23:30</td>
      </tr>
      <tr>
         <td style="text-align:center;">8</td>
         <td class="multi_row" style="line-height:15px;">Gate Out from Inbound Terminal for Delivery to Consignee (or Port Shuttle)</td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-15 04:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">9</td>
         <td class="multi_row" style="line-height:15px;">Empty Container Returned from Customer</td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> </td>
         <td class="ico_e">2017-11-15 10:00</td>
      </tr>
   </tbody>
</table>

Edit

I printed soup object and went through all the html code and surprisingly it contains only the thead of the table and not the tbody, is this a bug in boutifulsoup? This is the only part of the table that beautifulsoup4 captures:

 <table class="table_type1" data-tdborder="" id="detail">
    <colgroup>
       <col width="38"/>
       <col/>
       <col/>
       <col width="140"/>
    </colgroup>
    <thead>
       <tr>
          <th>No.</th>
          <th>Status</th>
          <th>Location</th>
          <th>Event Date</th>
       </tr>
    </thead>
 </table>
Ibo
  • 4,081
  • 6
  • 45
  • 65
  • 1
    Is your HTML coming from a third-party site? If yes, it might be dynamically generated. If no, which version of BeautifulSoup are you using? I'm using [bs4 4.5.1](https://pypi.python.org/pypi/beautifulsoup4/4.5.1) with the HTML you gave and it appears to works fine – Wondercricket Oct 31 '17 at 20:53
  • I am looking for a container number in https://www.nykline.com/ and try to get the result data. When I right click, it has the html code as I posted, I copied from the browser, I am using bs4 4.6.0 – Ibo Oct 31 '17 at 20:57
  • @Wondercricket it seems you were right. I gave 5 seconds of delay (`time.sleep(5)`)to page after the click event and `soup` object this time had all the data. is there anything to tell bs4 to proceed as soon as the whole thing is ready? – Ibo Oct 31 '17 at 21:04
  • For that, you might need to use [selenium](https://stackoverflow.com/questions/7781792/selenium-waitforelement) to wait for the element to exist – Wondercricket Oct 31 '17 at 21:08
  • I thought of that, I am actually using selenium to mimic the data entry and clicking the button. – Ibo Oct 31 '17 at 21:11

0 Answers0