Web scraping with Python Web scraping is an automated, programmatic process through which data can be constantly 'scraped' off webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping. Web Scraping with Python Code Samples These code samples are for the book Web Scraping with Python 2nd Edition If you're looking for the first edition code files, they can be found in the v1 directory. Most code for the second edition is contained in Jupyter notebooks.
importurllib2 |
frombs4importBeautifulSoup |
# http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/ |
f=open('cricket-data.txt','w') |
linksFile=open('linksSource.txt') |
lines=list(linksFile.readlines()) |
foriinlines[12:108]: #12:108 |
url='http://www.gunnercricket.com/'+str(i) |
try: |
page=urllib2.urlopen(url) |
except: |
continue |
soup=BeautifulSoup(page) |
title=soup.title |
date=title.string[:4]+','#take first 4 characters from title |
try: |
table=soup.find('table') |
rows=table.findAll('tr') |
fortrinrows: |
cols=tr.findAll('td') |
text_data= [] |
fortdincols: |
text='.join(td) |
utftext=str(text.encode('utf-8')) |
text_data.append(utftext) # EDIT |
text=date+','.join(text_data) |
f.write(text+'n') |
except: |
pass |
f.close() |
The data is in the text content of response, which is response.text, and is HTML. We can use the html.parser from BeautifulSoup to parse it, saving us a lot of time when web scraping in Python. This transforms the HTML document into a BeautifulSoup object, which is a complex tree of Python objects. Beautiful Soup is a python HTML parser that is supposed to be good for screen scraping. In particular, here is their tutorial on parsing an HTML document.
commented Jan 15, 2018 •
import pandas as pd from bs4 import BeautifulSoup import requests import lxml url = 'http://espn.go.com/college-football/bcs/_/year/2013 ' result = requests.get(url) c= result.content soup.prettify() summary = soup.find('table',attrs = {'class':'tablehead'}) #tables = summary.fins_all('td' /'tr') data =[] rows = tables[0].findAll('tr') list_of_rows=[] for row in table.findAll('tr')[0:]: outfile = open('./Rankings.csv', 'wb') Can please you help me with this code? Am using python 3.5 |
beautifulsoup44.6.3 |
certifi2018.10.15 |
chardet3.0.4 |
idna2.7 |
lxml4.2.5 |
requests2.20.1 |
selenium3.141.0 |
urllib31.24.1 |
# answer to https://stackoverflow.com/q/53475578/890242 |
importrequests |
fromurllib.parseimporturljoin |
frommultiprocessing.poolimportThreadPool, Pool |
frombs4importBeautifulSoup |
fromseleniumimportwebdriver |
importthreading |
defget_links(link): |
res=requests.get(link) |
soup=BeautifulSoup(res.text,'lxml') |
titles= [urljoin(url,items.get('href')) foritemsinsoup.select('.summary .question-hyperlink')] |
returntitles |
threadLocal=threading.local() |
defget_driver(): |
driver=getattr(threadLocal, 'driver', None) |
ifdriverisNone: |
chromeOptions=webdriver.ChromeOptions() |
chromeOptions.add_argument('--headless') |
driver=webdriver.Chrome(chrome_options=chromeOptions) |
setattr(threadLocal, 'driver', driver) |
returndriver |
defget_title(url): |
driver=get_driver() |
driver.get(url) |
sauce=BeautifulSoup(driver.page_source,'lxml') |
item=sauce.select_one('h1 a').text |
print(item) |
if__name__'__main__': |
url='https://stackoverflow.com/questions/tagged/web-scraping' |
ThreadPool(5).map(get_title,get_links(url)) |
commented Jun 9, 2020
Hey mate, I had a question regarding the code. While using the map function to get the titles, the code calls the get_title function 50 times and I was wondering if it would open 50 browsers? If yes, what would be the best practice for reducing the memory usage while maintaining the speed of parallel processing? Thanks |
commented Mar 16, 2021
Doesn't work if ThreadPool is changed to Pool tho... Getting pickle error |
commented Mar 19, 2021
This starts a ThreadPool of 5 threads, each thread will have only one web browser open at anyone time. |
commented Mar 19, 2021
Python Scrapy Github
That would indicate that the get_links function returns a list of unpickable objects. Should be easy to fix. |