Github Web Scraping With Python



  1. Python Scrapy Github
  2. Web Scraping With Python Github
scraping data from a web table using python and Beautiful Soup

Web scraping with Python Web scraping is an automated, programmatic process through which data can be constantly 'scraped' off webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping. Web Scraping with Python Code Samples These code samples are for the book Web Scraping with Python 2nd Edition If you're looking for the first edition code files, they can be found in the v1 directory. Most code for the second edition is contained in Jupyter notebooks.

Cricket data.py
importurllib2
frombs4importBeautifulSoup
# http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
f=open('cricket-data.txt','w')
linksFile=open('linksSource.txt')
lines=list(linksFile.readlines())
foriinlines[12:108]: #12:108
url='http://www.gunnercricket.com/'+str(i)
try:
page=urllib2.urlopen(url)
except:
continue
soup=BeautifulSoup(page)
title=soup.title
date=title.string[:4]+','#take first 4 characters from title
try:
table=soup.find('table')
rows=table.findAll('tr')
fortrinrows:
cols=tr.findAll('td')
text_data= []
fortdincols:
text='.join(td)
utftext=str(text.encode('utf-8'))
text_data.append(utftext) # EDIT
text=date+','.join(text_data)
f.write(text+'n')
except:
pass
f.close()

The data is in the text content of response, which is response.text, and is HTML. We can use the html.parser from BeautifulSoup to parse it, saving us a lot of time when web scraping in Python. This transforms the HTML document into a BeautifulSoup object, which is a complex tree of Python objects. Beautiful Soup is a python HTML parser that is supposed to be good for screen scraping. In particular, here is their tutorial on parsing an HTML document.

commented Jan 15, 2018

import pandas as pd
from pandas import Series, DataFrame

from bs4 import BeautifulSoup
import json
import csv

import requests

import lxml

url = 'http://espn.go.com/college-football/bcs/_/year/2013 '

result = requests.get(url)

c= result.content
soup = BeautifulSoup((c), 'lxml')

soup.prettify()

summary = soup.find('table',attrs = {'class':'tablehead'})
tables = summary.find_all('table')

#tables = summary.fins_all('td' /'tr')

data =[]

rows = tables[0].findAll('tr')
''
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.find(text = True)
print (text),
data.append(text)
''
soup = BeautifulSoup((html), 'lxml')
table = soup.find('table', attrs = {'class' : 'tablehead'})

list_of_rows=[]

for row in table.findAll('tr')[0:]:
list_of_cells=[]
for cell in findAll('td'):
text = cell.text.replace(' ',')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)

outfile = open('./Rankings.csv', 'wb')
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

Can please you help me with this code? Am using python 3.5

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment
Python multiprocess parallel selenium web scraping with improved performance
requirements.txt
beautifulsoup44.6.3
certifi2018.10.15
chardet3.0.4
idna2.7
lxml4.2.5
requests2.20.1
selenium3.141.0
urllib31.24.1
scraper.py
# answer to https://stackoverflow.com/q/53475578/890242
importrequests
fromurllib.parseimporturljoin
frommultiprocessing.poolimportThreadPool, Pool
frombs4importBeautifulSoup
fromseleniumimportwebdriver
importthreading
defget_links(link):
res=requests.get(link)
soup=BeautifulSoup(res.text,'lxml')
titles= [urljoin(url,items.get('href')) foritemsinsoup.select('.summary .question-hyperlink')]
returntitles
threadLocal=threading.local()
defget_driver():
driver=getattr(threadLocal, 'driver', None)
ifdriverisNone:
chromeOptions=webdriver.ChromeOptions()
chromeOptions.add_argument('--headless')
driver=webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
returndriver
defget_title(url):
driver=get_driver()
driver.get(url)
sauce=BeautifulSoup(driver.page_source,'lxml')
item=sauce.select_one('h1 a').text
print(item)
if__name__'__main__':
url='https://stackoverflow.com/questions/tagged/web-scraping'
ThreadPool(5).map(get_title,get_links(url))

commented Jun 9, 2020

Hey mate, I had a question regarding the code.

While using the map function to get the titles, the code calls the get_title function 50 times and I was wondering if it would open 50 browsers?

If yes, what would be the best practice for reducing the memory usage while maintaining the speed of parallel processing?

Thanks

commented Mar 16, 2021

Doesn't work if ThreadPool is changed to Pool tho... Getting pickle error

commented Mar 19, 2021

Github Web Scraping With Python

Hey mate, I had a question regarding the code.

While using the map function to get the titles, the code calls the get_title function 50 times and I was wondering if it would open 50 browsers?

If yes, what would be the best practice for reducing the memory usage while maintaining the speed of parallel processing?

Thanks

This starts a ThreadPool of 5 threads, each thread will have only one web browser open at anyone time.

commented Mar 19, 2021

Python Scrapy Github

Doesn't work if ThreadPool is changed to Pool tho... Getting pickle error

That would indicate that the get_links function returns a list of unpickable objects. Should be easy to fix.

Web Scraping With Python Github

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment