|
|
Learn Web Scraping with Python - Beginner Friendly Guide
What You Need:
- Python 3.6+
- requests library: pip install requests
- beautifulsoup4: pip install beautifulsoup4
Setup:
pip install requests beautifulsoup4
Example 1: Scrape a webpage title and links
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
print("Title:", soup.title.string)
for link in soup.find_all("a"):
print(link.get("href"))
Example 2: Scrape table data
import requests
from bs4 import BeautifulSoup
url = "https://example.com/table"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find("table")
for row in table.find_all("tr"):
cells = row.find_all("td")
data = [cell.text.strip() for cell in cells]
print(data)
Example 3: Save data to CSV
import requests, csv
from bs4 import BeautifulSoup
url = "https://example.com/data"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all("div", class_="item")
results = []
for item in items:
results.append({
"name": item.find("h3").text,
"price": item.find("span").text,
})
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price"])
writer.writeheader()
writer.writerows(results)
print("Data saved to output.csv")
Useful Tips:
1. Always add headers: requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
2. Handle errors: use try/except for network issues
3. Add delays: import time; time.sleep(2) between requests
4. Respect robots.txt: check target website rules
5. Use sessions for multiple requests to same site
Common HTTP Status Codes:
- 200: OK, success
- 403: Forbidden, need headers
- 404: Page not found
- 429: Too many requests, slow down
That is it! Web scraping is powerful. Use it responsibly. Happy coding! |
|