Web scraping is the process of extracting data from web pages in a structured format. It’s one of the most efficient ways to get data from online sites — especially if you need the data for feeding into an application or another website.
Web scraping — also known as data scraping — has numerous applications including comparing prices across multiple websites, gathering market research data, product monitoring, researching, etc. As a data scientist, I find it most useful for getting data that is not available using APIs. As a beginner or pro user, you may find it useful for comparing prices or collecting data from the web.
In this write-up, I will introduce you to two methods for web scraping. The first method is a beginner-friendly way to scrape data using a ready-to-use solution. The second method is a programmer-friendly way to scrape data using Scrapy, which supports powerful scraping if well done. Let’s check both of them.
If you are not a developer or not familiar with Python, here is the easy solution for you. There are numerous tools in the market for scraping the web, allowing you to scrape the web with zero to some programming. Surprisingly, some of the web scrapers allow you to scrape the web via their intuitive interface.
Octoparse is such a web scraping tool, which allows you to easily scrape the web. As I will detail later in this post, Octoparse lets you perform scraping in three simple steps. Fortunately, it does offer a free plan via its app to scrape data locally, allowing you to perform small scraping tasks with zero investment.
Moreover, Octoparse has many advanced features for getting hands-on web scraping without programming. I find its templates interesting as they allow you to scrape data from popular websites without configuration. For example, you can choose a prebuilt template to scrape product data from Amazon or eBay.
Octoparse is powerful yet handy. The reason being it supports detecting and extracting data from web pages spontaneously. Though I found it mostly works with a list or table of data, it is the fastest way to get started. Here’s how:
https://www.ebay.com/itm/Amazon-Echo-Dot-4th-Gen-With-Clock-2020-Smart-Speaker-Alexa-All-Colors-NEW/363219888368
here.Sometimes, Octoparse’s automatic data extraction features may not be enough for you. Maybe the web page you are trying to extract data from is complex or dynamic. Whatever be the case, you are in good hands with Octoparse since it also allows manually selecting the data to extract. Here’s how to do it:
Alright, we have created the task to extract pricing information from eBay via Octoparse. However, it is not yet automated. That is, you must run it manually for now. That said, let’s check how to automate it to extract data periodically:
Finally, it is completed. Your task will run automatically in Octoparse Cloud per your configured schedule. You can view the data by clicking on Dashboard, clicking on the More button of your task, and selecting View data > Cloud data.
If you have checked out the first method and want more control or you are a programmer and want to learn the programmatic method to scrape web pages, you must try this method. We will utilize Scrapy to build the solution. I assume you have hands-on knowledge of working with HTML & CSS and Python.
Scrapy is an open-source framework for extracting data from websites. It is a popular data scraping tool among data scientists. In my experience, it works great for small or big projects, but you may require to properly configure it and implement third-party tools to make it effective for large scraping projects.
Do you know about selectors in CSS? On any web page, selectors help to identify and select specific elements. You can read about selectors on W3Schools. For example, if you want to find all top headings on a web page, you can use h1
as the selector. Here’s how to find an element’s selector in Google Chrome:
If you are trying to scrape any web page not discussed in this post or you want to scrape more data in those pages, you need to find selectors and use them. That said, let’s get started. In the example below, I am going to scrape the price of Amazon Echo Dot 4th Gen from eBay using Scrapy. Let’s get started:
cd scrape-web-regularly
), then run the following to install Scrapy: pip install scrapy
.If you get an error, read the message carefully to find the mitigation steps. For example, if the error says “error: Microsoft Visual C++ 14.0 or greater is required. Get it with Microsoft C++ Build Tools: https://ift.tt/2Mj3miN“, you should download and install it from the given link, then try again. In such cases, you can alternatively use Anaconda to download prebuilt packages.
import scrapy class EBaySpider(scrapy.Spider): # name of the scraper name = 'ebay_spider' # link or URL to scrape from link1 = 'https://www.ebay.com/itm/Amazon-Echo-Dot-4th-Gen-With-Clock-2020-Smart-Speaker-Alexa-All-Colors-NEW/363219888368' # links or URLs to scrape data from start_urls = [link1] def parse(self, response, **kwargs): # select the element to scrape data from for title in response.css('#prcIum'): # extract the text data from element yield {'price': title.css('::text').get()}
scrapy runspider try-one.py
to perform the scraping. If you are writing for some other web page/site, you may try running it sometimes for getting the right selector.[scrapy.core.engine] INFO: Spider opened [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay.com/itm/Amazon-Echo-Dot-4th-Gen-With-Clock-2020-Smart-Speaker-Alexa-All-Colors-NEW/3632198 88368> (referer: None) [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.com/itm/Amazon-Echo-Dot-4th-Gen-With-Clock-2020-Smart-Speaker-Alexa-All-Colors-NEW/3632198 88368> {'price': 'US $59.99'} [scrapy.core.engine] INFO: Closing spider (finished)
DEBUG: Scraped from
. Else, you will see a message containing {'title'
— this is the scraped data.However, if you do not see any of these messages, check if the output contains DEBUG: Crawled (200)
. If not, Scrapy was not able to crawl or get the web page. There can be multiple reasons, so multiple troubleshooting tips:
link1
in a web browser. link1
in your code.ScrapingBee is a web scraping service for getting around scraping protection technologies and scraping the web without getting blocked. It provides a simple API for scraping the web using headless browsers and rotating proxies, letting you bypass scraping protection technologies while scraping using Scrapy.
For example, I started testing out scraping with eBay but without success. eBay detects and blocks all requests until they come from a real user using a real web browser, and so, Scrapy does not work for eBay. Also, you may find that big or popular websites work with Scrapy for some requests, but then, they start blocking the requests too. That is where ScrapingBee comes super handy.
I liked that ScrapingBee avails a free trial that includes 1,000 free calls to its API, allowing you to test its service and/or work on a small scraping project. That said, let’s get started with using ScrapingBee API in our Scrapy project:
link1
in the code. So, link1 = 'https://www.ebay...
becomes link1 = 'https://app.scrap...
.You have completed building the scraper, i.e., the logic to scrape data. But it will not run at regular intervals yet, but instead, you need to manually run it by yourself, which is not the goal of this tutorial. Hence, let’s try to automate our custom scraper so that it runs automatically at the scheduled intervals.
In Linux OSs like Ubuntu and Linux Mint, you can use a cron job to run your scraper periodically. You can read this cron guide, and follow these steps:
crontab -e
to edit the user’s cron file. <CRON_SCHEDULE> cd <PROJECT_DIR> scrapy runspider try-one.py
in the cron file and save it to schedule your cron job. * * * * *
that means run it every minute. Or 0 * * * *
that means run it one every hour, or 0 0 * * *
that means run it once every day.Now, your scraper will run periodically at the scheduled time using cron in Linux OSs. If you are using Windows 10, you can use Task Scheduler to schedule your scraping task to run periodically. Read my guide to automate repetitive tasks.
That is all about scraping a website using a ready-to-use platform like Octoparse and a custom-built program using Scrapy and ScrapingBee.
The post How to Scrape Webpages at Regular Intervals (Automatically) appeared first on Hongkiat.
from Irvine Business Signs https://ift.tt/3c7fjBX
via Irvine Sign Company