Skip to main content

Webscrape + ETL

In 5 minutes, build an automated web scraper and ETL pipeline

Wait, but why?

Just about every data scientist or general web hacker out there has dreamed of a use case for web scraping.

  • Scrape pricing data from your competitors for up to date pricing intelligence
  • Scrape product pricing data from suppliers to automate inventory management
  • Scrape prospect profiles to find customer leads, potential employees, or service providers
  • Scrape houses from Zillow or apartments from craigslist to optimize your search
  • Many many more…

If you’re technically skilled enough, maybe you’ve built your own using beautifulsoup or selenium, prototyped in a Jupyter notebook and deployed that script to a serverless function like AWS Lambda or a Google Cloud Function. One week later, after learning way too much about cloud infrastructure, you developed yet another function to pipe that messy data to your data warehouse so you can clean it up. To clean it up, you maybe used Airflow or dbt or another modeling/orchestration pipeline to parse HTML, rip apart JSONs, and generate structured data so you can finally do analysis.

Wait, but the jobs not done yet. Because you’re building a data product or a business operation that relies on data being up to date, you either need to wake up at 7am everyday to manually run the pipeline, or develop an automation to kickoff the scraping job, maintain state of what you’re scraping and persist that data in your warehouse.

If this ^ sounds familiar — this post is for you.

Patterns is a general purpose data science tool that abstracts away the messy bits of deploying infrastructure and hacking together tooling. In this post, we will demonstrate how you can build a robust, scalable, and automated web scraping and data pipeline application in 5 minutes. To do this you will need:

  1. Webscraper.io for running proxied, parallelized, automated web scraping jobs
  2. Patterns for managing webscraper, ingesting, storing, and pipelining your data

Your final result will look like this (click to clone it!)

Foo

Step-by-step instructional video

  1. Create a new app and setup a webhook node in Patterns to receive job events from webscraper.io
  2. Create a scraping job in webscraper.io
  3. Add a stream node to store and pass job events downstream
  4. Use a python node to retrieve actual scraped data, and structure it
  5. Add a table node to store web scraping results to, enable any type of downstream analysis depending on your use case

Tutorial Video