Monday, June 8, 2020

Data mining vs screen scraping


Data mining is not a screen scraping. I know some people in the room may disagree with that statement, but they are actually two almost entirely different concepts.

Simply put, you can put it this way: screen scraping allows you to get information, while data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen scraping" comes from the old days of the mainframe terminal where people worked on computers with green and black screens that only contained text. Screen scraping was used to extract characters from screens for analysis. Fast forward to today's web world, screen web scraping is now more commonly referred to as extracting information from websites. That is, computer programs can "crawl" or "scratch" through websites, extracting data. People often do this to build things like shopping comparison engines, archive web pages, or simply download text into a spreadsheet so it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large data stores for patterns." In other words, you already have the data, and now you're analyzing it to learn useful things about it. Data mining often involves many complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what is already there.

The difficulty is that people who don't know the term "screen scraping" will try to Google anything that looks like it. We include some of these terms on our website to help such people; for example, we created pages titled Text Data Mining, Automated Data Collection, Website Data Extraction and even Website Ripper (I suppose "scrape" is something like "rip"). So it presents a little problem: we don't necessarily want to perpetuate a misconception (i.e. screen scraping = data mining), but we also have to use terminology that people will actually use.

No comments:

Post a Comment