I’m using PHP to scrape a website and collect some data. It’s all done without using regex. I’m using php’s explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is – how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don’t have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don’t know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
I have a small blog which will give some pointers to what I mean
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn’t. So, if you could capture the DOM state, and then compare it at each scrape, couldn’t you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed – content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That’s going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.