Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've done a ton of scraping (mostly legal: on behalf of end users of an app on sites they have legit access to). This article misses something that affects several sites: JavaScript driven content. Faking headers and even setting cookies doesn't get around this. This is of course is easy to get around, using something like phantom.js or Selenium. Selenium is great because unlike all the whiz bang scraping techniques, you're driving a real browser and your requests look real (if you make 10000 requests to index.php and never pull down a single image, you might look a bit suspicious). There's a bit more overhead, but micro instances on EC2 can easily run 2 or 3 Selenium sessions at the same time, and at 0.3 cents per hour for spot instances, you can have 200-300 browsers going for 30-50 cents/hour.


In regards to the javascript problem. I'd suggest checking out the mobile versions of the sites first before you hop to a weighty solution like Selenium. Could be a very simple solution to the problem :)

I recently built a twitter bot that did some scraping and posting. I beat my head against a wall for a couple of hours trying to find a good tool to deal with all the javascript driven stuff. I happened to get an update on my phone when it dawned on me that the mobile.twitter site is, for the most part, simple html stuff. Once I realized that, I was able to programatically log into my account with no problems, and the rest of twitter was unlocked for me. I could scrape and post to (almost) my heart content.

However, there were a few very big problems. Which makes me feel that scraping is not the way to go about things. I certainly wouldn't build a service based around scraping a particular site's data.

When I had my twitter bot operational, I would get blocked from twitter for hours at a time. It seems anytime I hit their servers too hard, or crossed some threshold, I would be locked out. I'm assuming it was some kind of IP level ban, because I wasn't even able to access the site from an actual browser.

I was able to deal with the setback by setting up a script to repeatedly check its access the site, and then relaunch the scraper upon discovering access, but the solution was just a band-aid. That would translate to significant downtime if I was running a service with counted on access to their data. The ban-hammer is too easily laid down.

Finally, just as a word of caution, I'd warn prospective scrapers to be careful of just who you scrape. I've inadvertently "DDoS'd" a site when a multiprocessed script got away from me. It spawned 1000+ instances of this particular request, all of which were doing their best to beat the bejesus out of this small websites servers. The site ended up going down for a couple of hours; I assume because of a bandwidth cap or something.

So, my point being, scraping is cool, but (1) I'm unsure if I agree with relying on it over a proper API, and (2) with great power comes great responsibility! Be nice to smaller guys, and don't punish their servers to bad.


For my twitter bot I extracted the xauth keys from twitter's official Mac client(s) (Tweetie 1 and 2 have different keys) and used those to access the API. To twitter the bot looked like the official client and they couldn't ban it without banning their official clients.

And XAuth made account creation and log in a breeze as there was no need for OAuth tokens - username/password was enough.

But you're not always as lucky as that and many websites are heavily JS driven. For Reddit I had to resort to selenium.


Reddit JSON api (just add .json to any URL) is not good enough for you ?


Another proof that spending 15 minutes on research can save you days in development and production.


And your comment is another proof that people tend to assume everyone else is an idiot.


I think his comment was quite appropriate, and did not feel there was an implication that he thought anyone was dumb for not having known about the alternative approach to getting Reddit data.

Often times programmers and the managers that drive them are way too quick to get going building or solving something with brute force. If they would just be patient and stop for a moment. Spending even a mere 30 minutes extra doing your homework on a problem can save hours or days in dev time.


Saying that that was a proof of spending 15 minutes to search about Reddit API would save his time implied that kybernetyk didn't do that research.

But kybernetyk already said he did the research before and that Reddit's API is not good enough for his requirement.

So this is not the case that 15 minutes of research will save the time. And his comment meant he assumed kybernetyk didn't do research, i.e. being dumb for not doing search.


Let me tell you what I thought when I wrote it...

I did not assume that kybernetyk was dumb or anything, I simply chuckled and thought to myself ouch haven't I done similar mistake before?! Please don't assume the worst when reading someone's comment.


Seems like one a situation where the language in one cultural context would be insulting, but in another, is merely a literal statement.


No, I didn't meant it. It's a too common mistake to make.


Obviously not - since I would have used it if it was?


Well, what exactly was actually crucially-missing from the json one?


Reddit's API doesn't give you access to child comments past a certain number, so that could have been it.


My scraping is part of a transactional B2B service, not a high traffic social or B2C thing, so it's a different set of problems than those who want their hands on Twitter data. These are Fortune 100's, so if I can bring down their site, they have bigger problems. :-)


I wouldn't make that assumption. Do check return codes and load times, and back off if you see issues. If these sites are business partners/suppliers you have a lot to lose if things go wrong. It's worth it to develop your relationship with the business owners of the services you're touching in the correspondent organizations. And do set a User-Agent string that declares who you are and provides a link for information; if you are doing business with them, it should be on a basis of honesty.


If you're using ruby, I've found watir (http://watir.com/) to be very nice to use. There might be better alternatives now but it made my life easier when I had to scrape a bunch of our supplier's crappy B2B sites that required JavaScript.


+1. Thanks for the hints about Selenium.

My 2c about scraping - when you try to obtain data from large websites, always go for javascript content. Pages like Newegg or Amazon * may change html outline very often even without a single alteration to the front-user and even your smartest regex can have a brain fart. In contrast, even when site gets major overhaul, most likely old javascript will be left in place with all up to date variables, because engineers will be concerned of removing that code not to break some functionality .

* given you have rights to scrap.

not that there are no tools to debug the site; but I found websites like mentioned plus youtube, and bunch others just not fiddling too much with js.


> even your smartest regex can have a brain fart

If you're using regex to solve this sort of problem, your code deserves to break, I'm sorry.


I've found that regex is very brittle when you don't control what comes across. DOM traversal is far more reliable.


Agree... basically search methods that specify a branch or leaf locally rather than the entire tree structure can more often resist layout changes.

Regex for HTML is a bad idea ... http://stackoverflow.com/questions/590747/using-regular-expr...


Parsing arbitrary HTML is not the same as scraping a page for data -- that link isn't really that relevant.


Good point. I simply avoided regex for HTML for this reason and it wasn't justified (although a good choice).


This. You usually traverse the DOM. Either you use some XQuery /XPath magic or a library like beautiful soup.


Sizzle for life.


Not quite ready for prime time but I am working on a project that makes it really easy to grab content from any site using a point and click interface no xpaths selectors or regex.

You enter the url you want to capture data from, it gets loaded in an iframe, you click on the texts you need and set a schedule to receive updates and how(email/twitter dm) that's it.

It supports javascript driven content and can handle practically any website.

http://www.followwww.com


In my experience, you seldom need a full browser to extract data from javascript-heavy sites. You often can make your way with a little bit of reverse engineering, starting from a traffic capture and looking after parameters you dont understand in the HTML/JS code. Usually, there is nothing hidden. Though, when they're effectively trying to make your life harder with JS, it is easily solved by feeding a JS interpreter (like python spidermonkey) the offending algorithm.

Depending on your use case, headless may be simpler, but it has also many drawbacks that don't show at first, the main being that they're not simple to drive from remote processes as queue-consuming devices.

The article suggests BeautifulSoup as a parsing library for python. If I'm not mistaken BeautifulSoup is not actively maintained anymore, and other cleaner and faster solutions exists, like lxml.html. Ian Bicking made a good article on that topic : http://blog.ianbicking.org/2008/03/30/python-html-parser-per...


BeautifulSoup is in fact still actively maintained. “The current release is Beautiful Soup 4.1.3 (August 20, 2012).”

http://www.crummy.com/software/BeautifulSoup/

I hear it recommended the most among Pythonistas, and it's plenty clean and fast for my use. But if you're skeptical, I'd still look for a more up to date benchmark (or run your own) rather than rely on results from >4 years ago.


Looks like things have changed since the last time I checked. Thank you for pointing this out. Next time I'll check y facts twice before posting.

Still, lxml being basically a binding to libxml2 the performance comparison of the two libs should still hold. I heard it recommended too, in a python talk about scraping like 1 or 2 (at most) years ago.

BeautifulSoup may still be better for parsing broken documents, though I never had problems with lxml while using it on a very large variety of sites.


You can use BeautifulSoup with lxml if you like, although I just use the HTMLParser in lxml these days and don't use BeautifulSoup any more. It seems to work a little better, at least for my uses.

http://lxml.de/elementsoup.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: