AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Web scraper click button1/3/2024 ![]() When you think you have a crawl script ironed out - be sure you toss it at a tool like the EFF Panopticlick (and others) to look for obvious fingerprinting you may have not plugged.You’d be amazed at how many big sites now start throwing you trash data that looks correct at a glance. Have some controls for measuring data quality over time.Meaning if you pull IP #1 and go off to hit a site and randomly grab a UA string to roll with it - make sure the next time you show that IP - you come with the same UA string. If you only have like 100 IPs to work with, don’t touch a target more than 1 time per hour with the same IP to start unless you plan to treat it like a “smash and grab”. Easier said than done, but ask around - exhaust friends in SEO and marketing fields. Try to form a personal relationship with an IP provider - yes you can use the publicly available providers you mention above and there are tons, but none of those will scale to the numbers I needed to hit in any reasonably economic way. ![]() Some more tips from my view, having done this for so long ( without a single legal issue): The premise of crawling/scraping is not complex, but I can assure you that sustaining it for 7-8 years on end and returning data in a timely fashion to paying customers is not easy at all. ![]() Not being disrespectful, but saying something like that without some real meat could get a new person in trouble.īut trust me - If you find yourself scraping 5-10 million jobs a day, it quickly becomes “ not simple”. I’d say this is a very generalized statement that should be cleared up for future readers bumping into your post. And I don’t see the web without search these days. Well… is that correct to assume the internet is also prohibited in these countries? Please take into account that no search engine can work without web scraping.
0 Comments
Read More
Leave a Reply. |