So the Python Scrapy library is adhering to robots.txt directives, but what can you do when you want it to not follow a “nofollow” link?
The solution is elusive but easy, there’s a callback after the response is done, before sending the found links to the queue that gets a list of links and returns the same.
Items in the list looks like this:
Link(url=’http://www.sciautonics.com/compare-quadcopter-drones/’, text=’Compare Quadcopter Drones’, fragment=", nofollow=False)
And this is how we tap into the process and eliminate the ones with “nofollow” so we don’t process them.
rules = (
# Extract all pages, follow links, call method ‘parse_page’ for response callback, before processing links call method links_processor
# A hook into the links processing from an existing page, done in order to not follow "nofollow" links
ret_links = list()
for link in links:
if not link.nofollow: ret_links.append(link)
Took me few hours to get there, hopefully this helps someone in the future.