Warning: A non-numeric value encountered in /srv/users/serverpilot/apps/itamargerocom/public/wp-content/themes/cardinal/includes/sf-custom-styles.php on line 124

Warning: A non-numeric value encountered in /srv/users/serverpilot/apps/itamargerocom/public/wp-content/themes/cardinal/includes/sf-custom-styles.php on line 125

Getting Scrapy (Python) to not follow rel=”nofollow” links

Getting Scrapy (Python) to not follow rel=”nofollow” links
June 25, 2015 Itamar Gero

So the Python Scrapy library is adhering to robots.txt directives, but what can you do when you want it to not follow a “nofollow” link?

The solution is elusive but easy, there’s a callback after the response is done, before sending the found links to the queue that gets a list of links and returns the same.

Items in the list looks like this:


Link(url=’http://www.sciautonics.com/compare-quadcopter-drones/’, text=’Compare Quadcopter Drones’, fragment=", nofollow=False)

And this is how we tap into the process and eliminate the ones with “nofollow” so we don’t process them.


[python]
rules = (
# Extract all pages, follow links, call method ‘parse_page’ for response callback, before processing links call method links_processor
Rule(LinkExtractor(allow=(",’/’)),follow=True,callback=’parse_page’,process_links=’links_processor’),
[/python]


[python]
def links_processor(self,links):
# A hook into the links processing from an existing page, done in order to not follow "nofollow" links
ret_links = list()
if links:
for link in links:
if not link.nofollow: ret_links.append(link)
return ret_links
[/python]

Took me few hours to get there, hopefully this helps someone in the future.