Home » excel » excel – Problems to extract links with webscraping

excel – Problems to extract links with webscraping

Posted by: admin May 14, 2020 Leave a comment

Questions:

I want to extract the links of the toys listed in this webpage:
https://cebra.com.ar/category/73/Juego-de-Construccion.html

I have an entire procedure (I don´t copye here because it´s very long and complicated), in which in some part I have the following code that doesn´t work:

 Cells(erow, 1) = html.getElementsByTagName("a").href

Any idea to solve this?

Thanks a lot!

How to&Answers:

getElementsByTagName returns a collection and indeed you would need to index into it to get a particular element.

However, you don’t want all a tags. That is inefficient and you would need an additional test to limit to those of interest. You want specifically the links for products so use an attribute = value css selector to get those:

Dim links As Object, i As Long
Set links = html.querySelectorAll("[href^=product]")

For i = 0 to links.Length - 1
    ActiveSheet.Cells(erow + i, 1) = links.item(i).href
Next

This:

[href^=product]

looks for href attributes whose value starts with, ^, product.

If you look at the page html you can see each of your target links begins with that substring

enter image description here

Answer:

The function getElementsByTagName() of the object HTMLDocument returns a list, but you’re trying to access the property .href of one object as if it was a single object.

You should replace this:

Cells(erow, 1) = html.getElementsByTagName("a").href

with this

Cells(erow, 1) = html.getElementsByTagName("a")[yourIndex].href

… where yourIndex is a number representing the index of your list (0, 1,… n).

Of course you’ll have to find the correct rule to get the right a element at the right place, as just getting all the elements of the document with tag a retrieves 278 elements in your page (including all the page headers, footers and other things I don’t really think you need):

enter image description here