blog

XPath Cheat Sheet for Web Scraping

This Cheat Sheet covers only the basics of how to use XPath to locate elements from the HTML markup

Published

Saturday, July 30, 2022

Reading time

6 min

Words

1,120 words

#XPath#code#features#web-scraping

About this Cheat Sheet

šŸ‘‰ This Cheat Sheet covers only the basics of how to use XPath to locate elements from the HTML markup.

šŸ‘‰ All the XPath expressions I’m gonna cover on this Cheat Sheet will be applied to the HTML.

HTML web page

<!DOCTYPE html>
<html lang="en">
   <head>
      <title>XPath and CSS Selectors</title>
   </head>
   <body>
      <h1>XPath expressions simplified</h1>
      <div class="intro">
         <p>
            I'm paragraph within a div with a class set to
            intro
            <span id="location">I'm a span with ID set to
            location and i'm within a paragraph</span>
         </p>
         <p id="outside">I'm a paragraph with ID set to
            outside and i'm within a div with a class set to intro
         </p>
      </div>
      <p>Hi i'm placed immediately after a div with a class
         set to intro
      </p>
      <span class='intro'>Div with a class attribute set to
      intro
      </span>
      <ul id="items">
         <li data-identifier="7">Item 1</li>
         <li>Item 2</li>
         <li>Item 3</li>
         <li>Item 4</li>
      </ul>
      <a href="https://www.google.com">Google</a>
      <a href="http://www.google.fr">Google France</a>
      <p class='bold italic'>Hi, I have two classes</p>
      <p class='bold'>Hi i'm bold</p>
   </body>
</html>

BASICS

An element is a tag in the HTML markup.

Example:

The ā€˜p’ tag aka paragraph is called an element. To select any element from HTML web pages we simply use the the following syntax

Example:

To select all p elements we can use the following XPath selector

//p

Although this approach works perfectly fine, it’s not recommended to use it, because if for example, we want only to select the ā€œpā€ elements that are inside the first div with a class attribute equal to ā€œintroā€ this approach won’t be the best solution, this is why we always prefer to target elements either by their class attribute, id or by position so we can limit the scope of the XPath expression.

CLASS & ID

So to select any element by its class attribute value we use the following syntax:

//elementName[@attributeName=’value’]

Example:

Let’s say we want to select the ā€œpā€ elements inside the ā€œdivā€ with a class attribute equal to ā€œintroā€ In this case we use the following XPath expression:

//div[@class=’intro’]/p If we want to select the ā€œpā€ element with ā€œidā€ equal to ā€œoutsideā€ we can use the following XPath expression:

//p[@id=’outside’]/p

REMEBER:

_Please note, that the same exact class attribute value can be assigned to more than one element however, id can be assigned to only and only one element. _

Sometimes we want also to select elements based on a foreign attribute that doesn’t belong to the HTML markup standard. For example to select the ā€œliā€ element with the attribute ā€œdata-identifierā€ equals to 7 in this case we use the following XPath expression:

//li[@data-identifier=ā€7ā€]

Sometimes the element we want to select does have two classes, for example, to select the ā€œpā€ element with a class attribute equal to ā€œboldā€ and ā€œitalicā€ in this case we use the following XPath expression:

//p[@class=’bold italic’]

OR:

Although the element does have two classes we can for example search for a substring within the class attribute value by using the contains function.

//p[contains(@class, ā€˜italic’)]

REMEBER:

The contains function takes two arguments:

The first one is where to search, whether on the class attribute value, id, or anything else. The second argument is the value you’re looking for. The value you search for is also case-sensitive, so be careful!

Value lookup

Let’s say you want to select all the ā€œaā€ elements in which the ā€œhrefā€ attribute value starts with ā€œhttpsā€ and not ā€œhttpā€, in this case, we can use the following XPath expression:

//a[starts-with(@class, ā€˜https’)]

So search for the text at the beginning we use the caret sign ā€œstarts-withā€ function which takes the same arguments as the contains function.

Now if you want to search for a value at the end we use the ā€œends-withā€ function, however, this function is not supported on XPath version 1.0 which is the version used by the majority of the browsers and LXML.

Finally, if we want to search for a particular value in between we use the contains function as explained before.

If you want to get the text of a particular element you can use the text function, for example, to get the text element of the ā€œpā€ element with id equals to ā€œoutsideā€ we use the following XPath expression:

//p[@id=ā€outsideā€]/text()

The position

Okay, let’s say you want to get the second ā€œliā€ element from the ā€œulā€ element with ā€œidā€ equals to ā€œoutsideā€, in this case, you can use the following XPath expression:

//ul[@id=ā€itemsā€]/li[2]

However, if you want to select the second list item but also want to make that its text element is ā€œItem 2ā€, in this case, you can use the following XPath expression:

//ul[@id=ā€itemsā€]/li[position() = 2 and text() = ā€œItem 2ā€] Notice in this case I did use the position() function, the text() function plus the ā€œandā€ logical operators.

In contrast to the ā€œandā€ logical operator, we also have the ā€œorā€ logical operator.

REMEBER:

In XPath everything we write within [] is known as a predicate.

XPath axes

In XPath an axis is used to search for an element based on its relationship with another element, we have some axes which we can use to navigate up and down in the HTML markup.

All axes in XPath use the following syntax:

ElementName::axis

XPath axes (GOING UP)

The parent The parent axis is used to get the parent of a specific element, for example, get the parent of the ā€œpā€ element with id equal to ā€œoutsideā€ We use the following XPath expression:

//p[@id="outside"]/parent::node()

The node() function in XPath is used to get the ā€œelementā€ no matter what its type is.

The ancestor

The ancestor axis can be used for all the ancestors of a specific element, for example, to get the ancestors(parent, grand parent, ...) of the ā€œpā€ element with id equal to ā€œoutsideā€œ we use the following XPath expression:

//p[@id="outside"]/ancestor::node()

The preceding

In XPath, the preceding axis will get all the elements that precede an element excluding its ancestors.

Preceding sibling

In XPath the preceding-sibling axis will get the sibling that precedes an element, in other words, it will return the brother that is on the top of a specific element.

XPath axes (GOING DOWN)

To go down on the HTML markup we also have 4 axis which are:

The child axis which will get the children of a specific element

The following axis will return all the elements that are after the closing tag of a specific element.

The following-sibling axis which will return all the elements that are after the closing tag of an element but these elements should share the same parent.

The descendant axis which will get the descendants of a particular element.

Thank you. šŸ˜