XPath Cheat Sheet for Web Scraping
This Cheat Sheet covers only the basics of how to use XPath to locate elements from the HTML markup
Published
Saturday, July 30, 2022
Reading time
6 min
Words
1,120 words
About this Cheat Sheet
š This Cheat Sheet covers only the basics of how to use XPath to locate elements from the HTML markup.
š All the XPath expressions Iām gonna cover on this Cheat Sheet will be applied to the HTML.
HTML web page
<!DOCTYPE html>
<html lang="en">
<head>
<title>XPath and CSS Selectors</title>
</head>
<body>
<h1>XPath expressions simplified</h1>
<div class="intro">
<p>
I'm paragraph within a div with a class set to
intro
<span id="location">I'm a span with ID set to
location and i'm within a paragraph</span>
</p>
<p id="outside">I'm a paragraph with ID set to
outside and i'm within a div with a class set to intro
</p>
</div>
<p>Hi i'm placed immediately after a div with a class
set to intro
</p>
<span class='intro'>Div with a class attribute set to
intro
</span>
<ul id="items">
<li data-identifier="7">Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
</ul>
<a href="https://www.google.com">Google</a>
<a href="http://www.google.fr">Google France</a>
<p class='bold italic'>Hi, I have two classes</p>
<p class='bold'>Hi i'm bold</p>
</body>
</html>
BASICS
An element is a tag in the HTML markup.
Example:
The āpā tag aka paragraph is called an element. To select any element from HTML web pages we simply use the the following syntax
Example:
To select all p elements we can use the following XPath selector
//p
Although this approach works perfectly fine, itās not recommended to use it, because if for example, we want only to select the āpā elements that are inside the first div with a class attribute equal to āintroā this approach wonāt be the best solution, this is why we always prefer to target elements either by their class attribute, id or by position so we can limit the scope of the XPath expression.
CLASS & ID
So to select any element by its class attribute value we use the following syntax:
//elementName[@attributeName=āvalueā]
Example:
Letās say we want to select the āpā elements inside the ādivā with a class attribute equal to āintroā In this case we use the following XPath expression:
//div[@class=āintroā]/p If we want to select the āpā element with āidā equal to āoutsideā we can use the following XPath expression:
//p[@id=āoutsideā]/p
REMEBER:
_Please note, that the same exact class attribute value can be assigned to more than one element however, id can be assigned to only and only one element. _
Sometimes we want also to select elements based on a foreign attribute that doesnāt belong to the HTML markup standard. For example to select the āliā element with the attribute ādata-identifierā equals to 7 in this case we use the following XPath expression:
//li[@data-identifier=ā7ā]
Sometimes the element we want to select does have two classes, for example, to select the āpā element with a class attribute equal to āboldā and āitalicā in this case we use the following XPath expression:
//p[@class=ābold italicā]
OR:
Although the element does have two classes we can for example search for a substring within the class attribute value by using the contains function.
//p[contains(@class, āitalicā)]
REMEBER:
The contains function takes two arguments:
The first one is where to search, whether on the class attribute value, id, or anything else. The second argument is the value youāre looking for. The value you search for is also case-sensitive, so be careful!
Value lookup
Letās say you want to select all the āaā elements in which the āhrefā attribute value starts with āhttpsā and not āhttpā, in this case, we can use the following XPath expression:
//a[starts-with(@class, āhttpsā)]
So search for the text at the beginning we use the caret sign āstarts-withā function which takes the same arguments as the contains function.
Now if you want to search for a value at the end we use the āends-withā function, however, this function is not supported on XPath version 1.0 which is the version used by the majority of the browsers and LXML.
Finally, if we want to search for a particular value in between we use the contains function as explained before.
If you want to get the text of a particular element you can use the text function, for example, to get the text element of the āpā element with id equals to āoutsideā we use the following XPath expression:
//p[@id=āoutsideā]/text()
The position
Okay, letās say you want to get the second āliā element from the āulā element with āidā equals to āoutsideā, in this case, you can use the following XPath expression:
//ul[@id=āitemsā]/li[2]
However, if you want to select the second list item but also want to make that its text element is āItem 2ā, in this case, you can use the following XPath expression:
//ul[@id=āitemsā]/li[position() = 2 and text() = āItem 2ā] Notice in this case I did use the position() function, the text() function plus the āandā logical operators.
In contrast to the āandā logical operator, we also have the āorā logical operator.
REMEBER:
In XPath everything we write within [] is known as a predicate.
XPath axes
In XPath an axis is used to search for an element based on its relationship with another element, we have some axes which we can use to navigate up and down in the HTML markup.
All axes in XPath use the following syntax:
ElementName::axis
XPath axes (GOING UP)
The parent The parent axis is used to get the parent of a specific element, for example, get the parent of the āpā element with id equal to āoutsideā We use the following XPath expression:
//p[@id="outside"]/parent::node()
The node() function in XPath is used to get the āelementā no matter what its type is.
The ancestor
The ancestor axis can be used for all the ancestors of a specific element, for example, to get the ancestors(parent, grand parent, ...) of the āpā element with id equal to āoutsideā we use the following XPath expression:
//p[@id="outside"]/ancestor::node()
The preceding
In XPath, the preceding axis will get all the elements that precede an element excluding its ancestors.
Preceding sibling
In XPath the preceding-sibling axis will get the sibling that precedes an element, in other words, it will return the brother that is on the top of a specific element.
XPath axes (GOING DOWN)
To go down on the HTML markup we also have 4 axis which are:
The child axis which will get the children of a specific element
The following axis will return all the elements that are after the closing tag of a specific element.
The following-sibling axis which will return all the elements that are after the closing tag of an element but these elements should share the same parent.
The descendant axis which will get the descendants of a particular element.
Thank you. š