Find all anchor tags in a page with PHP and the Simple HTML DOM Parser
Posted November 16th, 2009 in PHP
This post shows how to download a web page and find all the link anchor tags in the page using PHP and the Simple HTML DOM Parser which has a jQuery like syntax selector.
PHP Simple HTML DOM Parser
The PHP Simple HTML DOM Parser makes it easy to find particular elements within an HTML page in a similar way to jQuery. It can be downloaded from http://simplehtmldom.sourceforge.net/ where there are also several examples.
Finding the <a> tags from a web page
First of all include the Simple HTML DOM Parser using either include, require, include_once or require_once:
require_once('/path/to/simple_html_dom.php');
And then load the webpage into the DOM using either the file_get_html() or str_get_html() helper functions. The filename passed to file_get_html() can either be the URL to the web page or the filename of a local file. str_get_html() takes a string instead of a filename.
$dom = file_get_html('http://www.google.com/');
$dom = str_get_html('... some html string ...');
Now do find() on the DOM for 'a' tags as in the following example which echos out the "href" property with a linebreak between each one:
foreach($dom->find('a') as $a) {
if($a->href) {
echo $a->href . "\n";
}
}
Using www.google.com as an example the above would output this:
http://images.google.co.nz/imghp?hl=en&tab=wi http://maps.google.co.nz/maps?hl=en&tab=wl http://news.google.co.nz/nwshp?hl=en&tab=wn http://groups.google.co.nz/grphp?hl=en&tab=wg http://books.google.co.nz/bkshp?hl=en&tab=wp http://mail.google.com/mail/?hl=en&tab=wm http://www.google.co.nz/intl/en/options/ http://scholar.google.co.nz/schhp?hl=en&tab=ws http://blogsearch.google.co.nz/?hl=en&tab=wb http://translate.google.co.nz/?hl=en&tab=wT http://www.youtube.com/?hl=en&tab=w1&gl=NZ http://www.google.com/calendar/render?hl=en&tab=wc http://docs.google.com/?hl=en&tab=wo http://www.google.co.nz/reader/view/?hl=en&tab=wy http://sites.google.com/?hl=en&tab=w3 http://www.google.co.nz/intl/en/options/ /url?sa=p&pref=ig&pval=3&q=http://www.google.co.nz/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNGi5EQv2pmx9Kd5MdCX46heegpxAw /preferences?hl=en https://www.google.com/accounts/Login?hl=en&continue=http://www.google.co.nz/ /advanced_search?hl=en /language_tools?hl=en http://www.google.co.nz/setprefs?sig=0_Va9MAZW7LCKUpGRFXj4-Xh78Tkc=&hl=mi /intl/en/ads/ /services/ /intl/en/about.html http://www.google.com/ncr /intl/en/privacy.html
Notice that these are the hrefs as they appear in the HTML source, so some are relative to the current document/domain and some are absolute containing a full http:// path.
Resolving the paths
I've posted how to resolve the paths to full http:// URLs using the url_to_absolute library from Nadeau Software Consulting in my earlier post titled "Extract images from a web page with PHP and the Simple HTML DOM Parser"
I will write a standalone post about how to do this later this week, which also deals with a slight issue with the URLs returned as they are partially encoded by default using rawurlencode() which is not really ideal. This later post shows the modification needed to resolve this along with some additional examples.
Related posts:
- Resolving relative URLs to absolute in PHP (Thursday, November 26th 2009)
- Extract domain, path etc from a full url with PHP (Monday, September 28th 2009)
- Extract images from a web page with PHP and the Simple HTML DOM Parser (Monday, September 14th 2009)
- Get meta tags from an HTML file with PHP (Thursday, September 10th 2009)
- Change the user agent string in PHP (Thursday, October 2nd 2008)
Share or Bookmark
Share or Bookmark this page using the following services. You will need to have an account with the selected service in order to post links or bookmark this page.
Subscribe or Follow
Subscribe via RSS or email, or follow me on Facebook or Twitter below. The RSS icon takes you through to Feedburner where you can select the service or application to use.

