Find all anchor tags in a page with PHP and the Simple HTML DOM Parser
Posted November 16th, 2009 in PHP
This post shows how to download a web page and find all the link anchor tags in the page using PHP and the Simple HTML DOM Parser which has a jQuery like syntax selector.
PHP Simple HTML DOM Parser
The PHP Simple HTML DOM Parser makes it easy to find particular elements within an HTML page in a similar way to jQuery. It can be downloaded from http://simplehtmldom.sourceforge.net/ where there are also several examples.
Finding the <a> tags from a web page
First of all include the Simple HTML DOM Parser using either include, require, include_once or require_once:
require_once('/path/to/simple_html_dom.php');
And then load the webpage into the DOM using either the file_get_html() or str_get_html() helper functions. The filename passed to file_get_html() can either be the URL to the web page or the filename of a local file. str_get_html() takes a string instead of a filename.
$dom = file_get_html('http://www.google.com/');
$dom = str_get_html('... some html string ...');
Now do find() on the DOM for 'a' tags as in the following example which echos out the "href" property with a linebreak between each one:
foreach($dom->find('a') as $a) {
if($a->href) {
echo $a->href . "\n";
}
}
Using www.google.com as an example the above would output this:
http://images.google.co.nz/imghp?hl=en&tab=wi http://maps.google.co.nz/maps?hl=en&tab=wl http://news.google.co.nz/nwshp?hl=en&tab=wn http://groups.google.co.nz/grphp?hl=en&tab=wg http://books.google.co.nz/bkshp?hl=en&tab=wp http://mail.google.com/mail/?hl=en&tab=wm http://www.google.co.nz/intl/en/options/ http://scholar.google.co.nz/schhp?hl=en&tab=ws http://blogsearch.google.co.nz/?hl=en&tab=wb http://translate.google.co.nz/?hl=en&tab=wT http://www.youtube.com/?hl=en&tab=w1&gl=NZ http://www.google.com/calendar/render?hl=en&tab=wc http://docs.google.com/?hl=en&tab=wo http://www.google.co.nz/reader/view/?hl=en&tab=wy http://sites.google.com/?hl=en&tab=w3 http://www.google.co.nz/intl/en/options/ /url?sa=p&pref=ig&pval=3&q=http://www.google.co.nz/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNGi5EQv2pmx9Kd5MdCX46heegpxAw /preferences?hl=en https://www.google.com/accounts/Login?hl=en&continue=http://www.google.co.nz/ /advanced_search?hl=en /language_tools?hl=en http://www.google.co.nz/setprefs?sig=0_Va9MAZW7LCKUpGRFXj4-Xh78Tkc=&hl=mi /intl/en/ads/ /services/ /intl/en/about.html http://www.google.com/ncr /intl/en/privacy.html
Notice that these are the hrefs as they appear in the HTML source, so some are relative to the current document/domain and some are absolute containing a full http:// path.
Resolving the paths
I've posted how to resolve the paths to full http:// URLs using the url_to_absolute library from Nadeau Software Consulting in my earlier post titled "Extract images from a web page with PHP and the Simple HTML DOM Parser"
I will write a standalone post about how to do this later this week, which also deals with a slight issue with the URLs returned as they are partially encoded by default using rawurlencode() which is not really ideal. This later post shows the modification needed to resolve this along with some additional examples.
Related posts:
- Resolving relative URLs to absolute in PHP (Thursday, November 26th 2009)
- Extract domain, path etc from a full url with PHP (Monday, September 28th 2009)
- Extract images from a web page with PHP and the Simple HTML DOM Parser (Monday, September 14th 2009)
- Get meta tags from an HTML file with PHP (Thursday, September 10th 2009)
- Change the user agent string in PHP (Thursday, October 2nd 2008)
Subscribe / Follow / Email / Bookmark / Share
Use the buttons below to subscribe to my RSS feed to be notified next time something is posted, share this post with others, or subscribe by email to have my posts sent in a daily email, follow me on Twitter or follow me on Facebook.
At least one new post is usually made every day. See my posting schedule for more details.
