Extract the first paragraph text from a web page with PHPExtract the first paragraph text from a web page with PHP

Posted February 8th, 2010 in PHP

This post looks at how to extract the first paragraph from an HTML page using PHP's strpos and substr functions to find the location of the first <p> and </p> tags and get the content between them.

Using strpos and substr

Assuming the content to extract the paragraph from is in the variable $html (which may have come from a file, database, template or downloaded from an external website), use the following code to work out the position of the first <p> tag, the first </p> tag after that tag, and then get all the HTML between them including the opening and closing tags:

$start = strpos($html, '<p>');
$end = strpos($html, '</p>', $start);
$paragraph = substr($html, $start, $end-$start+4);

Line 1 gets the position of the first opening <p> tag

Line 2 gets the position of the first </p> after the first opening <p>

Line 3 then uses substr to get the HTML. The third parameter is the number of characters to copy and is calculated by subtracting $start from $end and adding on the length of "</p>" so it is included in the extracted HTML.

Converting to plain text

If the extracted paragraph needs to be in plain text rather than HTML, use the following to remove the HTML tags and convert HTML entities into normal plain text:

$paragraph = html_entity_decode(strip_tags($paragraph));

Related posts:

Share or Bookmark

Share or Bookmark this page using the following services. You will need to have an account with the selected service in order to post links or bookmark this page.

Subscribe or Follow

Subscribe via RSS or email, or follow me on Facebook or Twitter below. The RSS icon takes you through to Feedburner where you can select the service or application to use.

Comments

blog comments powered by Disqus