PHP: get keywords from search engine referer url - version 2

Posted in PHP -

I posted how to get the keywords from a search engine referer url a while back (here) and from a number of useful comments on that post have completely revised the function which now also supports the query string when it is sent as part of a URL fragment. I've also added more information about how it works and about the HTTP_REFERER string etc.

A note about HTTP_REFERER

You cannot ever guarantee that the $_SERVER['HTTP_REFERER'] variable is passed along by the browser and is available to your PHP script.

There are a variety of reasons why it may not be set, such as browser configuration settings, local proxy software that blocks it, clicking a link that moves you from HTTPS to HTTP, right-click and opening in a new browser tab/window, etc.

Having said that, it is available some of the time and you can then capture this information using PHP's parse_url and parse_str functions.

Referer URL examples

Here are some examples that will be used with the example function. The first two are from Google, one with a regular query string and the second where it's passed as a #fragment. The other two are from Bing and Yahoo.

http://www.google.co.nz/url?sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
http://www.google.com/#hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2
http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC
http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql

You can see from looking at the URLs that Bing and Google store the keyword word as the "q" variable and Yahoo does it with "p".

The code

Here's the PHP code to extract the keywords entered from the above examples. I will explain it on a line by line basis below.

function search_engine_query_string($url = false) {

    if(!$url && !$url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : false) {
        return '';
    }

    $parts_url = parse_url($url);
    $query = isset($parts_url['query']) ? $parts_url['query'] : (isset($parts_url['fragment']) ? $parts_url['fragment'] : '');
    if(!$query) {
        return '';
    }
    parse_str($query, $parts_query);
    return isset($parts_query['q']) ? $parts_query['q'] : (isset($parts_query['p']) ? $parts_query['p'] : '');

}

How it works

1. Optionally passing in a url, or getting it from HTTP_REFERER

The full url is optionally passed to the function. If it does not contain a value the first few lines get it from $_SERVER['HTTP_REFERER'] as shown below. At this stage if nothing is available it returns an empty string.

    if(!$url && !$url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : false) {
        return '';
    }

My original post took a few extra lines to do the above so I have consolidated it here into fewer lines. It was suggested by one commenter to do the return as part of the assignment (e.g. like this: $url = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : return '';) but it results in a parse error.

2. Using parse_url to gets the parts from the URL

The next line of code uses the parse_url function. This extracts the various parts of the URL into an associative array which is returned into the $parts variable.

    $parts_url = parse_url($url);

If print_r($parts_url was done using the first URL in my examples it would output this:

Array
(
    [scheme] => http
    [host] => www.google.co.nz
    [path] => /url
    [query] => sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
)

You can see the array item we want to use is the "query" one. In the case of the 2nd URL example which has Google sending through the query in the #fragment in the URL it would look like this:

Array
(
    [scheme] => http
    [host] => www.google.com
    [path] => /
    [fragment] => hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2
)

3. Getting the query string or fragment

Because the query string can effectively be in either the [query] or [fragment] the next line of code works out which one it is in and assigns it to the $query variable:

	$query = isset($parts_url['query']) ? $parts_url['query'] : (isset($parts_url['fragment']) ? $parts_url['fragment'] : '');

If $query is empty at this stage then nothing has been passed in HTTP_REFERER that is a query string or fragement so return an empty string:

    if(!$query) {
        return '';
    }

4. Use parse_str to get explode the query string

The next line uses parse_str to explode the query string into an associative array and store it in the $parts_query array:

    parse_str($query, $parts_query);

Using the Google example again, doing print_r($parts_query) would output this:

Array
(
    [sa] => t
    [source] => web
    [cd] => 3
    [sqi] => 2
    [ved] => 0CCkQFjAC
    [url] => https://www.electrictoolbox.com/using-settimeout-javascript/
    [rct] => j
    [q] => javascript settimeout
    [ei] => IijsTIzYAYLCcfeB2fYO
    [usg] => AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ
)

You can see the element in the array that has the actual query string searched on at Google is in [q].

5. Return the search engine query

The final line in the function checks [q] and then [p] in the $parts_query array sets whichever is set, or an empty string if neither was set. You can easily add additional isset and value clauses if a different search engine sends through the query in a different variable.

    return isset($parts_query['q']) ? $parts_query['q'] : (isset($parts_query['p']) ? $parts_query['p'] : '');

Example output

Using the examples at the top of this post, here's some example output:

echo search_engine_query_string('http://www.google.co.nz/url?sa=t&source=web&cd=3&sqi=2&ved=0CCkQFjAC&url=http%3A%2F%2Fwww.electrictoolbox.com%2Fusing-settimeout-javascript%2F&rct=j&q=javascript%20settimeout&ei=IijsTIzYAYLCcfeB2fYO&usg=AFQjCNFJ5Fn8pm2lVcZCt46Jn6A7v_S4TQ');

Result: javascript settimeout

echo search_engine_query_string('http://www.google.com/#hl=en&biw=1440&bih=688&q=javascript+settimeout&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=1b219014ca3fb4b2');

Result: javascript settimeout

echo search_engine_query_string('http://www.bing.com/search?q=javascript+date+to+timestamp&src=IE-SearchBox&FORM=IE8SRC');

Result: javascript date to timestamp

echo search_engine_query_string('http://us.yhs.search.yahoo.com/avg/search?fr=yhs-avg-chrome&type=yahoo_avg_hs2-tb-web_chrome_us&p=concatenation+in+mysql');

Result: concatenation in mysql

A note on $_SERVER['QUERY_STRING']

There is also a $_SERVER['QUERY_STRING'] variable which also gets assigned the query string. It would be simpler to use this than using parse_url but then we couldn't also check for a #fragment.



Related posts:


Comments