How to extract URLs from Delicious: Web Scraping or Data Extraction Techniques using PHP

There are numerous techniques to extract data from the web and one of the powerful tool that is available is Webharvest ( Its a Java and XML based extraction system and you can write XML based configuration files to tell what exactly you want to extract and how. But, sometimes you just wish that these people had done more documentation. Probably, i’ll write about it some other time in some other blog. As of now, I would discuss how I often extract data from search-engines and sites like delicious using PHP.

In the example here, I would extract links and title of each link from delicious that are tagged with some keyword, say, all those links that are tagged with “PHP”. I started this project in an attempt to randomly pick links (Interesting links) from delicious, google etc and then display the content after removing HTML tags.

I’ll go step by step here:

Step 1: This function would fetch the complete Delicious page that are tagged with some keyword.


function getDelicious($keyword, $page_num){

$string1 = “”;

//$odpurl = urlencode($odpurl);

$fp = fopen($odpurl, “r”);
$string1 = join(“”, file( $odpurl));


$result = parseDelicious($string1);

if (($result == “”) || ($result == NULL)) { return “”; } else { return $result;}


In the function above, I call the complete “deliciou” page and store it in $string1. Now, I can move on to extracting URL from this fetched page. Now, I can pull pages from that are tagged with some keyword say PHP, Digg, Bush, Humour.


Step 2: This function would extract URLs from the fetched Delicious page.


function parseDelicious($string1){

$listUrl = “”;

// Convert the HTML page (fetched from web) into an array by splitting it using space.

$ArrayText = explode(” “, $string1);

// iterate through each word in HTML page.

for($i=0; $i<count($ArrayText); $i++){

// Ignore all inertnal URLs having “” and advertizement keyword having “” and consider ones starting with “href=http://”


if ( !(strstr($ArrayText[$i], “”)) && !(strstr($ArrayText[$i], “”)) && !(strstr($ArrayText[$i], “”)) &&

$piece = substr($ArrayText[$i], 6);
$piece1 = substr($piece, 0, strpos($piece, “\””));
// $piece1 now has the URL extracted. Now, we shall extract the title also.

$url_title = “”;
$end_found = false;
while (!$end_found){
$end_found = strstr($ArrayText[$i+$j], “</a>”);

if ($j == 1){
$url_title_rel = substr($ArrayText[$i+1], 15);
}else {
$url_title_rel = $ArrayText[$i+$j];
if ($end_found){
$url_title .= substr($url_title_rel, 0, strpos($url_title_rel ,”<“));
$url_title .= $url_title_rel.” “;

//$url_title now has the title of the URL

$listUrl = $piece1.”—-“.$url_title.”<br> “;
echo $listUrl; // printing extracted URL and title from delicious

return $listUrl;



In the function above, I explode (or simply, split) the fetched delicious HTML page using spaces and store it in array. Once done, now I can move through each keyword to find relevant information. Here, in this case it is URL and title.

Step 3: Thus, if I want to extract all links from delicious that are tagged with PHP, I can simply iterate as follows:


$i = 1;

while (true){

$cont = getDelicious(‘php’, $i);

if ($cont == “”) { break; }






Finally, the output for all URLs from delicious tagged with PHP on the first page is:— – 40 PHP Tutorials—PHP: Hypertext Preprocessor—PHP Help: PHP Freaks!—symfony – open-source PHP5 web framework—, the best resource for PHP tutorials, templates, PHP manuals, content management systems, scripts, classes and more.—CakePHP : the rapid development php framework—PHPit – Totally PHP » Taking a look at ten different PHP frameworks—SAJAX – Simple Ajax Toolkit by ModernMethod – XMLHTTPRequest Toolkit for PHP—PHP: PHP Manual – Manual—CakePHP : the rapid development php framework






Comments on this entry are closed.