Log In | Register | August 29, 2014

Share
 

PHP Programming - October 16, 2010

PHP Web Page Scraping Tutorial

Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.

First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)

The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.

As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.

For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.

Lets start by setting up our variable that will contain our string of html from the external file.

<?php
$file_string = file_get_contents('page_to_scrape.html');
?>

What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.

* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.

Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.

Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.

preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description);
$description_out = $description[1];

Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.

preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);

With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.

Now that we have all of our data that we want  to collect. We can simply just print them out as follows:

<p><strong>Title:</strong> <?php echo $title_out; ?></p>
<p><strong>Keywords:</strong> <?php echo $keywords_out; ?></p>
<p><strong>Description:</strong> <?php echo $description_out; ?></p>
<p><strong>Links:</strong> <em>(Name - Link)</em><br />
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
</p>

Attached are the files used in this tutorial. Let me know if you have any questions below.

Download This PHP Website Scraper Script

Post By: | FavoriteLoadingAdd to favorites

36 Comments

Opt
Friday, July 22, 2011

nice its working fine. Here same tutorial on

Ahmad Shahwaiz
Thursday, September 29, 2011

Awesome tutorials , thanks a lot!

hank
Wednesday, October 19, 2011

Thanks, will use this on my website.

Does the link output example work in a table with for example 10 rows, how can i define such a loop for 10 times? to scrape only first 10 results from a page and put into my 10 row table.

kind regards hank

Frank Perez
Wednesday, October 19, 2011

You could just make it so the for loop stops after 10 records. I would do an if check though to make sure you have 10 or more records before looping to the static 10 value. You could also remove the

  • reference from the preg_match_all function to assure you get all links on a page.
  • PHP scraper – regular expressions
    Thursday, February 23, 2012

    [...] trying to follow a tutorial for web scraping with [...]

    sohel
    Tuesday, March 6, 2012

    Nice!! Tutorial.

    Marin
    Monday, March 26, 2012

    Nice tutorial. I’m offering a free PHP web scraper for download here: http://code.google.com/p/universal-web-scraper/

    Maybe people reading your tutorial will be interested in looking at its source code and learning how to write more advanced stuff.

    Rupert
    Friday, April 27, 2012

    Thanks buddy for this.
    I am planning to create my own data scrapper from PHP and I think this will be my starting point!

    Regards,
    Rupert

    Noah John
    Monday, April 30, 2012

    Do you have some script that can scrape all backlinks from a website?

    ROCKESH RONITH
    Thursday, August 16, 2012

    hello !!
    Can any one help me….i trying to using this above code..but i unable to extract data from another website…..????

    Frank Perez
    Thursday, August 16, 2012

    What issues are you having? Can you provide some more detail?

    Rockesh Ronith
    Saturday, August 18, 2012

    hello! actually i need a data from other website url…like name,email,phone,address

    so what should i do????

    how can i write a script for scraping a data…..plz tell me………

    Yan Da
    Sunday, August 19, 2012

    I cant seem to get the retrieving link to work
    I think its because there are //s in it

    Anonymous
    Wednesday, November 14, 2012

    [...] redes sociales Busco un modulo de redes sociales parecido al que se encuentra en esta pagina PHP Web Page Scraping Tutorial | DevBlog.co ojala puedan [...]

    sumit sharma
    Saturday, November 17, 2012

    How to store html hierarchy in db when scraping using simple html dom

    Abc
    xyz

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

    Abc-Again
    xyz-Again

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

    Suppose i scraped this html and i want to store this in db according to hierarchy.
    Store in db At first and second postion Abc and xyz anchor tags and at third postion image at fourth postion text and at fifth and sixth postion Abc-Again and xyz-Again at seventh postion image at eigth postion text.

    Wadeski
    Saturday, December 1, 2012

    Hi there. Great tutorial. I’m trying to scrape a set of links within a target div. How am I able to do that. I can get the div, but not the ‘a’ tag inside the div.

    Here is an example of what I’m trying to get.

    Div Heading
    Link

    From that, I would like to extract the info between the href=”".

    Your help is appreciated.

    Cheers!

    ibnu
    Monday, December 24, 2012

    thank’s a lot, it works

    Grant
    Monday, January 21, 2013

    It doesn’t work on http://im.storm8.com/home.php …why?

    Frank Perez
    Thursday, January 31, 2013

    When I attempt to load the link you provided, it comes up blank in a browser.

    Scraping data from various online stores
    Tuesday, March 19, 2013

    [...] tried a PHP scrape like [...]

    Thomas
    Monday, May 6, 2013

    Hi!
    Just a little sidenote:
    HTML is not regular hence you shouldn’t attempt to parse it using regular expressions.
    Taking your example , let’s assume you do not have a -tag but rather a more sophisticated one, like … and within this you have equally interesting tags with tones of additional attributes e.g. data-store_for_js .
    So at this point the RegEx-Approach will not work anymore or will become very complex and unmanagable.
    I would recommend DOMDocument or a XMLParser.
    sincerely

    bhavesh
    Saturday, June 29, 2013

    hi dear,

    i have a some scrapping example but i want to scrap perticular part of website so what can i do ? please help me.
    i have used example
    http://lessonsone.blogspot.in/2013/06/how-to-web-scrapping-in-php.html

    james
    Thursday, July 11, 2013

    Excellent PHP web scraping ebook here:

    https://leanpub.com/web-scraping

    Ali Baig
    Thursday, August 8, 2013

    Excellent..

    Click4Joomla
    Friday, September 6, 2013

    Very good post….
    For web scraping implementation contact http://www.click4joomla.com

    Dan
    Friday, September 6, 2013

    Good tutorial, simple and informative, thank you

    Grosir
    Friday, December 20, 2013

    Very good post….

    http://www.formspring.me/
    Monday, December 23, 2013

    I am truly happy to read this blog posts which includes
    plenty of valuable information, thanks for providing these kinds of data.

    ARIES
    Monday, February 24, 2014

    what ever

    jane
    Monday, February 24, 2014

    it’s not working hahaha

    vijay
    Thursday, March 20, 2014

    I read your post completely
    i am trying in my hosting but not working for external urls
    its working for the file u provided. i want to know how to extract data from external website

    vijay
    Thursday, March 20, 2014

    I read your post completely
    i am trying in my hosting but not working for external urls
    its working for the file u provided. i want to know how to extract data from external website
    i tried in my site http://www.lic24.in

    kisan
    Saturday, April 12, 2014

    i have used this sites for data scraping using dom curl.

    Kishan Rathod
    Saturday, April 12, 2014

    Hello, Word this is my second comments…

    Kishan Rathod
    Saturday, April 12, 2014

    **********///////////*********/

    Web Scraping
    Monday, July 28, 2014

    very nice tutorial….i m gonna try this….thanks for sharing…

    Leave a Comment



    Need Help? Ask a Question

    Ask anything you want from how to questions to debug. We're here to help.

    You Must Be Logged In To Post A Question.

    Log In or Register