Log In | Register | April 18, 2024

Share
 

PHP Programming - October 16, 2010

PHP Web Page Scraping Tutorial

Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.

First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)

The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.

As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.

For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.

Lets start by setting up our variable that will contain our string of html from the external file.

<?php
$file_string = file_get_contents('page_to_scrape.html');
?>

What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.

* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.

Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.

Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.

preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description);
$description_out = $description[1];

Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.

preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);

With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.

Now that we have all of our data that we want  to collect. We can simply just print them out as follows:

<p><strong>Title:</strong> <?php echo $title_out; ?></p>
<p><strong>Keywords:</strong> <?php echo $keywords_out; ?></p>
<p><strong>Description:</strong> <?php echo $description_out; ?></p>
<p><strong>Links:</strong> <em>(Name - Link)</em><br />
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
</p>

Attached are the files used in this tutorial. Let me know if you have any questions below.

Download This PHP Website Scraper Script

Post By: | FavoriteLoadingAdd to favorites

56 Comments

Opt
Friday, July 22, 2011

nice its working fine. Here same tutorial on

Ahmad Shahwaiz
Thursday, September 29, 2011

Awesome tutorials , thanks a lot!

hank
Wednesday, October 19, 2011

Thanks, will use this on my website.

Does the link output example work in a table with for example 10 rows, how can i define such a loop for 10 times? to scrape only first 10 results from a page and put into my 10 row table.

kind regards hank

Frank Perez
Wednesday, October 19, 2011

You could just make it so the for loop stops after 10 records. I would do an if check though to make sure you have 10 or more records before looping to the static 10 value. You could also remove the

  • reference from the preg_match_all function to assure you get all links on a page.
  • PHP scraper – regular expressions
    Thursday, February 23, 2012

    [...] trying to follow a tutorial for web scraping with [...]

    sohel
    Tuesday, March 6, 2012

    Nice!! Tutorial.

    Marin
    Monday, March 26, 2012

    Nice tutorial. I’m offering a free PHP web scraper for download here: http://code.google.com/p/universal-web-scraper/

    Maybe people reading your tutorial will be interested in looking at its source code and learning how to write more advanced stuff.

    Rupert
    Friday, April 27, 2012

    Thanks buddy for this.
    I am planning to create my own data scrapper from PHP and I think this will be my starting point!

    Regards,
    Rupert

    Noah John
    Monday, April 30, 2012

    Do you have some script that can scrape all backlinks from a website?

    ROCKESH RONITH
    Thursday, August 16, 2012

    hello !!
    Can any one help me….i trying to using this above code..but i unable to extract data from another website…..????

    Frank Perez
    Thursday, August 16, 2012

    What issues are you having? Can you provide some more detail?

    Rockesh Ronith
    Saturday, August 18, 2012

    hello! actually i need a data from other website url…like name,email,phone,address

    so what should i do????

    how can i write a script for scraping a data…..plz tell me………

    Yan Da
    Sunday, August 19, 2012

    I cant seem to get the retrieving link to work
    I think its because there are //s in it

    Anonymous
    Wednesday, November 14, 2012

    [...] redes sociales Busco un modulo de redes sociales parecido al que se encuentra en esta pagina PHP Web Page Scraping Tutorial | DevBlog.co ojala puedan [...]

    sumit sharma
    Saturday, November 17, 2012

    How to store html hierarchy in db when scraping using simple html dom

    Abc
    xyz

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

    Abc-Again
    xyz-Again

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

    Suppose i scraped this html and i want to store this in db according to hierarchy.
    Store in db At first and second postion Abc and xyz anchor tags and at third postion image at fourth postion text and at fifth and sixth postion Abc-Again and xyz-Again at seventh postion image at eigth postion text.

    Wadeski
    Saturday, December 1, 2012

    Hi there. Great tutorial. I’m trying to scrape a set of links within a target div. How am I able to do that. I can get the div, but not the ‘a’ tag inside the div.

    Here is an example of what I’m trying to get.

    Div Heading
    Link

    From that, I would like to extract the info between the href=”".

    Your help is appreciated.

    Cheers!

    ibnu
    Monday, December 24, 2012

    thank’s a lot, it works

    Grant
    Monday, January 21, 2013

    It doesn’t work on http://im.storm8.com/home.php …why?

    Frank Perez
    Thursday, January 31, 2013

    When I attempt to load the link you provided, it comes up blank in a browser.

    Scraping data from various online stores
    Tuesday, March 19, 2013

    [...] tried a PHP scrape like [...]

    Thomas
    Monday, May 6, 2013

    Hi!
    Just a little sidenote:
    HTML is not regular hence you shouldn’t attempt to parse it using regular expressions.
    Taking your example , let’s assume you do not have a -tag but rather a more sophisticated one, like … and within this you have equally interesting tags with tones of additional attributes e.g. data-store_for_js .
    So at this point the RegEx-Approach will not work anymore or will become very complex and unmanagable.
    I would recommend DOMDocument or a XMLParser.
    sincerely

    bhavesh
    Saturday, June 29, 2013

    hi dear,

    i have a some scrapping example but i want to scrap perticular part of website so what can i do ? please help me.
    i have used example
    http://lessonsone.blogspot.in/2013/06/how-to-web-scrapping-in-php.html

    james
    Thursday, July 11, 2013

    Excellent PHP web scraping ebook here:

    https://leanpub.com/web-scraping

    Ali Baig
    Thursday, August 8, 2013

    Excellent..

    Click4Joomla
    Friday, September 6, 2013

    Very good post….
    For web scraping implementation contact http://www.click4joomla.com

    Dan
    Friday, September 6, 2013

    Good tutorial, simple and informative, thank you

    Grosir
    Friday, December 20, 2013

    Very good post….

    http://www.formspring.me/
    Monday, December 23, 2013

    I am truly happy to read this blog posts which includes
    plenty of valuable information, thanks for providing these kinds of data.

    ARIES
    Monday, February 24, 2014

    what ever

    jane
    Monday, February 24, 2014

    it’s not working hahaha

    vijay
    Thursday, March 20, 2014

    I read your post completely
    i am trying in my hosting but not working for external urls
    its working for the file u provided. i want to know how to extract data from external website

    vijay
    Thursday, March 20, 2014

    I read your post completely
    i am trying in my hosting but not working for external urls
    its working for the file u provided. i want to know how to extract data from external website
    i tried in my site http://www.lic24.in

    kisan
    Saturday, April 12, 2014

    i have used this sites for data scraping using dom curl.

    Kishan Rathod
    Saturday, April 12, 2014

    Hello, Word this is my second comments…

    Kishan Rathod
    Saturday, April 12, 2014

    **********///////////*********/

    Web Scraping
    Monday, July 28, 2014

    very nice tutorial….i m gonna try this….thanks for sharing…

    Ashish Sharma
    Friday, September 5, 2014

    Thanks its really cool but i want to scrape data from the text box and select menu can you suggest me some regular expressions

    jual batik modern
    Sunday, September 7, 2014

    It’s an amazing article designed for all the online viewers; they will take advantage from
    it I am sure.

    http://inthegarlic.com
    Thursday, September 25, 2014

    Hey there I am so glad I found your weblog, I really found you by mistake,
    while I was looking on Aol for something else,
    Anyways I am here now and would jut like to say thank
    you for a remarkable post and a all round thrilling blog (I
    also love the theme/design), I don’t have time to go through
    it all at the minute but I have saved it and also added in your RSS
    feeds, so when I have time I will be back to read a
    great deal more, Please do keep up the great jo.

    Leadership Development
    Tuesday, October 21, 2014

    Wow, this article is good, myy sister is analyzing these kinds
    of things, thus I am going to lett know her.

    cookies premium
    Thursday, October 23, 2014

    It’s a pity you don’t have a donate button! I’d certainly donate too this excellent blog!
    I sulpose for now i’ll settle for book-marking and adding your RSS ferd to my Google account.
    I loook forward to new updates and will talk about this site with my Facebook group.
    Chat soon!

    Jonathan
    Tuesday, March 10, 2015

    This is a great tutorial for getting started, but I learned the hard way today that you are not supposed to parse HTML with Regex. I’m only saying this so nobody else ends up wasting a day of work and then faces senseless ridicule after asking a peer for help when they’ve finally gotten frustrated with trying to parse more complicated elements (like with specific attributes).

    If you’re going to parse HTML, use something like XPath or DOMDocument in PHP.

    karnisze
    Wednesday, April 1, 2015

    Thanks for the marvelous posting! I actually enjoyed reading it, you can be a great
    author.I will always bookmark your blog and will often come back very soon. I want to encourage continue your great job,
    have a nice holiday weekend!

    Kerem
    Wednesday, April 1, 2015

    Thanks for posting. Here is a video tutorial http://goo.gl/PJScH6

    Mark S
    Monday, May 11, 2015

    Great tutorial to get one started…

    But as Thomas mentioned in May 2013, not all web pages are so clean to allow for easy parsing. Some projects may be much more complicated due to the formatting of the output. Good luck to all.

    mashhood
    Thursday, August 27, 2015

    Hello. i need some help with it, Any body contact me on mail, thanks

    mashhood
    Thursday, August 27, 2015

    Hello. i need some help with it, Any body contact me on mail, thanks

    mashhood
    Thursday, August 27, 2015

    Hello. i need some help with it, Any body contact me on mail, thanks

    PHP scraper – regular expressions – Tech Magazine
    Sunday, October 25, 2015

    […] trying to follow a tutorial for web scraping with […]

    Amir shah
    Thursday, December 17, 2015

    i try best to get the data but script does not work. i need your help
    Thanks

    coc hack app
    Wednesday, December 28, 2016

    I’m really enjoying the theme/design of your site. Do you
    ever run into any internet browser compatibility issues?
    A few of my blog audience have complained about my site not operating
    correctly in Explorer but looks great in Safari. Do you
    have any advice to help fix this problem?

    pk
    Sunday, May 28, 2017

    some sites such as stackoverflow have complex and long tags, i doubt your code would work. I need to extract heading tags from external websites, any help?

    مهرجانات 2017
    Monday, May 29, 2017

    Hi,I check your new stuff named “PHP Web Page Scraping Tutorial | DevBlog.co” like every week.Your story-telling style is witty, keep up the good work! And you can look our website about مهرجانات 2017.

    xps production line manufacturer
    Monday, June 26, 2017

    An outstanding share! I have just forwarded this onto a coworker who had been conducting a little
    homework on this. And he actually bought me dinner simply because I discovered it for
    him… lol. So allow me to reword this…. Thanks for the meal!!
    But yeah, thanx for spending some time to talk about this issue here on your blog.

    Roderick Monsoor
    Tuesday, November 27, 2018

    Howdy devblog.co

    SEO Link building is a process that requires a lot of time.
    If you aren’t using SEO software then you will know the amount of work load involved in creating accounts, confirming emails and submitting your contents to thousands of websites in proper time and completely automated.

    With THIS SOFTWARE the link submission process will be the easiest task and completely automated, you will be able to build unlimited number of links and increase traffic to your websites which will lead to a higher number of customers and much more sales for you.
    With the best user interface ever, you just need to have simple software knowledge and you will easily be able to make your own SEO link building campaigns.

    The best SEO software you will ever own, and we can confidently say that there is no other software on the market that can compete with such intelligent and fully automatic features.
    The friendly user interface, smart tools and the simplicity of the tasks are making THIS SOFTWARE the best tool on the market.

    IF YOU’RE INTERESTED, CONTACT ME ==> MoneyRobotSubmitter@mail.com

    Regards, Roderick Monsoor
    France, ILE-DE-FRANCE, Le Mee-Sur-Seine, 77350, 9 Rue Goya

    Leave a Comment



    Need Help? Ask a Question

    Ask anything you want from how to questions to debug. We're here to help.

    You Must Be Logged In To Post A Question.

    Log In or Register