Email

PHP Programming - October 16, 2010

PHP Web Page Scraping Tutorial

Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.

First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)

The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.

As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.

For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.

Lets start by setting up our variable that will contain our string of html from the external file.

<?php $file_string = file_get_contents('page_to_scrape.html'); ?>

What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.

* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.

Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.

<?php $file_string = file_get_contents('page_to_scrape.html'); preg_match('/<title>(.*)<\/title>/i', $file_string, $title); $title_out = $title[1]; ?>

Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.

Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.

preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords); $keywords_out = $keywords[1]; preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description); $description_out = $description[1];

Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.

preg_match_all('/<li><a href="(.*)">(.*)<\/a><\/li>/i', $file_string, $links);

With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.

Now that we have all of our data that we want to collect. We can simply just print them out as follows:

Title: <?php echo $title_out; ?> Keywords: <?php echo $keywords_out; ?> Description: <?php echo $description_out; ?> Links: (Name - Link) <?php echo '<ol>'; for($i = 0; $i < count($links[1]); $i++) { echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>'; } echo '</ol>'; ?> 

Attached are the files used in this tutorial. Let me know if you have any questions below.

Download This PHP Website Scraper Script

Post By: Frank Perez | Add to favorites

56 Comments

Opt
Friday, July 22, 2011

nice its working fine. Here same tutorial on

Ahmad Shahwaiz
Thursday, September 29, 2011

Awesome tutorials , thanks a lot!

hank
Wednesday, October 19, 2011

Thanks, will use this on my website.

Does the link output example work in a table with for example 10 rows, how can i define such a loop for 10 times? to scrape only first 10 results from a page and put into my 10 row table.

kind regards hank

Frank Perez
Wednesday, October 19, 2011

You could just make it so the for loop stops after 10 records. I would do an if check though to make sure you have 10 or more records before looping to the static 10 value. You could also remove the

reference from the preg_match_all function to assure you get all links on a page.

PHP scraper – regular expressions
Thursday, February 23, 2012

[...] trying to follow a tutorial for web scraping with [...]

sohel
Tuesday, March 6, 2012

Nice!! Tutorial.

Marin
Monday, March 26, 2012

Nice tutorial. I’m offering a free PHP web scraper for download here: http://code.google.com/p/universal-web-scraper/

Maybe people reading your tutorial will be interested in looking at its source code and learning how to write more advanced stuff.

Rupert
Friday, April 27, 2012

Thanks buddy for this.
I am planning to create my own data scrapper from PHP and I think this will be my starting point!

Regards,
Rupert

Noah John
Monday, April 30, 2012

Do you have some script that can scrape all backlinks from a website?

ROCKESH RONITH
Thursday, August 16, 2012

hello !!
Can any one help me….i trying to using this above code..but i unable to extract data from another website…..????

Frank Perez
Thursday, August 16, 2012

What issues are you having? Can you provide some more detail?

Rockesh Ronith
Saturday, August 18, 2012

hello! actually i need a data from other website url…like name,email,phone,address

so what should i do????

how can i write a script for scraping a data…..plz tell me………

Yan Da
Sunday, August 19, 2012

I cant seem to get the retrieving link to work
I think its because there are //s in it

Anonymous
Wednesday, November 14, 2012

[...] redes sociales Busco un modulo de redes sociales parecido al que se encuentra en esta pagina PHP Web Page Scraping Tutorial | DevBlog.co ojala puedan [...]

sumit sharma
Saturday, November 17, 2012

How to store html hierarchy in db when scraping using simple html dom

Abc
xyz

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

Abc-Again
xyz-Again

Suppose i scraped this html and i want to store this in db according to hierarchy.
Store in db At first and second postion Abc and xyz anchor tags and at third postion image at fourth postion text and at fifth and sixth postion Abc-Again and xyz-Again at seventh postion image at eigth postion text.

Wadeski
Saturday, December 1, 2012

Hi there. Great tutorial. I’m trying to scrape a set of links within a target div. How am I able to do that. I can get the div, but not the ‘a’ tag inside the div.

Here is an example of what I’m trying to get.

Div Heading
Link

From that, I would like to extract the info between the href=”".

Your help is appreciated.

Cheers!

ibnu
Monday, December 24, 2012

thank’s a lot, it works

Grant
Monday, January 21, 2013

It doesn’t work on http://im.storm8.com/home.php …why?

Frank Perez
Thursday, January 31, 2013

When I attempt to load the link you provided, it comes up blank in a browser.

Scraping data from various online stores
Tuesday, March 19, 2013

[...] tried a PHP scrape like [...]

Thomas
Monday, May 6, 2013

Hi!
Just a little sidenote:
HTML is not regular hence you shouldn’t attempt to parse it using regular expressions.
Taking your example , let’s assume you do not have a -tag but rather a more sophisticated one, like … and within this you have equally interesting tags with tones of additional attributes e.g. data-store_for_js .
So at this point the RegEx-Approach will not work anymore or will become very complex and unmanagable.
I would recommend DOMDocument or a XMLParser.
sincerely

bhavesh
Saturday, June 29, 2013

hi dear,

i have a some scrapping example but i want to scrap perticular part of website so what can i do ? please help me.
i have used example
http://lessonsone.blogspot.in/2013/06/how-to-web-scrapping-in-php.html

james
Thursday, July 11, 2013

Excellent PHP web scraping ebook here:

https://leanpub.com/web-scraping

Ali Baig
Thursday, August 8, 2013

Excellent..

Click4Joomla
Friday, September 6, 2013

Very good post….
For web scraping implementation contact http://www.click4joomla.com

Dan
Friday, September 6, 2013

Good tutorial, simple and informative, thank you

Grosir
Friday, December 20, 2013

Very good post….

http://www.formspring.me/
Monday, December 23, 2013

I am truly happy to read this blog posts which includes
plenty of valuable information, thanks for providing these kinds of data.

ARIES
Monday, February 24, 2014

what ever

jane
Monday, February 24, 2014

it’s not working hahaha

vijay
Thursday, March 20, 2014

I read your post completely
i am trying in my hosting but not working for external urls
its working for the file u provided. i want to know how to extract data from external website

vijay
Thursday, March 20, 2014

I read your post completely
i am trying in my hosting but not working for external urls
its working for the file u provided. i want to know how to extract data from external website
i tried in my site http://www.lic24.in

kisan
Saturday, April 12, 2014

i have used this sites for data scraping using dom curl.

Kishan Rathod
Saturday, April 12, 2014

Hello, Word this is my second comments…

Kishan Rathod
Saturday, April 12, 2014

**********///////////*********/

Web Scraping
Monday, July 28, 2014

very nice tutorial….i m gonna try this….thanks for sharing…

Ashish Sharma
Friday, September 5, 2014

Thanks its really cool but i want to scrape data from the text box and select menu can you suggest me some regular expressions

jual batik modern
Sunday, September 7, 2014

It’s an amazing article designed for all the online viewers; they will take advantage from
it I am sure.

http://inthegarlic.com
Thursday, September 25, 2014

Hey there I am so glad I found your weblog, I really found you by mistake,
while I was looking on Aol for something else,
Anyways I am here now and would jut like to say thank
you for a remarkable post and a all round thrilling blog (I
also love the theme/design), I don’t have time to go through
it all at the minute but I have saved it and also added in your RSS
feeds, so when I have time I will be back to read a
great deal more, Please do keep up the great jo.

Leadership Development
Tuesday, October 21, 2014

Wow, this article is good, myy sister is analyzing these kinds
of things, thus I am going to lett know her.

cookies premium
Thursday, October 23, 2014

It’s a pity you don’t have a donate button! I’d certainly donate too this excellent blog!
I sulpose for now i’ll settle for book-marking and adding your RSS ferd to my Google account.
I loook forward to new updates and will talk about this site with my Facebook group.
Chat soon!

Muhammad Mahad Azad
Monday, December 8, 2014

try using:
https://github.com/mahadazad/page-scrapper

Jonathan
Tuesday, March 10, 2015

This is a great tutorial for getting started, but I learned the hard way today that you are not supposed to parse HTML with Regex. I’m only saying this so nobody else ends up wasting a day of work and then faces senseless ridicule after asking a peer for help when they’ve finally gotten frustrated with trying to parse more complicated elements (like with specific attributes).

If you’re going to parse HTML, use something like XPath or DOMDocument in PHP.

karnisze
Wednesday, April 1, 2015

Thanks for the marvelous posting! I actually enjoyed reading it, you can be a great
author.I will always bookmark your blog and will often come back very soon. I want to encourage continue your great job,
have a nice holiday weekend!

Kerem
Wednesday, April 1, 2015

Thanks for posting. Here is a video tutorial http://goo.gl/PJScH6

Mark S
Monday, May 11, 2015

Great tutorial to get one started…

But as Thomas mentioned in May 2013, not all web pages are so clean to allow for easy parsing. Some projects may be much more complicated due to the formatting of the output. Good luck to all.

mashhood
Thursday, August 27, 2015

Hello. i need some help with it, Any body contact me on mail, thanks

mashhood
Thursday, August 27, 2015

Hello. i need some help with it, Any body contact me on mail, thanks

mashhood
Thursday, August 27, 2015

Hello. i need some help with it, Any body contact me on mail, thanks

PHP scraper – regular expressions – Tech Magazine
Sunday, October 25, 2015

[…] trying to follow a tutorial for web scraping with […]

Amir shah
Thursday, December 17, 2015

i try best to get the data but script does not work. i need your help
Thanks

coc hack app
Wednesday, December 28, 2016

I’m really enjoying the theme/design of your site. Do you
ever run into any internet browser compatibility issues?
A few of my blog audience have complained about my site not operating
correctly in Explorer but looks great in Safari. Do you
have any advice to help fix this problem?

pk
Sunday, May 28, 2017

some sites such as stackoverflow have complex and long tags, i doubt your code would work. I need to extract heading tags from external websites, any help?

مهرجانات 2017
Monday, May 29, 2017

Hi,I check your new stuff named “PHP Web Page Scraping Tutorial | DevBlog.co” like every week.Your story-telling style is witty, keep up the good work! And you can look our website about مهرجانات 2017.

xps production line manufacturer
Monday, June 26, 2017

An outstanding share! I have just forwarded this onto a coworker who had been conducting a little
homework on this. And he actually bought me dinner simply because I discovered it for
him… lol. So allow me to reword this…. Thanks for the meal!!
But yeah, thanx for spending some time to talk about this issue here on your blog.

Roderick Monsoor
Tuesday, November 27, 2018

Howdy devblog.co

SEO Link building is a process that requires a lot of time.
If you aren’t using SEO software then you will know the amount of work load involved in creating accounts, confirming emails and submitting your contents to thousands of websites in proper time and completely automated.

With THIS SOFTWARE the link submission process will be the easiest task and completely automated, you will be able to build unlimited number of links and increase traffic to your websites which will lead to a higher number of customers and much more sales for you.
With the best user interface ever, you just need to have simple software knowledge and you will easily be able to make your own SEO link building campaigns.

The best SEO software you will ever own, and we can confidently say that there is no other software on the market that can compete with such intelligent and fully automatic features.
The friendly user interface, smart tools and the simplicity of the tasks are making THIS SOFTWARE the best tool on the market.

IF YOU’RE INTERESTED, CONTACT ME ==> MoneyRobotSubmitter@mail.com

Regards, Roderick Monsoor
France, ILE-DE-FRANCE, Le Mee-Sur-Seine, 77350, 9 Rue Goya