I want to fetch all the pages from the site of which I URL I am going to insert.
ex. If I enter http://www.microsoft.com then I should get all the pages from that site, pages should be in full URL format e.g. http://www.microsoft.com/en-in/default.aspx.
and want this in asp.net code behind, If any one knows how to do that please help me.
effectively, what you would be doing is crawling the pages ... Google, Bing, and other search engines do this to mine URLs; spammers also do this to mine e-mail addresses.
what you are asking is a complex task.
dilipwanave
I want to fetch all the
pages from the site of which I URL I am going to insert.
"ALL" can be a huge number of pages!!!!
Many of those pages would be duplicates or near duplicates.
search for just "country" in the category "books" ... you should get over 240 thousand results ...
here is a partial count of over 10 million book titles at amazon.com (i've omitted categories with less than 500 thousand titles):
Business & Investing (796,063)
Children's Books (1,094,377)
Education & Reference (1,110,954)
History (868,538)
Literature & Fiction (1,667,806)
Politics & Social Sciences (1,198,974)
Professional & Technical (2,127,087)
Science & Math (1,223,781)
French Books (Livres français) (1,546,914)
dilipwanave, you do not really want all of the pages from any website? ... you'd be getting too much noise; also, you can not get all of the pages from a website because many
of those pages may require authentication; other pages might have URLs that are known only to the employees of a company; others may be located in hidden directories.
et cetera
B-) Please help me by completing my school survey about computer programmers on my website. Thank you!!! Gerry Lowry +1 705-429-7550 wasaga beach, ontario, canada
It is quite reasonable to have a page on a web site called
www.domain.com/NoLinksToThisPage which does not have any links to it from any other page on any website on the internet. The only way to get to this page would be if you typed in the full url.
Having said that, I'm guessing you don't care. The simple answer is to have two lists, the first list, called unvisited, is of the pages that you have not yet visited but that you know exist and the second list, called visited, is of pages that you have
visited. You initialise the first list with the page that you are interested in (http://www.microsoft.com in your example
1 - Retrieve the first page on the unvisited list - if the list is empty then you are done
2 - Find all the links on that page, if they do not appear on the visited list and they are for this domain then add them to the unvisited list
3 - Go to 1
You should read the robots.txt file and obey any requests in there.
But, even though simple to state, it is a huge job. I will ask the question that I often pose "What are you really trying to do?" It seems unlikely in the extreme that you really need to take a copy of every Office, xBox, MSDN, Server, Exchange, etc, page
on the Micrrosoft site.
Don't forget to visit the web site in one hours time as many pages may have changed!
dilipwanave
0 Points
1 Post
Get page listing of the site by using website URL.
Jan 25, 2013 06:53 AM|LINK
Hi All,
I want to fetch all the pages from the site of which I URL I am going to insert.
ex. If I enter http://www.microsoft.com then I should get all the pages from that site, pages should be in full URL format e.g. http://www.microsoft.com/en-in/default.aspx.
and want this in asp.net code behind, If any one knows how to do that please help me.
Thanks in advance.
Dilip Wanave
gerrylowry
All-Star
20513 Points
5712 Posts
Re: Get page listing of the site by using website URL.
Jan 25, 2013 07:34 AM|LINK
@ dilipwanave welcome to forums.asp.net
TIMTOWTDI =. there is more than one way to do it
effectively, what you would be doing is crawling the pages ... Google, Bing, and other search engines do this to mine URLs; spammers also do this to mine e-mail addresses.
what you are asking is a complex task.
"ALL" can be a huge number of pages!!!!
Many of those pages would be duplicates or near duplicates.
Example:
MSDN library. http://msdn.microsoft.com
you would have nearly identical pages for .NET Framework 4.5, 4.0, 3.0, ...
same for vs2012, vs2010, vs2008, ...
You would get all of the MSDN pages mentioned above, plus TechNet, plus social.microsoft.com, et cetera, et cetera, et cetera
probably millions of pages.
http://amazon.com
search for just "country" in the category "books" ... you should get over 240 thousand results ...
here is a partial count of over 10 million book titles at amazon.com (i've omitted categories with less than 500 thousand titles):
Business & Investing (796,063)
Children's Books (1,094,377)
Education & Reference (1,110,954)
History (868,538)
Literature & Fiction (1,667,806)
Politics & Social Sciences (1,198,974)
Professional & Technical (2,127,087)
Science & Math (1,223,781)
French Books (Livres français) (1,546,914)
For each of the above titles you would get a page URL. Example: http://www.amazon.com/Ping-ebook/dp/B0058UW9H4
MORE INFORMATION
dilipwanave, you do not really want all of the pages from any website? ... you'd be getting too much noise; also, you can not get all of the pages from a website because many of those pages may require authentication; other pages might have URLs that are known only to the employees of a company; others may be located in hidden directories. et cetera
To be successful you need to be aware of the DOM, example: http://en.wikipedia.org/wiki/Document_Object_Model.
You can find many interesting articles here: http://lmgtfy.com/?q=how+to+write+a+web+site+crawler+in+c%23 "how to write a web site crawler in c#", Google search. Example: http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets "A Simple Crawler Using C# Sockets".
g.
Paul Linton
Star
13421 Points
2535 Posts
Re: Get page listing of the site by using website URL.
Jan 27, 2013 04:22 AM|LINK
You can't do it.
It is quite reasonable to have a page on a web site called www.domain.com/NoLinksToThisPage which does not have any links to it from any other page on any website on the internet. The only way to get to this page would be if you typed in the full url.
Having said that, I'm guessing you don't care. The simple answer is to have two lists, the first list, called unvisited, is of the pages that you have not yet visited but that you know exist and the second list, called visited, is of pages that you have visited. You initialise the first list with the page that you are interested in (http://www.microsoft.com in your example
1 - Retrieve the first page on the unvisited list - if the list is empty then you are done
2 - Find all the links on that page, if they do not appear on the visited list and they are for this domain then add them to the unvisited list
3 - Go to 1
You should read the robots.txt file and obey any requests in there.
But, even though simple to state, it is a huge job. I will ask the question that I often pose "What are you really trying to do?" It seems unlikely in the extreme that you really need to take a copy of every Office, xBox, MSDN, Server, Exchange, etc, page on the Micrrosoft site.
Don't forget to visit the web site in one hours time as many pages may have changed!