Get all urls from a web page – Simple Web Crawler

Here I’m going to post a class that will extract all valid URLs from a web page.  The example can be treated as a basic web crawler. My class uses “URLConnectionReader” provided by Sun Tutorial

The class defines 2 constructors.

  1. One by default returns you the vector containing only text/html url objects from page.
  2. For the other you can specify the type of urls you want from a page. This is helpful when you want to get all images, videos or any other media urls.

The class also considers relative urls. It returns relative urls with http and host name prefixed.
E.g. If you have urls like “/about.php”, then class will return “http://hostname.domain/about.php”

The URLFinder

Usage

This will get you all the URLs from any web page. So its pretty simple to come up with your basic version of web crawler. I am sure you will be able to build something more on top of this.

Happy Sharing!!

Add a Comment

Your email address will not be published. Required fields are marked *