Skip to content

A simple web crawler that gets a news website as input (e.g. http://www.spiegel.de) and crawls the HTML content of up to 100 pages of that site with a breadth-first approach. The downloaded pages should be stored as HTML in a folder in the file system. The crawler needs to be able to work with up to 50 parallel processes....

Notifications You must be signed in to change notification settings

Oyelamin/Webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webcrawler

Challenge

In a language of your choice, implement a simple web crawler that gets a news website as input (e.g. http://www.spiegel.de) and crawls the HTML content of up to 100 pages of that site with a breadth-first approach. The downloaded pages should be stored as HTML in a folder in the file system. The crawler needs to be able to work with up to 50 parallel processes. The number of processes can be passed as a parameter. If no input is given the default value shall be 5 processes.

Language Used

  • PHP

Solution Installation

  • Clone this repository: git clone https://github.com/Oyelamin/Webcrawler.git
  • Install the dependencies: composer install

Now You can run the code for use

Solution Usage

  1. Firstly, you can either choose to run the solution in the index.php file or inject in your php application as long as composer is installed, you are good to go!.

  2. You can initialise the Crawl class but you need to import it into your file or controller class. e.g:

### *`use WebCrawler\Crawl;`*
3. Declare your basic inputs. e.g:
    $websiteUrl = 'http://www.spiegel.de'; // Any url of your choice - Required<br>
    $maxPages = 10; // This can increase/decrease - Optional<br>
    $maxProcesses = 5; // Can increase/decrease - Optional<br>
    $folderName = "MyCustompages"; // - Optional<br>
    $fileExtension = "html"; // txt,htm,css, etc... - Optional

  1. Feed your declared inputs in the Crawl class as initialization:

    $crawl = new Crawl($websiteUrl, $maxPages, $maxProcesses, $folderName, $fileExtension);

  2. Execute the program:

    return $crawl->execute(); // run { php index.php }

Example

This is example of how you can run it:

<?php

require 'vendor/autoload.php';

use WebCrawler\Crawl;

$websiteUrl = 'http://www.spiegel.de'; // Any url of your choice - Required
$maxPages = 10; // This can increase/decrease - Optional
$maxProcesses = 5; // Can increase/decrease - Optional
$folderName = "MyCustompages"; // - Optional
$fileExtension = "html"; // txt,htm,css, etc... - Optional
$crawl = new Crawl($websiteUrl, $maxPages, $maxProcesses, $folderName, $fileExtension);

return $crawl->execute(); // run { php index.php } to execute

// THANK YOU and i hope you enjoyed the code ❤ 🤗!!!!!

About

A simple web crawler that gets a news website as input (e.g. http://www.spiegel.de) and crawls the HTML content of up to 100 pages of that site with a breadth-first approach. The downloaded pages should be stored as HTML in a folder in the file system. The crawler needs to be able to work with up to 50 parallel processes....

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages