Skip to content
This repository has been archived by the owner on Mar 21, 2020. It is now read-only.
/ create_seed Public archive

Bash script that creates a URL seed list with URLs included in a generic file

License

Notifications You must be signed in to change notification settings

giuseppetotaro/create_seed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

create_seed

This repository includes bash scripts that perform pattern matching against files or webpages aiming at extracting URLs and providing a seed list. The following scripts are provided:

  • create_seed.sh creates a URL seed list with URLs extracted from a generic file. This script uses Apache Tika (more in detail, the tika-app jar file) to extract textual content from the input file, and Unix grep command to search URL using the specified pattern. This script needs the tika-app jar file. The latest stable release of tika-app jar file can be downloaded online.

  • create_backpage_seed.sh creates a URL seed list with (US) URLs extracted from backpage.com webpage. Each URL is combined with the keywrds included in the given file. This script uses the Unix command-line tools sed and grep.

All these scripts aim at creating a URLs list (one URL per line) that can be used as seed list for Apache Nutch.

Getting Started

create_seed.sh is a bash script that allows to extract URLs from text extracted from a generic file by using Tika and grep command.

To launch create_seed.sh, use the following command:

./create_seed.sh -i /path/to/input -o /path/to/output

By default, the script looks for tika-app jar file in the current folder. Optionally, a different pathname for tika-app can be specified using the following command-lline option:

./create_seed.sh -i /path/to/input -o /path/to/output -p /path/to/tika-app

create_backpage_seed.sh is a bash script that allows to extract URLs from backpage.com webpage combining US addresses with a list of keywords.

To launch create_backpage_seed.sh, use the following command:

./create_backpage_seed.sh /path/to/keywords /path/to/output

Pattern matching

Pattern matching is perfomed using the Unix grep command. More in detail, URLs are detected using the following regular expression:

grep -ioE '\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=_|!:,.;]*[-A-Za-z0-9+&@#/%=_|]'

Searching for different URL patterns is possible by changing the regular expression.

About

Bash script that creates a URL seed list with URLs included in a generic file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages