This program does the following:
-
Imports the necessary modules: requests to send HTTP requests, urljoin from urllib.parse to join URLs, BeautifulSoup to parse HTML, time to delay execution, csv to write data to a CSV file, os to work with the operating system, and dotenv to load settings from an .env file.
-
Loads settings from the .env file using the load_dotenv() function.
-
Sets the value of the last_page variable to 91. This is the number of pages to be processed.
-
Sets the base URL and creates an empty firms_data list to store firm data.
-
Gets proxy server credentials from environment variables.
-
Specifies proxy server settings, including host, port, and credentials.
-
Creates a requests.Session() session, sets the session parameters, including the proxy server and disables SSL certificate verification.
-
Disable warnings about insecure requests with requests.packages.urllib3.disable_warnings().
-
Creates a global variable counter to keep track of the number of firms processed.
-
Starts a loop that will iterate through the pages until the last_page variable is reached.
-
Generates the URL of the current page and sends a GET request to get the page's HTML using requests.get().
-
Delays execution by 5 seconds so that the page has time to load.
-
Uses BeautifulSoup to parse HTML code and extract company links.
-
Creates a loop that will iterate over references to firms.
-
Loads the company page using session.get() and delays execution by 5 seconds to load the page.
-
Uses BeautifulSoup to extract business data such as name, description, website, and LinkedIn link.
-
Adds company data to the firms_data list.
-
Displays progress information.
-
Increases the value of the counter counter.
-
Checks the exit condition of the loop.
-
Opens dealroom_data.csv file to record results in CSV format.
-
Creates a csv.DictWriter object to write data to a CSV file.
-
Writes the header (column names) using the writer.writeheader() method.
-
Writes each firm's data to a CSV file using a for loop.
-
Closes the file.