-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SDK support for web scraping #30
Conversation
Caution Review failedThe pull request is closed. WalkthroughThis update introduces significant enhancements to URL and sitemap handling in the catalog system, adds new test scripts for scraping, and introduces Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (4)
scraping.test.ts (4)
8-8
: Add a comment explaining the choice of the value 4.It would be helpful to add a comment explaining why the value 4 is chosen for
expectedSitemapUrls
.+ // The value 4 is chosen because <explanation> const expectedSitemapUrls = 4;
10-49
: Add more detailed logging for debugging purposes.The test for scraping a single URL is correctly implemented, but adding more detailed logging would help in debugging.
+ console.log(`Configured catalog: ${catalogName}`); + console.log(`Upserted document: ${docs[0].url}`); while (!docsFound) { const docCount = await catalog.documentCount(); if (docCount === 1) { docsFound = true; } else { console.log("no docs found. sleeping..."); await sleep(5000); } } + console.log(`Document count after upsert: ${docCount}`);
51-87
: Add more detailed logging for debugging purposes.The test for scraping a sitemap is correctly implemented, but adding more detailed logging would help in debugging.
+ console.log(`Configured catalog: ${catalogName}`); + console.log(`Upserted sitemap document: ${docs[0].sitemapUrl}`); while (!docsFound) { const docCount = await catalog.documentCount(); if (docCount === expectedSitemapUrls) { docsFound = true; } else { console.log(`${docCount} docs found. sleeping...`); await sleep(5000); } } + console.log(`Document count after upsert: ${docCount}`);
89-151
: Add more detailed logging for debugging purposes.The test for isolation of scraping multiple catalogs at once is correctly implemented, but adding more detailed logging would help in debugging.
+ console.log(`Configured catalogs: ${catalogName1}, ${catalogName2}, ${catalogName3}, ${catalogName4}, ${catalogName5}`); + console.log(`Upserted sitemap document: ${docs[0].sitemapUrl}`); while (!docsFound) { const catalog1Count = await catalog1.documentCount(); const catalog2Count = await catalog2.documentCount(); const catalog3Count = await catalog3.documentCount(); const catalog4Count = await catalog4.documentCount(); const catalog5Count = await catalog5.documentCount(); if ( [ catalog1Count, catalog2Count, catalog3Count, catalog4Count, catalog5Count, ].every((e) => e === 4) ) { docsFound = true; } else { await sleep(5000); } } + console.log(`Document counts after upsert: ${catalog1Count}, ${catalog2Count}, ${catalog3Count}, ${catalog4Count}, ${catalog5Count}`);
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- .npmignore (1 hunks)
- catalog.ts (5 hunks)
- document.ts (2 hunks)
- package.json (1 hunks)
- scraping.test.ts (1 hunks)
Files skipped from review due to trivial changes (1)
- .npmignore
Additional comments not posted (17)
package.json (3)
38-38
: LGTM!The script for running scraping tests in the development environment is correctly configured.
39-39
: LGTM!The script for running scraping tests in the production environment is correctly configured.
37-37
: LGTM!The script for running indexer tests in the development environment is correctly configured.
document.ts (6)
4-4
: LGTM!The type definition for
UrlContentType
is correct.
5-5
: LGTM!The type definition for
SitemapContentType
is correct.
10-15
: LGTM!The type definition for
ContentType
is correctly extended to includeUrlContentType
andSitemapContentType
.
17-20
: LGTM!The type definition for
UrlDocument
is correct.
22-25
: LGTM!The type definition for
SitemapDocument
is correct.
51-56
: LGTM!The type definition for
DocumentBatch
is correctly extended to include arrays ofUrlDocument
andSitemapDocument
.scraping.test.ts (2)
6-6
: LGTM!The variable
runScraperTests
is correctly set based on the environment variableRUN_SCRAPER_TESTS
.
154-158
: LGTM!The utility function
sleep
is correctly implemented.catalog.ts (6)
139-140
: LGTM!The variables
hasUrl
andhasSitemapUrl
are correctly added to theCatalog
class.
153-155
: LGTM!The case for
url
content type is correctly added to the batch processing logic.
156-158
: LGTM!The case for
sitemap-url
content type is correctly added to the batch processing logic.
167-172
: LGTM!The conditional for mixed content types is correctly updated to include
url
andsitemap-url
content types.
195-195
: LGTM!The conditional for upsert success is correctly updated to check if the response status is greater than 202.
292-303
: LGTM!The function
mapBatch
is correctly updated to handleurl
andsitemap-url
content types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just left a comment / question.
@@ -1,11 +1,28 @@ | |||
import { CortexApiClient } from "./api-client.js"; | |||
import { Catalog } from "./catalog.js"; | |||
|
|||
export type UrlContentType = "url"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better for discoverability / intellisense experience to use ContentTypeXYZ naming pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight yes. Probably a change we should make at some point, but for now I'm just pattern matching against the existing content types.
Nothing fancy, just what it says on the box. I did add e2e tests, but left them turned off by default. Given that scraping is async, it could take multiple minutes for scraping to actually happen if there are a lot of documents already in the queue (like we might see in production). Running these tests against the API dev stack takes less than a minute.
Summary by CodeRabbit
New Features
Bug Fixes
Chores
.npmignore
to excludeimages
and.github
directories.package.json
.