MongooseMiner is a search system that pushes LLM-based code generation beyond average human performance. Most LLMs for code generation write code like humans:
- They use the most common packages rather than the most appropriate and powerful ones,
- They use the most common functions instead of the most appropriate and powerful ones,
- They do not memorize all available function arguments and perform many operations where one is sufficient. This is because LLMs learn from the average developer's code, while MongooseMiner learns from the experts who wrote the libraries and documentation.
By evaluating the documentation strings of the most common PyPI projects and retrieving them as needed to guide LLM autocompletion, MongooseMiner can deliver the most appropriate and performant code.
To enable MongooseMiner, we needed both PyPi and GitHub data. BigQuery hosts both:
- PyPi downloads
distribution_metadata
table contains other tables we need to fetch:name
mapped topypi_name
version
mapped topypi_version
summary
&description
combined into a singlepypi_description
stringhome_page
string &download_url
string &project_urls
array of strings where we can find the source code links and check if it leads to GitHub export togithub_url
requires
for dependencies
file_downloads
table contains columns:project
likea8
- GitHub activity
sample_repos
table contains:repo_name
string likeFreeCodeCamp/FreeCodeCamp
watch_count
integer for the number of people watching the repo
languages
table contains:repo_name
string likeFreeCodeCamp/FreeCodeCamp
language.name
string likeC
language.bytes
integer containing the amount of code written in that language
We use that data to aggregate information into one table:
- Sample the mentioned columns from the PyPi table
- Check if any of the links leads to GitHub
- Extract the name of the repo name from the GitHub URL
- Join it with the watch-count from the GitHub table
For details and the code check bigquery.sql
.