Skip to content

Latest commit

 

History

History
65 lines (34 loc) · 5.31 KB

README.md

File metadata and controls

65 lines (34 loc) · 5.31 KB

Marketplace Version

Code Compass is a contextual search engine for software packages developed at Nokia Bell Labs. It supercharges code reuse by recommending the best possible software libraries for your specific software project. See for yourself:

showcase

Code Compass is available as a website, a REST API and as an IDE plug-in for vscode.

We index packages hosted on NPM for JavaScript, PyPI for Python and Maven Central for Java.

If you're looking for the similarly named code comprehension tool from Ericsson to explore large codebases, look here. Apart from the name, there is no relationship (formal or informal) between that project and this one.

Why?

Modern software development is founded on code reuse through open source libraries and frameworks. These libraries are published in software package repositories, which are growing at an exponential rate. By building better software package search tools we aim to stimulate more code reuse and make software packages in the "long tail" more discoverable.

A gentle introduction to the why, what and how of Code Compass can be found in this introductory blog post.

What?

Code Compass is a contextual search engine for software packages.

Code Compass differs from other package search engines in that you can "seed" the search with names of libraries that you already know or use. We call these "context libraries". Code Compass then uses these context libraries to "anchor" the search in those technology stacks that are most relevant to your code.

When using the Visual Studio Code IDE extension there is no need to manually enter context libraries: Code Compass will automatically extract the import dependencies of the active source file to anchor its search.

Note that Code Compass will never send your code to the server. Only the names of third-party modules imported in your code are sent.

How?

Code Compass uses unsupervised machine learning to learn how to cluster similar software packages by their context of use, as determined by how libraries get imported alongside other libraries in large open source codebases.

Software packages are represented as vectors which we call "library vectors" by analogy with word vectors. Just like word2vec turns words into vectors by analyzing how words co-occur in large text corpora, our "import2vec" turns libraries into vectors by analyzing how import statements co-occur in large codebases.

You can read the details in our MSR 2019 paper. Supplementary material including trained library embeddings for Java, JavaScript and Python is available on Zenodo.

As an example, for Java we looked at a large number of open source projects on GitHub and libraries on Maven Central and studied how libraries are imported across these projects. We identified large clusters of projects related to web frameworks, cloud computing, network services and big data analytics. Well-known projects such as Apache Hadoop, Spark and Kafka were all clustered into the same region because they are commonly used together to support big data analytics.

Below is a 3D visualization (a t-SNE plot) of the learned vector space for Java. Each dot represents a Java library and the various colored clusters correspond to different niche areas that were discovered in the data. We highlighted the names of Apache projects.

3dviz

What's in this repo?

  • docs/: REST API docs for the Code Compass search service
  • plugins/vscode/: Visual Studio Code extension to integrate Code Compass into the IDE
  • scripts/: data extraction scripts to generate library import co-occurrences from source code
  • nbs/: Jupyter notebooks with TensorFlow models to train library embeddings from import co-occurrence data

Team

Code Compass is developed by a research team in the Application Platforms and Software Systems Lab of Nokia Bell Labs.

See CONTRIBUTORS for an alphabetic list of contributors to Code Compass.

Contributing

If you would like to train embeddings for other languages, have a look at the scripts under import2vec to get an idea of what data is required.

If you have suggestions for improvement, user feedback or want to report a bug, please open an issue in this repository.

License

BSD3