Source Code Semantic Search

When programming, a developer often has to search for examples that demonstrate the use of the existing code, especially code from a library or a framework. This might be done one the Web but for internal codebases, it is done in the IDE.

Searching for such code examples is inherently a human-oriented and exploratory task, since it involves fuzzy queries and unstructured information, i.e., natural language text, such as identifiers and comments.

IDE search tools can be quite comprehensive:

Search 2

Yet, because source code search faces analogous challenges as unstructured information management with fuzzy user queries and natural language text, keyword search is still the preferred interface.

Moreover, complex UI interfaces are still very limited in the type of complex queries they can express.

If we had an Open Source XML Fragments Search Engine available, it would be possible to index source code such as:

class QueryContextFactory {
   QueryContext createQueryContext(String query) {
	return new QueryContext(query); 

as an XML document:

  <classname>query context factory</classname>
     <method><methodname>create query context</methodname><returns>query context</returns>
       <constructor>query context</constructor>

that gets indexed and can be searched upon by using XML fragments.

Then, if we were programming and needing a factory method for query concepts we could query:


This query will fetch all classes with methods that returns a query context. If this is too general, we can either restrict to classes that return a query context and also call the query context constructor:


Alternatively, if this is still not enough, we can search based on the name of the class:


The main advantage of using semantic search over source code is the enabling of cross boundaries searches, the above search could be simplified as


This would have matched ‘query context’ somewhere (including the class name) in a class which name contains the word ‘factory’.

Note that the use of a search engine means the retrieved results are not "just all the places where the query matches" (i.e., a Boolean query result such as the ones implemented in IDE's complex search) but it profits from advanced ranking functions. That is to say, when we search for 'factory' and 'QueryContext', if 'factory' is more rare than 'QueryContext', results that match 'factory' will appear higher in the list of results. Advanced query expansion with synonyms is also possible.