Keyvan Nayyeri is a Ph.D. student in Computer Science and previously held a B.Sc. degree in Applied Mathematics. He was born in Kermanshah, Kurdistan, Iran in 1984, and is currently living in San Antonio, TX. His primary research interests are Programming Languages & Compilers and Software Engineering.  He’s also a software architect and developer with focus on Microsoft development technologies as well as Open Source platforms. Keyvan is an avid community leader and contributor who has written four books for Wiley/Wrox and several articles for prominent community websites. Also he has contributed to many Open Source projects. As a result of his long-time contributions to the community, he has received several recognitions and awards from Microsoft, its partners, and community websites. Keyvan is a continues learner who loves to study, learn, and discover new technologies everyday, and is enthusiast for serving to the humankind through his research and contributions. For a long time he has been blogging about various topics on his blog that has become a rich resource for software developers. His blog is available at Keyvan has posted 36 posts at DZone. View Full User Profile

Exemplar - Code Search Engine for Finding Highly Relevant Applications

  • submit to reddit
After our first meeting to discuss ICSE 2010 papers with my presentation on Software Traceability with Topic Modeling, yesterday we had our second presentation on another paper entitled a Search Engine for Finding Highly Relevant Applications. The implementation of the idea introduced in this paper (that I’ll describe shortly in this post) is available on the web and is called Exemplar code search engine.

As you would already know, Code Reusability is one of the main principles of Software Development and an important aspect of Object-Oriented Programming. Software developers try to reuse components or pieces of code in their programs in order to speed up the process and reduce the costs. Besides, code reusability can help improve the quality of code by focusing on better design and implementation of smaller components.

As a common part of daily programming for industrial Software Developers, they try to search for relevant components, libraries, or code snippets to use in their projects. They often search for their needs on code search engines like SourceForge, Google Code, Koders, CodePlex, and many other services.

Most of these code search engines rely heavily on some textual values entered by project coordinators on the websites such as the title, description, category, tag, or some other attributes.

However, there is a common problem in using these search engines and that is the relevance of search results because it depends on two major parameters: the careful selection of keywords and the richness of the textual parameters entered by project owners. The first parameter is something that can easily be resolved only by better training of users, but for the second parameter there are some difficulties. Whatever you enter for a project even something very rich, still there may be some parts of the project missing from the project codebase especially for bigger projects that consist of various components.

There have been some attempts to solve this issue with different techniques. The paper that we discussed and is recently published at ICSE 2010 tries to provide an improvement in this area. This technique consists of not only searching in the textual properties of a project on a repository, but also on the relationships between the project APIs based on the help documents written for the project.

In this paper, authors have tried to apply this idea using two approaches: a pure search in the help documents for project APIs, and an advanced search in API documents based on the Data Flow analysis of the API.

In order to implement this idea, the authors have aggregated around 30,000 Java projects on SourceForge, processed their APIs with the abovementioned approaches, and published this code search engine, called Exemplar, on the web. Then they asked a group of 39 Java developers with different levels of experience to search for some common programming tasks using this search engine under a time limit. In the next step they asked the developers to evaluate the results and rank the relevance of them as well as their own confidence in their answers.

This experiment is done using statistical methods and the authors have provided the results which reflects the fact that using the API descriptions improves the relevance of search results, but the use of Data Flow analysis doesn’t have a big impact.

However, it appears that there isn’t enough work done in the area of Data Flow analysis, and the implementation is weak and superficial. It seems that authors agree with this fact because they talk about their future work in this area to have a stronger implementation of Data Flow analysis to improve the relevance of search results.

All in all, I think that this new approach has a good potential to improve the search results on code search engines, but a higher level implementation of Data Flow analysis would be costly and much work will be needed in fine-tuning of the search engine in this area.

Published at DZone with permission of its author, Keyvan Nayyeri. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Sindy Loreal replied on Sat, 2012/02/25 - 7:18am

Very Nice idea, hmm is there any ranking or value based algorithm which can re order search result? This means that system suggests search result based on popularity, download, community support and some others metrics.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.