The Curious Case of Code Duplication in GitHub‹Programming› Keynote
Previous studies have shown that there is a non-trivial amount of duplication in source code. We analyzed a corpus of 2.6 million non-fork projects hosted on GitHub representing over 258 million files written in Java, C++, Python and JavaScript, and found a large amount of duplication, much more than we anticipated. This finding made us be much more careful when using open source repositories for drawing statistical conclusions, especially now – in the age of machine learning. In this talk, I will present our GitHub study, and will briefly cover some of our most recent work on extending duplicate detection to the machine learning models themselves.
Cristina (Crista) Lopes is a Professor in the School of Computer Sciences at University of California, Irvine, with research interests in Programming Languages, Software Engineering, and Distributed Virtual Environments. She is an IEEE Fellow, an ACM Distinguished Scientist, a twice-elected member of the SIGPLAN Executive Committee, and Editor in Chief of The Art, Science, and Engineering of Programming. She is the recipient of the 2016 Pizzigati Prize for Software in the Public Interest for her work in the OpenSimulator virtual world platform. She’s also co-funder of Midspace, a virtual conference platform.
Wed 23 MarDisplayed time zone: Lisbon change
09:00 - 10:00 | |||
09:00 60mKeynote | The Curious Case of Code Duplication in GitHub‹Programming› Keynote Keynotes |