A good example of the way that data should influence the way we develop software is Microsoft’s “Windows Error Reporting” tool. Talk about Big Data: Microsoft receives more than 100 million crash reports per day. A fascinating article from 2009 by Kirk Glerum and others at Microsoft discusses how these reports are used. Not surprisingly, the major benefit appears to be prioritizing defects. More to the point, the article discusses the attempt to properly map a crash report to a single defect “bucket.” When the article was written, this was done using hand-written functions comprising “100,000 lines of code implementing some 500 bucketing heuristics.” This is testimony to the difficulty of the task, but I would be disappointed if, by now, Microsoft Research had not applied machine learning to the problem. The article also discusses the use of massive data to find root causes, although the implication is that this is (or was) based on manual searches.

In addition to analytics about the crashes that occur at runtime, what about the errors that are seen at compile time? I was recently talking with a colleague about the potential (and privacy objections) for compiler and IDE vendors to analyze compiler errors. It’s probably safe to assume that missing semicolons and miscounted closing braces are going to lead the field, but wouldn’t it be interesting to know which library functions are most frequently called with the wrong parameter types or are correlated with defect-prone functions?

Although I argued previously that developers do not have to worry about software taking their jobs, I think it’s very possible that developers could benefit greatly from machine-written code that is proposed as a possible solution. At the Splash 2011 conference, Markus Püschel gave a fascinating keynote on “automatic performance programming” in which he described the use of generative techniques to create fine-tuned implementations of the fast Fourier transform. The paper is unfortunately behind a paywall, but you can find some information at the Spiral project homepage.

Finally, there’s something fundamentally wasteful about writing unit tests and then writing code to satisfy the constraints you’ve just specified. I like to imagine that a day will come when, for certain programming tasks, I can ask my computer to present me with some suggestions. A developer asking for a proposed function that satisfies defined constraints is a vastly simpler scenario than a generalized software-writing scenario, and although it’s certainly in the realm of a research-level problem, I think it’s something we might see within several years.

None of the scenarios I’ve proposed are possible without considerable pre-processing of the data into a usable form, as well as some amount of symbolic processing to create the patterns for neural nets to work upon. But software developers have for too long been “the cobbler’s children who go shoeless” and deserve to reap the benefits of this technology. The data sets are there, the deep-learning libraries are open source; all that is required is some insightful developers to put it all together.