The Tale of Ted Dunning: The Apache Incubator's ever-curious Big Data scientist

Published: May 28th, 2015

- Rob Marvin

Ted Dunning, the newly appointed vice president of the Apache Incubator, is a Big Data scientist in a world of coders.

Currently the chief application architect at Hadoop distribution company MapR, the longtime Apache Software Foundation contributor and project mentor took over as the ASF’s vice president of incubation in April. Tasked with keeping Apache Incubator projects in accordance with open-source standards and with fostering new communities, Dunning will play an important role in nurturing the software and Big Data technologies the nonprofit organization supports over the next several years.

Dunning’s varied 40-year career sprouted from what he called a “compulsion to compute,” driven by a lifelong fascination with data: processing it, analyzing it and drawing insights from it back before it became very “Big.”

As vice president of incubation, Dunning said he sees his role as that of an open-source cheerleader.

“Apache doesn’t produce software; it doesn’t select projects,” he said. “Software comes to Apache and projects self-select. Apache is about building community first—one of the mottos is building community over code. We need to foster good projects that can build into good communities. I want Apache to be a very open and welcoming place, and the Incubator is the gateway.”

Dunning has been involved with open source since the mid-1970s on projects such as the XPL0 programming language and the Apex operating system. Over the past several decades he also got a Ph.D. in computing science, worked on advanced computing research projects for DARPA (the U.S. government’s Defense Advanced Research Projects Agency), and joined or founded nearly half a dozen startups spanning behaviorally targeted ads, financial risk management insights, identity theft detection, and online streaming and recommendations for music, movies and TV.

In the late 2000s, Dunning began interacting with the Apache Software Foundation community, ultimately committing to and mentoring a plethora of projects along with joining MapR.

On its face Dunning’s career seems like a random assortment of research positions, business ventures and technologies. Yet underlying every professional decision, open-source contribution and new idea is the theme of identifying larger patterns. Whether in examining user behaviors or gleaning Big Data insights to optimize a larger process, Dunning comes at programming from an exploratory scientific perspective, always with a sense of wonder.

“I’m a geek who’s suddenly fashionable; never would’ve guessed,” he said. “The ability to go out and actually try to find these patterns is so exciting. A friend of mine used to talk about squeezing the brain of Mother Nature. Whether it’s astronomy, genomics, biology, commerce, how people speak and communicate, or how machines and networks communicate: These are all examples of how these patterns exhibit in the real world. I’m just stunned when people are unmoved by that.”

Photo credit: Ellen Friedman

Follow the data
The turning point toward Big Data for Dunning came in 1984 when he joined New Mexico State University’s Computing Research Laboratory to work on large-scale projects for DARPA. He experimented with projects on statistical symbol and genomic analysis, machine translation, and forays into computer vision and robotics.

The lab started as one of five centers of excellence funded by the state, but “Within a few years, we were one of the few human language technology (HLT) contractors for DARPA,” said Dunning. “That’s where a lot of the techniques came from that I’ve been able to apply in many different situations.”

In the mid-1990s, startup culture lured Dunning to California. He left New Mexico State in 1996 to work at Aptex, a startup spun off from HNC Software. There Dunning helped build the first behaviorally targeted advertising system primarily for the company’s biggest customer, the InfoSeek search engine, using what he called context vector technology to transform raw user data into ad insights.

“The ability to target ads based on what people did and what they clicked on was a very interesting opportunity,” said Dunning. “That work was based quite literally on research I’d done on sequences in symbols. I’d previously thought of the sequences as language, either human or genomic, but it could be applied to sequences representing things you typed into a query engine; places you visit and the content of websites.”

When HNC Software bought back Aptex in late 1999, Dunning continued his symbol sequencing work at Musicmatch. He applied the same data principles to build some of the first commercially viable music recommendation engines around early Internet radio, integrating the recommendations into streaming protocols. Dunning’s name can be found on several of the first patents around the technology.

When Yahoo bought Musicmatch in 2004, Simon Ferrett, a systems administrator at Slacker Radio who worked with Dunning at Musicmatch, followed as Dunning cofounded Veoh Networks, a user-generated video content platform that served as a precursor of sorts to YouTube. Veoh built video recommendation engines and generated behavioral analytics around the modern notion of multi-modal recommendation—looking at multiple kinds of behavior integrated into a coherent view of what causes people to act.

“If you can retain the full nuance of [users’] actions, whether it’s scrolling on a website, looking at reviews or playing a video, you can make much better-informed recommendations because they are talking to you, telling you what they like and don’t like,” said Dunning. “We also used that same behavioral knowledge to predict what was going to be popular, allowing us to populate a peer-to-peer network that acted almost as a self-organized content-delivery network to substantially decrease streaming costs.”

Ferrett spoke about how Dunning brought the same data-informed problem-solving perspective to Veoh, and also how the way he approached code then and now makes Dunning a good fit for his current role in the Apache Incubator.

“Whenever I had some issues with the code I was writing and wasn’t sure if I was attacking it the right way, Ted had a great way of looking at it,” said Ferrett. “Some of the code Ted writes hews a bit closer to the way a professor would write it—assuming an infinitely perfect computer with an infinite drive, etc.—but the concepts were sound. For him to be reviewing these sorts of startup projects seems like the best combination of applying that theoretical and analytical perspective to other folks’ code, mentoring to make sure it’s done in the correct manner.”

Dunning left Veoh in 2007, but between then and the beginning of his work with the ASF, he founded one more startup: ID Analytics. The company offered consumer risk-management software with real-time behavioral insights to identify credit and financial identity fraud. LifeLock, an identity theft protection company, bought ID Analytics in 2012.

“On the face of it, music recommendations, identity fraud, Internet advertising and genomics look very different, but at their deep heart they have a lot of similarities: The ways you find order and structure in these domains,” said Dunning. “We pioneered special kinds of database technologies around this idea of graph theoretical anomalies so we could find synthetic identities and run-of-the-mill identity fraudsters. I think we were the first to prove the existence of the synthetic identity industry.”

The open-source philosopher
Throughout Dunning’s research days and career progression through a string of startups, he stayed involved in the open-source community. Dunning’s work in open source began as an undergrad in electrical engineering at the University of Colorado in 1975, when he joined the 6502 Interest Group, one of the oldest computer clubs in the United States. Every Tuesday night they met at the Colorado School of Mines, which birthed XPL0, Apex OS, the FOCAL language and other breakthroughs, all hand-assembled and coded into a mainframe.

Dunning later earned his M.S. in computer science from New Mexico State University, and graduated with a Ph.D. in computing science from the U.K.’s University of Sheffield in 1999.

“If we go back to the dark ages of sorts, I’ve been involved in open-source software for a very, very long time and open-source has changed a lot. We have an Internet now,” said Dunning. “Open source used to be folks getting together and swapping floppies. Now worldwide you have these global communities, and the capabilities are just earth-shaking.”

In today’s world of GitHub and mainstream open source, the free software veteran brings a more measured open-source philosophy to mentoring ASF projects. Dunning began working with Hadoop in 2007 and 2008, participating at first on the mailing list and then as a committer for Apache Mahout, followed by committing to and mentoring projects like Apache Storm, Lucene, Flink, Kylin, Drill and Myriad.

Taylor Goetz, the project management committee (PMC) chair of the Apache Storm project and a technical staff member at Hadoop development company Hortonworks, said that as a mentor on the Storm project, Dunning helped steer debate about Storm’s initial incubation. According to Goetz, Dunning’s presence was important in guiding the Storm committee through what it meant to be an Apache project.

“When [Storm] first started in the incubator, none of us [PMC members] had any experience operating as an Apache project,” said Goetz. “So when we were accepted into the Incubator and it was kind of like ‘Finding Nemo’ when all the fish escape into the ocean in bags and one fish just says ‘Now what?’ Ted was really instrumental in helping us navigate those waters, figuring out all those processes and procedures around release licenses that can be pretty daunting.”

Photo credit: Philip Kademan

Though Hortonworks and MapR are competing in the enterprise Hadoop market, Goetz drew attention to a recent YouTube video where, when talking about his new role, Dunning symbolically took off his MapR hat. Goetz said it speaks to Dunning’s personality that he’s approaching this role from a vendor-neutral stance.

“That meant a lot to me. You have to learn to be an Apache person and embrace the Apache way, because our contributor licensing agreements are with Apache, not our employer,” said Goetz. “Ted is very reasonable and empathetic, which are two extremely important traits when contributing to open-source communities. He understands the Apache philosophy and the organizational dynamics at play. He doesn’t look at projects through rose-colored glasses. He sees places where improvements can be made in helping fledgling projects become successful.”

Dunning talked about the interplay between the “great leaps” indicative of modern open-source development and the slow-and-steady progress of taking continuous steps to improve a piece of software. In ultimately getting an open-source project ready for enterprise adoption, he stressed an exacting emphasis on adherence to standards and licenses.

“Part of Apache’s core mission is making software that’s safe for restricted business environments, which means we as an organization pay huge attention to licensing hygiene,” Dunning said. “That’s often the furthest thing from an excited developer’s mind. It’s really important for building a community around commercial adoption. One of the key risks of open source is not knowing where that code came from. With Apache there’s traceability of every line of code and whose responsibility that piece of code was, and it’s a concern that needs to be met for projects that want to be in the big time.”

Incubating Big Data’s future
Dunning said his role in the incubator is one of facilitation, not control. He stressed that the ASF serves primarily as an open-source charity rather than a corporation. The ASF organizational structure rotates as well, so he knows his position heading up the incubator is not a permanent gig.

“It’s an opportunity to contribute in a new kind of way,” said Dunning. “It’s been a long and interesting ride, and it’s exciting to see how [open source] has progressed. Admittedly it’s much, much easier now because of acceptance and the communication we have now to work on open source.”

Going forward, Dunning has many ideas about where the ASF can expand grow through the Apache Incubator. One of his most active goals is extending the open-source Big Data communities farther into Europe and Asia.

“Kylin is a project that started in China’s eBay facilities and it’s now in Apache. There’s a cultural gap to be sure, but there’s huge enthusiasm around embracing open source,” said Dunning.

“The SINGA project originally out of Singapore University that deals with neural networks and deep learning was pushed into Apache and is now a very competitive machine learning project. Tajo is another project out of an Asian development group showing the same trend. It makes the world a lot bigger.”

Dunning also drew attention to the Incubator’s growing focus on integration-oriented projects such as Apache Zeppelin, which he said is breaking ground in providing visualization across different modes of computation. Finally, he mentioned a collection of science-related U.S. government projects including research from NASA’s Jet Propulsion Laboratory (JPL) coming into Apache as open-source projects.

While explaining his vision for the different paths the open-source community around Apache projects might take, that sense of wide-eyed excitement rung throughout. Dunning said his scientific fascination with patterns led him into Big Data, and it’s what motivates him to keep up.

“There are always funny contradictions in life,” said Dunning. “Because things change so quickly, no one really has more than five years of applicable experience in the Big Data world. But on the other hand some old fogeys—which I try not to be—complain about reinvention. Supercomputer guys complain about microprocessor people reinventing optimization techniques. Database people complain about Big Data people. I don’t see it so much as reinvention as this fascinating and joyful realization of patterns occurring over time and across domains. I find the fact that the world does exhibit order; exhibit patterns, as just wondrous.”

Article Tags

Apache Incubator, Apache Software Foundation, ASF, Big Data, Drill, Flink, Hadoop, Kylin, Mahout, MapR, Musicmatch, open source, Storm, Tajo, Ted Dunning, Veoh

About Rob Marvin

Rob Marvin has covered the software development and technology industry as Online & Social Media Editor at SD Times since July 2013. He is a 2013 graduate of the S.I. Newhouse School of Public Communications at Syracuse University with dual degrees in Magazine Journalism and Psychology. Rob enjoys writing about everything from features, entertainment, news and culture to his current work covering the software development industry. Reach him on Twitter at @rjmarvin1.

View all posts by Rob Marvin

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

The Tale of Ted Dunning: The Apache Incubator’s ever-curious Big Data scientist

Article Tags

Subscribe to SDTimes

About Rob Marvin

Related Articles

Google’s Agent2Agent protocol finds new home at the Linux Foundation

Open source wins again! Redis adds GNU AGPL license to its offering

Report: Keeping up with patches is the number one challenge when using open source software

Sonatype reveals 18,000 malicious open source packages in its Q1 Open Source Malware Index