David Blei is a Professor of Statistics and Computer Science at Columbia University. His research is in statistical machine learning, involving probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications including text, images, music, social networks, user behavior, and scientific data.
David earned his Bachelor's degree in Computer Science and Mathematics from Brown University (1997) and his PhD in Computer Science from the University of California, Berkeley (2004). Before arriving to Columbia, he was an Associate Professor of Computer Science at Princeton University (2006-2014). He has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), Blavatnik Faculty Award (2013), and ACM-Infosys Foundation Award (2013).
To listen to the interview, click on this MP3 file link
PARTIAL EXTRACTS AND QUOTES FROM THE EXTENSIVE DISCUSSIONS:
Interview Time Index (MM:SS) and Topic
When did you hear of this extraordinary honour, being the recipient of the ACM Infosys Foundation Award in 2014? How did you feel at the time and what was the reaction from your colleagues and your family?
"….My colleagues were very happy for me, as was my family. One colleague of mine, Sanjeev Arora, won the same award a couple of years ago so it was quite exciting for us…."
What were the drivers behind your early interest in computer science and mathematics?
"….I've been interested in computer science since I was a kid. I grew up in the 1980s and my parents came home at some point with a TI-99/4A — a Texas Instruments computer that we hooked up to the TV. I started programming in Basic when I was something like seven years old…."
Can you talk about your approach to analyzing large collections of data using innovative statistical methods?
"….What topic modeling does is use new statistical tools to uncover what the hidden schematic patterns are in these document collections, and then annotate those documents according to those patterns. Then we can use those annotations that were derived from the algorithm to visualize, explore, predict and so on, whatever we want to do with the documents…."
Do you see any ability to meld some of the work you've done with Tom Mitchell with his Never Ending Language Learner (NELL) at Carnegie Mellon, or Andrew Ng at Stanford with regards to his deep learning projects?
"….If you think about topic modeling which operates on text, then what does it mean to have a topic model where text is coming at you in a never-ending way — this is similar to what Tom Mitchell's project is about, something I'm interested in too. It's a challenge for this field. Deep learning is a separate, also very interesting idea and everybody has a different perspective on it. The perspective I like about it is that deep learning is about feature learning. It's about understanding the hidden structure, the hidden features of your data that are important for describing how it's similar to each other or how it might predict some other variable….Another challenge I think for machine learning in general is to connect these two sides, these two styles of approach. One style is this revival of Convolutional Neural networks which is the new field of deep learning or the revival field of deep learning, and the other is Applied Probabilistic Modeling where we're drawing more from statistics and probability models to describe data. Bringing these together has not yet really been done although there have been some first attempts at it…."
There is Judea Pearl's work with Bayesian networks and counterfactuals and causal calculus where he's created a sort of mathematical model for causation and so on — do you see any kind of melding between his work and yours?
"….Judea Pearl — in 1988 his seminal book is one of the pieces of work that launched Bayesian networks or graphical models which are ways of representing probability models as graphs and connect intimately to how we compute about those graphs….Now it's statistical machine learning, topic models is one example of this. We are developing probability models of high dimensional data, understanding hidden structure from them and developing new algorithms for doing that; it's been going on for twenty years. Now Pearl comes out with another book and this book is about causality, so what's the challenge? We know how to build complex probability models of high dimensional data. We know how to infer what the hidden structure embedded in data under the assumptions of that model is, but the question that remains is what does that mean about my data?….I think a challenge for probabilistic modeling and for statistical machine learning is to try to understand what a technique like graphical model or topic model applied to data can tell us about the data themselves….So what kinds of truth live in those data is a very difficult question to answer and that gets at what Pearl's work on causality is trying to understand….We know now how to define complicated models and compute with them and the next question we need to ask in statistical machine learning is what do these models mean, especially when we apply them to massive observational dataset? Something I’m very interested in…."
You've already talked about this, but let's explore this a little further. What are the applications of your work and the different scientific domains that it influences?
"….Analyzing text directly has a lot of applications; I mentioned the digital humanities as one where some people are doing some very exciting work using topic models to stimulate insights about their archives of literature and other documents (historical documents). Other applications are to technology companies like recommendation systems, mail analysis helping you detect spam or doing more complicated things like group your email or even suggest who should be on a recipient list of an email automatically….Models like topic models have been developed independently in fact to analyze human genomes. This kind of analysis called population genetics is very important for doing things like correcting for ancestry when trying to understand how genes and diseases are related to each other….Another application is social network analysis…."
How are your roles changing for 2014 and why?
"….In July I will be moving to Columbia. It's a new role for me and I will be in both the Statistics Department and the Computer Science Department as well as in this new Institute for Data Science and that's a lot of new roles…."
What are the big research questions with topic models, Bayesian methods, and approximate posterior inference, and what are the ways and processes for answers?
"….Something that my group is very interested in right now is working on the combination of text and user behavior data, reader data….Some of the bigger research challenges I've mentioned, one which is to think about causality and truth in observational data….Other research challenges involve posterior inference, Bayesian inference that you mentioned; this is the algorithmic problem of taking a model that has hidden variables and basically filling in what those hidden variables are given a big dataset. We need to make Bayesian inference scalable, though with massive datasets we need to be able to do Bayesian inference and as I said this earlier when you mentioned Tom Mitchell's Never Ending Learning — we need to figure out how to apply Bayesian inference to streaming data. Related to the problem of causality in observational data is to understand the exploratory data analysis a bit more rigorously….Another research challenge is to make exploratory data analysis more of a first class activity, and to treat it equal to prediction and understand what are some of the principles that can guide exploratory data analysis general principles (if there are any)…."
We have covered this throughout the interview so far, but are there any additional applications of your work, including within text, images, music, social networks, user behavior and scientific data?
"….I guess things which work with the data directly that take things like fMRI data and other kinds of neural science data and build probability models and discover components in recurring patterns in these high dimensional neural science datasets…."
Do you see any controversies in your field and why?
"….One might be the controversy around causality and observational data. Some scholars believe that it's hopeless, that it's just not possible to make causal claims from observational data and others (I'm defining my field broadly to include machine learning, computer science, statisticians and social scientists), who think about these kinds of issues and some think that it IS possible….There is a distinction between machine learners who use neural networks in the sense of 'define an objective function and then optimize that objective function to create an answer to a question about data', and those that prefer the probabilistic modeling approach where I'm going to define a model that has assumptions, then I'm going to use an inference algorithm to use those assumptions to compute with my data, and then use those inferences to solve my problem. I wouldn't exactly call it a controversy, but there are these different styles to computation….I guess the final controversy I should mention because it's so famous is the controversy between Bayesian and Frequentists. I think in the old days this amounted to a controversy about understanding truth in data. When is an inference real? There are different perspectives about that and Frequentists and Bayesians came up with different perspectives…."
What are the practical applications of your work in 10 to 15 years time?
"….I do think that broadly speaking my work is about organizing digital information automatically, and right now we are bombarded with digital information through many devices and many channels and each of us has developed a survival mechanism for dealing with it. Thinking 10 to 15 years ahead, all of that will be organized for us and it will be much more digestible, much less overwhelming and much more personalized. We will have trusted sources of information about news, weather, our friends and our work and this problem of filtering all of these many things (that right now we go and look for), I think we will make a lot of progress on that problem using the kinds of methods that my colleagues and I have been developing…."
Looking further into the future do you see your research interests changing?
"….I started out working primarily in text and digital media organizing that and in recent years started to look more seriously at scientific data, collaborating with scientists trying to bring what we learn from extracting patterns and texts and exploiting those patterns to extracting patterns in high dimensional scientific data. I see myself continuing down that route, continuing to collaborate with scientists and work with more raw scientific data that helps them understand what patterns are in there….I'm excited about what's happening in the Digital Humanities in which I have been using these kinds of techniques to analyze texts. Looking ahead I've already become interested in probabilistic modeling more broadly and developing probabilistic modeling into a usable language (I mean that broadly – not necessarily a computer language)….Another place where I see my research going to (and this relates to the problems of exploration and causation), is to start to sink my teeth into model criticism and model revision…."
It sounds like this work aligns or can integrate with some of Judea Pearl's work on external validity?
"….The work on model criticism and model validation is intimately tied to the questions of causality and model validity and it's vital. So when you think about inferring causality from observational data what that means is that you need to have a valid model of how the world is working. Statisticians have thought about this problem so a lot of what I like to work on is to uncover and recover old ideas from statistics that when put through modern algorithmic and machine learning machinery, they are new fruit…."
When you come across really difficult challenges in your research what kind of general, valuable lessons can you share with the audience in terms of dealing with these challenges?
"….One, it's okay that not every path you go down bears fruit and two, to stick with it and adapt to whatever comes your way as you embark on the project and enjoy that process, otherwise it's not really worth it….Pay attention to the history, the scholarship, and always on the lookout for papers that you missed from the last decade. Read them slowly to understand what they are saying, to dust off old ideas that might have been forgotten or that might not seem as relevant anymore to see what insights you might glean from them…."
David, what are the greater burning challenges and research problems for today's youth to solve to inspire them to go into computing?
"….The challenge I mentioned earlier, to me is the motivating challenge. I know there are others, which is the challenge of information inundation — that we have been anointed with access to everything, constantly and that is a burden, and we need to solve that problem and that's a computing problem…."
Can you name some people who have inspired you?
"….I'm very lucky I've had several great mentors over the years and I've also been inspired by great scholars and their writing….George Box and John Tukey….I. J. Good….David Freedman, a Berkeley statistician….my advisor Mike Jordan….Brad Efron the Presidential Medal of Science winner….The list goes on and on…."
Can you quantify what you think are the qualities that make for these great people (who inspire you) such as your advisor?
"….I don't know, I wish I did….One thing is as a faculty member, when you take on a PhD student or postdoc for that matter, it's a big decision for the faculty member because you are making a commitment to help another person somehow find who they are, who they want to become, and help them find a path that they are excited about that's satisfying to them and it's a big commitment. I think that Mike respects that commitment a lot and I try to do that too with my students and postdocs…."
How has the ACM and its resources supported your research? Which ACM assets (SIGS, conferences, publications, digital library, peer network, and so forth) are the most valuable to you?
"….The ACM is an amazing organization and the way that the ACM has really helped me is through their publication, Communications of the ACM (CACM)…."
Over your distinguished career, what are some broader lessons that you can share with the audience?
"….Read deeply into the literature and look at older papers on computer science and statistics and other fields that touch your field….Be open to new ideas….Another lesson for the faculty members is to take your student's success and happiness seriously and really commit to it….Another lesson that I've learned over the years (and it also involves and relates to this research problem of being inundated with information), is to be protective about your time…."
From your extensive speaking, travels and work, do you have any stories you can share (perhaps something amusing, surprising, unexpected or amazing)?
"….There is this big conference called the Joint Statistical Meeting (JSM), a giant Statistics conference and Machine Learning conference. It's hard to know what's going on at any one time, anywhere at that JSM and as a consequence some great speakers and famous intellectuals will give a talk and very few people will be there….I went to this room and there was this great probabilist statistician and he gave a beautiful talk, so clear it was incredible. I think in the room was me and another guy and the speaker's wife and son. This might have been the entire audience, which was a shock….Afterwards I walked up to him and shook his hand and said, 'I'm David Blei and love your work and he looked at me and looked at my name tag and said, 'Ah David Blei, I read your papers, I don't understand them'….And that was it…."
Do you have broader life goals that you want to achieve?
"….You know I don't think too far ahead to be honest with you. I like being a scientist and a researcher and I also like teaching and mentoring my students and growing my group of student alumni…."
Outside of computer science, statistics and the areas that you are currently researching, do you see any top challenges facing us today that you want to comment on?
"….I have an old friend from college, Tapan Parikh, who works on computer science for the developing world and he's been a real inspiration to me in the sense of somebody I really admire. What his work is about is what can we do with our technology that can help people in the developing world, and not in the way that you might immediately think to solve that problem….What Tap does is tries to figure out how can we meet their needs with our technology in a way that makes sense for them and it's an incredible challenge. Respecting the context, cultural parameters, and the financial and resource limitations, how can all of this amazing technology that we're building here in the first world in our privileged society, how can they help others in this real way?…."
If you were conducting this interview, what question would you ask, and then what would be your answer?
"….'Is there anything else you want to say?'….You asked me about my mentors and I've been focusing in the last hour on people like my advisor and other scholars who have inspired me, but I should add over the last eight years or so I have had an amazing group of PhD students and postdocs come through my lab and they have been inspiring and really this award is largely due to our collaborative work that they drove…."
David, with your demanding schedule, we are indeed fortunate to have you come in to do this interview. Thank you for sharing your substantial wisdom with our audience.