TRAINING STUDENTS TO EXTRACT VALUE FROM
BIG DATA
Summary of a Workshop
Maureen Mellody, Rapporteur
Committee on Applied and Theoretical Statistics
Board on Mathematical Sciences and Their Applications
Division on Engineering and Physical Sciences
NATIONAL RESEARCH COUNCIL
OF THE NATIONAL ACADEMIES
THE NATIONAL ACADEMIES PRESS
Washington, D.C.
THE NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine.
This study was supported by Grant DMS-1332693 between the National Academy of Sciences and the National Science Foundation. Any opinions, findings, or conclusions expressed in this publication are those of the author and do not necessarily reflect the views of the organizations or agencies that provided support for the project.
International Standard Book Number-13: 978-0-309-31437-4
International Standard Book Number-10: 0-309-31437-2
This report is available in limited quantities from:
Board on Mathematical Sciences and Their Applications
500 Fifth Street NW
Washington, DC 20001
bmsa@nas.edu
http://www.nas.edu/bmsa
Additional copies of this workshop summary are available for sale from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu/.
Copyright 2014 by the National Academy of Sciences. All rights reserved.
Printed in the United States of America
THE NATIONAL ACADEMIES
Advisers to the Nation on Science, Engineering, and Medicine
The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Ralph J. Cicerone is president of the National Academy of Sciences.
The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. C. D. Mote, Jr., is president of the National Academy of Engineering.
The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Victor J. Dzau is president of the Institute of Medicine.
The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Ralph J. Cicerone and Dr. C. D. Mote, Jr., are chair and vice chair, respectively, of the National Research Council.
This page intentionally left blank.
PLANNING COMMITTEE ON TRAINING STUDENTS TO EXTRACT VALUE FROM BIG DATA: A WORKSHOP
JOHN LAFFERTY, University of Chicago, Co-Chair
RAGHU RAMAKRISHNAN, Microsoft Corporation, Co-Chair
DEEPAK AGARWAL, LinkedIn Corporation
CORINNA CORTES, Google, Inc.
JEFF DOZIER, University of California, Santa Barbara
ANNA GILBERT, University of Michigan
PATRICK HANRAHAN, Stanford University
RAFAEL IRIZARRI, Harvard University
ROBERT KASS, Carnegie Mellon University
PRABHAKAR RAGHAVAN, Google, Inc.
NATHANIEL SCHENKER, Centers for Disease Control and Prevention
ION STOICA, University of California, Berkeley
Staff
NEAL GLASSMAN, Senior Program Officer
SCOTT T. WEIDMAN, Board Director
MICHELLE K. SCHWALBE, Program Officer
RODNEY N. HOWARD, Administrative Assistant
COMMITTEE ON APPLIED AND THEORETICAL STATISTICS
CONSTANTINE GATSONIS, Brown University, Chair
MONTSERRAT (MONTSE) FUENTES, North Carolina State University
ALFRED O. HERO III, University of Michigan
DAVID M. HIGDON, Los Alamos National Laboratory
IAIN JOHNSTONE, Stanford University
ROBERT KASS, Carnegie Mellon University
JOHN LAFFERTY, University of Chicago
XIHONG LIN, Harvard University
SHARON-LISE T. NORMAND, Harvard University
GIOVANNI PARMIGIANI, Harvard University
RAGHU RAMAKRISHNAN, Microsoft Corporation
ERNEST SEGLIE, Office of the Secretary of Defense (retired)
LANCE WALLER, Emory University
EUGENE WONG, University of California, Berkeley
Staff
MICHELLE K. SCHWALBE, Director
RODNEY N. HOWARD, Administrative Assistant
BOARD ON MATHEMATICAL SCIENCES AND THEIR APPLICATIONS
DONALD SAARI, University of California, Irvine, Chair
DOUGLAS N. ARNOLD, University of Minnesota
GERALD G. BROWN, Naval Postgraduate School
L. ANTHONY COX, JR., Cox Associates, Inc.
CONSTANTINE GATSONIS, Brown University
MARK L. GREEN, University of California, Los Angeles
DARRYLL HENDRICKS, UBS Investment Bank
BRYNA KRA, Northwestern University
ANDREW W. LO, Massachusetts Institute of Technology
DAVID MAIER, Portland State University
WILLIAM A. MASSEY, Princeton University
JUAN C. MESA, University of California, Merced
JOHN W. MORGAN, Stony Brook University
CLAUDIA NEUHAUSER, University of Minnesota
FRED S. ROBERTS, Rutgers University
CARL P. SIMON, University of Michigan
KATEPALLI SREENIVASAN, New York University
EVA TARDOS, Cornell University
Staff
SCOTT T. WEIDMAN, Board Director
NEAL GLASSMAN, Senior Program Officer
MICHELLE K. SCHWALBE, Program Officer
RODNEY N. HOWARD, Administrative Assistant
BETH DOLAN, Financial Associate
This page intentionally left blank.
Acknowledgment of Reviewers
This report has been reviewed in draft form by persons chosen for their diverse perspectives and technical expertise in accordance with procedures approved by the National Research Council’s Report Review Committee. The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as possible and to ensure that the report meets institutional standards of objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We thank the following individuals for their review of this report:
Michael Franklin, University of California, Berkeley,
Johannes Gehrke, Cornell University,
Claudia Perlich, Dstillery, and
Duncan Temple Lang, University of California, Davis.
Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the views presented at the workshop, nor did they see the final draft of the workshop summary before its release. The review of this workshop summary was overseen by Anthony Tyson, University of California, Davis. Appointed by the National Research Council, he was responsible for making certain that an independent examination of the summary was carried out in accordance with institutional procedures and that all review comments were carefully considered. Responsibility for the final content of this summary rests entirely with the author and the institution.
This page intentionally left blank.
Contents
2 THE NEED FOR TRAINING: EXPERIENCES AND CASE STUDIES
Training Students to Do Good with Big Data
The Need for Training in Big Data: Experiences and Case Studies
3 PRINCIPLES FOR WORKING WITH BIG DATA
Big Data Machine Learning—Principles for Industry
Principles for the Data Science Process
Principles for Working with Big Data
4 COURSES, CURRICULA, AND INTERDISCIPLINARY PROGRAMS
Computational Training and Data Literacy for Domain Scientists
Experience with a First Massive Online Open Course on Data Science
Can Knowledge Bases Help Accelerate Science?
Divide and Recombine for Large, Complex Data
Yahoo’s Webscope Data Sharing Program
Whom to Teach: Types of Students to Target in Teaching Big Data
How to Teach: The Structure of Teaching Big Data
What to Teach: Content in Teaching Big Data
Parallels in Other Disciplines