D
Skills for Data Science Mastery
The National Academies of Sciences, Engineering, and Medicine 2018 report Data Science for Undergraduates: Opportunities and Options (The National Academies Press, Washington, DC) emphasized that a critical task in the education of future data scientists is to instill basic data acumen. This requires exposure to key concepts in data science, real-world data and problems that can reinforce the limitations of tools, and ethical considerations that permeate many applications. The following are key concepts involved in developing basic data science acumen.
-
Mathematical foundations. Key mathematical concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Set theory and basic logic,
- Multivariate thinking via functions and graphical displays,
- Basic probability theory and randomness,
- Matrices and basic linear algebra,
- Networks and graph theory, and
- Optimization.
Some data scientists and programs require a deeper understanding of mathematical underpinnings. This might include the following:- Partial derivatives (to understand interactions in a model),
- Advanced linear algebra (i.e., properties of matrices, eigenvalues, decompositions),
- “Big O” notation and analysis of algorithms, and
- Numerical methods (e.g., approximation and interpolation).
-
Computational foundations. While it would be ideal for all data scientists to have extensive coursework in computer science, new pathways may be needed to establish appropriate depth in algorithmic thinking and abstraction in a streamlined manner. This might include the following:
- Basic abstractions,
- Algorithmic thinking,
- Programming concepts,
- Data structures, and
- Simulations.
-
Statistical foundations. Important statistical foundations might include the following:
- Variability, uncertainty, sampling error, and inference;
- Multivariate thinking;
- Non-sampling error, design, experiments (e.g., A/B testing), biases, confounding, and causal inference;
- Exploratory data analysis;
- Statistical modeling and model assessment; and
- Simulations and experiments.
-
Data management and curation. Key data management and curation concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Data provenance;
- Data preparation, especially data cleansing and data transformation;
- Data management (of a variety of data types);
- Record retention policies;
- Data subject privacy;
- Missing and conflicting data; and
- Modern databases.
-
Data description and visualization. Key data description and visualization concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Data consistency checking,
- Exploratory data analysis,
- Grammar of graphics,
- Attractive and sound static and dynamic visualizations, and
- Dashboards.
-
Data modeling and assessment. Key data modeling and assessment concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Machine learning (e.g., supervised, unsupervised, and deep learning),
- Multivariate modeling and supervised learning,
- Dimension reduction techniques and unsupervised learning,
- Deep learning,
- Model assessment and sensitivity analysis, and
- Model interpretation (particularly for “black box” models).
-
Workflow and reproducibility. Key workflow and reproducibility concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Workflows and workflow systems,
- Documentation and code standards,
- Source code (version) control systems,
- Reproducible analysis, and
- Collaboration.
-
Communication and teamwork. Key communication and teamwork concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:
- Ability to understand client needs,
- Clear and comprehensive reporting,
- Conflict resolution skills,
- Well-structured technical writing without jargon, and
- Effective presentation skills.
- Domain-specific considerations. Effective application of data science to a domain requires knowledge of that domain. Grounding data science instruction in substantive contextual examples (which will require the development of judgment and background in those areas) will help ensure that data scientists develop the capacity to pose and answer questions with data. Reinforcing skills and capaci
ties developed in data science courses in the context of a specific domain will help students see the entire data science process.
-
Ethical problem solving. Key aspects of ethics needed for all data scientists (and for that matter, all educated citizens) include the following:
- Ethical precepts for data science and codes of conduct,
- Privacy and confidentiality (both in the spirit and letter of the law),
- Responsible conduct of research (e.g., human subjects),
- Ability to identify “junk” science, and
- Ability to detect algorithmic and human bias.