You have undoubtably seen many articles on the web and in the popular press that explore the question “What is Data Science?” Data Science is hard to pin down, for several reasons:
- First of all, it is interdisciplinary. People have varying opinions about what activities or subject areas make up Data Science. Some even question whether Data Science has an identity separate from its component parts.
- Second, it is relatively new as a distinct field of practice.
- Third, Data Science can be practiced in many different areas that can superficially look very different. Business, Life Science, Physical Science, Social Science, Government, Medicine…. Even within an area it can be practiced many different ways.
- Fourth, frankly, is hype. Data Science is hot. And while Data Science is a “real” profession, perhaps it is currently riding high on the hype cycle. The good news is that there is much work to be done that is important and satisfying.
Here is my “elevator pitch” take on the question:
What is Data Science?
Technology Advances Set the Stage
Cumulative advances across a broad front of technologies have had a synergistic effect. These technologies include:
- Ubiquitous distributed computer-mediated business and social processes
- Databases, including Big Data and Unstructured/Semi-structured data
- Statistics/Analytics/Machine Learning
- Data-Driven Decision Making in the context of Business Strategy and Performance Management
This Synergy Produces a New Field
Data Science is an interdisciplinary combination of:
- Data Programming
- Soft Skills
This spells DAVIS!
Data Programming is the ability to extract and prepare data using a combination of scripting languages, high-level programming languages, query languages like SQL, and ETL (Extract-Translate-Load) software. The data can be “Big Data” or “Regular Data,” and it can be anywhere: traditional databases, flat files, NoSQL databases, NewSQL databases, or streaming event/sensor data.
Analytics builds on a foundation of basic statistics to support advanced methods like classification, clustering, regression, forecasting, and machine learning. This is what many consider to be the “sexy” part of Data Science, but analytics does not stand on its own.
Visualization uses best practices based on human factors and perception research, plus an explosion of new creative visual formats, to show complex data, relationships, and conclusions. The ability to tell a story is essential.
Insight means you have more than academic knowledge or can drive a software package — you effectively bring all of yourself to bear on tough problems. Insight also means you have real-world knowledge of the problem area (often called “domain knowledge” or “substantive expertise”). It also means that you think about problems and their associated resources and constraints more deeply than may be expressed by a client in an initial problem statement, to clarify their real needs and provide the most value.
Soft Skills enable you to effectively relate to all kinds of people, from fellow team members, to gatekeepers of data, to executive decision-makers/stakeholders. This recognizes that Data Science in practice is fundamentally a team-driven culturally-embedded decision support service, and that what matters are actionable insights that lead to real-world outcomes.
There are obviously many other ways to distill Data Science into a small number of categories, including a popular 3-category Venn diagram that has been used in a Data Science training program where I work, and a 2-category pithy saying. These ways of looking at Data Science complement rather than compete with each other. Perhaps one way to think of this is as a clustering problem. If you are familiar with K-Means Clustering, you know that you have to specify the number of clusters. Perhaps someday a computer will analyze a mathematical representation of the natural language semantics and give us its take on what Data Science is!