Data Scientist Foundations: The Hard and Human Skills You Need
Some companies prefer data scientists to have advanced degrees (M.S. or Ph.D.) in a quantitative field.
SPONSORED SCHOOLS
SPONSORED
A master’s degree is typically required for positions as a computer and information research scientist, which includes data scientists, according to the Bureau of Labor and Statistics (BLS).
Assembling a strong set of data skills is useful, but there may be other skills to consider::
- Hard (i.e., technical)
- Human (i.e., interpersonal)
Companies typically look for both.
HARD SKILLS
Mathematics (Other Than Statistics)
Basic undergraduate courses typically cover calculus and linear algebra. Once you have those “basics” under your belt, try digging into matrix computation, diffusion geometry, and similar topics in applied mathematics.
“Most data mining applications use matrix computations as their fundamental algorithms, so a strong understanding of them is essential.”
Many of these courses cross over into …
Statistics
A verifiable background in statistical analysis (through work or education) is typically a prerequisite for most data science jobs.
“Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence.”
Which means interviewers are going to be looking for core competencies in statistical tools such as:
- R
- SAS
- SPSS
- SciPy
- Stata
Initially written by a couple of Kiwi statisticians, R is free and widely used for statistical analysis.
Check out the R-Project for more.
Programming/Scripting Languages
These are some of the programming languages data scientists may use day-to-day:
- Python
- C/C++
- Java
- Ruby
- Perl
- MATLAB
- Pig
Python is a good one to have in your tool-belt. In his coverage of a 2013 Predictive Analytics World conference, Derrick Harris had this quote from Sameer Chopra, Orbitz VP of Advanced Analytics:
“If you were to leave today and ask: ‘What specific skills should I learn?’ Python.”
Relational Databases
A solid understanding of SQL-based systems is useful. Try learning the fundamentals of database design and management – get a handle on primary and foreign keys, indexing, querying, normalization, constraints and other basic features.
NewSQL helps create scalable, horizontally-distributed systems like Cloudera Impala or VoltDB. This effort combines the power of NoSQL for big, messy data with the rigorous and reliable structure of traditional relational databases.
Distributed Computing Systems
You may want to seek hands-on experience with:
- Hadoop
- HBase
- Cassandra
- MapReduce
- Hive
- and all the new packages and approaches that continue to proliferate
Data Mining
Data miners dig into big data sets to discover new and interesting patterns using cluster analysis, anomaly detection, and dependencies.
Interested? Find a class or work mentor. Most data mining syllabi will touch on key concepts, practical applications and common algorithms.
Above all, stay open to possibilities. Data mining is interdisciplinary by its very nature, drawing from AI and machine learning, statistics and database systems.
Data Modeling
Even if you’re not creating the models yourself, it’s useful to know how to parse them and present them to higher-ups, as well as how to tell if you yourself can work with them.
“Knowing the difference between a fact table that is put together well and one that is faulty with semi-structured unconstrained keys makes all the difference in how easily you can trust and massage the data you’re trying to capture.”
For those getting into the game, Roe recommends starting with data modeling tools, techniques and methodologies like:
- ERWin
- Agile Data Modeling
- ORM Diagrams
- UML class diagrams
- CRC cards
- Conceptual/logical/physical schema
- DDL
- Bachman diagrams
- Zachman Framework
Predictive Modeling
Predictive modeling is important. Harris goes so far as to class predictive modeling as one of a data scientist’s four core competencies (along with SQL, statistics and programming).
“If you don’t have at least a grounding in these skills, you’re probably not getting through the door, in part because they form a common language that lets people from different backgrounds talk to each other.”
Want to make back some of your education costs?
Test your skills against the best on Kaggle, a crowdsourced platform for data predictions. Companies and organizations regularly award prizes for the best solutions to their predictive-modeling needs.
Machine Learning
Machine learning is variously defined as the:
- Ability of a machine to improve its own performance through artificial intelligence
- Use of computers to develop and improve algorithms
- Science of getting computers to act without being explicitly programmed
Machine learning, formerly the province of science fiction, is now making a regular appearance in lists of data science job requirements.
You don’t necessarily have to pay for it. Andrew Ng’s free Machine Learning course on Coursera has produced a number of distinguished alumni, including Kaggle winners like Xavier Conort.
Visualization
You can search, scrub and mine data to your heart’s content, but in the end, it all comes down to showcasing your findings in a way that business users will understand.
This can be achieved with visualization tools such as:
- Flare
- HighCharts
- AmCharts
- D3.js
- Processing
- Google Visualization API
- Raphael.js
- Tableau
It’s the all-important end step. Always keep in mind: the clearer your findings, the easier the decision, the quicker the outcome, and the higher the praise for all your hours of hard work.
HUMAN SKILLS
Domain Expertise
Domain expertise usually means having a deep and abiding interest in your field of expertise (e.g., medicine, government, retail, manufacturing, etc.) and a total understanding of your organization’s data.
How can you cultivate those two desirables?
- Ask questions.
- Become familiar with the systems.
- Explore the products.
- Learn how the data is collected and how it’s being used.
- Get to know the people who are involved in each step of that collection and use.
As Bill Franks, Teradata Chief Analytics Officer, points out:
“I’ve never heard anyone discuss a data science profile without talking about understanding the business. Again, it’s critical to have the person running the analysis fully understand – and be interested in – why this question is being asked, what the business person would do given the results, and why they would make that decision.”
Creativity and Curiosity
Data scientists look at a incomplete and messy data, inadequate analytics, faulty methods and models, and seemingly insoluble business problems, and they say:
“I got this!”
Creative data scientists are curious. They aren’t afraid of playing around in unstructured environments, of proceeding by trial and error, of following the white rabbit down the hole.
Creative data scientists experiment. They blend CRM transaction records with traffic reports; they entangle multiple systems and data sets; they hack across a dizzying array of incompatible data sources.
Above all, creative data scientists get hired.
Need proof? As part of it’s hiring process, Netflix asks each of its data science applicants to come up with a framework that solves a problem of the interviewer’s choice.
Storytelling
Once you’ve interpreted the data, the next step is typically communicating your findings.
It’s not always easy:
- You may be presenting your findings to a room full of people who have absolutely no clue what it is that you do.
- You may be fighting to persuade them that what you’re suggesting will actually have long-term benefits.
- You may be struggling to speak to them in a language they’ll understand at all.
Try weaving a narrative around your data. Show your audience (using those outstanding visualization skills) where they’ve come from, where they are now and where they could be in the future.
If you’re an introvert by nature, you may want to consider a communication course.
Ethics
Whatever the project, you may want to consider what effect your work will have on the lives of customers and clients. Considerations may include:
- Examining a web user’s data can invade their privacy
- Monitoring communications has legal implications
- Placing sensors in household objects raises red flags
- Hoarding data indiscriminately makes the FTC very unhappy
- Predictive models can have disastrous results for unlucky individuals
Indeed, in the words of Edith Ramirez, Commissioner of the FTC at the Technology Policy Institute Aspen Forum in 2013, in this brave new world of data science:
“Individuals may be judged not because of what they’ve done, or what they will do in the future, but because inferences or correlations drawn by algorithms suggest they may behave in ways that make them poor credit or insurance risks, unsuitable candidates for employment or admission to schools or other institutions, or unlikely to carry out certain functions.”
These risks may be difficult for your company to ignore.
The Elevator Speech
In the end, it may come down to a few simple must-haves.
Harris explains what Chris Pouliot, Director of Algorithms and Analytics at Netflix, is really looking for in candidates:
“An advanced degree in a quantitative field; hands-on experience hacking data (ideally using Hive, Pig, SQL or Python); good exploratory analysis skills; the ability to work with engineering teams; and the ability to generate and create algorithms and models rather than relying on out-of-the-box ones.”
Last updated: June 2020