On What Data Science Is

I wrote this as notes while preparing my thoughts for a student group presentation I’m scheduled to do this week.

Introduction

Data science is a poorly defined field with an unclear future. Simply go on Quora, Reddit, Twitter, surf the business news, the scientific journals, and so on, and you will find opinions from those proclaiming data science the glorious destiny of mankind to those despising it as an unproductive pipe dream. From personal experience, I can say that the field has definite and proven potential, but I agree that there is plenty still to be decided about its future.
So what is data science? Written naively from my own opinion, I would say that data science is the discipline of those ‘responsible for generating value from all the information of the digital age.’ It sits integrated with data engineers – IT types largely responsible for making the infrastructure to gather and process the data – business analysts – people who usually rely on business knowledge instead of advanced computer tools to get value from data – and a collection of PhD’s, statisticians, and operations research types – the super-technical chaps who only indirectly touch the business and handle very complicated problems and research in general. A final niche, but rapidly growing, group of careers are those involved in data cataloging and data architects who try to organize the mess that data systems usually are into a coherent picture.
Both data scientists and business intelligence analysts or engineers face the complex task of trying to be IT specialist, business specialist, and statistician all at once, the primary difference to me between these two specializations seeming to be that business intelligence personnel tend to handle the less technical, day-to-day tasks usually involving creating lots of visuals/reports, while a data scientist takes on larger and more sophisticated projects. But this is just my opinion, from company to company, region to region, the shape of the job will vary tremendously.
The stereotypical front of data scientists, right now, is that of a machine learning engineer: someone who customizes an existing collection of self-teaching computer models to generate a useful means of automatically pulling insights from data and directly generating answers.
A classic example of this in the real world would be a company buying a large chunk of data on existing and potential customers. The company wants to target these customers with a marketing campaign, but it is far too expensive to target all potential customers, so the company wants a priority list of those to target who will mostly likely respond. A data scientist takes the company’s historical data, and trains a magical machine learning model which figures out which factors in the past were most strongly linked with good, responsive sales, and then applies this model to the new data to generate a list of predicted most valuable potential contacts, thus increasing the companies return on the marketing campaign by a massive amount. True story, happens every time.

The Mindset

A topic of some debate is why data scientists are called scientists when they (excluding academic PhD data scientists) don’t really go around publishing papers and experimenting on rats, at least not most of the time.
First order of consideration: being IT. At first glance, it would seem logical that Information Technology would exactly fit data scientists, and indeed in a corporate setting data scientists often live under IT. But see, IT is largely concerned with big business-critical systems: massive databases, sales transaction systems, websites, and so on, while data scientists use, but don’t usually build, such systems. IT spends most of its time concerned with edge cases, freaking out about you or anyone else breaking their systems. Data scientists are chill, we build margins of error into our model, and are focused first and foremost on valuable insights, not just on wrangling ones and zeroes around in circles.
Second order of consideration: being an engineer. Data scientists are far more like engineers in that they play around with technology in order to build new systems. Generally, engineers make good data scientists, and vice versa. But that said, data science has much less of a focus on limitations – we don’t have to worry about our data bridges collapsing and killing lots of people – and more a focus on creativity. In general, at least at the moment, data scientists are expected to be able to work in any industry with any tool, and are faced with an overwhelming array of possible solutions in a way that engineers usually aren’t.
Finally, the scientist. There is no recipe for success in data science. As a data scientist, you use data, which no one has ever used before, for tasks no one has ever considered doing before. The end result of that is constant experimentation, dreaming up new ideas, testing them, and seeing if they work. And just like in science, all of the burden of proof rests on you. By and large, data scientist are expected to be capable of explaining everything they do and why they do it, continually justifying themselves in a way most careers don’t have to. Above all, data science is the cutting edge, and knows it. Everything is evolving in this field, it requires adaptability, creativity, and initiative. You’ve got to love complex thinking, juggling ideas, different lines of work, and the chaos that is information from the real world, and then you’ll love being a data scientist.
Biggest challenge: working with business people. Honestly, machine learning is pretty much magic to an outsider, and even a bit to an insider. Getting a good dialogue going and trust established is hard, but critical. Probably less of a problem in Silicon Valley and other tech hubs.
Biggest surprise: how important understanding systems is. Building a cool model is relatively easy, getting it into production is hard. Understanding databases, cloud systems, and so on is very useful, unless you want to wait six months for IT to have time to productionize things for you.
Biggest value: clever data manipulation. Figuring out clever ways to manipulate variables and finding new data sources that tell your model information in useful ways is the number one way to get better results. AutoML (where a computer chooses the best algorithm designs for you) is a thing that I use a lot, old school data science people had to spent lots of time trying different models. The biggest challenge is finding surprising little things that together can tell a system about all the complexities of the real world, in a repeatable and usable way.
Education Recommendations:

  • Get a graduate degree. Not necessary, but strongly recommended for data scientists.
  • Know Statistics 101. You don’t need advanced math most of the time, however.
  • Know at least one of these languages: R or Python. Also basic SQL querying. There are lots of other skills and tools you may or may not need, but R, Python, and SQL are the basics.
  • Play around with some AWS/Azure/Google Cloud products using their free trials, fire up a VM, setup a simple database, try some machine learning tools, and try to integrate pieces together.
  • Use GitHub. It’s not critical to have a public repository of cool work. But it is very useful to know the basics of using git for source control and management.
  • Try Tableau, PowerBi, or another major visualization tool. There’s a lot of skill required to make genuinely good graphics, and good graphics make your work look far cooler than anything else you can do, at least to a business person.
  • Have general ability to document your work, write executive summaries, and generally share and present knowledge to non-technical audiences.
  • Play around with a Raspberry Pi, use Linux, research how to keep it cybersecure, maybe try Docker containers, and setup a live data science system of some kind. For example I’ve got a live weather/sensor data stream with a Pi from my family’s farm. I never used this in interviews, but I should have. Having built your own end-to-end data science system is a great thing to do.

Leave a Comment

Your email address will not be published. Required fields are marked *