Don’t get that Data Science PhD



For 90% of the people reading this, my goal of this post is to persuade you or to help you persuade your friend to reconsider the idea of getting a PhD degree in Data Science as a form of career advancement. If you are in the 10% of the population who are pursuing a degree for the advancement of human knowledge or are planning to work in academic or industry research labs, please ignore this article.


I get it, demand for Data Scientists is through the roof. From 2016 to 2019, it was ranked the #1 job in America according to Glassdoor. It fell to #3 in 2020. Yet, committing 5-6 years of your career while delaying full-time real world experience and not collecting a professional salary is a very significant investment.

Technology Advancements


Coding in the 90’s

Nowaday, technology moves at lightning speed. I began studying computer programming 25 years ago. My first real programming language was C and then C++. I distinctly remember the hell I went through in dealing with memory management in C. Invalid Pointers, Buffer Overrun, Invalid Array Index and Heap Errors kept me in the computer lab for countless hours. It was a struggle but I did eventually learn. After graduating college, I programmed in C/C++ for a few more years and gradually moved to other languages like Python. As I look back today, I can’t remember the last time I cared about memory management. Perhaps I just wasn’t using recursion enough. But my point is that many of the complexity surrounding memory management got automated away by higher level programming languages, their associated ecosystems and the distributed platforms that they are deployed on. Most of the time it is more productivity to add machines instead of optimizing the code. The blood, sweat and tears spent in my early years westling memories seemed to have added minimally to my career.


Scaling My First Startup

I co-founded my first startup Visual Revenue back in 2010. Visual Revenue was a big data and machine learning solution that powered a recommendation system for editorial teams at large publishers like Forbes, Comcast and Weather.com. To power our analytics, the infrastructure had to ingest all the web traffics that visit these top publishing sites. By the time we were acquired a few years later, we were processing 12 billions requests per month.


Initially, we built a stream processing system from scratch with a 10-person engineering team. Though we were sort of meeting business demand to scale up the system to process 2 billions requests per month, our system failed weekly, sometimes, multiple times a week. Every time our system failed, I would get a text. The text tone I used then would induce an immediate stress response from me, even today.


Towards the end of 2012, we came to the conclusion that the situation can’t go on. We dedicated 4 months of our time and moved our system over to a couple of robust open source frameworks: Kafka + Storm. The result was striking. Without adding headcounts to the team, we were able to scale our system 6x to 12 billions requests/month with a 99.9% uptime. If I have to guess today, a talented engineer can probably build the same system with off-the-self AWS components in a month. It would have higher scalability and better uptime.

Stream data processing evolution 2010 to today
Stream Processing Evolution

My Takeaways

Technologies that are hard and valuable today will likely get commoditized over time. We are seeing some of these in action today in machine learning and AI. In the below chart I mapped out the high level development timelines of TensorFlow and AWS SageMaker over the past few years as illustrations. Challenging tasks like hyperparameter search, model monitoring, model evaluation and version control had all gotten significantly easier if not highly automated in a short few years. The area is maturing fast.


What’s in a Data Science PhD?

Different schools offer different programs and courses are not the only component of a PhD program. Though for discussion purposes, I pull out the list of courses of the Stanford Data Science PhD program. Not surprisingly, the courses are heavily tilted toward statistics. In a career and employment perspective, are these the most important skills to spend 5 years in acquiring?


The Future of Data Scientists

My prediction is that in the coming years, 90% of data science jobs will follow the same path as full stack engineers. We will see demand shifts toward full stack data scientists. As tools and frameworks mature, the focus will be on delivering end to end value with less and less specializations. Companies will place a premium on a data scientist’s ability to:

  • Understand customer problems, define and prioritize requirements

  • Collaborate across team, write documentation, get buy-in, share results

  • Understand and visualize data, A/B testing & inference

  • Core machine learning + experimentation, implementation, and metrics

  • Transfer and transform data from point A to B

  • Practice production code practices including unit tests, docs, logging

  • Basic proficiency of containerization and cloud computing


Hence, I would argue that spending the next 5 to 6 years working as a data scientist or data engineer solving real world problems would be a superior choice to a PhD degree. My 2 cents ;)


Finally….


At NorthShore.ai, we take a modern knowledge management approach to support scaling Customer Experience and Product teams. Hit us up to see how we can help your team improve productivity and delivery higher customer satisfaction through knowledge intelligence.