Data Science in 2022

A decade ago, Doug Laney defined the three dimensions to data growth as increasing volume, velocity, and variety. In 2016, IBM defined the “Four V’s of Big Data” adding veracity to the mix, stating that both data quality and availability were key to business analytics. At the same time finding one in three business leaders “don’t trust the information they use to make decisions”.[1] Veracity is an issue today with over 50% of business leaders saying they don’t fully trust their data assets according to the 2021 Experian study [2]. This increase in mistrust over five years is likely a reflection on the complexity of the data lake today[3] and perhaps a lack of data governance oversight overall, not data science itself.

“it’s estimated that more than 90 percent of the total data created by humans has been generated in just the last two years.”


The art of Business Intelligence and its core support areas of Data Engineering, Data Analytics and Data Science has grown tremendously over the past decade, with exponential growth during the pandemic as more businesses rely on the collection of big data to make important decisions at this time. recent study reported gains for big data usage with an “8% increase in revenues and a 10% reduction in costs”.[4] Knowledge and insight for a company today may be the only way it survives and thrives especially with the supply chain crisis.

The Data Scientists toolset in 2022 involves many specialties statistics, calculus, data pattern recognition, machine learning (ML) algorithms and data visualization tools. They must know how to use visualization tools like Excel, Tableau and PowerBI along with cloud based analytical tools such as Looker and SageMaker while having a solid grounding in SQL reporting. Combining these talents with a solid grasp of data transfer formats from XML, JSON and Avro to REBOL and Parquet is important. Factor in the usage of some sort of Application Programming Interface (API) like MuleSoft, Apigee or Swagger and managing data repositories it seems we have a jack of all trades. Even this needs to be rounded out with an equal amount of bravado in languages like Python, Scala, Java, and R. Dedicating time to each area within the job description while balancing time for one’s further learning of these specialties is a struggle for many. Staying on top of new technological advancements in the data science field means online learning programs are the most effective and wise way to improving knowledge in this area quickly.

The number of jobs in this field is rising fast, “with Data Engineering and MLOps taking precedence in 2022”[5]. A 2020 study showed “a whopping 84 companies of the Fortune 500 don’t seem to have a single data scientist on their payroll”.[6] As shown in the figure below, the highest concentration of actual data scientists still falls into technology and financials fields. Candidates with the right skillsets are not coming directly to the workforce from college or universities, they are often coming from already existing positions within the firms.

Data Scientist Distribution across Fortune 500


This has led to many fortune 500 companies spending significant money in retraining existing employees to fill needs and when it comes to new hires, human resources departments creating dedicated training programs for faster integration. Five years ago, only 15% of hospitals employed data science and predictive analytics to prevent hospital readmissions and other patient care.[8] In2017, it was 31%, and healthcare has seen incredible growth in the data science field over the last two years.[9] Data scientist in this field are collecting data within the hospital and from patient care externally though mobile devices now too.

Knowledge of the types of data being collected today goes beyond formats and languages and leads to two specific requirements the data scientist must recognise: quantitative and qualitive data. Finding the quantitative data is the starting point which determines the sampling methods and rates for typically numeric datasets. Once collected, exploratory data analysis (EDA) can be undertaken essentially looking for motivations, opinions, and reasons to answer hypothetical questions asked. Interpreting these results, researching actionable plans, and finding trends that possibly are not evident, is all in the job description. Scientific experimentation techniques utilizing ML models on current and new data, creating training and test models and even data preparation and cleansing, known as Mungling, come into play not only with on-premises tools but more likely in multi-cloud environments today. This processing time comes at a cost though. Anaconda’s August 2021 State of Data Science survey showed Data Scientists spend “39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection, and deploying models combined.”[10]

According to Google’s Director of Research Dr Benjamin Obi Tayo’s 2021 Data Science Preliminaries study, there are three levels of data science competency require by Data Scientists today; level 1 (basic level); level 2 (intermediate level); and level 3 (advanced level).[11] As shown below these 3 levels account for many technologies and tools needed.

Three levels of data science competency

A younger workforce which is more adept at technology, along with the drive to cut cost and improve the performance of systems, is leading to more as a Service (Paas, SaaS, AIaaS) cloud implementations, which are expected to form the basis of 95% of companies’ digital transformation projects by 2025, compared with 40% in 2021.[12] What does this mean for a Data Scientist in today’s workplace?

Data Science in 2022 means an uptake in specialized tools to move data, prepare and analyse it[13] and that translates to more specific training requirements today too. Additionally, being adept in this areas entails writing customized code to get the job done, resulting in expanded training for languages like Python and it’s supporting libraries, like pandas, Numpy, Scipy, scikit-learn and Notebooks.[14]

DataOps and MLOps are moving further into the mainstream in 2022 with new technologies like Data Fabric[15] and Data Mesh[16] emerging to help speed up the processing and management of data. Interest in cloud technology one stop solutions like DataBrick’s Lakehouse, Snowflake’s SnowGrid and Explorieum’s augmented data discovery tools[17] is helping businesses manage costs and meet security and regulatory requirements effectively.[18] The slippery slope of privacy will continue to be at the forefront of the data scientists mind going forward and companies need to ensure compliance using proper master data management and data governance programs.

Growth in computer vision tools due to the pandemic like DarwinAI,[19] used in medical diagnoses for Covid-19, has many companies implementing best practices for Artificial Intelligence (AI) around automating the analysis. AI uses specialized computer-assisted solutions knowledge in other areas too that benefit businesses like cybersecurity, web search, e-commerce, advertising, smart homes, and infrastructure. “It seems as though every week companies are finding new uses for algorithms that adapt as they encounter new data”.[20] The application of ML, a subset of AI, that makes predictions on new datasets has skyrocketed covering everything from speech recognition, fraud detection, spam and malware filtering, image recognition and even self-driving vehicles. Anaconda’s 2021 State of Data Science Survey Results showed the biggest problem to tackle in the AI/ML area today was the social impacts caused by bias in data and models.[21] In the same study, 55% of respondents hope to see more automation and AutoML in data science[22]

2022 sees ML expanding beyond tabular table data sources and simplified training models to the utilization of off the shelf prebuilt models from various sources. AWS Marketplace for SageMaker was one of the first out of the gate with this approach and it has been highly successful for data scientists with one testimonial from OneCup AI saying it “reduced our training time from several hours at best and days at worst down to 15 minutes, giving us a massive competitive advantage.”[23]

Making better more confident decisions, with higher quality data and a plethora of tools helps business insight and to meet needs quicker, adjusting faster to change and developing solutions by managing their data lakes more succinctly. Competitive advantage has always been a desire of businesses, in 2002 that comes with additional skillsets afforded to the data scientist.



























CEO and President of Claremont Communications and 3xd, has spent over 30 years in the technology and training industry focusing on helping start-ups grow.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Why physical storage of your database tables might matter

Food Recommendation System by ALS Method in Pyspark and Diet Food Recommender by Cosine Similarity…

The difference in the career options in Data Science: Data Scientist vs Data Engineer vs Data…

Using Stock Data for Classification Problem: Action

Is your company’s historical data in a database or your employees’ head?


Using Machine Learning techniques to predict if a startup will succeed.

The key tools you need to get started with data science in 2020

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kevin Clements

Kevin Clements

CEO and President of Claremont Communications and 3xd, has spent over 30 years in the technology and training industry focusing on helping start-ups grow.

More from Medium

What Is Data Science?

March Madness and Machine Learning

The Beginning of my Data Science Journey

The Start of my Data Science Journey