Top 10 Essential Data Engineering Skills for 2022

Kevin Clements
5 min readFeb 19, 2022

A year ago, in the midst of the global pandemic, we looked at the top skills required by the data engineer in 2021 to meet growing data lake demands in our digital society. At the time, we identified a huge growth in cloud implementations to meet demands of data modernization, cost, and security.[1] The adoption has been further split into what is now billed as multi-cloud[2] environments with Gartner predicting that “more than 85% of organizations will embrace a cloud-first principle by 2025.”[3]

In 2022, the number of jobs in the Data Science Domain will continue to rise with Data Engineering and MLOps taking precedence.[4] Certified skillsets are still required with a excess of new technology tools in the market, both open source and paid, on-premises or cloud-based. Let’s look at the skill requirements that makes sense for a data engineer now.

10. Scripting. Yes, skills in scripting are still required. Linux Bash, PowerShell, Typescript, JavaScript, and Python are all still here and if anything were dealing with even more data types (text based allow includes CSV, TSV, JSON, Avro, Parquet, XML, ORC, etc) in the data pipeline that require additional knowledge of ETL / ELT techniques and tools. See more later here on Data Pipeline.

9. Programming. The move to cloud has changed the required languages little in the last year with Java, C# and C++ still important on-premises.[5] More prevalent cloud languages are centered around Go, Ruby, and Rust and especially Python, and Scala with Apache Spark data store and its online cloud implementations like Amazon Glue and DataBricks. Working with streaming real-time data items like social media, NLP, email, controls, on cloud-based systems[6] is only going to increase in the coming years.

8. DevOps. A year ago, we recognized this key foundational piece for the Data Engineers knowledge as part of programming. This year it is broken into its own multi-piece area. This area includes Software Development Life Cycle (SDLC) and Continuous Development (CD) and Continuous Integration (CI) techniques and tools like Jenkins, Git, and GitLab. The process especially tied into DataOps[7] and Data Governance results in higher data quality practices and better more accurate results.[8]

7. SQL. Can’t get away from those schemas and their infamous joining syntax yet! In fact, more cloud-based systems are adding SQL like interfaces that allow the usage of SQL, for instance Google’s Looker or Amazon’s Athena and QuickSight combination. Relational Database Management Systems (RDBMS) are key still to data discovery and reporting no matter where they reside.

6. NoSQL. I keep hearing from organizations saying Hadoop is not important as we are moving to the cloud. Let’s set the record straight here… Google BigTable, AWS S3, Azure File and Blob are all related and manage hierarchical file data like the open-source ecosystems of Hadoop. The cloud is full of unstructured or semi-structured (lacking a SQL schema) data stores, in fact over 225.[9] NoSQL, whether open-source Apache based, or MongoDB and Cassandra are all the rage in 2022. Knowing how to manipulate key value pairs and object formats like JSON, Avro or Parquet is still a necessity for these.

5. Data Pipelines. Desperate Data Lakes keep getting new names like DataBricks Lakehouse and Snowflakes Data Cloud implementations, same thing, new year. Operating with real-time streams, data warehouse queries, JSON, CSV, raw data is a daily occurrence. The way and where data engineers setup storage may change skillsets and tools are required for the ETL / ELT injection. This is one area that is getting more complex and skewed depending on the source and resource used.

4. Hyper Automation. Value added tasks, like running jobs, schedules, events, are now data engineer’s skillset requirements in 2022. The last 10 years shows this trend getting more predominant with specialized Scripting and Data Pipelines tasks required to successful move data to the cloud. Gartner states that ““the most successful hyper-automation teams focus on three key priorities: improving the quality of work, accelerating business processes and increasing decision-making agility. “[10]

3. Visualization. Exploratory Data Analysis (EDA) appears again now as part of Data Engineers talents to ensure ETL /ELT work mentioned earlier is successful. Working with tools like SSRS, Excel, PowerBI, Tableau, Google Looker, Azure Synapse is a must. Data quality of the resultant data is crucial as the Data Engineers processes and visualizes datasets.

2. Machine Learning and AI. Last year we mentioned these subjects at the same position, and knowledge of terminology and familiarity with algorithms remain an important part of the Data Engineers skillset. At minimum familiarity with Python’s libraries NumPy, SciPy, pandas, sci-kit learn and some actual experience with Notebooks (Jupyter or online cloud) is vital. Taken to the next level in cloud-based tools like AWS Sagemaker, Microsoft’s HDInsight, or Google’s DataLab toolsets. This fields’ toolsets are getting more complex every year.

1. Multi-cloud computing. Still number one for a second year, but just add the word multi in front for good measure. No longer content to be tied to single cloud vendors companies are opting to join the multi-cloud, instead of which cloud technology to choose, 76% of enterprises[11] have already chosen a couple. Cloud spending in 2022 will reach $482 billion[12]. A Data Engineer still needs to have a good understanding of the underlying technologies that make up cloud computing and in particular, knowledge around IaaS, PaaS, and SaaS implementations.[13]

Data Engineers can’t afford to make one of the five common mistakes; data too complex, inaccurate data, not clarifying, usage requirements and not communicating issues.[14] Trying to gain knowledge on your own, without proper guidance and insight generally takes a long time. A proper certified training program that plans out your schedule, is adaptable, uses real-world labs, and allows you to study with an experienced instructor is key to your success.

[1] https://www2.deloitte.com/us/en/insights/industry/technology/why-organizations-are-moving-to-the-cloud.html

[2] https://www.citrix.com/solutions/app-delivery-and-security/what-is-multi-cloud.html

[3] https://www.gartner.com/en/newsroom/press-releases/2021-11-10-gartner-says-cloud-will-be-the-centerpiece-of-new-digital-experiences

[4] https://www.analyticsvidhya.com/blog/2021/12/a-review-of-2021-and-trends-in-2022-a-technical-overview-of-the-data-industry/

[5] https://www.techrepublic.com/article/the-best-programming-languages-to-learn-in-2022/

[6] https://www.ibm.com/cloud/blog/top-7-most-common-uses-of-cloud-computing

[7] https://jdp491bprdv1ar3uk2puw37i-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/102519_Ultimate_Guide_To_Data_Ops_Tamr.pdf

[8] https://www.oss-group.co.nz/blog/data-governance-key-elements-to-consider

[9] https://hostingdata.co.uk/nosql-database/

[10] https://www.gartner.com/en/information-technology/insights/top-technology-trends

[11] https://www.computerweekly.com/news/252505227/Multicloud-adoption-on-the-rise

[12] https://www.gartner.com/en/newsroom/press-releases/2021-08-02-gartner-says-four-trends-are-shaping-the-future-of-public-cloud

[13] https://www.bigcommerce.com/blog/saas-vs-paas-vs-iaas/#the-key-differences-between-on-premise-saas-paas-iaas

[14] https://learnsql.com/blog/data-engineering-mistakes/

--

--

Kevin Clements

Director. Former CEO of Claremont Communications and 3xd, has spent over 30 years in the technology and training industry focusing on helping start-ups grow.