Top 10 Essential Data Engineering Skills for 2021

Kevin Clements
4 min readFeb 15, 2021

It is 2021 and the “Data Lake”, disparate datasets of organizations, are still evolving with a shift away from on-premises servers to the cloud at an astonishing pace. Recent studies from Deloitte show the drivers for this are data modernization, cost, and security.[1] The adoption is ongoing for organizations whether government, or businesses large and small. According to Gartner, in 2020 “83% of the companies will be using cloud platforms and in 83%, 41% will prefer to use public cloud platforms. “[2] In a recent Salesforce Webinar, they detailed how “over 60% of Canadian businesses were forced to accelerate their technology plans”[3] around cloud technologies because of the pandemic.

What does this shift mean for the Data Engineer whose responsibility is this “Data Lake”?

The answer is keeping their certified skillsets up to date through learning and by embracing the new technology tools available, whether on-premises or cloud-based. But which skills for what purpose? For that perspective, a review of these skill requirements makes sense.

10. Scripting. Ten years ago, scripting meant writing in the Linux Bash shell or Perl, using a Linux cron job to schedule the script run, and collecting information from logs to see if there were any issues. Today, there are many techniques a Data Engineer needs to know and languages around this, beyond just shell scripting. For instance, PowerShell, Typescript, JavaScript, and Python are common.

9. Programming. Java and C++ have been integral languages in the Data Engineering field for over a decade, serving as an important interface with the disparate data in the organizations systems. Many newer systems today require additional integration and utilizing programming languages such as Python, C#, Scala and Go is more prevalent. Knowing these languages is a must to have ability to work with real-time data like social media, email, controls, or cloud-based systems. Additionally, ELT (Extract, Load and Transform) methods need to be in line for other data sources like CSV and databases. Programming means using repositories like Git for source control. Data Engineers should also know about Software Development Life Cycle (SDLC) and Continuous Development (CD) and Continuous Integration (CI) techniques and tools like Jenkins and GitLab in DevOps.

8. SQL. Structured Query Language is 45 years young in 2021, and still a must have for any Data Engineer. Knowledge of Relational Database Management Systems (RDBMS) is key in this role still.

7. NoSQL. Not only SQL is the acronym for working with Data Stores that store data in unstructured or semi-structured (lacking a schema) ways. NoSQL invokes data in a key-value way, using clustered environments; many machines working in parallel. Open-source systems Apache Hadoop, HBase, Redis, MongoDB and Cassandra are all the rage in 2021. Knowing how to manipulate key value pairs and object formats like JSON, AVRO or Parquet is a necessity for these.

6. Data Pipelines. Processing data and ensuring the efficient moving of that desperate “Data Lake” data for future analysis and visualization is another key knowledge area. Operating with real-time streams, data warehouse queries, JSON, CSV, raw data is a daily occurrence. Understanding which tools to use like Apache Kafka, Storm, Flume for ingesting data or Amazon Web Services (AWS) Cloud Development Kit (CDK) for on-premises to cloud is a must have skill.

5. Automation. Scripting and Data Pipelines need to run on their own jobs, either scheduled or invoked, to perform the tasks required to successful move data. Beyond cron jobs the Data Engineer must know about the integrated tools in many server environments to achieve this.

4. Analysis. Exploratory Data Analysis (EDA) has been used in the realm of the Data Scientist in the past. Today, Data Engineers must also acquire this knowledge to be able to ensure ETL work mentioned earlier is successful. Knowledge of terminology and data manipulation is key here as is utilization of Apache Spark engine with PySpark or Scala.

3. Visualization. Understanding visualization techniques is a key success factor for Data Engineers now. Working with tools like SSRS, Excel, PowerBI, Tableau, and AWS Quicksight is a must. Data Engineers need to ensure data integrity throughout the ETL process and how to visualize the resultant data.

2. Machine Learning and AI. Knowledge of terminology and familiarity with algorithms is becoming a more important part of the Data Engineers skillset. Today knowing and utilizing Python’s libraries numpy, pandas, and sci-kit learn and even cloud based tools like AWS Sagemaker, Microsoft’s HDInsight, or Google’s DataLab should be part of the known toolsets.

1. Cloud computing. As mentioned at the top of this article, the growth in cloud computing today is astronomical. Herein lies an issue though, which cloud technology to choose. According to Flexera, 76% of public cloud adoption in 2020 was AWS based with Microsoft slightly behind at 69% and Google a distant 34%.[4] Does that mean recommending only the top 3? Absolutely not! A Data Engineer needs to have a good understanding of the underlying technologies that make up cloud computing and in particular, knowledge around IaaS, PaaS, and SaaS implementations.[5]

How can you, the data engineer, be successful with all these areas since “studies show that 73% of digital transformation efforts fail.”[6] Gaining knowledge generally takes a long time, especially trying to do it all on your own. A proper certified training program that plans out your schedule, is adaptable, uses real-world labs, and allows you to study with an experienced instructor is key to your success.

[1] https://www2.deloitte.com/us/en/insights/industry/technology/why-organizations-are-moving-to-the-cloud.html

[2] https://idego-group.com/cloud-technology-in-2020-benefits-of-moving-your-data-to-the-cloud-server/

[3] https://www.salesforce.com/ca/form/events/webinars/form-rss/2960551

[4] https://www.flexera.com/blog/industry-trends/trend-of-cloud-computing-2020/

[5] https://www.bigcommerce.com/blog/saas-vs-paas-vs-iaas/#the-key-differences-between-on-premise-saas-paas-iaas

[6] https://www.salesforce.com/ca/form/events/webinars/form-rss/2960551

--

--

Kevin Clements

Director. Former CEO of Claremont Communications and 3xd, has spent over 30 years in the technology and training industry focusing on helping start-ups grow.