Here are some of the most significant themes we see as we look toward 2021. Some of these are emerging topics and others are developments on existing concepts, but all of them will inform our thinking in the coming year.
MLOps FTW
MLOps attempts to bridge the gap between Machine Learning (ML) applications and the CI/CD pipelines that have become standard practice. ML presents a problem for CI/CD for several reasons. The data that powers ML applications is as important as code, making version control difficult; outputs are probabilistic rather than deterministic, making testing difficult; training a model is processor intensive and time consuming, making rapid build/deploy cycles difficult. None of these problems are unsolvable, but developing solutions will require substantial effort over the coming years.
The Time Is Now to Adopt Responsible Machine Learning
The era in which tech companies had a regulatory “free ride” has come to an end. Data use is no longer a “wild west” in which anything goes; there are legal and reputational consequences for using data improperly. Responsible Machine Learning (ML) is a movement to make AI systems accountable for the results they produce. Responsible ML includes explainable AI (systems that can explain why a decision was made), human-centered machine learning, regulatory compliance, ethics, interpretability, fairness, and building secure AI. Until now, corporate adoption of responsible ML has been lukewarm and reactive at best. In the next year, increased regulation (such as GDPR, CCPA), antitrust, and other legal forces will force companies to adopt responsible ML practices.
The Right Solution for Your Data: Cloud Data Lakes and Data Lakehouses
Data lakes have experienced a fairly robust resurgence over the last few years, specifically cloud data lakes. With more businesses migrating their data infrastructure to the cloud, as well as the increase of open source projects driving innovation in cloud data lakes, these will remain on the radar in 2021. Similarly, the data lakehouse, an architecture that features attributes of both the data lake and the data warehouse, gained traction in 2020 and will continue to grow in prominence in 2021. Cloud data warehouse engineering develops as a particular focus as database solutions move more and more to the cloud.
A Wave of Cloud-Native, Distributed Data Frameworks
Data science grew up with Hadoop and its vast ecosystem. Hadoop is now last decade’s news, and momentum has shifted to Spark, which now dominates the way Hadoop used to. But there are new challengers out there. New distributed computing frameworks like Ray and Dask are more flexible, and are cloud-native: they make it very simple to move workloads to the cloud. Both are seeing strong growth. What’s the next platform on the horizon? We’ll see in the coming year.
Natural Language Processing Advances Significantly
This year, the biggest story in AI was GPT-3, and its ability to generate almost human-sounding prose. What will that lead to in 2021? There are many possibilities, ranging from interactive assistants and automated customer service to automated fake news. Looking at GPT-3 more closely, here are the questions you should be asking. GPT-3 is being delivered via an API, not by incorporating the model directly into applications. Is “Language-as-a-service” the future? GPT-3 is great at creating English text, but has no concept of common sense or even facts; for example, it has recommended suicide as a cure for depression. Can more sophisticated language models overcome those limitations? GPT-3 reflects the biases and prejudices that are built into languages. How are those to be overcome, and is that the responsibility of the model or of the application developers? GPT-3 is the most exciting development to appear during the last year; in 2021, our attention will remain focused on it and its successors. We can’t help but be excited (and maybe a little scared) by GPT-4.