"Brain implant", commonly seen in many sci-fi films, demonstrate the power of augmenting human's physical or mental capability with technological alteration. The image was taken from the article "Brain implant delivers drugs directly through head.[1]An algorithm must be seen to be believed.
Background
Nowadays it is recognized by the industrial practitioners that engineering plays a growingly significant role in data science. Taking a look at the tasks that a data scientist takes, it is seen that the primary goal is not merely demonstrating the value of a statistical or machine learning algorithms in solving a well conditioned technical problem, it is also about estabilishing a full-fledged reliable system that works under sophisticated context contrained by both technical and business requirements.
Software engineering is hence important to transform the data science output to applicable software products. For companies, perhaps there are dedicated roles of software engineers (in many circumstances these roles are called “machine learning engineer”) who handle the productionization of a data science or machine learning model, the skill sets that data scientists should have grow broader than ever before. More importantly, the mindset shift that converges the development practices of data science models to favour productionization becomes vital.
The post does not intend to suggest a superiority of “software engineering” over “data science”. Instead, a combination of the skill sets tally with the contemporary progress in the industry, and that’s why the word “implanting” is used for the title. For a data scientist, it is not trivial to pick up the engineering skills. The ramp-up is made even harder given the fast pace of technology advancement. It is hence significant to keep learning and practicing with the most recent technological trend to keep up.
Implanting software engineering into data science
Based on my personal experience, as well as the good learning materials shared by the experts, there are multiple keys to “implanting” the software engineering mindset into data science practices. There are quite a few principles and good practices that are widely applied in the software development, whose value might yet been well acknowledged by data scientists.
Coding conventions
I personaly felt reluctant to follow a verbose coding guidelines - it seems not worthwhile of paying particular attention to using snake or camel style in the codes. In fact, the benefits of developing standardized codes start to glow when a group of software engineers, data scientists, researchers, etc., work on the same project. Unfollowing the same conventions may lead the same code base to look like being developed by using different languages even though it is not.
As the most popular programming languages used in data science, Python has an
explicit guidelines to program “pythonic” codes, see
PEP8. Companies sometimes have their own
style but in general they share mostly the same. For example. Google has its
guidelines for Python (see
here). Linting or formatting
tools like black
, autopep8
, etc., help improve Python code styles.
Programming “pythonic” codes helps enhance readability, and thus it greatly improves the process of smoothly transform the “experimental” codes that data scientists develop to production-ready codes that are to deploy onto the server.
Design pattern
The contemporary software is developed by using object-oriented programming (OOP) language, and the OOP characteristics gives the possibility to develop reusable and modular patterns (apparently anti-patterns co-exist) that nicely excels in resolving particular types of problems. There are formalized design patterns which are generically used in software engineering tasks. It is recently discussed in the industry that in either data science or machine learning tasks, well-developed design patterns can be applied to enhance reusability of certain re-occurring solutions.
The Google Cloud team proposed the machine learning design patterns in the book. With the codes, it covers the basic best practices to handle the common tasks in building machine learning system such as data representation, problem representation, etc.
In the repository
ml-design-patterns, the
authors proposed design patterns that are commonly seen in machine learning
development. For example, the pattern that is used in the well-known
scikit-learn
package, called “learning pattern” by the authors, represent a
typical way to wrap a machine learning algorithm that may involve supervised
training and then inferencing operations.
class Model:
def __init__(self):
self.model = nn_model()
self.loss_function = subtract/square_loss/l1
def fit(self, data):
# 1. Compute forward function
output = self.model(data)
# 2. Get loss
loss = self.loss_function(data)
# 3. Update model
self.model.update(loss)
def update(self, loss):
# 1. Compute gradients with autograd
self.model.weights = ...
def predict(self, data):
prediction = self.model(data)
The Model
object can also be extended by inheriting the attributes and methods
from the base class with modifications, to support various application
scenarios.
It is also discussed in the recent post by Eugene Yan that some design patterns used in the popular packages like pytorch, gensim, huggingface, etc., supports convenient data loading, and this idea can be used in other realms as well. The key takeaways from the examples discussed in the post are that reusable design patterns as code templates greatly help data scientists and software developers to code for production; the patterns are also extensible and flexible to support various domains (e.g., conventional machine learning, natural language processing, image processing, etc.) with high quality and well-defined structure.
In general, keeping a library of useful patterns is important, especially when these patterns are industry or domains specific such that they can be shared and reused as enables to other data scientists’ work within the same organization.
Performance enhancement
Viewed from a computer architecture’s perspective, most of the tools that data scientists use, e.g., Python, R, etc., sits above the compiler layer, which do not require compilation before being executed. Along with the progressive upgrade of the modern data science application in terms of scalability and data volume, the requirement of the data science or machine learning engines that backbone these applications need to meet the pressing engineering specifications. Data science and machine learning models developed by purely Python or R codes may not suffice. The core components that run the critical job, e.g., model inferencing, need to be compiled into a form that is close to machine codes to achieve the maximum computational efficiency.
Knowing that this may not fall into the portfolio of most data scientists, it is
still worth gaining some knowledge about the low-level representation of codes
to fill the gap between experimentation and production. In reality, most of the
machine learning packages implement the key model parts in a compiling language,
e.g., tensorflow
, pytorch
, etc. implements the underlying key components in
C++. In addition, there are handy tools or framework help refactor a vanila
Python codes into a compiled version to achieve fast speed.
tvm
is a compiling framework that accelerates
execution efficiency of deep learning model on a given hardware target.
Conveniently, instead of rewriting or refactoring the models that data
scientists experiment with, the Python API of tvm
can be leveraged to avail
low-level compiling to achieve production-read implementation. See the below
editted example from tvm
’s document.
from tvm.driver import tvmc
# Load a ONNX-formatted model object, which is saved after a model training
# by using a deep learning framework and converted to the ONNX format.
model = tvmc.load("my_model.onnx")
# Compile the model object to a low-level representation. Multiple targets can
# be specified. Here, the "llvm" target is used to represent the compiled
# version of the model object.
package = tvmc.compile(model, target="llvm")
# The compiled object can then be run over a target type of hardware, e.g. cpu.
result = tvmc.run(package, device="cpu")
Similar tools of such are DLIR, MLIR, etc., that try to tackle the computational efficiency issues by transforming the high-level codes to low-level representations.
Another commonly used package for compiling Python codes in general is numba
.
numba
translates Python code to low-level intermediate representation by LLVM,
such that the original Python code, after compilation, runs as fast as C or
Fortran code on the target hardware platform. It is convenient to data
scientists in development because refactoring conventional Python codes to
numba
-compatible codes require merely an addition of the decorator. See the
below example from the numba
official document.
from numba import jit
import random
@njit
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
The first run of the above codes will trigger a compilation which then allows the next run to use the compiled version directly, and thus, the overal execution efficiency is improved.
jax
is a tool developed by Google for high-performance machine learning
research and development. It uses Autograd
and XLA for differentiating numpy and Python
codes. It has a similar programming interface as numpy
such that is makes it
fairly easy for the data scientists or researchers to program jax
compatible
models that have higher efficience compared to numpy
. For example, from the
document, creating a numpy
-like array in jax
can be done by
import jax
import jax.numpy as jnp
x = jnp.arange(10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
The numpy
-like array as generated above can then leverage different backend
hardware, e.g., CPU, GPU, or TPU, for operations with high performance.
MLOps
MLOps is more of a philosophy and practice to properly manage the life cycle of the deployed machine learning models. Data scientists’ participating into the MLOps practices are significantly important though quite a lot of times the important is neglected.
Ideally, the hollistic platform or framework of MLOps is established by the machine learning architects or engineers, while the data scientists experiment and develop the models by using the pre-defined interfaces, pipelines, and conventions. For example, upon releasing a model, the data scientists are supposed to provide
- Unit tests with pre-determined coverage and pass rate to guarantee the reliability and robustness of model related codes against various corner testing cases. Conducting Test-Driven Developement (TDD) might be necessary to some circumnstances when data scientists develop critical components for the machine learning system.
- Smoke testing with an end-to-end run of model training and inferencing.
- Properly developed model performance evaluation metrics and integration testing codes to calculate the metrics against sample data.
The evaluation process enables a Continuous Integration and Continuous Delivery (CI/CD) practice on the MLOps platform to make sure that that the entire journey of exploration, experimentation, development, and deployment are organically connected to yield high-performance productionization with efficiency.
References
- Engineering and Technology Editorial Staff, Brain implant delivers drugs directly through head, url: https://eandt.theiet.org/content/articles/2019/02/brain-implant-delivers-drugs-directly-through-head/, 2019.
- Gamma, Erich; Helm, Richard; Johnson, Ralph; Vlissides, John (1995). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley
- Beck, Kent (2002-11-08). Test-Driven Development by Example. Vaseem: Addison Wesley.