Gather together those things that change for the same reason, and separate those things that change for different reasons.
Introduction
Single responsibility is a well-known principle used in developing computer software. The principle originates from Robert Martin. It is a general principle that applies to any software design. The core idea of single responsibility is that a module should be responsibility to one, and only one, actor. There are a few ways to interpret this idea. For example, Martin explained the principle with “a class should have only one reason to change”.
There are many benefits of following the single responsibility principle in software development.
- It enhances modularization.
- It favors unit test.
- It improves code readability.
- It mitigates ambiguity and confusion.
Practicing the single responsibility principle is vital to developing data science and machine learning software. The single responsibility principle can be illustrated on three folds.
“Different problems have different objects”
A “problem” may be defined with different scopes, but in general, different problems need to be implemented in different objects however they are big or small. For example, an object can be created to deal with database-related operation, and another object can be created to handle model training and model scoring. “Database-related operations” and “machine learning model-related operations” are considered to be two different sets of problems so they require different objects.
For example, scikit-learn
applies single responsibility in implementing the
base Estimator
or
Transformer
. That
is, these base classes have the standardized structure such as the methods of
fit
, predict
, score
, etc., which are the most relevant “roles” to an
Estimator
or Transformer
. The classes or functions for other atomic
operations are implemented as separate entities.
Similar idea can be seen in the deep learning packages like torch
. The
nn.Module
implements the neural network topology, but it does not incorporate
the training
component. This
is because a separate module takes the “responsibility” for optimizing the loss
of a model defined in the nn.Module
.
“Focus on one if there are multiple types of sub-problems to resolve”
A lot of times the object can be made obscure when there are functionalities being added to it that may exceed its pre-defined scope. If there is an object that is supposed to handle multiple different sub-problems, it may be reasonable to split the scope of the object and apply the splits to only the sub-problems, respectively.
A very representative example to illustrate this principle is the write
functions used for saving the pandas
dataframe. Though the write operation can
be considered as one problem, it may have sub-problems that write the dataframe
to different types of output formats, i.e., csv
, parquet
, etc. And
therefore, there are different dedicated methods of the pandas.DataFrame
object to handle these sub-problems
. That is, for saving to csv
, to_csv
is
used, and for saving to parquet
, to_parquet
is used.
“Add only the necessary functionalities to the object”
It happens frequently that unnecessary functionalities are added to an object, which makes the object less maintainable. Considering a class where the unnecessary emthods are added, the cost will be not merely the implementation of the additional methods but also the unit tests and integration tests that may apply.
Sometimes, it is indeed hard to tell whether the functionalities are required or not. The usual way of dealing with such situation is that the responsibility can be “propagated” to or “passed” from by using other auxiliary objects. This idea can be seen in many commonly used design pattern such as “strategy” - it implements the “context” that controls the actual behavior of the object. The unnessary components of a “strategy” can then be put into the “context” so that whenever there is a change of the context, the actual strategy is also affected. And as a result, the context and the strategies can be implemented and maintained separately.
Case study: recommender system
The following demonstrates the use of the “single-responsibility” principle in designing a recommender system.
Recall and rerank
In a recommender system, a commonly seen architecture pattern is “recall-rerank”
- the former does large-scale retrieval of relevant items and the latter reranks
the relevant items with detailed contextual information for recommending to
users. A possible design of such system is a
Recommender
class, where the “recall” stage and the “rerank” stage are added as methods for performing respective tasks. The high-level code example is shown as below.
class Recommender():
def recall(self):
print("Do recall")
def rerank(self):
print("Do rerank)
Apparently, if there are details to add into the Recommender
class, it will
make it look “bloated”. For example, the recall stage requires the input of
users
, items
, and the user-item interactions, i.e.,
user_item_interactions
, to get the similar items that a user has interacted
from the items
pool. With the input data, the actual “recall” operation to
generate the relevant items may be powered by a model
trained by the
interation data with an algorithm, which generates the quantitative measure
which decides the output of recall
. That is, if the measure is higher than a
threshold, the item is considered to be relevant. The parameter of
item_per_user
determines how many items for each user need to be recalled.
The above detailed design leads to a possible implementation of the recall
in
an actual Recommender
class that inherits the abstract one. The codes can be
found below
def recall(
self,
users,
items,
user_item_interactions,
threshold,
item_per_user,
**algorithm_parameters
):
"""Recall method
Args:
users: a list of users to generate recall items for.
items: a list of candidate items.
algorithm: an algorithmic class for recalling items.
user_item_interactions: user item interaction data.
threshold: threshold for generating relevant items.
item_per_user: number of recalled items for each user.
algorithm_parameters: parameters of the recall algorithm.
"""
# Train the recall model.
model = Algorithm(**algorithm_parameters).fit(user_item_interactions)
# Generate the relevant items.
recalled_items = model.predict(users, items, threshold, item_per_user)
return recalled_items
class Algorithm():
"""Implementation of the algorithm that builds the recall model
"""
def fit(self, data):
...
def predict(self, data, *args, **kwargs):
...
It is obvious that the above implementation will break the single responsibility
principle for the class of Recommender
. This is because
- the workflow in either
recall
orrerank
is only applicable to itself, respectively, - anything changed in either
recall
orrerank
leads to a change in the entire object ofRecommender
.
An advisable approach to mitigate the issue is to separate the stages and implement them as single classes. That is
class Recall():
model = None
def build_model(
self,
user_item_interactions,
**algorithm_parameters
):
"""Build the model to perform recall operation
"""
self.model = Algorithm(**algorithm_parameters).fit(user_item_interactions)
def generate_items(
self,
users,
items,
threshold,
item_per_user,
):
"""Use the pre-built model to generate the recalled items.
"""
if self.model is not None:
# Generate the relevant items.
recalled_items = self.model.predict(users, items, threshold, item_per_user)
else:
raise ValueError("The model should be built in the first place")
return recalled_items
Similarly, Rerank
can be implemented as a different class with detailed
attributes and methods that a specific to rerank stage. The Recommender
class
encapsulates the high-level steps of recall
and rerank
by adding the
Recall
and Rerank
object to propagate the “responsibility” of each to an
upper level, and then the Recommender
object harness the run of recall stage
or rerank stage. Any logic changes in either recall
or rerank
is not
reflected in the code of Recommender
.
from typing import Union
class Recommender():
def __init__(self, recall: Recall, rerank: Rerank):
self.recall = recall
self.rerank = stage
And the recall and rerank operations will be run as
# Initialize a recommender
recommender = Recommender(Recall(), Rerank())
# Build model
recommender.recall.build_model(user_item_interactions, **algorithm_parameters)
# Do recall
recall_items = recommender.recall.generate_items(users, items, threshold, item_per_user)
Data preservation
It is common that the items generated from one of the stages are cached into a
database for later reference. A tendency of implementing such is to add a method
into the Recall
or Rerank
class for preserving data.
def write_items(self, items, connection_string: str):
"""write the items to a database.
Args:
items: the items to write to the database.
connection_string: the connection string for the database to write results.
"""
# build_db_connection is a function that returns the generator to perform
# write operation under the context managed by the connection.
with build_db_connection(connection_string) as conn:
write_data(items, conn)
After adding write_items
method, the Recall
class has multiple types of
sub-problems, i.e., 1) builds a recall model and generates the recall items and
2) save the items to the database. This breaks the single responsibility of the
class in a way that the two types of problems have minimal overlap in terms of
functionalities and implementation, but changing one of the two will lead to the
change of the entire class.
A better choice is to have a separate class of RecommenderDataManager
where
there are methods to support data read and write. The RecommenderDataManager
may be used as a generic object at the Recommender
class level, to handle the
data read/write related operations.
class RecommenderDatabaseManager:
def __init__(self, connection_string):
self.connection_string = connection_string
def write_items(self, items):
"""Write items to the database
Args:
items: the items to write.
Notes:
The items are assumed to be written to a default table
"""
with build_db_connection(self.connection_string) as conn:
write_data(items, conn)
def read_items(self):
"""Read items from the database
Notes:
The items are read from the default table
"""
with build_db_connection(self.connection_string):
items = read_data(conn)
In actual use, it becomes
data_manager = RecommenderDatabaseManager(connection_striing)
# Write recall items
data_manager.write_items(recall_items)
# Read recall items
data_manager.read_items()
To support the rerank data read/write operation, a parameter of table may be
added in the methods of RecommenderDatabaseManager
to filter the target tables
in the database.
Making the hollistic pipeline
Usually, before either the recall or the rerank stage, possibly, there may be
data pre-processing steps, e.g., feature engineering; after generating the
results, sometimes the items need to be post-procssed, e.g., business rules may
apply to the items before they are recommended. An anti-pattern approach is to
add the pre_process
and post_process
methods to the Recall
class, for
preprocessing and post processing, respectively.
However, these two methods do not add help to the existing methods build_model
and generate_items
directly. This is because the methods in the Recall
class
are already self-consistent in terms of functionality provided the input data
and parameters. In this case, the pre-process
and post-process
methods are
not necessary to be added to the Recall
class. A better choice is that the
two methods can be implemented as standalone classes or functions to serve for
the purposes of preprocessing and post-processing, respectively.
The following shows an example of the function implementation and its workflow
in combination with the Recall
class.
def pre_process(user_item_interactions):
"""Preprocess function for user-item interactions
"""
# Do some preprocessing
return user_item_interactions
# Preprocess data.
user_item_interactions = pre_process(user_item_interactions)
# Build recall model
recall = Recall()
recall.build_model(user_item_interactions, **parameters)
By doing this, the functionalities of preprocessing and the actual recall are separated for single responsibilities.
References
- Martin, Robert C. (2005). “The Principles of OOD”. butunclebob.com.
- Andreas Argyriou, Miguel Gonzalez-Fierro, and Le Zhang, “Microsoft Recommenders: Best Practices for Production-Ready Recommender Systems”.
- Pandas Development Team, “Pandas”, url: https://doi.org/10.5281/zenodo.3509134, Zenodo, 2020
- Refactoring guru, “Strategy in Python”.