Tutorial: Continuous Mastering¶
This tutorial will cover using the Python client to keep a Mastering project up-to-date. This includes carrying new data through to the end of the project and using any new labels to update the machine-learning model.
While this is intended to propagate changes such as pair labeling that may be applied in the Tamr user interface, at no point during this tutorial is it necessary to interact with the user interface in any way.
Prerequisites¶
To complete this tutorial you will need:
tamr-unify-client
installedaccess to a Tamr instance, specifically:
a username and password that allow you to log in to Tamr
the socket address of the instance
an existing Mastering project in the following state
the schema mapping between the attributes of the source datasets and the unified dataset has been defined
the blocking model has been defined
labels have been applied to pairs
It is recommended that you first complete the tutorial here. Alternatively, a different Mastering project can be used as long as the above conditions are met.
Steps¶
1. Configure the Session and Instance¶
Use your username and password to create an instance of
tamr_client.UsernamePasswordAuth
.Use the function
tamr_client.session.from.auth
to create aSession
.
from getpass import getpass
import tamr_client as tc
username = input("Tamr Username:")
password = getpass("Tamr Password:")
auth = tc.UsernamePasswordAuth(username, password)
session = tc.session.from_auth(auth)
Create an
Instance
using theprotocol
,host
, andport
of your Tamr instance. Replace these with the corresponding values for your Tamr instance.
protocol = "http"
host = "localhost"
port = 9100
instance = tc.Instance(protocol=protocol, host=host, port=port)
2. Get the Tamr Mastering project to be updated¶
Use the function tc.project.by_name
to retrieve the project information from the server by its name.
project = tc.project.by_name(session, instance, "MasteringTutorial")
Ensure that the retrieved project is a Mastering project by checking its type:
if not isinstance(project, tc.MasteringProject):
raise RuntimeError(f"{project.name} is not a mastering project.")
3. Update the unified dataset¶
To update the unified dataset, use the function tc.mastering.update_unified_dataset
. This function:
Applies the attribute mapping configuration
Applies any transformations
Updates the unified dataset with updated source data
operation_1 = tc.mastering.update_unified_dataset(session, project)
tc.operation.check(session, operation_1)
This function and all others in this tutorial are synchronous, meaning that they will not return until the job in Tamr has resolved, either successfully or unsuccessfully. The function tc.operation.check
will raise an exception and halt the script if the job started in Tamr fails for any reason.
4. Generate pairs¶
To generate pairs according to the configured pair filter rules, use the function tc.mastering.generate_pairs
.
operation_2 = tc.mastering.generate_pairs(session, project)
tc.operation.check(session, operation_2)
5. Train the model with new Labels¶
Running all of the functions in this section and in the “Apply the model” section that follows is equivalent to initiating “Apply feedback and update results” in the Tamr user interface.
To update the machine-learning model with newly-applied labels use the function tc.mastering.apply_feedback
.
operation_3 = tc.mastering.apply_feedback(session, project)
tc.operation.check(session, operation_3)
6. Apply the model¶
Running all of the functions in the previous “Train the model with new labels” section and in this section is equivalent to initiating “Apply feedback and update results” in the Tamr user interface.
Running the functions in this section alone is equivalent to initiating “Update results only” in the Tamr user interface.
Applying the trained machine-learning model requires three functions.
To update the pair prediction results, use the function
tc.mastering.update_pair_results
.
operation_4 = tc.mastering.update_pair_results(session, project)
tc.operation.check(session, operation_4)
To update the list of high-impact pairs, use the function
tc.mastering.update_high_impact_pairs
.
operation_5 = tc.mastering.update_high_impact_pairs(session, project)
tc.operation.check(session, operation_5)
To update the clustering results, use the function
tc.mastering.update_cluster_results
.
operation_6 = tc.mastering.update_cluster_results(session, project)
tc.operation.check(session, operation_6)
7. Publish the clusters¶
To publish the record clusters, use the function tc.mastering.publish_clusters
.
operation_7 = tc.mastering.publish_clusters(session, project)
tc.operation.check(session, operation_7)
All of the above steps can be combined into the following script continuous_mastering.py
:
from getpass import getpass
import tamr_client as tc
username = input("Tamr Username:")
password = getpass("Tamr Password:")
auth = tc.UsernamePasswordAuth(username, password)
session = tc.session.from_auth(auth)
protocol = "http"
host = "localhost"
port = 9100
instance = tc.Instance(protocol=protocol, host=host, port=port)
project = tc.project.by_name(session, instance, "MasteringTutorial")
if not isinstance(project, tc.MasteringProject):
raise RuntimeError(f"{project.name} is not a mastering project.")
operation_1 = tc.mastering.update_unified_dataset(session, project)
tc.operation.check(session, operation_1)
operation_2 = tc.mastering.generate_pairs(session, project)
tc.operation.check(session, operation_2)
operation_3 = tc.mastering.apply_feedback(session, project)
tc.operation.check(session, operation_3)
operation_4 = tc.mastering.update_pair_results(session, project)
tc.operation.check(session, operation_4)
operation_5 = tc.mastering.update_high_impact_pairs(session, project)
tc.operation.check(session, operation_5)
operation_6 = tc.mastering.update_cluster_results(session, project)
tc.operation.check(session, operation_6)
operation_7 = tc.mastering.publish_clusters(session, project)
tc.operation.check(session, operation_7)
To run the script via command line:
TAMR_CLIENT_BETA=1 python continuous_mastering.py
To continue learning, see other tutorials and examples.