Tutorial: Continuous Mastering

This tutorial will cover using the Python client to keep a Mastering project up-to-date. This includes carrying new data through to the end of the project and using any new labels to update the machine-learning model.

While this is intended to propagate changes such as pair labeling that may be applied in the Tamr user interface, at no point during this tutorial is it necessary to interact with the user interface in any way.

Prerequisites

To complete this tutorial you will need:

  • tamr-unify-client installed

  • access to a Tamr instance, specifically:

    • a username and password that allow you to log in to Tamr

    • the socket address of the instance

  • an existing Mastering project in the following state

    • the schema mapping between the attributes of the source datasets and the unified dataset has been defined

    • the blocking model has been defined

    • labels have been applied to pairs

It is recommended that you first complete the tutorial here. Alternatively, a different Mastering project can be used as long as the above conditions are met.

Steps

1. Configure the Session and Instance

  • Use your username and password to create an instance of tamr_client.UsernamePasswordAuth.

  • Use the function tamr_client.session.from.auth to create a Session.

from getpass import getpass

import tamr_client as tc

username = input("Tamr Username:")
password = getpass("Tamr Password:")

auth = tc.UsernamePasswordAuth(username, password)
session = tc.session.from_auth(auth)
  • Create an Instance using the protocol, host, and port of your Tamr instance. Replace these with the corresponding values for your Tamr instance.

protocol = "http"
host = "localhost"
port = 9100

instance = tc.Instance(protocol=protocol, host=host, port=port)

2. Get the Tamr Mastering project to be updated

Use the function tc.project.by_name to retrieve the project information from the server by its name.

project = tc.project.by_name(session, instance, "MasteringTutorial")

Ensure that the retrieved project is a Mastering project by checking its type:

if not isinstance(project, tc.MasteringProject):
    raise RuntimeError(f"{project.name} is not a mastering project.")

3. Update the unified dataset

To update the unified dataset, use the function tc.mastering.update_unified_dataset. This function:

operation_1 = tc.mastering.update_unified_dataset(session, project)
tc.operation.check(session, operation_1)

This function and all others in this tutorial are synchronous, meaning that they will not return until the job in Tamr has resolved, either successfully or unsuccessfully. The function tc.operation.check will raise an exception and halt the script if the job started in Tamr fails for any reason.

4. Generate pairs

To generate pairs according to the configured pair filter rules, use the function tc.mastering.generate_pairs.

operation_2 = tc.mastering.generate_pairs(session, project)
tc.operation.check(session, operation_2)

5. Train the model with new Labels

Running all of the functions in this section and in the “Apply the model” section that follows is equivalent to initiating “Apply feedback and update results” in the Tamr user interface.

To update the machine-learning model with newly-applied labels use the function tc.mastering.apply_feedback.

operation_3 = tc.mastering.apply_feedback(session, project)
tc.operation.check(session, operation_3)

6. Apply the model

Running all of the functions in the previous “Train the model with new labels” section and in this section is equivalent to initiating “Apply feedback and update results” in the Tamr user interface.

Running the functions in this section alone is equivalent to initiating “Update results only” in the Tamr user interface.

Applying the trained machine-learning model requires three functions.

  • To update the pair prediction results, use the function tc.mastering.update_pair_results.

operation_4 = tc.mastering.update_pair_results(session, project)
tc.operation.check(session, operation_4)
  • To update the list of high-impact pairs, use the function tc.mastering.update_high_impact_pairs.

operation_5 = tc.mastering.update_high_impact_pairs(session, project)
tc.operation.check(session, operation_5)
  • To update the clustering results, use the function tc.mastering.update_cluster_results.

operation_6 = tc.mastering.update_cluster_results(session, project)
tc.operation.check(session, operation_6)

7. Publish the clusters

To publish the record clusters, use the function tc.mastering.publish_clusters.

operation_7 = tc.mastering.publish_clusters(session, project)
tc.operation.check(session, operation_7)

All of the above steps can be combined into the following script continuous_mastering.py:

from getpass import getpass

import tamr_client as tc

username = input("Tamr Username:")
password = getpass("Tamr Password:")

auth = tc.UsernamePasswordAuth(username, password)
session = tc.session.from_auth(auth)

protocol = "http"
host = "localhost"
port = 9100

instance = tc.Instance(protocol=protocol, host=host, port=port)

project = tc.project.by_name(session, instance, "MasteringTutorial")

if not isinstance(project, tc.MasteringProject):
    raise RuntimeError(f"{project.name} is not a mastering project.")

operation_1 = tc.mastering.update_unified_dataset(session, project)
tc.operation.check(session, operation_1)

operation_2 = tc.mastering.generate_pairs(session, project)
tc.operation.check(session, operation_2)

operation_3 = tc.mastering.apply_feedback(session, project)
tc.operation.check(session, operation_3)

operation_4 = tc.mastering.update_pair_results(session, project)
tc.operation.check(session, operation_4)

operation_5 = tc.mastering.update_high_impact_pairs(session, project)
tc.operation.check(session, operation_5)

operation_6 = tc.mastering.update_cluster_results(session, project)
tc.operation.check(session, operation_6)

operation_7 = tc.mastering.publish_clusters(session, project)
tc.operation.check(session, operation_7)

To run the script via command line:

TAMR_CLIENT_BETA=1 python continuous_mastering.py

To continue learning, see other tutorials and examples.