LinkTransformer – AI-powered dataframe operations in 2 lines of code

Link data frames, deduplicate, cluster, and perform multilingual merges with the full power of Deep Learning.

What is LinkTransformer?

A Unified Python Package for Record Linkage with Transformer Models

Easy Record Linkage with Deep Learning Models

With the help of state-of-the-art deep learning models, LinkTransformer brings the advantages of artificial intelligence to standard data frame manipulation tasks like deduplication, clustering, and merges. Using most deep-learning libraries and pipelines for these tasks is out of reach for the average data analyst. LinkTransformer makes using these models in a standard data-wrangling workflow easy with just a few lines of code. Training your own models is also as easy as one line of code with most of the heavy lifting done behind the scenes. The API is designed to be as simple as possible and very familiar to practitioners coming from other environments like R and Stata.

Example of a familiar API on data frame manipulation tasksMerge with Language models like you would in Pandas!

Easy Text Classification with Deep Learning Models

LinkTransformer allows a simple way to access the power of Transformer language models in a simple API to classify rows of a data frame in one line of code! It supports both OpenAI and Hugging Face models for the task. We recommend binary classification for beginners, but also support multi-class classification and also support weighted training loss to allow for unbalanced training classes. All you need is a data frame of text you want to classify (row-by-row) to use a pre-trained model for classification (like ChatGPT) or a data frame of Text and Class “labels” to train your own models within minutes - that too within a colab session! Check out this Tutorial!

Classify Text using Transformer language models as if it were a feature of Pandas!

All models on Hugging Face and Open AI embeddings are supported

LinkTransformer supports all models on the Hugging Face Hub (we recommend those trained for sentence similarity tasks) and OpenAI Embedding models. To accommodate heterogeneous use cases, LinkTransformer has also trained our own collection of open-source language models on different datasets for record linkage. A guide to selecting models is included in our Introductory Notebook. Currently, there are over 20 models trained on diverse datasets which can be downloaded from Hugging Face and are discoverable at this link.

Language Model Name
Company Alias linkage  
Spanish dell-research-harvard/lt-wikidata-comp-es
French dell-research-harvard/lt-wikidata-comp-fr
Japanese dell-research-harvard/lt-wikidata-comp-ja
Chinese dell-research-harvard/lt-wikidata-comp-zh
German dell-research-harvard/lt-wikidata-comp-de
English dell-research-harvard/lt-wikidata-comp-en
Multilingual dell-research-harvard/lt-wikidata-comp-multi
Japanese dell-research-harvard/lt-wikidata-comp-prod-ind-ja
Product Linkage  
English dell-research-harvard/lt-un-data-fine-fine-en
Spanish dell-research-harvard/lt-un-data-fine-fine-es
French dell-research-harvard/lt-un-data-fine-fine-fr
Multilingual dell-research-harvard/lt-un-data-fine-fine-multi
Product to Industry Linkage  
English dell-research-harvard/lt-un-data-fine-industry-en
Spanish dell-research-harvard/lt-un-data-fine-industry-es
French dell-research-harvard/lt-un-data-fine-industry-fr
Multilingual dell-research-harvard/lt-un-data-fine-industry-multi
Product Aggregation  
English dell-research-harvard/lt-un-data-fine-coarse-en
Spanish dell-research-harvard/lt-un-data-fine-coarse-es
French dell-research-harvard/lt-un-data-fine-coarse-fr
Multilingual dell-research-harvard/lt-un-data-fine-coarse-multi

For base models for various languages, we recommend the following models:

Language Base Model
English sentence-transformers/multi-qa-mpnet-base-dot-v1
Japanese oshizo/sbert-jsnli-luke-japanese-base-lite
French dangvantuan/sentence-camembert-large
Chinese DMetaSoul/sbert-chinese-qmc-domain-v1
Spanish hiiamsid/sentence_similarity_spanish_es
German Sahajtomar/German-semantic
Multilingual sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Easy to Train Your Own Record Linkage models with one line of code

LinkTransformer also comes with full support for customized record linkage model training on your own dataset. This enables you to achieve optimal prediction accuracy on your own dataset and can simplify your pipeline. You can train your own or fine-tune any pretrained model from Hugging Face! Learn more details at our repo. Training illustration for LT

See Also:

Easy to Train Your own Transformer models for Text Classification with one line of code - for any base model on HuggingFace!

LinkTransformer also supports training classification models - just with a single line of code with a data frame of text and labels!

See Also:

Sharing Trained models on the HuggingFace Hub

LinkTransformer aims to create a community for deep record linkage, as deep learning has currently made few inroads into record linkage. Making open-source, pre-trained models available is particularly important for democratizing the benefits of deep learning. LinkTransformer is well-integrated with the hugging face hub, and all models can be uploaded/downloaded with a single line of code. Our own models are shared on the hub, and we encourage the community to share their models as well. If you use the library to upload LinkTransformer models, they automatically get a “tag” on Hugging Face and become discoverable on this link. By streamlining the distribution of record linkage models, as well as promoting the reusability of these deep learning-based pipelines, Linktransformer can transform how social scientists build their record linkage workflows.

Illustration of the sharing Platform

Coming Soon!

  • Hard-negative mining for efficient training
  • FAISS GPU, cuDF, cuML and cuGraph integration
  • Convenience wrapper to use our models (trained on UN products and Wikidata)
  • Integration of other modalities in this framework (Vision/Multimodal models) Tell us what other features you would like - raise an issue on our repo or help us develop it!

Team

Join us!

Collaborate with us on the package - check out our repo and contribute!

Get started!

Learn about LinkTransformer via a collection of carefully curated tutorials on this website.