New patent text similarity methods with a comprehensive understanding of SAO semantics
Patent text similarity measurement is recognized as a critical component of semantic search, due diligence, infringement detection, and litigation in intellectual property management. With the continued growth in global patent filings, conventional keyword-, citation-, and classification-based appro...
Saved in:
| Published in | World patent information Vol. 83; p. 102403 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
Elsevier Ltd
01.12.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0172-2190 |
| DOI | 10.1016/j.wpi.2025.102403 |
Cover
| Summary: | Patent text similarity measurement is recognized as a critical component of semantic search, due diligence, infringement detection, and litigation in intellectual property management. With the continued growth in global patent filings, conventional keyword-, citation-, and classification-based approaches have been shown to inadequately capture the contextual semantics inherent in patent documents. Subject–Action–Object (SAO) structures provide a promising semantic representation; however, their effectiveness has been limited by the scarcity of specialized Semantic Text Similarity (STS) datasets and the lack of comprehensive evaluations. In this study, a novel and comprehensive framework for patent text similarity leveraging SAO semantics is proposed. Specialized patent STS datasets were constructed from USPTO examination decisions and PTAB appeal documents, comprising a 2-point scale similarity dataset and a ranking dataset for retrieval evaluation—the first openly available benchmarks of this kind. The framework integrates multiple SAO extraction techniques, novel weighting strategies including clustering-based methods, and a variety of similarity computation approaches ranging from lexical to deep learning models. Experimental evaluations show that the proposed SAO-based framework improves retrieval accuracy by 43 % over keyword-based baselines and by 26 % over standard document embedding methods. Vector-based similarity algorithms incorporating K-means clustering weights achieved a 32 % improvement over unweighted baselines. The vector-based similarity algorithm combined with K-means clustering weights improved by 32 % compared to the unweighted baseline, while the knowledge-based similarity threshold of 0.4–0.6 achieved the maximum distinction between similar and dissimilar patents. A systematic ablation analysis identified the optimal configuration as combining SAO embeddings derived from pre-trained patent vectors with clustering-based weighting, similarity thresholds, and semantic knowledge extensions. This configuration yielded superior performance in litigation support, infringement detection, and patent retrieval, reducing the average ranking position of relevant patents from 5.7 to 2.7 and achieving top-3 retrieval in all test cases.
•Proposed patent semantic text similarity (STS) benchmarks.•Segment ratios, thresholds and clustering strategies make effects in semantics.•Patent pre-trained models and classifications improve the similarity efficiency.•Deep matching has few outstanding advantages over direct similarity methods.•Propose an optimally selected framework for the better patent retrieval. |
|---|---|
| ISSN: | 0172-2190 |
| DOI: | 10.1016/j.wpi.2025.102403 |