New patent text similarity methods with a comprehensive understanding of SAO semantics

Patent text similarity measurement is recognized as a critical component of semantic search, due diligence, infringement detection, and litigation in intellectual property management. With the continued growth in global patent filings, conventional keyword-, citation-, and classification-based appro...

Full description

Saved in:

Bibliographic Details
Published in	World patent information Vol. 83; p. 102403
Main Authors	Wang, Nan, Wan, Ziyi, Zhao, Hongyu, Hu, Yingtong
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.12.2025
Subjects	Knowledge-based algorithms Patent retrieval Pre-trained language models (PLM) Semantic text similarity (STS) Similarity datasets Subject-action-object (SAO) Weighted strategy O34 68U15 Knowledge-based algorithms Similarity datasets Subject-action-object (SAO) Patent retrieval Semantic text similarity (STS) Weighted strategy Pre-trained language models (PLM) O32
Online Access	Get full text
ISSN	0172-2190
DOI	10.1016/j.wpi.2025.102403

Cover

More Information
Summary:	Patent text similarity measurement is recognized as a critical component of semantic search, due diligence, infringement detection, and litigation in intellectual property management. With the continued growth in global patent filings, conventional keyword-, citation-, and classification-based approaches have been shown to inadequately capture the contextual semantics inherent in patent documents. Subject–Action–Object (SAO) structures provide a promising semantic representation; however, their effectiveness has been limited by the scarcity of specialized Semantic Text Similarity (STS) datasets and the lack of comprehensive evaluations. In this study, a novel and comprehensive framework for patent text similarity leveraging SAO semantics is proposed. Specialized patent STS datasets were constructed from USPTO examination decisions and PTAB appeal documents, comprising a 2-point scale similarity dataset and a ranking dataset for retrieval evaluation—the first openly available benchmarks of this kind. The framework integrates multiple SAO extraction techniques, novel weighting strategies including clustering-based methods, and a variety of similarity computation approaches ranging from lexical to deep learning models. Experimental evaluations show that the proposed SAO-based framework improves retrieval accuracy by 43 % over keyword-based baselines and by 26 % over standard document embedding methods. Vector-based similarity algorithms incorporating K-means clustering weights achieved a 32 % improvement over unweighted baselines. The vector-based similarity algorithm combined with K-means clustering weights improved by 32 % compared to the unweighted baseline, while the knowledge-based similarity threshold of 0.4–0.6 achieved the maximum distinction between similar and dissimilar patents. A systematic ablation analysis identified the optimal configuration as combining SAO embeddings derived from pre-trained patent vectors with clustering-based weighting, similarity thresholds, and semantic knowledge extensions. This configuration yielded superior performance in litigation support, infringement detection, and patent retrieval, reducing the average ranking position of relevant patents from 5.7 to 2.7 and achieving top-3 retrieval in all test cases. •Proposed patent semantic text similarity (STS) benchmarks.•Segment ratios, thresholds and clustering strategies make effects in semantics.•Patent pre-trained models and classifications improve the similarity efficiency.•Deep matching has few outstanding advantages over direct similarity methods.•Propose an optimally selected framework for the better patent retrieval.
ISSN:	0172-2190
DOI:	10.1016/j.wpi.2025.102403