12th April 2020
Testing for accuracy using independently created Gold Standards
Data science and machine learning helps us better manage and shape our portfolio; and operate more efficiently and at scale so that we can execute on our patent strategy.
Mike Lee, Director, Head of Patents, Google
Cipher’s ML technology is custom built for classifying patents, not a standard text system that’s been repurposed. We have dedicated processing for the data that makes patents unique, in order to get the most accurate classification.
One of the great things about ML algorithms is that it’s easy to test their accuracy scientifically. In order to do that we took some test data generated by the third party (Tony Trippe, Patinformatics), which was split into two parts – inside the topic, and outside the topic (but still relevant).
We then trained Cipher’s classifiers on a portion of the data and tested it against the remainder, using a process described in our paper, Construction and evaluation of gold standards for patent classification, published in World Patent Information.
The two topics we’ve tested Cipher on are Quantum Computing Q-bit Generation, and Cannabinoid Edibles. The results are shown for both a small training set (the patent’s used to train the classifier) of 150 families, medium at 250, and large at 350 families. The results of the tests, averaged over 100 runs are:
|Training set size||Small||Medium||Large|
The definitions of the technologies are:
- Quantum Computing Q-bit Generation: Qubit Generation for Quantum Computing refers to patents that discuss the various means of generating qubits for use in a quantum mechanics based computing system. Types of qubits included superconducting loops, topological, quantum dot based and ion-trap methods as well as others. The excluded technologies are applications, algorithms and other auxiliary aspects of quantum computing that do not mention a hardware component, and hardware for other quantum phenomena outside of qubit generation. The test data consists of 2,282 positive example patents, and 2,801 negative examples (from adjacent technologies).
- Cannabinoid Edibles: The positive collection covers edible items, which can include lozenges, beverages, or powders containing a cannabinoid substance that can be used directly by oral absorption, or by formulating into a foodstuff for oral consumption. Cannabinoid substances include products from Cannabis sativa, ruderalis, or indica as well as products coming from the processing of hemp including hemp seeds, fibers, or oils. All of the records in the negative collection mention an edible item of one sort or another, and specifically a foodstuff. The test data consists of 1,603 positive example patents, and 9,191 negative examples (from adjacent technologies).
Cipher is a pioneer in using supervised machine learning for the binary classification of patents, and our confidence comes not only from customer feedback, but from peer reviewed scientific evidence.