Abstract
MDC engages extensively with patent data and we keep up with developments in both commercial and open sources. This survey in Oct 2021 gives an update of open chemistry submitting to PubChem. The four largest sources are SureChEMBL, Google Patents, PATENTSCOPE (WIPO), and IBM, with compound (CID) counts of 21.5, 17.9, 17.7 and 10.7 million, respectively. SureChEMBL is the most recent (Aug 2021) and largest, Google patents has not updated in over a year, WIPO updated in Jan 2021 while IBM ceased in 2017. A PubChem query including the three sources above plus a legacy source (SCRIPDB) of 3.9 mill and a 1.8 mill chemical synthesis set from NextMove Software, add up to just under 40 mill CIDs from the total of 111 mill.
The “junk yard” aspects accrue from the following caveats. Automated extraction quality is lower than expert curation and includes many erroneous structures from IUPAC splitting via poor document OCR. The “treasure trove” of exemplified structures with SAR is only ~ 5 million which questions the IP and scientific value of the 35 million “junk yard”. Extensive over-mapping of common chemistry to 1000s of documents (e.g. aspirin CID2244 has 410,666 patent numbers in PubChem). Also, major sources are highly discordant in the chemistry they extract from nominally the same document corpus.
This update survey in PubChem indicates both the “treasure trove” as well as the “junk yard” aspects and the challenge of discriminating between the two.
References: Opening up connectivity between documents, structures and bioactivity. PMID: 32280387, Examples of SAR-centric patent mining using open resources (https://www.research.ed.ac.uk/en/publications/examples-of-sar-centric-patent-mining-using-open-resources), Expanding opportunities for mining bioactive chemistry from patents. PMID: 26194581, 20 years of compound-to-target output from literature and patents. PMID: 24204758