Use Curation datasets to improve your machine learning models, even if you’re new to AI. We are releasing a series of datasets, available both as Open Source software and commercially that will empower your Data Science teams to create best-in-class NLP applications.
We pride ourselves in delivering the news in a bite-size, digestible format, that’s why we have a professional team of writers hand-crafting summaries of the latest news stories for our clients, focussing just on the news that matters to you.
Curation has released 40,000 of these summaries, along with the reference article links for free, so that the AI community can build and deploy their own solutions using Curation data.
When creating a machine learning model that summarises one or many news articles/documents/reports, you'll need to feed it lots of training data. Existing public data sets are often tricky to come by, dubiously licensed, and crucially do not represent a human-written paraphrased summary, intended to be read by itself as a cogent abstract. Producing your own data is expensive, and requires infrastructure to scale.
With Curation data, you can fine-tune your NLP models using a dataset designed for the task from the ground-up. Check out how our release stacks up against existing publically-available datasets below.
|Model||Documents||License||Avg. summary length (words)||Avg. document length (words)||Avg. summary length (sentences)||Avg. document length (sentences)||Type|
|Curation Base||40,000||CC-BY||82.6||527.9||4.9||27.4||Professionally written and edited standalone summary intended to be understood by itself|
|CNN||90,266||N/A||45.7||760.5||3.59||34||Implied by “summary” box|
|DailyMail||196,961||N/A||54.7||653.3||3.86||29.3||Implied by bullets below headline|
|Xsum||276,711||N/A||23.3||431||1||19.7||Single sentence answering “what is this article about?”|
Our open source dataset should get you off to a good start, but we have much more data to offer you! Curation Corpus Large is many times the size of our open-source offering, and growing every day. It's also what powers our internal summarisation technology.
We're looking to partner with industry leaders to offer commercial licenses for our full corpus. If you'd like to learn more, please enter your email below.