Supercharge your AI with our unique datasets

Use Curation datasets to improve your machine learning models, even if you’re new to AI. We are releasing a series of datasets, available both as Open Source software and commercially that will empower your Data Science teams to create best-in-class NLP applications.

Get Started
Abstractive summaries

We pride ourselves in delivering the news in a bite-size, digestible format, that’s why we have a professional team of writers hand-crafting summaries of the latest news stories for our clients, focussing just on the news that matters to you.

Curation has released 40,000 of these summaries, along with the reference article links for free, so that the AI community can build and deploy their own solutions using Curation data.

Why use Curation data?

When creating a machine learning model that summarises one or many news articles/documents/reports, you'll need to feed it lots of training data. Existing public data sets are often tricky to come by, dubiously licensed, and crucially do not represent a human-written paraphrased summary, intended to be read by itself as a cogent abstract. Producing your own data is expensive, and requires infrastructure to scale.

With Curation data, you can fine-tune your NLP models using a dataset designed for the task from the ground-up. Check out how our release stacks up against existing publically-available datasets below.

ModelDocumentsLicenseAvg. summary length (words)Avg. document length (words)Avg. summary length (sentences)Avg. document length (sentences)Type
Curation Base40,000CC-BY82.6527.94.927.4Professionally written and edited standalone summary intended to be understood by itself
CNN90,266N/A45.7760.53.5934Implied by “summary” box
DailyMail196,961N/A54.7653.33.8629.3Implied by bullets below headline
NYT110,540Non-commerical45.58002.4435.6Abstractive summary
Xsum276,711N/A23.3431119.7Single sentence answering “what is this article about?”
Get started
Step 1
Step 2
Download Curation Corpus

(28mb CSV, also available as JSON or Binary, contains 40,000 abstracts + URLs)

Step 3
Follow the instructions in README, run the article-downloader and get building!
Want more?

Our open source dataset should get you off to a good start, but we have much more data to offer you! Curation Corpus Large is many times the size of our open-source offering, and growing every day. It's also what powers our internal summarisation technology.

We're looking to partner with industry leaders to offer commercial licenses for our full corpus. If you'd like to learn more, please enter your email below.