Source Attribution: Recovering the Press Releases Behind Health Science News
We explore the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an unobserved source document. Specifically, we model the relationship between news articles and their press release sources using a dataset of 64,784 health science news articles and 23,068 press releases. We approach the problem at the sentence level and work with science journalism professors to develop a four point Likert scale describing the extent to which a news article sentence is derived from the content in the corresponding press release. Because manual annotation of news article - press release pairs is time-consuming, we turn to a mix of expert, non-expert, and heuristic-based annotation to label our dataset. After a small pilot study, which found that humans, when only able to view the text of the news article, struggle to identify which content is derived or not, we compare four different sentence regression models on the task. We find that modeling a sentence's context in the entire document is important, with the best performing model, a sequence regression model with BERT token representations, achieving a spearman's ρ of 0.49 and NDCG@1 of 0.60 on the expert-labeled test set. Examining the model's predictions, we find that it successfully identifies copied or closely paraphrased sentences in articles with a mix of derived and original content, but struggles to differentiate between loosely paraphrased and original sentences in articles with mostly original writing.