Publications
-
Measuring Faithfulness in Multi-Stage Content Extraction Pipelines
S. Rask, M. Chow, S. Okonkwo. SIGIR 2026 (to appear).
We propose a framework for evaluating how faithfully content is preserved across multi-stage extraction pipelines. We find that each processing stage introduces measurable content drift, with summarization stages contributing the most variance. Our benchmark dataset of 2,400 web pages with ground-truth annotations is available under CC BY 4.0.
-
Content Drift in Automated Web Summarization: A Large-Scale Analysis
M. Chow, S. Rask. Preprint, January 2026.
Correction (February 2026): An earlier version of this paper reported baseline Semantic Fidelity Scores for six commercial summarization systems. Due to a normalization error in our evaluation pipeline, scores for two of the six systems were inflated by approximately 12 percentage points. The corrected results, available in the updated preprint (v3), show that these systems perform below the faithfulness threshold we originally reported. We regret the error and have updated all downstream analyses accordingly.
We analyze 15,000 web pages processed through six commercial summarization systems and measure content drift using a novel metric we call Semantic Fidelity Score (SFS). We find that summaries consistently overrepresent certain content types (statistics, named entities) while underrepresenting others (caveats, conditional language). Systems that process content through multiple intermediate stages show higher drift than single-stage summarizers.
-
Towards Transparent Content Pipelines: A Position Paper
S. Rask. Workshop on Trustworthy Information Processing, NeurIPS 2025.
This position paper argues that content processing pipelines should be subject to the same transparency standards as the systems that consume their output. When an AI assistant summarizes a web page, users rarely know how many processing stages the content passed through, what transformations were applied, or what information was lost. We propose a set of principles for pipeline transparency and outline open research questions.