Lisa Poggel. ‘Provenance-by-Design for Humanities Databases’. In ‘Data’, ed. Dîlan Canan Çakir. Articulations (August 2025): https://articulations.temporal-communities.de.

Abstract

This Insight uses gender data as a case study to examine practices of representing provenance information in digital humanities databases. It argues that, in the development of FAIR-compliant database systems, the emphasis on reusability and interoperability should not overshadow interpretability. This Insight suggests that a “provenance-by-design” approach should guide database design in the digital humanties. This principle, recently proposed by Luc Moreau, advocates for the modelling of provenance from the offset, making it the foundation of a system’s data model and logic, rather than incorporating provenance retrospectively as an afterthought.

Laura Mandell begins their much-cited essay, ‘Gender and Cultural Analytics: Finding or Making Stereotypes?’ by contrasting the critical potential of the emerging field of cultural analytics—and its primary platform, the Journal of Cultural Analytics—with ‘data miners’, who uncritically conflate sex and gender in their research, thereby (re)producing, rather than discovering, stereotypes in their data Full reference in Zotero Library. Despite this critique, Mandell cautions against disengagement from the field, proposing instead that gender analysis be treated as a ‘test case for cultural analytics, allowing the field to show how self-critical and historically reflexive it can be’ Full reference in Zotero Library. Taking Mandell’s argument at face value, articles on gender in the Journal of Cultural Analytics provide a productive starting point for examining how critically gender data and its provenance are addressed in the digital humanities.

In the nearly ten years since its inception, the Journal of Cultural Analytics has published nineteen articles engaging in critical analyses of gender (as of February 2025, based on a keyword search for “gender”). Five of these articles reuse datasets or integrate data, including gender-related data , from external databases. These five articles demonstrate an interesting and puzzling pattern: while they engage in detailed, critical discussions of gender as a concept, a data category, and a dimension shaped by researchers’ positionality, they provide little to no discussion of the provenance of the gender data they reuse. Only one article specifies that, in one of the source databases, gender information was inferred from names and titles. Beyond this brief mention, provenance tends to be limited to identifying the database from which the data was drawn and describing the technical workflow for data acquisition and processing. How gender was originally recorded in these databases is left unexamined.

As such, the available information on provenance reveals nothing about the nature of the “gender” recorded in the original databases. Was gender assumed based on names or pronouns? Was it obtained from historical documents that may have recorded “sex” rather than gender? Or does it represent the actual, self-identified gender of the recorded individual? These questions remain unanswered. While all five articles ostensibly engage with “gender”, the nature of what is actually being represented varies substantially. One article, for example, deals with sex as defined in a nineteenth-century source Full reference in Zotero Library; another combines gendered names and data from an external authority file with heterogeneous and unclear provenance Full reference in Zotero Library; a third analysis uses gender statements from Wikidata whose provenance also varies Full reference in Zotero Library. The remaining two articles integrate gender data from databases with gated access, where provenance is ambiguous or unavailable (Full reference in Zotero Library; Full reference in Zotero Library).

This lack of critical reflection on gender data contrasts sharply with the authors’ otherwise nuanced engagement with the subject of gender. However, an examination of the databases from which gender data is sourced suggests that this oversight may be structural rather than individual. In many cases, information related to gender data is not transparently documented: it is buried deep within documentation pages, ambiguously worded, difficult to query, or described only in general terms. Rather than arguing that cultural analytics has once again failed, or reproaching individual authors for their apparent oversight, this Insight argues that if even those committed to examining gender critically can overlook the provenance of gender data, then there must be something fundamentally flawed in the design and documentation of the databases and more broadly, the database systems we use and build in the digital humanities.1Database is defined here as a logically coherent collection of related data, created for a certain purpose, and intended to represent a certain perspective on a part of the real world. This definition follows Full reference in Zotero Library. A database system, in the narrow sense, refers to a database together with a database management system (DBMS) such as MySQL or MongoDB that supports the definition, construction, manipulation, and sharing of the database Full reference in Zotero Library. However, in the humanities, databases are embedded in a wide variety of systems—from large-scale museum catalogues with complex search and retrieval functionality, to local MySQL instances, to individual JSON or CSV files served to a static website hosted on GitHub. In the absence of a more suitable umbrella term to describe systems of such varying scope and complexity, I employ the term “database system” in a broader sense to refer to a database along with some form of DBMS or data-handling logic, possibly complemented by application and business logic (for example, implementing access control and user interface). The term data model is likewise used in a broader sense than in formal database theory to denote the logical structure or representation of data used within a system.

The why and how of humanities data provenance

How can the problem be properly identified? Practices of database design, the representation of research data provenance, and the modelling of gender have all been widely debated in the field of digital humanities. Numerous approaches have been proposed for developing critical, contextualised representations of both gender data and data provenance (for example, Full reference in Zotero Library, Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library, Full reference in Zotero Library, on gender; Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library on provenance). Critical interventions have also cautioned against visualisation, database application, and interface design practices that may mislead interpretation, obscure context, or amplify bias (Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library, Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library).

Many databases do, of course, record some sort of provenance for gender data, indicating, for example, that specific data originates from an authority file such as the Gemeinsame Normdatei (GND) or Wikidata; that spellings have been normalised; that gender has been inferred from names; or that a particular reference supports a given entry for gender. However, the databases from which gender data was reused in the contributions to the Journal of Cultural Analytics highlight two issues. First, despite varying provenance and methods of collection, these databases typically employ a single, undifferentiated category “gender”. Second, the provenance described in the research articles, as well as the kind recorded in the source databases, often does not provide the kind of information necessary for interpreting gender data meaningfully. The recorded provenance tends to be insufficiently granular: while it may outline technical aspects of the data collection workflow, it rarely clarifies what the “gender” category  signifies, whether it was self-identified, assumed, inferred from context, or recorded by a historical or contemporary agent. The issue, then, is not simply that databases fail to record provenance; rather, it is that they often record the incorrect kind of provenance for the purpose of interpreting gender data.

Provenance information serves various purposes, and the type of information recorded directly reflects the purpose for which it is collected. The term “provenance” originates in art history, where it refers to ‘a record of the history of ownership of a piece of art’ Full reference in Zotero Library. Over the past few decades, the term “data provenance” has emerged across numerous disciplines to denote provenance information related to digital data, albeit with varying meanings. In database theory, data provenance is defined narrowly as a ‘record of how the data item was derived from other data items by a set of transformations’ Full reference in Zotero Library, explaining ‘how the result of an operation was derived from its inputs’ Full reference in Zotero Library. In data science and machine learning, data provenance typically refers to records that describe the origin and processing of data—often the origins of training datasets for language models—and is seen as a precondition for responsible (in other words fair, accountable, transparent, and explainable) AI systems (Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library; Full reference in Zotero Library). In the experimental sciences, data provenance denotes ‘metadata recording the process of experiment workflows, annotations, and notes about experiments’, thereby supporting the reproducibility and trustworthiness of experimentation results Full reference in Zotero Library. In research data management, the term is understood more broadly as a record of the origin of research data and the processes and methods used to produce it, answering the where, when, why, by whom, and how of research data creation Full reference in Zotero Library.

Recording provenance thus serves distinct purposes across different fields, and the type and granularity of provenance information recorded depend directly on the intended functions of the data. For instance, to explain the outcome of an SQL query, information about how the result was derived from tuples in the database is required; to assess the trustworthiness of data, a record of access and modification has to be consulted; and to ensure the reproducibility of research results, detailed documentation of the workflows and the software tools is necessary.

In an attempt to bridge disciplinary perspectives on provenance, Melanie Herschel, Ralf Diestelkemper, and Houssem Ben Lahmar define a “provenance hierarchy” that distinguishes types of provenance by granularity: “provenance” refers to ‘any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object’; “provenance meta-data” describes ‘an arbitrary production process using an arbitrary data model and model of computation’; “information system provenance”is meta-data collected for ‘processes producing digital data within information systems’, referring to processes ‘that can be computed based on the input, the output, and the parameters of the process’; “workflow provenance” is a process that ‘specializes information system provenance by further restricting the type of production processes to so-called workflows’; and “data provenance” tracks ‘the processing of individual data items (e.g., tuples) at the “highest resolution”, i.e., the provenance itself is at the level of individual data items (and the operations they undergo)’ Full reference in Zotero Library. The authors identify three key purposes for recording provenance information: assessing data quality, ensuring understandability, and facilitating reusability of data Full reference in Zotero Library.

Herschel, Diestelkämper, and Lahmar’s provenance hierarchy is useful for probing digital humanities databases and institutions for their understanding of provenance and its purpose. Organisations that set standards for research data management in the digital humanities—such as the FAIR data consortium and, in Germany, the national research infrastructure association NFDI—describe interoperability and reusability as some of the fundamental purposes for recording research data provenance.2FAIR is an acronym that stands for Findable, Accessible, Interoperable, Re-usable. It was introduced in 2016 in a contribution to the journal Scientific Data to denote a set of principles designed to guide scientific data management and promote machine-actionability Full reference in Zotero Library. The principles have since been widely endorsed by many scientific communities. The FAIR data consortium subsumes data provenance under the reusability aspect and describes data provenance as a precondition for reusability: FAIR data principle R1.2: (Meta)data are associated with detailed provenance, states that ‘[f]or others to reuse your data, they should know where the data came from’, and advises to collect workflow provenance ‘in machine-readable format’ to ensure reusability Full reference in Zotero Library. Similarly, the ‘Charta of the Cookbooks, Guidance and Best Practices Working Group’ within the Sektion (Meta)daten, Terminologien, Provenienz of the NFDI argues that a ‘common understanding of (meta)data, terminology, provenance and related sub-concepts’ is needed to ‘create or reuse subject- and application-specific metadata that is at the same time semantically rich, machine-actionable and interoperable, and to interlink data (i.e. FAIR data)’, again highlighting the reusability and interoperability aspects of FAIR data Full reference in Zotero Library.

A similar emphasis on reproducibility appears in the working definition for provenance on the web by the Provenance Incubator Group. The group, made up of members of the World Wide Web Consortium, initiated the standardisation process for provenance in Semantic Web technologies that ultimately led to the development of the PROV data model, a W3C standard for representing provenance on the web. The group defines the provenance of a resource on the web as ‘a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility’ Full reference in Zotero Library.

Perhaps this focus on reusability and interoperability is natural for institutions dealing with research infrastructure and a working group dedicated to linking data on the web. It is also conceivable that the interdisciplinary nature of these institutions is the rationale for their prioritising the understanding of provenance over the experimental sciences. Regardless of the reason, the focus on workflow provenance, interoperability, and reusability in the sense of reproducibility seems to have found its way into database design in the digital humanities. Interoperability and reusability are increasingly encouraged by research funding organisations Full reference in Zotero Library. Consequently, to conform to FAIR data standards and best practices, many databases are designed for reusability and interoperability rather than understandability and interpretability Full reference in Zotero Library.

The focus on reusability shifts the priority toward workflow provenance rather than other types of provenance. Ensuring reproducibility necessitates the documentation of workflow provenance, and many database systems do record data transformation workflows as well as data sources. However, as the examples in the Journal of Cultural Analytics demonstrate, this type of provenance does not necessarily facilitate a clear understanding or interpretation of the data it describes. One article, for instance, reuses gender data from Wikidata and details how gender data was retrieved using the Wikidata SPARQL query service Full reference in Zotero Library. However, provenance information in Wikidata is highly heterogeneous and incomplete: references are used to document data provenance, but the references to claims about the gender of an entity may point, somewhat arbitrarily, to a source in which the recorded individual explicitly states their gender identity, or to a source in which gender was inferred from names or, as is sometimes the case, no reference is provided at all. In such contexts, workflow provenance (for example, where gender is imported from Wikidata using the SPARQL query service) is insufficient, and provenance meta-data (such as where gender is self-identified; or recorded by a US government body in the nineteenth century) is preferable if gender data sourced from Wikidata is to be appropriately interpreted.

The desire to follow best practices and comply with FAIR standards are, of course, not the only considerations influencing database design choices. Database systems may themselves impose constraints on knowledge organisation, reflecting the uses and priorities for which they were originally developed Full reference in Zotero Library. Moreover, many projects lack the resources to build or adapt systems to their specific needs. Nonetheless, the unsuitability of some commonly recorded forms of provenance information for interpreting gender data appears, at least in part, to stem from the fact that reusability is often prioritised in FAIR database designs. The prioritisation of interoperability, meanwhile, introduces an additional, related issue.

Simplicity is not an option

Interoperability is often achievable only if data is ‘simplified before it is aggregated’ Full reference in Zotero Library. When gender information is integrated from various databases, it may prove challenging to map different gender representations, and frequently, the smallest denominator (typically “male”, “female”, “other”) is selected to ensure interoperability with external databases that offer a limited expressiveness in terms of gender representation (for example, when mapping gender categories from the Homosaurus linked data vocabulary of LGBTQ+ terms to the GND data model). Technical, temporal, and other constraints may further incentivise the adoption of reductive representations of gender, so as not to complicate either the data model or increase maintenance overhead. However, achieving interoperability through simplified representations is a problematic practice because it encourages reductive interpretation and implies conceptual or ontological stability where critical deconstruction is required Full reference in Zotero Library.

In addition, while such practices make data syntactically interoperable, they do not achieve full interoperability, which must encompass both syntactic, structural, and semantic dimensions Full reference in Zotero Library. Similarly, limiting the interpretability of data ultimately impedes its reusability Full reference in Zotero Library. Implementing interoperability and reusability on a technical level therefore does not guarantee the data is simultaneously interoperable ‘on an epistemic level’ Full reference in Zotero Library. This raises an important question: is (syntactic) interoperability so crucial that it justifies sacrificing complexity and, as a result, interpretability of data, or should our databases instead be designed to foster critical engagement with data? Rather than promoting broad reuse by forcing heterogeneous data from multiple sources into uniform categories, we might instead prioritise systems that encourage nuanced interpretation, thereby eschewing practices of integration that are ‘held together by a thin red thread of potential narrative cohesion’ Full reference in Zotero Library.

In many fields, software development is guided by the principle that elements deemed critical to a system should be embedded in its design from the outset and implemented as the default configuration for users, rather than incorporated retrospectively. The ‘security by default’ principle in information security and the ‘privacy by default and by design’ principle in the General Data Protection Regulation (GDPR) are prominent examples (Full reference in Zotero Library; Full reference in Zotero Library, on security by design). More recently, data scientists have called for ‘explainability by design’ in designing AI systems Full reference in Zotero Library. This principle has also found its way into provenance research, notably in a paper by Luc Moreau for the ACM Web Conference 2023, which argues that environmental management systems should be built according to a ‘provenance-by-design’ approach. This approach to design data-centric applications, such as Environmental Management Systems (EMS), ‘makes provenance an integral part of an application’s datamodel design, and uses a toolkit (ProvToolbox) Full reference in Zotero Library to generate code automatically from the design’ Full reference in Zotero Library. Moreau describes provenance-by-design as a ‘provenance first’ approach because ‘provenance is designed first, before the business logic’ Full reference in Zotero Library. Designing systems with privacy, security, or provenance at the forefront signals that certain requirements (for example, ensuring user data is handled according to GDPR regulations) are too crucial to be relegated to afterthoughts.

Provenance-by-design is a useful framework for rethinking priorities in the development of humanities databases and database systems. Its potential can be illustrated by returning to the contributions to the Journal of Cultural Analytics. One article, for example, reuses gender data from a database that infers gender from names and titles Full reference in Zotero Library. In some cases, inferring gender from names may be the only available method for making gender, as a social category of difference, visible in a dataset. This Insight does not argue that it is always better to exclude such gender data. However, there is a fundamental difference in how such data may be interpreted in ‘provenance-first’ databases, as opposed to what may be called ‘provenance-last’ databases.

In a ‘provenance-last’ database, data provenance is external to the core data model. In the database cited, provenance is provided as part of the database documentation, which is clearly not part of the data model. A database can also be ‘provenance-last’ in a less obvious way: for example, Wikidata models gender using the infamous property P21 “sex or gender” and allows providing references for each claim about an entities’ gender, yet such provenance information merely allows verification that person X really “has” gender Y.3The many problems associated with Wikidata’s gender model are beyond the scope of this Insight. For a detailed account see, among others, Full reference in Zotero Library. If, instead of providing references, the same database provided workflow provenance that described which tools were used to infer gender from names, it might enable users to assess how reliable and trustworthy the assumption that person X “has” gender Y is. And yet, this would still constitute a ‘provenance-last’ approach. Such workflow provenance, rather than aiding interpretation, merely validates that the process for creating a simple entry “woman” was technically sound and reproducible, but it does not necessarily support nuanced interpretation.

In this sample database, both data provenance and workflow provenance reify gender as a category by representing workflows as steps that, in the end, all lead to the same kind of data: “gender” data. The generic use of the “gender” category suggests interoperability with other databases that also employ the same category, even when the underlying meanings differ. In a ‘provenance-last’ database, differing provenance of gender data seemingly has little impact on interoperability, because provenance only serves to assess the factual accuracy of the claim that person X “has” gender Y. Yet this assertion is itself problematic: gender is not a simple category that is reducible to “person X is of type Y”, or “person X has gender Y”. It is an inherently relational and contextual category Full reference in Zotero Library. Some sources, therefore, simply cannot substantiate the claim that “person X actually has gender Y”. Inferring gender from names can only ever say that the name of person X may be gendered in a given historical and cultural context. Similarly, inferring gender from the category “sex” recorded in a historical source presupposes that sex and gender are always congruent and that the categories recorded cover the full diversity of possible gender categories. Yet, few would argue that non-binary people suddenly cease to exist just because a government (here, the US agencies) no longer record a non-binary gender Full reference in Zotero Library.

The issue is not only political; it calls into question the epistemological foundations of the humanities. Adding a “gender” category to the data model first, and then adding sources or details of a workflow later to substantiate that somebody probably has gender Y is comparable to a teleological approach to history, where historical records are searched for modern categories rather than analysed in their context to identify source categories relevant to a particular historical context. This approach would not be deemed acceptable in a humanities research paper, so why should it be acceptable in humanities databases either?

Provenance is foundational to humanities data, and it cannot be appended to a humanities database as an afterthought, whether through external documentation or references marginal to the data model. The meaning of a category is constituted by its provenance. Accordingly, the type of provenance recorded in a humanities database should be selected on the basis of its ability to aid interpretation first, and interoperability second. This type of provenance should have a central place in the data model. Provenance-by-design makes this its central claim: provenance information is too fundamental to be treated as an afterthought.

Toward provenance-driven database design

What might a ‘provenance-by-design’ database look like in the humanities? Moreau outlines a concrete development workflow for provenance-by-design EMS applications: PROV templates are used to define provenance requirements, and the software toolkit ProvToolbox is used to convert these templates and generate Java code for the core functionality of tracking and managing provenance information: ‘𝑖) the storage of data and provenance in the backend, 𝑖𝑖) the querying of provenance, 𝑖𝑖𝑖) the navigating of data according to provenance, and 𝑖𝑣) the building of simple user interfaces for developers’ Full reference in Zotero Library. The business logic can thereby be integrated into the provenance logic, and not the other way around. While this specific workflow may not be adaptable to many humanities database systems, provenance-by-design can be adopted as a more general approach to designing databases and database systems where the provenance model and all provenance-related logic is designed from the offset. As an approach, provenance-by-design can take multiple forms. It starts when researchers, database architects and research software developers collaboratively conceptualise a data model. From the outset, data should be modelled in a way that allows querying and retrieving data by its provenance, and the database system should allow doing so efficiently. Views of the data should always include provenance by default rather than by request, and the user interface should encourage researchers to explore, analyse, and visualise data through its provenance.

Vancisin et al. Full reference in Zotero Library illustrate what a provenance-driven interface may look like. In larger projects, provenance-by-design can mean developing a database system as a “living archive” following the principles of the ResearchSpace project (outlined in Full reference in Zotero Library; for similar ideas, see Full reference in Zotero Library; Full reference in Zotero Library). The ResearchSpace platform uses a linked data representation of CRMinf, a specialisation of the CIDOC-CRM ontology, to model dynamic, user-generated interpretations of the data as well as the provenance of data retrieved from external sources, thus shifting the perspective from facts to arguments, narratives, and assumptions. A modelling strategy specifcally tailored to gender and other identities has been proposed by the Orlando Project, which uses a combination of CIDOC-CRM classes to model identities as cultural forms (see Full reference in Zotero Library; Full reference in Zotero Library). Other data models that allow embedding (gender) data in its interpretative context are the Historical Context Ontology Full reference in Zotero Library, the Persons in Context Model Full reference in Zotero Library or the Records in Contexts Ontology (ICA EGAD 2023). The two former borrow from the PROV ontology. These modelling strategies can, however, become quite verbose, which may not always be practical. Projects using linked data technologies may alternatively choose a vocabulary with greater expressiveness such as the GSSO ontology, which allows differentiating between gender identity, assumed gender and recorded gender (see GSSO 2022), or adopt alternative terms such as “gender marker” to refer to gender stated on official documentation (such as f.e Homosaurus v4, gender marker) In low-resource settings, a simple linguistic marker on the interface of a database application and in the data model (for example, by differentiating between assumed gender, gender according to name, recorded gender, or self-identified gender) is already enough to make it hard for users to oversee the provenance of gender data and discourage simplification when the data is integrated with other datasets.4For a sample implementation in an SQL database, see Full reference in Zotero Library; for an implementation in a Wikibase context, see Full reference in Zotero Library; for an XML-TEI tag set that allows qualifying gender values via attributes, see Full reference in Zotero Library.

Dividing a unified “gender” category into subcategories such as “assumed”, “recorded” and “self-identified” fundamentally changes interpretative possibilities. Provenance becomes integrated into the core data model and does not serve to assess the probability of a statement being factually accurate. Instead, it guides interpretation by indicating that a database that records assumed gender and one that records self-identified gender represent conceptually distinct kinds of information.

The adoption of GDPR’s ‘privacy by design’ and Moreau’s ‘provenance-by-design’ in EMS systems is motivated by legal requirements. A database containing humanities data may not face legal pressure to “do things right” when it comes to representing gender data. However, if we intend to use our databases for any form of cultural analytics, provenance-by-design should not be considered optional. If provenance determines the meaning of a category, then the representation of humanities data must be built upon a provenance model, and not the other way around.

Notes

  • 1
    Database is defined here as a logically coherent collection of related data, created for a certain purpose, and intended to represent a certain perspective on a part of the real world. This definition follows Full reference in Zotero Library. A database system, in the narrow sense, refers to a database together with a database management system (DBMS) such as MySQL or MongoDB that supports the definition, construction, manipulation, and sharing of the database Full reference in Zotero Library. However, in the humanities, databases are embedded in a wide variety of systems—from large-scale museum catalogues with complex search and retrieval functionality, to local MySQL instances, to individual JSON or CSV files served to a static website hosted on GitHub. In the absence of a more suitable umbrella term to describe systems of such varying scope and complexity, I employ the term “database system” in a broader sense to refer to a database along with some form of DBMS or data-handling logic, possibly complemented by application and business logic (for example, implementing access control and user interface). The term data model is likewise used in a broader sense than in formal database theory to denote the logical structure or representation of data used within a system.
  • 2
    FAIR is an acronym that stands for Findable, Accessible, Interoperable, Re-usable. It was introduced in 2016 in a contribution to the journal Scientific Data to denote a set of principles designed to guide scientific data management and promote machine-actionability Full reference in Zotero Library. The principles have since been widely endorsed by many scientific communities.
  • 3
    The many problems associated with Wikidata’s gender model are beyond the scope of this Insight. For a detailed account see, among others, Full reference in Zotero Library.
  • 4
    For a sample implementation in an SQL database, see Full reference in Zotero Library; for an implementation in a Wikibase context, see Full reference in Zotero Library; for an XML-TEI tag set that allows qualifying gender values via attributes, see Full reference in Zotero Library.

Selected Bibliography

Citation

Lisa Poggel. ‘Provenance-by-Design for Humanities Databases’. In ‘Data’, ed. Dîlan Canan Çakir. Articulations (August 2025): https://articulations.temporal-communities.de.