2.4 Adopting a Pay-as-you-go Approach As discussed in Section 1, potential users of big data will not
always have access to substantial budgets or teams of skilled data scientists to support manual data wrangling. As such, rather than depending upon a continuous labor-intensive wrangling effort, to enable resources to be deployed on data wrangling in a targeted and flexible way, we propose an incremental, pay-as-you-go approach, in which the “payment” can take different forms.
Providing a pay-as-you-go approach, with flexible kinds of pay- ment, means automating all steps in the wrangling process, and al- lowing feedback in whatever form the user chooses. This requires a flexible architecture in which feedback is combined with other sources of evidence (see Section 2.3) to enable the best possible de- cisions to be made. Feedback of one type should be able to inform many different steps in the wrangling process – for example, the identification of several correct (or incorrect) results may inform both source selection and mapping generation. Although there has been significant work on incremental, pay-as-you-go approaches to data management, building on the dataspaces vision , typically this has used one or a few types of feedback to inform a single ac- tivity. As such, there is significant work to be done to provide a more integrated approach in which feedback can inform all steps of the wrangling process.
Example 5 (e-Commerce Pay-as-you-go). In Example 1, auto- mated approaches to data wrangling can be used to select sources of 2http://www.productontology.org/
product data, and to fuse the values from such sources to provide re- ports on the pricing of different products. These reports are studied by the data scientists of the e-Commerce company who are review- ing the pricing of competitors, who can annotate the data values in the report, for example, to identify which are correct or incorrect, along with their relevance to decision-making. Such feedback can trigger the data wrangling system to revise the way in which such reports are produced, for example by prioritising results from dif- ferent data sources. The provision of domain-expert feedback from the data scientists is a form of payment, as staff effort is required to provide it. However, it should also be possible to use crowdsourc- ing, with direct financial payment of crowd workers, for example to identify duplicates, and thereby to refine the automatically gen- erated rules that determine when two records represent the same real-world object . It is of paramount importance that these feedback-induced “reactions” do not trigger a re-processing of all datasets involved in the computation but rather limit the processing to the strictly necessary data.
3. DATA WRANGLING – RELATED WORK
As discussed in Section 2, cost-effective data wrangling is ex- pected to involve best-effort approaches, in which multiple sources of evidence are combined by automated techniques, the results of which can be refined following a pay-as-you-go approach. Space precludes a comprehensive review of potentially relevant results, so in this section we focus on three areas with overlapping require- ments and approaches, pointing out existing results on which data wrangling can build, but also areas in which these results need to be extended.
3.1 Knowledge Base Construction In knowledge base construction (KBC) the objective is to au-
tomatically create structured representations of data, typically us- ing the web as a source of facts for inclusion in the knowledge base. Prominent examples include YAGO , Elementary  and Google’s Knowledge Vault , all of which combine candi- date facts from web data sources to create or extend descriptions of entities. Such proposals are relevant to data wrangling, in providing large scale, automatically generated representations of structured data extracted from diverse sources, taking account of the associ- ated uncertainties.
These techniques have produced impressive results but they tend to have a single, implicit user context, with a focus on consolidating slowly-changing, common sense knowledge that leans heavily on the assumption that correct facts occur frequently (instance-based redundancy). For data wrangling, the need to support diverse user contexts and highly transient information (e.g., pricing) means that user requirements need to be made explicit and to inform decision- making throughout automated processes. In addition, the focus on fully automated KBC at web-scale, without systematic support for incremental improvement in a pay-as-you-go manner, tends to require expert input, for example through the writing of rules (e.g., ). As such, KBC proposals share requirements with data wrangling, but have different emphases.
3.2 Pay-as-you-go Data Management Pay-as-you-go data management, as represented by the datas-
paces vision , involves the combination of an automated boot- strapping phase, followed by incremental improvement. There have been numerous results on different aspects of pay-as-you-go data management, across several activities of relevance to data wran-
gling, such as data extraction (e.g., ), matching , map- ping  and entity resolution . We note that in these proposals a single type of feedback is used to support a single data management task. The opportunities presented by crowdsourcing have provided a recent boost to this area, in which, typically, paid micro-tasks are submitted to public crowds as a source of feedback for pay-as-you- go activities. This has included work that refines different steps within an activity (e.g. both blocking and fine-grained matching within entity resolution ), and the investigation of systematic approaches for relating uncertain feedback to other sources of ev- idence (e.g., ). However, the state-of-the-art is that techniques have been developed in which individual types of feedback are used to influence specific data management tasks, and there seems to be significant scope for feedback to be integrated into all activities that compose a data wrangling pipeline, with reuse of feedback to inform multiple activities . Highly automated wrangling pro- cesses require formalised feedback (e.g., in terms of rules or facts to be added/removed from the process) so that they can be used by suitable reasoning processes to automatically adapt the wrangling workflows.
Data Tamer  provides a substantially automated pipeline in- volving schema integration and entity resolution, where compo- nents obtain feedback to refine the results of automated analy- ses. Although Data Tamer moves a significant way from classi- cal, largely manually specified ETL techniques, user feedback is obtained for and applied to specific steps (and not shared across components), and there is no user context to inform where compro- mises should be made and efforts focused.
3.3 Context Awareness There has been significant prior work on context in comput-
ing systems , with a particular emphasis on mobile devices and users, in which the objective is to provide data  or services  that meet the evolving, situational needs of users. In information management, the emphasis has been on identifying the portion of the available information that is relevant in specific ambient condi- tions . For data wrangling, classical notions of context such as location and time will sometimes be relevant, but we anticipate that for data wrangling: (i) there may be many additional features that characterise the user and data contexts, for individual users, groups of users and tasks; and (ii) that the information about context will need to inform a wide range of data management tasks in addition to the selection of the most relevant results.
4. DATA WRANGLING – VISION In the light of the scene-setting from the previous sections, Fig-
ure 1 outlines potential components and relationships in a data wrangling architecture. To the left of the figure, several (poten- tially many) Data Sources provide the data that is required for the application. A Data Extraction component provides wrappers for the potentially heterogeneous sources (files, databases, documents, web pages), providing syntactically consistent representations that can then be brought together by the Data Integration component, to yield Wrangled Data that is then available for exploration and analysis.
However, in our vision, these extraction and integration compo- nents both use all the available data and adopt a pay-as-you-go approach. In Figure 1, this is represented by a collection of Work- ing Data, which contains not only results and metadata from the Data Extraction and Data Integration components, but also:
1. all relevant Auxiliary Data, which would include the user context, and whatever additional information can represent the data context, such as reference data, master data and do-
Figure 1: Abstract Wrangling Architecture.
main ontologies; 2. the results of all Quality analyses that have been carried out,
which may apply to individual data sources, the results of different extractions and components of relevance to integra- tion such as matches or mappings; and
3. the feedback that has been obtained from users or crowds, on any aspect of the wrangling process, including the ex- tractions (e.g. users could indicate if a wrapper has extracted what they would have expected), or the results of integrations (e.g. crowds could identify duplicates).
To support this, data wrangling needs substantial advances in data extraction, integration and cleaning, as well as the co-design of the components in Figure 1 to support a much closer interaction in a context-aware, pay-as-you-go setting.
4.1 Research Challenges for Components This section makes the case that meeting the vision will require
changes of substance to existing data management functionalities, such as Data Extraction and Data Integration.
To respond fully to the proposed architecture, Data Extraction must make effective use of all the available data. Consider web data extraction, in which wrappers are generated that enable deep web resources to be treated as structured data sets (e.g., [12, 19]). The lack of context and incrementality in data extraction has long been identified as a weakness , and research is required to make extraction components responsive to quality analyses, insights from integration and user feedback. As an example, existing knowledge bases and intermediate products of data cleaning and integration processes can be used to improve the quality of wrapper induction (e.g. ).
Along the same lines, Data Integration must make effective use of all the available data in ways that take account of the user con- text. As data integration acts on a variety of constructs (sources, matches, mappings, instances), each of which may be associated with its own uncertainties, automated functionalities such as those for identifying matches and generating mappings need to be re- vised to support multi-criteria decision making in the context of uncertainty. For example, the selection of which mappings to use must take into account information from the user context, such as the number of results required, the budget for accessing sources, and quality requirements. To support the complete data wrangling
process involves generalising from a range of point solutions into an approach in which all components can take account of a range of different sources of evolving evidence.
4.2 Research Challenges for Architectures This section makes the case that meeting the vision will require
changes of substance to existing data management architectures, and in particular a paradigm-shift for ETL.
Traditional ETL operates on manually-specified data manipula- tion workflows that extract data from structured data sources, inte- grating, cleaning, and eventually storing them in aggregated form into data warehouses. In Figure 1 there is no explicit control flow specified, but we note that the requirements of automation, refined on a pay-as-you-go basis taking into account the user context, is at odds with a hard-wired, user-specified data manipulation work- flow. In the abstract architecture, the pay-as-you-go approach is achieved by storing intermediate results of the ETL process for on-demand recombination, depending on the user context and the potentially continually evolving data context. As such, the user context must provide a declarative specification of the user’s re- quirements and priorities, both functional (data) and non-functional (such as quality and cost trade-offs), so that the components in Fig- ure 1 can be automatically and flexibly composed. Such an ap- proach requires an autonomic approach to data wrangling, in which self-configuration is more central to the architecture than in self- managing databases .
The resulting architecture must not only be autonomic, it must also take account of the inherent uncertainty associated with much of the Working Data in Figure 1. Uncertainty comes from: (i) Data Sources in the form of unreliable and inconsistent data; (ii) the wrangling components, for example in the form of tentative ex- traction rules or mappings; (iii) the auxiliary data, for example in the form of ontologies that do not quite represent the user’s con- ceptualisation of the domain; and (iv) the feedback which may be unreliable or out of line with the user’s requirements or pref- erences. With this complex environment, it is important that uncer- tainty is represented explicitly and reasoned with systematically, so that well informed decisions can build on a sound understanding of the available evidence.
This raises an additional research question, on how best to repre- sent and reason in a principled and scalable way with the working data and associated workflows; there is a need for a uniform rep- resentation for the results of the different components in Figure 1, which are as diverse as domain ontologies, matches, data extrac- tion and transformation rules, schema mappings, user feedback and provenance information, along with their associated quality anno- tations and uncertainties.
In addition, the ways in which different types of user engage with the wrangling process is also worthy of further research. In Wran- gler , now commercialised by Trifacta, data scientists clean and transform data sets using an interactive interface in which, among other things, the system can suggest generic transformations from user edits. In this approach, users provide feedback on the changes to the selected data they would like to have made, and select from proposed transformations. Additional research could investigate where such interactions could be used to inform upstream aspects of the wrangling process, such as source selection or mapping gen- eration, and to understand how other kinds of feedback, or the re- sults of other analyses, could inform what is offered to the user in tools such as Wrangler.
4.3 Research Challenges in Scalability In this paper we have proposed responding to the Volume as-
pect of big data principally in the form of the number of sources that may be available, where we propose that automation and in- crementality are key approaches. In this section we discuss some additional challenges in data wrangling that result from scale.
The most direct impact of scale in big data results from the sheer volume of data that may be present in the sources. ETL vendors have responded to this challenge by compiling ETL workflows into big data platforms, such as map/reduce. In the architecture of Fig- ure 1, it will be necessary for extraction, integration and data query- ing tasks to be able to be executed using such platforms. However, there are also fundamental problems to be addressed. For example, many quality analyses are intractable (e.g. ), and evaluating even standard queries of the sort used in mappings may require substan- tial changes to classical assumptions when faced with huge data sets. Among these challenges are understanding the requirement for query scalability  that can be provided in terms of access and indexing information , and developing static techniques for query approximation (i.e., without looking at the data) as was initi- ated in  for conjunctive queries. For the architecture of Figure 1 there is the additional requirement to reason with uncertainty over potentially numerous sources of evidence; this is a serious issue since even in the classical settings data uncertainty often leads to intractability of the most basic data processing tasks [1, 23]. We also observe that knowledge base construction has itself given rise to novel reasoning techniques , and additional research may be required to inform decision-making for data wrangling at scale.
5. CONCLUSIONS Data wrangling is a problem and an opportunity: • A problem because the 4 V’s of big data may all be present
together, undermining manual approaches to ETL. • An opportunity because if we can make data wrangling much
more cost effective, all sorts of hitherto impractical tasks come into reach.
This vision paper aims to raise the profile of data wrangling as a research area within the data management community, where there is a lot of work on relevant functionalities, but where these have not been refined or integrated as is required to support data wrangling. The paper has identified research challenges that emerge from data wrangling, around the need to make compromises that reflect the user’s requirements, the ability to make use of all the available ev- idence, and the development of pay-as-you-go techniques that en- able diverse forms of payment at convenient times. We have also presented an abstract architecture for data wrangling, and outlined how that architecture departs from traditional approaches to ETL, through increased use of automation, which flexibly accounts for diverse user and data contexts. It has been suggested that this archi- tecture will require changes of substance to established data man- agement components, as well as the way in which they work to- gether. For example, the proposed architecture will require support for representing and reasoning with the diverse and uncertain work- ing data that is of relevance to the data wrangling process. Thus we encourage the data management research community to direct its attention at novel approaches to data wrangling, as a prerequisite for the cost-effective exploitation of big data.
Acknowledgments This research is supported by the VADA Programme Grant from the UK Engineering and Physical Sciences Research Council, whose support we are pleased to acknowledge. We are also grateful to our colleagues in VADA for their contributions to discussions on data wrangling: Peter Buneman, Wenfei Fan, Alvaro Fernandes, John
Keane, Thomas Lukasiewicz, Sebastian Maneth and Dan Olteanu.