Case Project 3-1: Evading Firewalls and the IDPS- 600 Words
You have been assigned to use Nmap to develop testing procedures for how attackers might try to evade detection at your organization’s firewalls and IDPS. Prepare a report that explains which Nmap options can be used to evade firewalls and IDPSs, and how these options function.
Case Project 4-1: Tunneling IPv6- 2 pages
You have been assigned to report to your network administrators on the use of Teredo. Prepare a two-page memo that outlines why Teredo was developed, how it is implemented in Windows operating systems, for which types of networks it is appropriate, and how long it should be implemented.
Case Project 4-2 Creating ACLs
You have been assigned to create access control lists to filter specific traffic on a Cisco router. Provide the commands needed to filter the appropriate traffic in each of the following ACLs.
Allow Telnet connections to the 192.168.1.0 network from the host 10.3.4.7 .
Allow established connections from 172.16.0.0 network to anywhere.
Permit all other access.
Prevent Telnet connections from 192.168.1.0 network to 172.16.0.0 network.
Prevent reserved addresses from accessing any network.
Deny spoofing from the broadcast address.
Permit all other access.
Data Wrangling for Big Data: Challenges and Opportunities
Tim Furche Dept. of Computer Science
Oxford University Oxford OX1 3QD, UK
Georg Gottlob Dept. of Computer Science
Oxford University Oxford OX1 3QD, UK
Leonid Libkin School of Informatics
University of Edinburgh Edinburgh EH8 9AB, UK
email@example.com Giorgio Orsi
School. of Computer Science University of Birmingham Birmingham, B15 2TT, UK
Norman W. Paton School of Computer Science
University of Manchester Manchester M13 9PL, UK
ABSTRACT Data wrangling is the process by which the data required by an ap- plication is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis. Although there are widely used Extract, Transform and Load (ETL) techniques and platforms, they often require manual work from technical and do- main experts at different stages of the process. When confronted with the 4 V’s of big data (volume, velocity, variety and veracity), manual intervention may make ETL prohibitively expensive. This paper argues that providing cost-effective, highly-automated ap- proaches to data wrangling involves significant research challenges, requiring fundamental changes to established areas such as data ex- traction, integration and cleaning, and to the ways in which these areas are brought together. Specifically, the paper discusses the im- portance of comprehensive support for context awareness within data wrangling, and the need for adaptive, pay-as-you-go solutions that automatically tune the wrangling process to the requirements and resources of the specific application.
1. INTRODUCTION Data wrangling has been recognised as a recurring feature of big
data life cycles. Data wrangling has been defined as:
a process of iterative data exploration and transforma- tion that enables analysis. ()
In some cases, definitions capture the assumption that there is sig- nificant manual effort in the process:
the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. ()
c©2016, Copyright is with the authors. Published in Proc. 19th Inter- national Conference on Extending Database Technology (EDBT), March 15-18, 2016 – Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0
The general requirement to reorganise data for analysis is noth- ing new, with both database vendors and data integration compa- nies providing Extract, Transform and Load (ETL) products . ETL platforms typically provide components for wrapping data sources, transforming and combing data from different sources, and for loading the resulting data into data warehouses, along with some means of orchestrating the components, such as a workflow language. Such platforms are clearly useful, but in being developed principally for enterprise settings, they tend to limit their scope to supporting the specification of wrangling workflows by expert de- velopers.
Does big data make a difference to what is needed for ETL? Al- though there are many different flavors of big data applications, the 4 V’s of big data1 refer to some recurring characteristics: Vol- ume represents scale either in terms of the size or number of data sources; Velocity represents either data arrival rates or the rate at which sources or their contents may change; Variety captures the diversity of sources of data, including sensors, databases, files and the deep web; and Veracity represents the uncertainty that is inevitable in such a complex environment. When all 4 V’s are present, the use of ETL processes involving manual intervention at some stage may lead to the sacrifice of one or more of the V’s to comply with resource and budget constraints. Currently,
data scientists spend from 50 percent to 80 percent of their time collecting and preparing unruly digital data. ()
and only a fraction of an expert’s time may be dedicated to value- added exploration and analysis.
In addition to the technical case for research in data wrangling, there is also a significant business case; for example, vendor rev- enue from big data hardware, software and services was valued at $13B in 2013, with an annual growth rate of 60%. However, just as significant is the nature of the associated activities. The UK Gov- ernment’s Information Economy Strategy states:
the overwhelming majority of information economy businesses – 95% of the 120,000 enterprises in the sec- tor – employ fewer than 10 people. ()
As such, many of the organisations that stand to benefit from big data will not be able to devote substantial resources to value-added 1http://www.ibmbigdatahub.com/infographic/ four-vs-big-data.
data analyses unless massive automation of wrangling processes is achieved, e.g., by limiting manual intervention to high-level feed- back and to the specification of exceptions.
Example 1 (e-Commerce Price Intelligence). When running an e- Commerce site, it is necessary to understand pricing trends among competitors. This may involve getting to grips with: Volume – thousands of sites; Velocity – sites, site descriptions and contents that are continually changing; Variety – in format, content, targeted community, etc; and Veracity – unavailability, inconsistent descrip- tions, unavailable offers, etc. Manual data wrangling is likely to be expensive, partial, unreliable and poorly targeted.
As a result, there is a need for research into how to make data wrangling more cost effective. The contribution of this vision pa- per is to characterise research challenges emerging from data wran- gling for the 4Vs (Section 2), to identify what existing work seems to be relevant and where it needs to be further developed (Sec- tion 3), and to provide a vision for a new research direction that is a prerequisite for widespread cost-effective exploitation of big data (Section 4).
2. DATA WRANGLING – RESEARCH CHALLENGES
As discussed in the introduction, there is a need for cost-effective data wrangling; the 4 V’s of big data are likely to lead to the man- ual production of a comprehensive data wrangling process being prohibitively expensive for many users. In practice this means that data wrangling for big data involves: (i) making compromises – as the perfect solution is not likely to be achievable, it is neces- sary to understand and capture the priorities of the users and to use these to target resources in a cost-effective manner; (ii) extending boundaries – as relevant data may be spread across many organ- isations and of many types; (iii) making use of all the available information – applications differ not only in the nature of the rele- vant data sources, but also in existing resources that could inform the wrangling process, and full use needs to be made of existing ev- idence; and (iv) adopting an incremental, pay-as-you-go approach – users need to be able to contribute effort to the wrangling process in whatever form they choose and at whatever moment they choose.
The remainder of this section expands on these features, pointing out the challenges that they present to researchers.
2.1 Making Compromises Faced with an application exhibiting the 4 V’s of big data, data
scientists may feel overwhelmed by the scale and difficulty of the wrangling task. It will often be impossible to produce a compre- hensive solution, so one challenge is to make well informed com- promises.
The user context of an application specifies functional and non- functional requirements of the users, and the trade-offs between them.
Example 2 (e-Commerce User Contexts). In price intelligence, following on from Example 1, there may be different user contexts. For example, routine price comparison may be able to work with a subset of high quality sources, and thus the user may prefer fea- tures such as accuracy and timeliness to completeness. In contrast, where sales of a popular item have been falling, the associated issue investigation may require a more complete picture for the product in question, at the risk of presenting the user with more incorrect or out-of-date data.
Thus a single application may have different user contexts, and any approach to data wrangling that hard-wires a process for se-
lecting and integrating data risks the production of data sets that are not always fit for purpose. Making well informed compro- mises involves: (i) capturing and making explicit the requirements and priorities of users; and (ii) enabling these requirements to per- meate the wrangling process. There has been significant work on decision-support, for example in relation to multi-criteria decision making , that provides both languages for capturing require- ments and algorithms for exploring the space of possible solutions in ways that take the requirements into account. For example, in the widely used Analytic Hierarchy Process , users compare criteria (such as timeliness or completeness) in terms of their rel- ative importance, which can be taken into account when making decisions (such as which mappings to use in data integration).
Although data management researchers have investigated tech- niques that apply specific user criteria to inform decisions (e.g. for selecting sources based on their anticipated financial value ) and have sometimes traded off alternative objectives (e.g. precision and recall for mapping selection and refinement ), such results have tended to address specific steps within wrangling in isolation, often leading to bespoke solutions. Together with high automation, adaptivity and multi-criteria optimisation are of paramount impor- tance for cost-effective wrangling processes.
2.2 Extending the Boundaries ETL processes traditionally operate on data lying within the
boundaries of an organisation or across a network of partners. As soon as companies started to leverage big data and data science, it became clear that data outside the boundaries of the organisation represent both new business opportunities as well as a means to optimize existing business processes.
Data wrangling solutions recently started to offer connectors to external data sources but, for now, mostly limited to open govern- ment data and established social networks (e.g., Twitter) via for- malised APIs. This makes wrangling processes dependent on the availability of APIs from third parties, thus limiting the availability of data and the scope of the wrangling processes.
Recent advances in web data extraction [19, 30] have shown that fully-automated, large scale collection of long-tail, business-related data, e.g., products, jobs or locations, is possible. The challenge for data wrangling processes is now to make proper use of this wealth of “wild” data by coordinating extraction, integration and cleaning processes.
Example 3 (Business Locations). Many social networks offer the ability for users to check-in to places, e.g., restaurants, offices, cine- mas, via their mobile apps. This gives to social networks the ability to maintain a database of businesses, their locations, and profiles of users interacting with them that is immensely valuable for advertis- ing purposes. On the other hand, this way of acquiring data is prone to data quality problems, e.g., wrong geo-locations, misspelled or fantasy places. A popular way to address these problems is to ac- quire a curated database of geo-located business locations. This is usually expensive and does not always guarantee that the data is really clean, as its quality depends on the quality of the (usually unknown) data acquisition and curation process. Another way is to define a wrangling process that collects this information right on the website of the business of interest, e.g., by wrapping the tar- get data source directly. The extraction process can in this case be “informed” by existing integrated data, e.g., the business url and a database of already known addresses, to identify previously un- known locations and correct erroneous ones.
2.3 Using All the Available Information Cost-effective data wrangling will need to make extensive use of
automation for the different steps in the wrangling process. Auto- mated processes must take advantage of all available information both when generating proposals and for comparing alternative pro- posals in the light of the user context.
The data context of an application consists of the sources that may provide data for wrangling, and other information that may inform the wrangling process.
Example 4 (e-Commerce Data Context). In price intelligence, fol- lowing on from Example 1, the data context includes the catalogs of the many online retailers that sell overlapping sets of products to overlapping markets. However, there are additional data resources that can inform the process. For example, the e-Commerce com- pany has a product catalog that can be considered as master data by the wrangling process; the company is interested in price compari- son only for the products it sells. In addition, for this domain there are standard formats, for example in schema.org, for describing products and offers, and there are ontologies that describe products, such as The Product Types Ontology2.
Thus applications have different data contexts, which include not only the data that the application seeks to use, but also local and third party sources that provide additional information about the domain or the data therein. To be cost-effective, automated tech- niques must be able to bring together all the available information. For example, a product types ontology could be used to inform the selection of sources based on their relevance, as an input to the matching of sources that supplements syntactic matching, and as a guide to the fusion of property values from records that have been obtained from different sources. To do this, automated processes must make well founded decisions, integrating evidence of differ- ent types. In data management, there are results of relevance to data wrangling that assimilate evidence to reach decisions (e.g. ), but work to date tends to be focused on small numbers of types of evidence, and individual data management tasks. Cost effective data wrangling requires more pervasive approaches.