Data extraction

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.^[1]

Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary sources, like measuring or recording devices. Today's electronic devices will usually present an electrical connector (e.g. USB) through which 'raw data' can be streamed into a personal computer.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction^[2] from the web is referred to as Web scraping.

The act of adding structure to unstructured data takes a number of forms

Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
Using text analytics to attempt to understand the text and link it to other information

Notes

External links

Data Extraction as a part of the ETL process in a Data Warehousing environment

Data warehouse

Creating the data warehouse

Concepts	Database Dimension Dimensional modeling Fact OLAP Star schema Aggregate

Variants	Anchor Modeling Column-oriented DBMS Data Vault Modeling HOLAP MOLAP ROLAP Operational data store

Elements	Data dictionary/Metadata Data mart Sixth normal form Surrogate key

Fact	Fact table Early-arriving fact Measure

Dimension	Dimension table Degenerate Slowly changing

Filling	Extract-Transform-Load (ETL) Extract Transform Load

Using the data warehouse

Concepts	Business intelligence Dashboard Data mining Decision support system (DSS) OLAP cube Data warehouse automation

Languages	Data Mining Extensions (DMX) MultiDimensional eXpressions (MDX) XML for Analysis (XMLA)

Tools	Business intelligence tools Reporting software Spreadsheet

People	Bill Inmon Ralph Kimball

Products	Comparison of OLAP Servers Data warehousing products and their producers

This article is issued from Wikipedia - version of the Friday, November 20, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.