Thursday, 8 July 2010

数据处理 DB ETL(extract, transform, load), SOA, ESB

Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves: 1). Extracting data from outside sources 2). Transforming it to fit operational needs (which can include quality levels) 3.) Loading it into the end target (database or data warehouse)
Extract: The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate(1. 巩固加强, 加固, We've made a good start, now it's time to consolidate. The first phase of the project is to consolidate the outside walls. The company consolidated its position in the international market. 2. 合并(=merge) combine (a number of things) into a single more effective or coherent whole All manufacturing activities have been consolidated in new premises全部生产活动已统一安排在新的厂址 Those two banks have consolidated and formed a single large bank那两家银行已合并成一家大银行 ) data from different source systems. Each separate system may also use a different data organization/format数据格式. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering网络蜘蛛 or screen-scraping(Screen scraping屏幕抓取 is normally associated with collecting visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. Web scraping: Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper bears little in common with a screen scraper, and is nothing more than不过是 an API to extract data from a web site.). The streaming of extracted data source and load on-the-fly即时上传 to the destination database is another way of performing ETL when no intermediate data storage中间数据存储 is required. In general, the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing. An intrinsic(基本的固有的本质的) part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
 Transform: The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database: * Selecting only certain columns to load (or selecting null columns not to load). For example, if source data has three columns (also called attributes) say roll_no, age and salary then the extraction may take only roll_no and salary. Similarly, extraction mechanism may ignore all those records where salary is not present (salary = null). * Translating coded values编码值 (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for需要 automated data cleansing; no manual cleansing occurs during ETL * Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M) * Deriving a new calculated value (e.g., sale_amount = qty * unit_price) * Filtering * Sorting * Joining data from multiple sources (e.g., lookup, merge) * Aggregation汇总, 合计 (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.) * Generating surrogate-key values * Transposing or pivoting (turning multiple columns into multiple rows or vice versa) * Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) * Disaggregation分解 of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table) * Lookup and validate the relevant data from tables or referential files for slowly changing dimensions. * Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.
Load: The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, frequently updating extract data is done on daily, weekly or monthly. while other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. To understand this, consider a DW that is required to maintain sales record of last one year. Then, the DW will overwrite any data that is older than a year with newer data. However, the entry of data for any one year window will be made in a historicized manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
Examples * For example, a financial institution might have information on a customer in several departments and each department might have that customer's information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by number. ETL can bundle all this data收集所有数据 and consolidate it into a uniform presentation, such as for storing in a database or data warehouse. * Another way that companies use ETL is to move information to another application permanently. For instance, word-processing data might be translated into numbers and letters, which are easier to track in a spreadsheet or database program. This is particularly useful in backing up information as companies transition to new software altogether.
Challenges: ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling(Data profiling[数据整形, 数据塑形 profiling [mass noun]the recording and analysis of a person's psychological and behavioural characteristics, so as to assess or predict their capabilities in a certain sphere or to assist in identifying a particular subgroup of people (对个人心理、行为特征的)剖析研究(以评定或预测其在某领域潜力或认识某一种人) Racial profiling种族定型 refers to the inappropriate use of an individual’s race or ethnicity by law enforcement personnel as a key factor (or sometimes as any factor at all) in deciding whether to engage in enforcement (e.g. make a traffic stop or arrest).] is the process of examining the data available in an existing data source (e.g. a database  or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: 1. Find out whether existing data can easily be used for other purposes 2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category 3. Give metrics on data quality, including whether the data conforms to particular standards or patterns 4. Assess the risk involved in integrating data for new applications, including the challenges of joins 5. Assess whether metadata accurately describes the actual values in the source database 6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. 7. Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality.) of a source during data analysis can identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogenous environment. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day microbatch to integration with message queues or real-time change-data capture for continuous transformation and update.
竞业禁止协议: A non-compete clause or covenant(协议契约) not to compete (CNC)[Non-compete Agreement,  Non-competitive Agreement, Anti-competitive Agreement], is a term used in contract law under which one party (usually an employee) agrees not to pursue a similar profession or trade in competition against another party (usually the employer). As a contract provision(规定, 条款 =item, article, term, clause), a CNC is bound by traditional contract requirements including the consideration(报酬, 补偿费 For a small consideration my friend will help you move your belongings to your new house给少量的报酬, 我的朋友就会帮你把东西搬到你的新居去doctrine宗旨, 教条, 教义. The use of such clauses is premised on(以...为前提) the possibility that upon their termination or resignation, an employee might begin working for a competitor or starting a business, and gain competitive advantage by abusing confidential information about their former employer's operations or trade secrets行业机密, or sensitive information such as customer/client lists, business practices, upcoming products, and marketing  plans. Conversely反过来说, 相反地(He would have preferred his wife not to work, although conversely he was also proud of what she did), a business might abuse a non-compete covenant to prevent an employee from working elsewhere at all.
Internet bots, also known as web robots, WWW robots  or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone. The largest use of bots is in web spidering, in which an automated script fetches, analyzes and files information from web servers at many times the speed of a human. Each server can have a file called robots.txt, containing rules for the spidering of that server that the bot is supposed to obey.
the more classical enterprise application integration (EAI): The hub-and-spoke(spoke辐条) distribution paradigm (or model or network) is a system of connections arranged like a chariot战车, 马车 wheel, in which all traffic moves along spokes connected to the hub at the center. The model is commonly used in industry, in particular in transport, telecommunications and freight, as well as in distributed computing.
enterprise service bus (ESB): In computing, an enterprise service bus (ESB) consists of a software architecture construct which provides fundamental services for complex architectures via an event-driven  and standards-based messaging-engine (the bus). Developers typically implement an ESB using technologies found in a category of middleware infrastructure products, usually based on recognized standards. An ESB generally provides an abstraction layer on top of an implementation of an enterprise messaging system, which allows integration architects to exploit the value of messaging without writing code. Unlike the more classical enterprise application integration (EAI) approach of a monolithic stack in a hub and spoke architecture, an enterprise service bus builds on base functions broken up into their constituent parts, with distributed deployment where needed, working in harmony as necessary. An ESB does not itself implement a service-oriented architecture (SOA) but provides the features with which one may implement such. An ESB should build on the basis of standards and provide flexibility, supporting many transport media capable of implementing both traditional SOA patterns as well as SOA 2.0-enriched business architecture. ESBs attempt to isolate the coupling耦合 between the service called and the transport medium. Most ESB providers incorporate SOA principles and allow for independent message formats.

service-oriented architecture (SOA): In computing, a service-oriented architecture (SOA) is a flexible set of design principles used during the phases of systems development and integration. A deployed SOA-based architecture will provide a loosely-integrated suite of services that can be used within multiple business domains. SOA also generally provides a way for consumers of services, such as web-based applications, to be aware of available SOA-based services. For example, several disparate完全不同的 departments within a company may develop and deploy SOA services in different implementation languages; their respective clients will benefit from a well understood, well defined interface to access them. XML is commonly used for interfacing with SOA services, though this is not required. SOA defines how to integrate widely disparate applications for a world that is Web based and uses multiple implementation platforms. Rather than defining an API, SOA defines the interface in terms of protocols and functionality. An endpoint is the entry point for such an SOA implementation. Service-orientation requires loose coupling松散耦合 of services with operating systems, and other technologies that underlie applications. SOA separates functions into distinct units, or services, which developers make accessible over a network in order to allow users to combine and reuse them in the production of applications. These services and their corresponding consumers communicate with each other by passing data in a well-defined, shared format, or by coordinating an activity between two or more services.