Slide 1: 60-510 Literature Review and Survey Winter 2008
XML Schema Metrics
Instructor: Dr. Richard Frost
Yanyin Zhang 102537020 zhang13c@uwindsor.ca
1
Slide 2: CONTENTS
CONTENTS.........................................................................................................................2
ABSTRACT
With the increasing application of XML as a popular data-format in different communities, the XML Schema is becoming a central component in software construction projects. The application of schemas varies from being the interface definitions, or protocol specifications, to being the input to some code generators for software components. The wide application of schemas has created the necessity of research in schema metrics for the software engineering process. Particularly, schema metrics should be developed to enable quantification of schema size, complexity, quality and other properties for controlling the software processes. Before the application of XML Schema, DTD was the data-exchange standard in XML Storage, XML Publishing, and optimization of XML queries. But the shortcomings of DTD limit its further application and it was eventually superseded by XML Schema. The metric studies of DTDs are insightful for XML schema metrics research. Besides, XML Schema being as an artifact of software component, the metrics in software components which started over three decades ago can provide meaningful illustrations for XML Schema. In this paper, a comprehensive survey on the research into XML Schema Metrics, DTDs and software component metrics is reviewed. Keywords: XML; Schema metrics; DTDs, software components;
2
Slide 3: 1 INTRODUCTION
XML schemas provide a consistent way to validate XML documents. The XML Schema grammar specifies a language that constrains and documents corresponding XML. The greatest strength of an XML schema is that each schema defines a “contract”. This contract provides the foundation for an application to accept or reject XML data before operating on that data. XML provides a grammar for parsing a particular file or stream format. XML schemas provide a mechanism for specifying more extensive grammar constraints. An XML Schema specifies valid elements and attributes in an XML instance. Furthermore, a schema specifies the exact element hierarchy [Binstock et al. 2002]. Also, a schema specifies options and various other constraints placed on the XML instance. In addition, a schema specifies a range for the value of an element or attribute. Suppose you have the XML instance shown in Figure 1. It consists of a product element that has two children (number and size) and an attribute (effDate). Figure 2 shows a schema that describes the instance. It contains element and attribute declarations that assign data types and element type names to elements and attributes.
<product effDate=”2001-04-02”> <number>557</number> <size>10</size> </product>
Figure 1 Product instance [Binstock et al. 2002, 21]
3
Slide 4: <xsd: schema xmlns : xsd=http://www.w3.org/2001/XMLSchema> <xsd: element name=”product” type=”ProductType”/> <xsd: complexType name=”ProductType”> <xsd: sequence> <xsd: element name=”number” type=”xsd: integer”/> </xsd: sequence> <xsd: attribute name=”effDate” type=”xsd:date”/> </xsd: complexType> <xsd: simpleType name=”SizeType”> <xsd: restriction base=”xsd: integer”> <xsd: minInclusive value=”2”/> <xsd: maxInclusive value=”18”/> </xsd: restriction> </xsd: simpleType> </xsd: schema>
Figure 2 Product schema [Binstock et al. 2002, 21]
2 GENERAL DISCUSSIONS
2.1 The purpose of XML schemas
XML has developed rapidly from its original idea, which is an electronic data interchange standard, To use in the new type of distributed application of Web Service which use XML documents for their data representation, XML Schema has become very convenient for the definition of interface specifications, definition of data models, protocol specifications and so on. Meanwhile, XML schema is a powerful tool for contracts specification between different business subjects in the E-business industry and it is beneficial in verifying an XML document to determine if it is valid according to a defined set of rules, being a contract with trading partners, providing documentation about the data in an XML instance, adding some data to the instance, inserting default and fixed values for elements and attributes, normalizing whitespace according to the data type and providing a way for additional information about the data to be supplied to the application when processing a particular type of document. These advantages benefit XML Schema in its role in software development process and need to be quantified for ease of maintainability in its development and use.
2.2 Why XML Schema metrics?
4
Slide 5: All XML Schema applications related to XML documents need to be properly designed, so that they can be easily maintained for the purpose of XML data to be effectively and properly used by various applications. Further, schema metrics must be developed to enable quantification of schema size, complexity, quality and the other properties. In software engineering, XML and XML Schema documents also have a great impact on the overall quality of the software. The metrics on them, for predicting quality and complexity of the software development process, are important components.
3 SURVEY RESEARCH
Not much research that deals with schema quality and complexity metric. But the research in software component metrics and object-oriented metrics has been studied thoroughly. Since XML Schema also can be said to be software components, in this survey, I will start the investigation from xml and XML Schema metrics, and then survey software component metrics.
3.1 Metrics for XML document collections
This part of the survey presents the metrics for XML documents, two papers focus on this subject. There is little research of metrics for XML documents. The most relevant research was done by [Klettke et al. 2002]. In it, they developed a set of five metrics for Document Type Definition (DTD) documents to measure the complexity of XML documents. It was an important first step in XML metrics research. But it focused mainly on complexity rather than quality. In this paper, the authors did not develop a new quality model, but used the ISO 9126 quality model [58] and concentrated on the characteristics usability and maintainability. The first metric proposed in this paper is the size. The software
5
Slide 6: metric of LOC (lines of code) was applied and developed to evaluate the number of elements, attributes, entities and notations in the metric. The second metric is to evaluate the structural complexity. The McCabe complexity [McCabe 1976] was adopted and added additional edges in the DTD graph to derive a new metric. The third metric developed in the paper is structure depth. Since a DTD can be represented as a graph. From the graph, the depth of the leaf nodes on the graph is 0. The depth of all other nodes is the maximal depth of all child nodes plus 1. The depth of the whole graph is equal to the depth of the root node. The fourth metric proposed is the fan-in metric. Different from the depth structure, this metric determines the “width” of each element declaration. It expresses how many child elements and attributes it has. This value correlates with the number of child nodes in the DTD graph. The formula for this metric is: Fan − In(n) =| {ni | ni is child node of n}|. The fifth metric is the fan-out metric. It is the counterpart of the fan-in metric. It is derived from the DTD graph and it counts how often every element occurs in the DTD declaration or how often an element declaration is reused. The formula for it is: Fan − Out (n) =| {ni | ni is parent node of n}|. Both the fan-in and fan-out metrics can be calculated using an adjacency matrix. In these five metrics, size and complexity can be applied to complete DTDs. The metrics of depth, fan-in and fan-out can be determined for each element and attribute of a DTD and for the complete graph. In [Mlynkova et al 2006], the authors make a thorough investigation of existing XML data and their real complexity and structure. Different from the formal definition in computer programming of O(n) or O(n2), the complexities here means counting the number of each component in a file with its related weight value. The authors gathered more than a total of 20 GB of real XML collections and studied the relationship between schemas and their instances. The total number of collected XML documents is 16534 and the number of XML collections is 133. The maximum size of a document is 1.3 MB. The median size of a document is 10 KB. In the total collection, the documents with DTDs accounts for 74.6%, and documents with XSDs is 38.2%. The collection is categorized into data-centric documents, document-centric documents, documents for data exchange, reports, research documents and semantic web documents. The authors use 10 logical
6
Slide 7: categories for better understanding the analysis. Their global statistics considers the overall properties of XML data such as number of elements of various types, number of attributes, paths, depths and portion of text in documents. The authors claim that their results show most of the documents are constructed simply using only a small number of distinct element and attribute names (usually less than 150). The maximum depth exceeded 20 only for few specific, recursive instances. The maximum depth of corresponding schemas is higher but has a maximum around 80. For depth statistics, it shows that the typical depth is always less than 10. This confirms the previous works [Choi 2002, Geert et al. 2004] related to maximum depth of XML data. In level statistics, it focuses on distribution of elements, attributes, text nodes, and mixed contents at each level of XML documents. The numbers show that the highest amounts of analyzed nodes are always at the first levels and then the number of occurrences rapidly decreases. The steep exponential decrease ends around level 20 and then the drop is much slower and shows more fluctuations. For fan-out statistics, the numbers show that the characteristics of the graph are similar at each level but with the growing depth it gets thinner. The fanin statistics are just an “inverse” of the fan-out characteristic. It indicates that the highest values of fan-in naturally occur in the doc and ex categories which show more complicated schema definitions than the rest. The recursive statistics deal with types and complexity of recursion. The results conclude that XML Schemas cannot be taken as the only source of information for benchmarking though they are reliable. There is a mixedcontent statistic which further analyzes the structure and complexity of mixed contents more deeply. They focus on average and maximum depth of mixed content and the percentage of simple mixed-content elements. The authors claim that the structure of mixed-content elements is not complex. The average depth is less than 5 and most of them are simple types which consist only of trivial subelements. Another category is relational statistics. This pattern can be easily processed in a relational database, or using relational approaches because they always are a product of various database export routines. The final category is schema statistics. This statistic analyzes XML Schemaspecific constructs. From the analysis in this paper, the authors conclude that many pattern usages are not as complex as they are expected to be. In conclusion, the authors suggest two ways of producing the XML documents, one is by flexible schema-driven
7
Slide 8: database mapping methods and the other one is to focus on an autoconfigurable XML processing system. 3.2 Metrics for DTDs The XML Schema language has a stronger capability than DTD to describe the vocabularies of XML documents, and has a good chance of being the schema language of the future for XML. The study on DTDs can provide a better understanding of XML Schema design. This subsection presents the metrics study on DTDs. There are two papers dealing with this problem.
3.2.1 DTD Metrics In the paper [Choi 2002], the author surveyed 60 DTDs collected from the web and provided statistics with respect to a variety of criteria. In the paper, the statistic study of DTDs was categorized into local and global properties. For local properties, it focused on the structure and complexity of the content model. It also studied the properties that affected the parsing of the DTDs or its ambiguity. For global properties, it was focused on some properties that might be important in the mapping of the XML into some database format. In local properties, DTDs were grouped into app, data meta categories based on their function. The first metric was the syntactic complexity. It defined the depth function as a rough measure of this metric. Then the determinism of DTDs was discussed. It applies this standard to analyze the DTDs and find out how many DTDs were non-deterministic. Also the ambiguity of DTDs was explained. The ambiguity of DTDs means that the the maping is not unique. For global properties, the reachability of element was first discussed which means that for each node n, there is exactly one node reachable from x along the path n. The author claims that separating the unreachable parts in DTD into smaller DTDs appear to be a better design. Secondly, the recursion in DTDs was defined. The recursion of the DTD means that it has an element which can be reached by itself. The linear recursive means that it is recursive and for any reachable element, this element occurs only once in a content model and the occurrence is not 8
Slide 9: enclosed in “*” or “+”. Following, the simple path, simple cycle, chain of Stars and hub were introduced. The simple paths happened in non-recursive DTDs and the simple cycle happens in recursive DTDs. The chain of Stars means the continuous stars are appeared in the DTD content model. The hub means an element name with a large fan-in value. The author collected the fan-in value for data DTDs and meta DTDs and represented them on graph. The analysis in this paper provides a good reference for other people to develop the research in XML. 3.2.2 DTD vs. XML Schema In [Bex et al 2004], the authors claimed that DTDs has some shortcomings and inspected a number of DTDs and XSDs to answer two questions. (1) Which of the extra feature /expressiveness of XML Schema not allowed by DTDs are effectively used in practice? To answer these two questions, the authors firstly discussed the expressiveness of XML schema. They investigated whether the expressive power of single-type SDTDs was used in real-world XSDs. The result showed that only about 15% of total investigated 30 XSDs are true single-type SDTDs. The reason for this is that expressiveness beyond local tree languages is simply rarely needed. The other reason is that because the relatively new nature of XML Schema and its complicated definition, most users have no clear view on what can be expressed. Secondly, they discussed the derived types of XSDs. Two kinds of XML Schema were provided: simple types and complex types. For XML Schema, the new types can be created by two mechanisms: extension and restriction. The author made a statistic analysis about application of these two mechanisms to simple and complex types of XML schema. Thirdly, they discussed the additional features that XML schema possesses. One feature that XML Schema has is the application of &-operator. The other is that the utilization of ID attributes and referred to by IDREF or IDREFS for elements in XML document. In addition, referring to elements can be expressed by key/keyref pairs and the use of namespaces for modularity is the feature that XML Schema differs from the DTDs. The last feature is the ability to redefine types and groups. The fourth question discussed is about the regular expression characterization. It also answers the second question introduced at the beginning of the paper that how sophisticated regular expressions tend to be in real world DTDs and XSDs. The simple 9
Slide 10: expressions were widely used and it indicates the simplicity that the DTDs and XSDs expressed. The fifth question discussed in the paper is about the schema and ambiguity. The author applies the term of one-unambiguous to check whether the DTDs and XSDs used in the paper satisfy the requirement. It is showed that almost all of them meet this requirement. Only few of them are not. In final, the authors collected total 109 DTDs and 93 XSDs for study. The author applied each metric and standard introduced in the paper to these samples. The results showed that there is no absolute advantage or disadvantage that which schema has, this helps the reader to get a better understanding of what the DTDs and XSDs is. The conclusion made in this paper is that many features defined in the XML Schema specification are not widely used, especially those that are related to object oriented data modeling such as derivation of complex types by extension. More importantly, it turns out that almost all XSDs are local tree grammars. The expressive power encountered in real world XSDs is mostly equivalent to that of DTDs. Also it concluded that the data type of XML Schema overcomes the shortcoming of DTDs that it has the ability to specify the format and type of the text of an element by restriction of simple types. It can give software engineer some suggestions to avoid the utilization of complex types when developing new XML implementations. 3.3 Metrics for XML Schema This subsection presents the studies of Metrics on XML Schema. There are 5 papers deal with this subject. 3.3.1 Structural Metrics With the development of software engineering, metrics for XML schemas are needed for quantification of schema size, complexity, quality, and other properties, instrumental to control the processes in which the schema re involved. The first definition of a suite of metrics for the XML Schema language was provided by [Lammel et al. 2005], the metrics proposed in that paper range from simple counters of various types of schema nodes to more involved metrics such as McCabe, depth, and breadth. 10
Slide 11: Table 1. Overview of XML Schema metrics defined by Lammel et al. [Visser 2006, 2]
The authors in this paper collected 63 schema projects from different IT sectors and each metric was applied to the schemas. The works done in this paper helped to introduce the fundamental metrics for the XSD language and identified the basic feature model of the XSD language at a basic level. At a more problem-specific level, it looked into problems that are related to the so-called impedance mismatch in data-model mapping which is relevant for XML data binding. 3.3.2 XML Schema usage The metrics proposed in [Lammel et al. 2005] mainly focus on size metrics. As an extension of this paper, the work in [Visser 2006] proposed a number of more advanced schema metrics that can be applied to measure other properties than size. The metrics in this paper are all defined over the graph representations of schema structure. Then they are called structure metrics. The metrics defined in this paper were listed in following table.
11
Slide 12: Table 2. Metrics defined in paper [Visser 2006, 8]
Some formulas used in this paper are: Impurity:
2(e − n + 1) • 100% , (n − 1)(n − 2)
e: denotes the edge and n: is the node;
Instability based on fan:
fan − out • 100% ; fan − in + fan − out
Instability based on coupling:
Ci • 100% Ci + Ca + Ce
Ce • 100% Ce + Ca
;
Coherence:
Ch =
;
In the paper, the authors collected two megabyte file and applied the prototype tool of XsdMetz to measure the metrics proposed in the paper. It gave some intuitions behind the metrics and their potential use.
12
Slide 13: 3.3.3 XML Schema Complexity and Quality index In [McDowell 2004], the authors focused on the quality measurement of XML Schema and the complexity measurement of conforming XML documents. The authors in this paper proposed metrics based on the metrics enumerated in [Klettke et al. 2002], they include
Number of Complex Type Declarations: -Number of Text Complex Type Declaration -Number of Element Complex Type Declaration -Number of Mixed Complex Type Declaration Number of Simple Type Declarations Number of Annotations Number of Derived Complex types Average Number of Attributes per Complex Type Number of Global Type Declaration Number of Global Type Reference Number of Unglobal Element Number of Unbounded Element Multiplicity Average Bounded Element Multiplicity Average Number of Restrictions per Simple Type Element Fanning Table 3. Metrics defined and experimented in [McDowell 2004, 543]
Based on these metrics, two indices for measuring quality and complexity are proposed. The formulae for calculating them are: Quality index= (Ratio if simple to complex type declarations)*5+(Percentage of annotations over total number of elements)*4+(Average restrictions per simple type declarations)*4+(Percentage of derived complex type declarations over total number of complex type declarations)*3-(Average bounded element multiplicity size)*2-(Average attributes per complex type declaration)*2; Complexity index= (Number of unbounded elements)*5+(Element fanning)*3+(Number of complex type declarations)+(Number of simple type declarations)+(Number of attributes per complex type declaration);
13
Slide 14: The quality index is intended to provide an indication of the quality of Schema documents. The complexity index is used for XML documents that are validated by the XML Schema. Both the values are used to compare with the average index calculated by the analyzing tool and give a relative indication of quality and complexity. 3.3.4 A Quality model for XML Schema evaluation Based on above XML Metrics, in [Sumak et al. 2007], the authors propose an XML Schema quality model for the purpose of assuring a quality XML schema. The model is composed of following factors.
Figure 2 XML Schema Quality Model [Sumak et al. 2007, 787]
These factors provide a basic framework for programmers when design and analyzing the XML Schema documents for how to evaluate them a good quality.
3.3.5 Metrics for XML Complexity
14
Slide 15: In [Qureshi et al. 2005], the authors focused on different ways of determining the complexity of XML documents based on different syntactic and structural aspects to lower the complexity of XML documents and improve their reusability and maintainability. The method proposed by others for measuring the complexity has a weakness that sometimes it can not distinguish the two different documents which have same expressions. To overcome this problem, the authors in this paper proposed a new algorithm of Weight Allocation Algorithm (WA) by assigning weights to the elements of XML trees according to their distance from the root node (element). Since XML documents can be denoted as a tree representation by applying the DOM (Document Object Model). Through a recursive pre-order traversal of the XML tree can get the corresponding XML documents. The formula used to decide the complexity of a document is: weight(i)=1, when i=root(ele(D)) and weight(i)=weight(parent(i))+1, when i belongs to elem(D)-Root(elem(D) where D denotes a given document and elem(D) is the collection of all elements contained within that document. This can be expressed by During the calculation process, the value allocation is that the root element has weight ‘1’, all immediate successors of the root element have weight ‘2’, and so on. Weights are assigned to each element in elem (D) on the basis from the root element. This algorithm gives some ways of gauging the quality and comprehensibility of XML documents. 3.3.6 Metrics for XML Schema complexity In [Basci 2007 et al], a new complexity metric for XML Schema Documents was proposed. This metric was based on the internal architecture of the XSD components. Different from other methods in measuring the complexity by counting the number of schema components, the authors in this paper considered the design architecture of XSD. Each component was assigned a weight value called complexity degree which reflects the complexity of each component. The total complexity was obtained by summing up each of component’s weight value. For a XSD, the formula for total complexity is:
C ( XSD) = C (V g ) + C (G g ) + C (T g )
. Here,
C (V g ) is
the total complexity value of all unreferenced
global elements and attributes that is assigned by the weight values of reflected by their type complexity values.
C (G g ) is
the total complexity values of unreferenced global
C (T g )
elements and attributes group, and
is the total complexity values of unreferenced 15
Slide 16: global complex and simple type definitions of XSD. Here, the word “unreferenced” means components that have no reference made within any component definitions of the current schema. In the calculation process of each variable, the weight value is defined by the type of group definitions that may have different complexity weight values. In the paper, the author illustrated the metric by applying it on a wsdl file and calculated its complexity. The result show that this metric gave better indication than the metrics which measures the complexity based on the counts of schema’s each components. 3.3.7 Summary of Research on XML Schema and DTD Authors
Arnaud Sahuguet
Conference/Journal
Proceedings of WebDB 2000
Paper Title
Everything You Ever Wanted About to DTDs, Know But
Main Contributions
It explores some shortcomings of DTDs and how people actually (mis)understand them. It also give some replacement for this problem. It analyzes the structures of XML and provides the statistics on some structures of real DTDs Five Metrics are proposed to evaluate the Schema of XML documents. It explores the features of XML Schema not allowed in DTDs, and the of structural properties
Were Afraid to Ask Byron Choi
Proceedings of WebDB 2002
What Are Real DTDs Like
Meike
Klettke,
Lars
Workshops on EDBT 2002 Workshop on the Web and Databases 2004
Metrics
for
XML
Schneider, and Andreas Heuer Geert Jan Bex, Frank Neven, Jan Van den Bussche
Document Collections DTDs Study versus XML
Schema: A Practical
Andrew Bun Yue Ralf
McDowell,
Proceedings of CSREA Press 2004
Analysis and Metrics of XML Schema
two foms It proposes eleven metrics to measure the quality and complexity of XML Schema It introduces essential concepts of XML schema analysis based on software code metrics. It analyzes the quantitative and qualitative information for XML Schema.
Chris Schmidt, Kwok-
Lammel,
Stan
Proceeding of XML 2005 Conference
Analysis
of
XML
Kitsis, Dave Remy
schema usage
16
Slide 17: Mustafa H. Quereshi, M. H. Samadzadeh
Proceedings of ITCC 2005
Determining Documents
the
It focuses on determining the complexity of XML documents different based syntactic on and
Complexity of XML
Zi Lin, Bingsheng He, Byron Choi
Proceedings 2006
of
ER
A quantitative Summary of XML Structures
structural characteristic. It proposes some metrics for major of structural XML, properties
especially the nestings of entities and one-to-many Irena Mlynkova, Kamil Toman, and Jaroslav Pokorny Joost Visser
Conference on
Statistical Analysis of Real XML Data Collections Structure Metrics for XML Schema
relationships. Analyzing real XML data and their structure, real complexity It proposes a suite of metrics for XML Schema to measure structural the XML quality properties. It analyzes XML
Management of Data, 2006 Proceedings of XATA
2006
Bostjan Sumak, Marjan Hericko, Maja Pusnik
Proceedings of the ITI conference, 2007
Towards a Framework for Quality XML Schema Evaluation
metrics and proposes a Schema framework. It proposes a metric based on the internal architectture of the XSD components and considers the complexities of its building components.
Dilek Misra
Basic,
Sanjay
OOPSLA 2007
Complexity Metric for XML Schema Documents
3.4 Metrics for Software components
XML Schemas are software artifacts that are claiming an increasingly central role in software construction projects. Schemas have come into use as interface definitions, data models, protocol specifications and more. With increased reliance on schemas comes the necessity of properly embedding them in the software engineering process. In software 17
Slide 18: engineering, the designing XML Schemas play an important role in software development process. There are lots of researches have been done in software metrics. The study of software metrics may give some inspirations to XML Schema metrics. In following part, some researches of software component metrics will be surveyed. 3.4.1 Component metrics to measure component quality Component-based software development (CBD) requires a considerably different approach from OO methods. OO methods develop systems by defining functional and object models, by comparison, CBD methods utilizes commonality and variability (C&V) analysis, components, component’s interfaces and relationships among components. Because OO metrics only focus on objects or classes, but component consists of one or more classes as well as interfaces. Therefore, various metrics developed for OO programming can’t be applied to CBD equally [Cho et. Al 2001]. In [Cho et. Al 2001], the authors proposed metrics for measuring the complexity, customizability, and reusability of software components. For measuring the complexity, the authors proposed a new complexity metric by combining cyclomatic complexity (McCabe) [MaCabe 1976] is called Component Complexity Metric (CCM). The CCM was then classified into four kinds of complexity metrics: component plain complexity (CPC), component static complexity (CSC), component dynamic complexity (CDC), and component cyclomatic complexity (CCC). The formula for calculating these metrics are following. For CPC, CPC (C ) = CmpC + ∑ CCi + ∑ MC j , Where:
i =1 j =1 m n
CmpC is calculated by counting classes, abstract classes, interfaces and methods.
∑ CCi : the complexity of each class, and
i =1
m
∑ MC
j =1
n
j
: the complexity of each method.
18
Slide 19: For calculating Component Static Complexity, it focuses on how complex the component’s internal structure is. The formula is defined as:
CSC =
m ∑ i =1
(Count ( Ri ) × W ( Ri )
, where Count(R) is the count of each relationship between classes,
and W(R): Weight value of each relationship. The Component Dynamic Complexity measures the complexity of internal message in a component with a dynamic view. The formula is defined as:
CDC =
m ∑ i =1
DC ( IM i ) ,
here, the formula represents the complexity f each interface method.
The Component Cyclomatic Complexity is used after the component implementation is finished. The formula is defined as:
CCC = CmpC +
m ∑ i =1
CC i +
n ∑ j =1
MC j +
m ∑ i =1
o ∑ k =1
CCM k
, where
CmpC
is the sum of classes, interfaces, and
interface methods. component, and
n ∑ j =1
CC i
is the sum of complexity of each classed contained in a
MC j
is the sum of complexity of each interface method. The cyclomatic
component method is defined as the number of edges minus the number of nodes plus 2. For measuring the Reusability, two approaches were proposed. One is a metric that measures how a component has reusability. The other is a metric that measures how a component is reused in a particular application. For the former approach, the formula is defined as:
n ∑ i =1 m ∑ j =1
CR =
(Count (CCM i ) Count (CIM j )
, where Count(CCM) is the count of each interface method for providing
common functions among several applications in a domain. Count(CIM) is the count of methods declared in interfaces provided by a component. The latter approach is a metric to measure particular component’s reuse level per application in a CBSD. The formula is defined as:
19
Slide 20: CRL LOCs (C ) =
Re use(C ) × 100% , Size(C )
where Reuse(C) is the line of code reused component in an
application. And Size(C) is the total lines of code delivered in the application. In the paper, the authors applied the metrics into several projects in the banking domain for evaluation. The results show that the proposed metrics help evaluates the development and testing efforts. It also provides valuable information to component developers, component assemblers, and application developers. 3.4.2 Software component reusability metric In [Washizaki et al. 2005], a metrics suite for measuring the reusability of software components was proposed. The novelty of this metrics suite compared to [Barnard et al. 1998] and [Cho et. Al 2001] lies that the author metrics defined with confidence intervals that were set by statistical analysis based on a certain number of JavaBean components. Also the authors combined the metrics proposed in this paper and provided a reusability model for indentifying black-box components with high reusability. The metrics defined are: 1) Existence of Meta-Information (EMI), measures whether the BeanInfo class corresponding to the target component is provided. 2) Rate of Component Observability (RCO), is a percentage of readable properties in all fields implemented within the Façade class of a component. It indicates the component’s degree of observability for users of the component. 3) Rate of Component Customizabilty (RCC), is a percentage of writable properties in all fields implemented within a Façade class of a component. It indicates the component’s degree of customizability for users of the component. 4) Self-Completeness of Component’s Return Value (SCCr), is a percentage of business methods without any return value in all business methods implemented within a component. It indicates the component’s degree of self-completeness, and low degree of external dependency for users of the component. 5) Self-Completeness of Component’s parameter (SCCp), is the percentage of business methods without any parameters in all business methods implemented within a component. It indicates the component’s degree of selfcompleteness, and the low degree of external dependency for users of the component. In final, a reusability metrics named Component Overall Reusability (COR) by combining
20
Slide 21: all above five metrics based on the reusability model. It indicates the component’s degree of reusability for users of the component. COR (c) = 1.76 VEMI (c) + VRCC (c) + VSCCr (c) − 1.13 3
The component c is reusable if the value of COR(c) is larger than 0. It can be effectively used for the component selection in terms of reusability when there are several black-box components implementing the same specification. In [Potaru et 1. 2005], the authors proposed reusability metrics of compose-ability and adaptability for software metrics. Both these two metrics were initially designed to for database components. The degree of compose-ability of a component is intended to quantify the easiness in combining a component with others. This metric is qualitatively defined by studying the parameters and return values of its interface methods. A software component interfaced only by methods with no parameters and no return values has the largest compose-ability degree because it doesn’t have any external data dependencies. Another metric measures the ability of a component to accommodate changes in the environment and it is evaluated by the complexity of its interface. 3.4.3 Usability metrics for software component The same work has been done by [Bertoa et al. 2004], the authors presented a suite of metrics for usability of software component based on the ISO 9126 Quality Model. The authors in this paper pointed out that one of deficiencies that other metrics for software component lies they don’t consider attributes, even they don’t associate any quality characteristic to metrics themselves. So the metrics defined in this paper focused on these two questions. The measurable concepts related to usability metrics are quality of documentation, complexity of the problem and complexity of the solution. The metrics for these three attributes are presented respectively. The advantage of propose the metrics for attributes lies that metrics can measure attributes directly.
21
Slide 22: As an extension of [Bertoa et al. 2004], in [Bertoa et al. 2006], the authors presents a set of measures to evaluate the usability of software component and how to validate these measures. It also proves how the appropriate combinations of measures can evaluate better the usability of a component than any individual measure. The metrics used in this paper is proposed in [Bertoa et al. 2004], but the authors conducted experiments for measuring the understandability, learnability and operability of a set of software components. The results showed that the understandability depends on the structure and organization of the manual as well as on the simplicity of the method’s signature. The learnability depends both the quality of the manuals and the complexity of the component’s design. The operability mainly depends on a combination of the amount of information available about the component configurable parameters. The future work in this paper is to conduct more experiments and to work on analysis model to define the appropriate quality indicators for usability. 3.4.4 Components Interface metrics for reusability In [Boxall et al. 2004], a set of metrics for measuring the understandability of component interface is proposed. For measuring the reusability of components, using some public and static properties are easier than some properties, like efficiency, safety which require runtime environment for measurement. The authors in the paper collected 12 component interfaces written in C and C++ and measured them thoroughly. The metrics proposed include: 1) Interface size: Argument Per Procedure (APP) measures the mean size of procedure declarations of an interface. It is defined as: APP = na , where n p is the total count of procedures that are publicly declared by an np
interface. na is the total count of arguments of the publicly declared procedures. 2) Distinct Argument Count, it measures the consistency of the naming and typing of arguments. It is defined as: DAC = A , where A is the set of name-type pairs used as
22
Slide 23: arguments in an interface. The interfaces with lower DAC will have arguments that are declared more consistently. 3) Argument Repetition Scale (ARS), it still measures the consistency of the naming and typing of arguments. It is an alternative to DAC.
ARS =
∑
a∈A
| a |2
na
, A is the set of name-type pairs. |a| is the count of procedures in which
argument name-type a is used, and na is the total count of arguments. ARS will be in the range 1 ≤ ARS ≤ n . Interface with higher ARS will tend to be dominated by fewer distinct arguments which are repeated more often. 4) Mean String Commonality (MSC), it measures the commonality of a set of identifiers. It is defined as:
MSC A =
∑
( x , y )∈comb ( A )
lcs( x, y ) max(| x |, | y |) , where A is a set of identifiers, n is the count of n
identifiers, comb(A) computes the set of combinations of two elements from the set A. lcs(x,y) computes the longest common subsequence of x and y. If the set of identifiers has zero or one elemens, MSC is undefined. Sets of identifiers with higher MSC will have more commonality which suggests that the identifiers have been named consistently and indicates that a naming convention has been adopted. 5) Mean Identifier Length (MIL) is the weighted mean length of identifiers occurring in the interface. Median Identifier Length (MeIL) is the weighted median length of identifiers. It is believed that longer identifiers will tend to contain more information and be more self-documenting. The results in the paper showed that MeIL appeared to be better than MIL because it is resistant to outliers. 6) Reference Argument Density (RAD), it measures the occurrence of reference arguments in an interface. It is defined as:
23
Slide 24: RAD =
nr , where nr is the count of pass by reference arguments and na is the total na
Argument Count. A higher RAD indicates that the interface tends to be more difficult to understand. The interface metrics can increase the understanding of the reusability components. They can provide some accurate and efficient information for reuse analysis because sometimes, other source information is unavailable and they can be accessed conveniently. 3.4.5 Metrics for the integration of software components Lots of researches focused on reusability of software components. In [Narasimhan et al. 2004], the authors start from the reality and propose a suite of metrics to measure complexity and criticality for the integration of software components. The complexity metrics consist of component packing density (CPD) metrics and component interaction density (CID) metrics. CPD metrics is defined as the form of a ratio of constituent to the number of components. It is used to identify the density of integrated components. The formula is:
CPDconstituent _ type = # < constituent > # components
, where #<constituent> is the number of lines of code,
operations, classes, and/or modules in the related components. The number of constituent depends on the information that might come from the definition of component. This is related to the information on the number of classes, number of lines of code (LOC), number of modules, or number of operations for each component. Interaction Density Metrics measures the interaction happens at one component’s interface or through consuming other component’s events. It also happens when a component submits an event and other components receive it. The interaction density metric proposed in this paper is called Average Interaction Density (AID). To measure AID, first the Interaction Density of a Component (IDC) is introduced as the ratio between the actual numbers of interactions to the available interactions in a component. So the AID is defined as:
24
Slide 25: AID =
IDC1 + IDC 2 + .... + IDC n # components
, where
IDC1 to IDC n is
interactions density for component 1 to n,
and #components is the number of the existing component in actual system. For these metrics, a component with high value of density indicates the need to utilize high quality professionals to do the design. In this paper, also criticality metrics for component were proposed. It measures the critical component existed in software. In a system, a component is called critical if it is a link criticality, bridge criticality, inheritance criticality and size criticality. The metrics defined for them are: for links criticality, it is the sum of the number of links connected to a component. For bridge criticality, it can be obtained from the incidence matrix and identify which component functions as bridge. For inheritance criticality, the information can be got from the inheritance tree. For size criticality, it is the number of lines of code, modules, operations or classes. By getting this critical number, it provides a better understanding of which component face higher risk than other components. It is an indication of the risk associated with the component. As an extension of [Narasimhan et al. 2004], in [Narasimhan et al. 2006], the authors present two suites of metrics, one is static metrics which measure complexity and criticality of component assembly. Here, the complexity is measured by using Component Packing Density and Component Interaction Density metrics. Further, the complexity and criticality metrics can be combined to form a Triangular Metric. The complexity metrics can be categorized into two groups. One deals with component packing density (CPD) and the other deal with component interaction density (CID) metrics. The metrics are defined as: CPDconstituent _ type =# constituent /# components and CID =# I /# I max , respectively. The #constituent is the number of LOC, object/classes, operations or modules. The #I is the number of actual interactions and #Imax is the number of maximum available interactions. For criticality Metrics, four metrics are presented. They are Link Criticality, Bridge Criticality, Inheritance Criticality and Size Criticality. The authors define the Link criticality metrics as the number of components. The bridge Criticality metrics as the number of bridge components. The Inheritance Criticality Metrics as the number of root components which has inheritance and the Size Criticality Metrics as the number of component which exceeds a given critical value. The final
25
Slide 26: Criticality Metrics is the sum of above four metrics. The other suite is Dynamic metrics which are collected during the runtime of a complete application. The dynamic metrics defined in the paper are: the number of Cycle (NC) metric, the number of Average number of active components which is defined as the number of active component divided by the time to execute the application, the active component density (ACD) which is defined as the number of active components divided by the number of available components. The fourth one is the average active component density (AACD) which is defined as the
∑
n
ACDn / Te where
∑
n
ACDn is the sum of ACD and Te is the time to
execute of a function. The fifth one is the Peak Number of Active Components, it is defined as PNA∆t = max{AC1 ,..., ACn } , where # ACn is the number of active component at time n and ∆t is the time interval in seconds. The dynamic metrics are helpful to identify super-component and to evaluate utilization of components. Continually as an extension of [Narasimhan et al. 2006], in [Narasimhan et al. 2007], the authors detailed illustrate the static and dynamic metrics proposed in [Narasimhan et al. 2006] and validated these metrics by using Weyuker’s properties. Most of these metrics fulfill the Weyuker’s property criteria. Also the authors checked the impact of these metrics in the context of McCall’s Quality Model. Therefore, it concluded that these metrics can help component-based developers to identify complexity and criticality in an integrated system. 3.4.6 Summary of Research on software component metrics Authors
Eun Sook Cho, Min Sun Kim, Soo Dong Kim
Conference/Journal
Conference of APSEC 2001
Paper Title
Component Metrics to Measure Quality Component
Main Contributions
It proposes metrics for measuring the complexity, customizability reusability of components A suite of on ISO 9126 and software usability Quality
Manuel
F.
Bertoa,
Workshop QAOOSE, 2002
of
Usability metrics for software components
Antonio Vallecillo
metrics is defined based Model
26
Slide 27: H
Washizaki, and
H Y
Proceedings METRICS 2003
of
A metrics suite for measuring of components reusability software
It proposes 5 metrics to measure a component’s understandability, adaptability, and portability with confidence interval It presents a set of metrics
Yamamoto Fukazawa
MAS Araban
Boxall,
Saeed
Conference ASWEC 2004
of
Interface Metrics for Reusability Analysis of Components
for
measuring
the and
understandability
VL
Narasimhan,
B
Workshop of WOSSA 2004
A new suite of metrics for the integration of software components
reusability It presents two sets of metrics to measure complexity and criticality of software systems It explores the features of XML Schema not allowed in DTDs, and the of structural properties
Hendradjaya
Geert Jan Bex, Frank Neven, Jan Van den Bussche
Workshop on the Web and Databases 2004
DTDs Study
versus
XML
Schema: A Practical
OP Dobre
Rotaru,
Marian
Conference Computer and 2005 The 2006
on Systems
Reusability Metrics for Software Components
two foms It proposes metrics and a mathematical model for measuring the adaptability and compose-ability It presents a set usability of
Applications, Journal of
Measuring the usability of software components
Manuel F. Bertoa, Jose M. Troya, Antonio Vallecillo
of
Systems and Software,
measures to assess the software components for building a
VL
Narasimhan,
B
Advances in Systems, Computing and Sciences Software
Detailed
theoretical for a
quality model. It presents two set of metrics based on static and dynamic aspects of component assembly
It presents two set of aspects them of using
Hendradjaya
considerations
suite of metrics for integration of software components Some theoretical considerations for a suite of metrics for the integration of software components
Engineering, 2006 VL Narasimhan, B Information Sciences, 2007
Hendradjaya
metrics based on static and dynamic evaluates component assembly and Weyuker’s set of properties
27
Slide 28: 3.5 Metrics for Object oriented design
There are two papers in this paper deal with the OO metrics for software In [Chidamber et. Al 1994], the authors proposed a new suite of theoretically and mathematically based metrics for OO design. The proposed metrics include: a) Weighted Methods per class (WMC), WMC = ∑ ci , here ci is the complexity of the method in
i =1 n
class. b) Depth of Inheritance tree (DIT).
c) Number of Children (NOC), this is
represented by the number of immediate subclasses subordinated to a class in the class hierarchy. d) Coupling between object classes (CBO), is a count of the number of other classes to which it is coupled. e) Response for a Class (RFC). The value of this metric is the value of the response set for the class and the response set is a set of methods that can potentially be executed in response to a message received by an object of that class. f) Lack of Cohesion in Methods (LCOM). The LCOM is a count of the number of method pairs whose similarity is 0 minus the count of method pairs whose similarity is not zero. It measures the inter-relatedness between portions of a program. A high LCOM value indicates disparateness in the functionality provided by the class. To evaluate the metrics, the authors applied the criterion proposed by Weyker for each metric. The work in this paper laid a solid foundation of later software metrics analysis and same to the design of software systems. In [Barnard et al. 1998], the authors defined a new reusability metric that can be used to give values for the reusability of OO code. Before the metrics were developed, the authors considered all factors that might affect the reusability of OO softwares. All these factors can be grouped into four categories, namely for classes, for attributes, for methods and for input parameters. All of these factors are listed below: Metrics for class
28
Slide 29: Table 4 metrics for class defined in [Barnard et al. 1998, 37]
Metrics for Attribute
Table 5 metrics for Attribute defined in [Barnard et al. 1998, 37]
Metrics for method
Table 6. Metrics for Method defined in [Barnard et al. 1998, 37]
29
Slide 30: Method input parameter metrics
Table 7 metrics for Method input parameter defined in [Barnard et al. 1998, 38]
But the experiments showed in the paper that the results conflicted with some published works which said that the size and complexity of program are relevant to reusability. In this paper, it showed that only certain factors are relevant to reusability. For example, some reusable codes don’t necessarily have low code complexity. The authors then suggested a “black box” approach which is only the interface important for reusability. The new model is following:
For class Low coupling: Low number of calls to foreign classes (should be zero) Low depth of inheritance Good documentation: Meaninful class name and descriptions, including methods and attributes For Attributes Simple type-few sub-types Meaningful name and description For methods Method must perform one function only Low number of calls to foreign classes (should be zero) Method must be robust-full coverage for all input parameters Meaningful name and description For input parameters Simple type-few sub-types Meaningful name and description
Table 8 The new model for reusability metric defined in [Barnard et al. 1998, 46]
Then the reusability metric is defined as: Re usabilityc = Rc × (((∑i =1 Rm (i ) ) / m) + ((∑ j =1 Ra ( j ) ) / a ))
cm cm
Where 30
Slide 31: Rc = ( MDc + MN c ) /(CBO + DIT + 1) Rm ( i ) = ( MN ( m ) i + MCi + ((∑k =1 R p ( k ) ) / p)) /( NFi + NFCi )
p
R p ( k ) = MN p /( PCi ( k ) + 1) or R p ( k ) = 5 if p = 0 Ra ( f ) = ( MN ( a ) j ) /( AC j + 1) cm: the number of methods in the class; ca: the number of attributes in the class; p: the number of input parameters of method i; CBO: the number of classes this class coupled to; MNc: meaningful name; MDc: meaningful description; MC: method coverage; PC: parameter complexity; NF: number of functions performed; NFC: number of calls to foreign classes. This metric is helpful for programmers to write more reusable code when working on software design. 3.5.1 Summary of Research on Object Oriented metrics Authors
Shyam R. Chidamber, Chris F. Kemerer Judith Barnard
Conference/Journal
IEEE Transactions on Software Engineering, 1994 Software Journal, 1998 Quality
Paper Title
A metrics Suite for Object Oriented Design A new for reusability object-
Main Contributions
This is a milestone paper. It first proposes six metrics for OO design It identifies the factors affecting the reusability and proposes a reusability metric for OO software
metric
oriented software
3.6 metrics for Semantic Web Schemas
The latest study of semantic web schemas is by [Theoharis et al. 2008], in which the authors analyze the graph features of semantic web schemas (RDF/S). The main finding in this paper is that the majority of SW schemas with a significant number of properties approximate a power law distribution for total-degree. To investigate whether the 31
Slide 32: property exhibits a power-law degree distribution, the authors analyzed the power laws on two different functions of a discrete random variable (DRV) X, the first one is Complementary Cumulative Density Function (CCDF) which measures the frequencies of X values. The second is denoted by Value versus Rank (VR), measures the relationship between the ith biggest X value and its rank. The result showed that both CCDF and VR showed a power-law distribution. This provides a good enlightment to investigate the m of XML schema whether has the same property of power-law distribution 3.6.1 Summary of Research on Semantic Web metrics Authors Y Theoharis, Tzitzikas, Kotzinos and Christophides Conference/Journal Y. V.
IEEE Transactions on engineering, 2008
Paper Title
On graph features of
Main Contributions
It’s the first the work property analyzing schemas
D. knowledge and data semantic web schemas
graph for semantic web
4 CONCLUSIONS
In this survey I have reviewed the current status of XML Schema metrics research and the metrics for software components. Different claims have been made about the metrics for XML and XML Schema documents, and the research derived from the metrics for software components which has been reached a high level. The studies on XML Schema need to be further improved which mainly focus on the document size, structure, complexity and reusability analysis. To develop a specialized metrics that evaluate XML documents is one of future tasks [Klettke et al. 2002]. Schema analysis is a key factor in software developing, schema-based data management and schema-aware XML programming. Designing a suite of metrics for schema analysis to analyze its features, styles, and others becomes an integral part of XML schema-development process, on the basis of appropriate tool support. This is helpful for future software/schema developers to express intentions about the complexity, features and styles when they consider them into designing the schema or software components [Lammel et al. 2005]. Combining with the latest study on semantic web schemas, such as RDF/RDFS, OWL to see whether the
32
Slide 33: XML Schema documents have a same property of power-law distribution will be a subject of continuous research in my following semesters.
5 ACKNOWLEDGEMENTS
I gratefully thank Dr. Richard Frost for his invaluable guidance and support I received is this semester. He teaches me a lot on how to conduct a literature review and how to compose a survey. I also sincerely express my thanks to my supervisor Dr. Jiaoguo Lu for his suggestions and help on my research topic.
33
Slide 34: REFERENCES
1 Klettke, M., Schneider, L., and Heuer A.. Metrics for XML Document Collections. Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering. Lecture Notes In Computer Science, 2490:1528, 2002. 2 Sahuguet A.. Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask. In Third International Workshop WEBDB2000. Lecture Notes in Computer Science. 1997: 171–183, 2000. 3 McDowell A., Schmidt C., Yue K.. Analysis and Metrics of XML Schema. Proceedings of the International Conference on Software Engineering Research and Practice, CSREA Press(2004), 538-544, 2004. 4 Martens W., Neven F., Schwentick T., Bex GJ.. Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems (TODS). 31(3), 770-813, 2006. 5 Lammel R., Kitsis S., Remy D.. Analysis of XML schema usage. Conference Proceedings XML, Atlanta, Georgia, U.S.A, 2005. Available as: http://www.idealliance.org/proceedings/xml05/ship/49/paper.PDF 6 Bex GJ., Neven F., Bussche JV.. DTDs versus XML Schema: A Practical Study. In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB Paris, France, 2004, 79-84, 2004. 7 What are real DTDs like? B. Choi., In Proceedings WebDB, Madison, Wisconsin, USA 2002, 43-48, 2002; 8 Mlynkova I., Toman K., Pokorny J.. Statistical Analysis of Real XML Data Collections. Proceedings of the 13th International Conference on Management of Data,
34
Slide 35: Delhi, India, 2006. Available as: http://www.cs.cas.cz/semweb/download.php?file=0605-mlynkova-toman-pokorny&type=pdf 9 Mustafa H. Qureshi, M. Samadzadeh H.. Determining the Complexity of XML Documents. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05). 2, 416 – 421, 2005. 10 Basci D. and Misra S.. Complexity Metric for XML Schema Documents. Proceedings of Object-Oriented Programming, Systems, Languages, and Applications, 2007. Available as: http://boss.bekk.no/oopsla2007/papers/p5.pdf 11 Visser J.. Structure Metrics for XML Schema. Proceedings of XATA, 2006. Available as: http://www.di.uminho.pt/~joostvisser/publications/StructureMetricsForXMLSchema.pdf 12 Binstock, C., Peterson D., Smith, M., Wooding, M., Dix, C., Galtenberg, C.. The XML Schema Complete Reference. Addison Wesley Professional Publishers, 2002. 13 Chidamber, SR., Kemerer CF. A metrics suite for object oriented design. Software Engineering, IEEE Transactions on. 20(6), 476 – 493,1994. 14 Washizaki H., Yamamoto H., Fukazawa Y.. A Metrics Suite for Measuring Reusability of Software Components. Proceedings of the 9th International Symposium on Software Metrics, 2003, 211 – 223, 2003. 15 Felix Michel. Representation of XML Schema Components. Master Thesis, School of Information University of California, Berkeley, 2007. 16 Goulão M., Abreu FB.. Software Components Evaluation: an Overview. 5ª Conferência da APSI, 2004. Available as: http://ctp.di.fct.unl.pt/QUASAR/Resources/ Papers/2004/goulaoCAPSI2004Final.pdf
35
Slide 36: 17 Cho ES., Min Sun Kim MS., Kim SD.. Component metrics to measure component quality. Software Engineering Conference on, 2001, Eighth Asia-Pacific, Macau, China, APSEC 2001, 419- 426, 2001. 18 Hilbert DM., Redmiles DF.. Extracting usability information from user interface events. ACM Computing Surveys (CSUR), 32(4), Pages: 384 – 421, 2000. 19 Kafura, D., Reddy, GR..The Use of Software Complexity Metrics in Software Maintenance. Software Engineering, IEEE Transactions on, Volume: SE-13(3), 335343,1987. 20 Lanubile, F., Visaggio, G.. Maintainability via structure models and software metrics. Software Engineering and Knowledge Engineering, 1992. Proceedings., Fourth International Conference on, 590-599, 1992. 21 Lin, Z., He, BS., Choi, B.. A Quantitative Summary of XML Structures. Lecture Notes in Computer Science, Conceptual Modeling –ER, 2006, 4215, 228-240. 22 Mayer, T., Halla, T.. Critical Analysis of Current OO Design Metrics. Software Quality Journal, 8, 97-110, 1999. 23 Narasimhan, VL., Hendradjaya, B.. Theoretical Considerations for Software Component Metrics. Proceedings of World Academy of Science, Engineering and Technology, 10, 165-170, 2005. 24 Narasimhan, VL., Hendradjaya, B.. Detailed Theoretical Considerations for a Suite of Metrics for Integration of Software Components. Advances in Systems, Computing Sciences and Software Engineering, Springer Netherlands, 257-264, 2006.
36
Slide 37: 25 Allen, EB., Gottipati, S., Govindarajan, R.. Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach. Software Quality Journal, 15(2), pages:179-212, 2007. 26 Jeffrey S. Poulin. Measuring software reusability; Software Reuse: Advances in Software Reusability, 1994. Proceedings., Third International Conference on, 126-138, 1994. 27 Rotaru, OP., Dobre, M.. Reusability metrics for software components. Proceedings of the ACS/IEEE 2005 International Conference on Computer Systems and Applications, 24-31, 2005. 28 Narasimhan, VL., Hendradjaya, B.. Some theoretical considerations for a suite of metrics for the integration of software components; - Information Sciences 177, 844-864, 2007. 29 Jianguo Lu, Ju Wang, Shengrui Wang. XML Schema Matching. International Journal of Software Engineering and Knowledge Engineering, 17(5), 575-597, 2007. 30 Sumak, Bostjan, Hericko, Marjan, Pusnik, Maja. Towards a Framework for Quality XML Schema Evaluation. Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on, 25-28 June 2007, 783 – 788, 2007. 31 Mignet, L., Barbosa, D. and Veltri, P..The XML web: a first study. In Proc. of International World Wide Web Conf, Budapest, Hungary, May 2003, 500—510, 2003. 32 Bernauer, M., Kappel, G., Kramler, G.. Approaches to implementing active semantics with XML schema. Database and Expert Systems Applications, 2003. Proceedings. 14th International Workshop on, 1-5 Sept. 2003, 559 – 565, 2003. 33 Yijun Yu, Jianguo Lu, Jinghao Xue, Yi Zhang, Weiwei Sun. Localizing XML Documents through XSLT. Applied Informatics, 1059-1064, 2003.
37
Slide 38: 34 Kotsakis, E., Bohm, K..XML Schema Directory: a data structure for XML data processing. Web Information Systems Engineering, 2000. Proceedings of the First International Conference on, June 2000, 1(19-21), 62 – 69, 2000. 35 Torabi, T, Dillon, Th, Rahayu, W.. XML schema for software process framework. 21st IASTED International Multi-Conference on Applied Informatics, Innsbruck, Austria, 10-13 Feb. 2003, 948-954, 2003. 36 Ramanath, M.. Schema-based Statistics and Storage for XML. A Doctoral Thesis, Indian Institute of Science, April, 2006. 37 Jianguo Lu, Yijun Yu, John Mylopoulos. A Lightweight Approach to Semantic Web Service Synthesis. ICDE Workshop, International Workshop on Challenges in Web Information Retrieval and Integration, Tokyo, 240-247, 2005. 38 Farooq, A., Braungarten, R., Dumke, RR.. An Empirical Analysis of Object-Oriented Metrics for Java Technologies. 9th International Multitopic Conference, IEEE INMIC 2005, Dec. 2005, 1-6, 2005. 39 Narasimhan, VL., Hendradjaya, B.. A New Suite of Metrics for the Integration of Software Components. The First International Workshop on Object Systems and oftware Architectures (WOSSA'2004), Victor Harbor, South Australia, 2004. Available as: http://www.cs.adelaide.edu.au/~wossa2004/HTML/09-bayu-paper.pdf 40 Washizaki, H., Yamamoto, Y., Fukazawa, Y.. Software Component Metrics and It's Experimental Evaluation. Proc. of International Symposium on Empirical Software Engineering (ISESE2002), Vol.II, 2002. 41 Fenton, NE., Neil, M.. Software metrics: roadmap. International Conference on Software Engineering, 2000, 357 - 370 , 2000.
38
Slide 39: 42 Basili, V.R., Briand, LC., Melo, WL.. A validation of object-oriented design metrics as quality indicators. Transactions on Software Engineering, 22(10), 751-761, 1996. 43 Coleman, D., Ash, D., Lowther, B., Oman, P.. Using Metrics to Evaluate Software System Maintainability. Computer, 27(8), pages: 44-49, 1994. 44 Yu, X., Lamb, DA.. Metrics applicable to software design. Annals of Software Engineering, 1995 – Springer, 1(1), 23-41, 1995. 45 Weyuker, E.J.. Evaluating software complexity measures. IEEE Transactions on Software Engineering, 14(9), 1357-1365, 1988. 46 McCabe, T.J.. A Complexity Measure. IEEE Transactions on Software Engineering, Volume: SE-2(4), 308- 320, 1976. 47 Boxall, M., and Araban, S.. Interface Metrics for Reusability Analysis of Components. Proceedings of the 2004 Australian Software Engineering Conference (ASWEC'04), 4051, 2004. 48 Power, JF., Malloy, BA.. A metrics suite for grammar-based software. Journal of Software Maintenance and Evolution, John Wiley & Sons, Inc., 16(6), 405-426, 2004. 49 Purao, S., and Vaishnavi, V.. Product metrics for object-oriented systems. ACM Comput. Surv., 35, 191-221, 2003. 50 Evanco, WM.. The confounding effect of class size on the validity of object-oriented metrics. Software Engineering, IEEE Transactions on, 29(7), 670 - 672, 2003.
39
Slide 40: 51 Etzkorn, L., Delugach, H.. Towards a semantic metrics suite for object-oriented design. Technology of Object-Oriented Languages and Systems, 2000. TOOLS 34. Proceedings. 34th International Conference on, 30 July-4 Aug. 2000. 71 – 80, 2000. 52 Barnard, J.. A new reusability metric for object-oriented software. Software Quality Control, 7, 35-50, 1998. 53 Alkadi, G., Carver, D.L.. Application of metrics to object-oriented designs. Aerospace Conference, 1998. Proceedings., IEEE; Volume 4, 21-28 March 1998, 159 – 163, 1994. 54 Gowda, R.G., Winslow, LE.. An approach for deriving object-oriented metrics. Aerospace and Electronics Conference, 1994. NAECON 1994., Proceedings of the IEEE 1994 National, 23-27 May 1994. 897 – 904, 1994. 55 Xenos, M., Stavrinoudis, D., Zikouli, K., Christodoulakis, D.. Object-Oriented Metrics – A Survey. Proceedings of the FESMA 2000. Available as: http://edu.eap.gr/pli/pli10 /info/old-xenos/2000_p04.pdf 56 Frakes, W., Terry, C.. Software reuse: metrics and models. ACM Computing Surveys (CSUR), 28(2), 415 – 435, 1996. 57 Bertoa, MF., Troya, JM., and Vallecillo, A.. Measuring the Usability of Software Components. Journal of Systems and Software, 79(3), 427-439, 2006. 58 http://www.sqa.net/iso9126.html
40
Slide 41: APPENDIX ANNOTATED BIBLIOGRAPHY A.1 [Visser 2006] Problem to be solved: A suite of metrics are developed to measure the structural properties of XML Schema. Previous work referred to: Metrics developed by other authors to quantify the schema include size, complexity, quality and other properties. The metrics developed in this paper is an extension of the work done by [Lammel et al. Analysis of XML Schema Usage] which proposed 11 size metrics. The metrics proposed in this paper focus on measuring some properties other than size of Schemas. New idea: The authors in this paper defined a graph to represent the schema structure. The components in XML Schema are related to each other, based on this dependence, a graph called successor graph representation of the schema structure can be obtained. In this graph, the nodes represent the components (such as elements, types, groups and attributes) and the edges are successor relations between two components. In the successor graph, any nodes which are strongly connected can be combined into one component as a specific notation of module. This kind of graph is called CG. Further, considering the namespaces, schema documents, or global declarations/definitions, each of them can be applied to group the nodes of a successor graph into modules. In this paper, the global declarations/definitions are adopted to develop the GG graph. The metrics proposed in this paper include: 1) Tree impurity . It indicates the extant a given graph deviates from a tree structure with the same number of nodes. 2) Efferent coupling, which is the number of edges from nodes outside the module to nodes inside the module. 3) Afferent coupling, is the number of edges from nodes inside the module to nodes outside .4) Fan-in; is the number of incoming edges of a node. 5) Fan-out: the number of outgoing edges of a node. 6) Instability: it is defined as the fan-out fraction of total fan which may be interpreted as resistance to change. 7) Coherence: it concerns the degree to which its internal nodes are connected with each other. 8) Normalized count of
41
Slide 42: modules: is defined as a ratio of potential module count to the total node in the graph. Value of 100% means that No grouping has occurred.. Correspondingly, a low normalized count of modules indicates a higher level of encapsulation. 9) Count of nodes per module: is the number of nodes in each module. Experiments/analysis carried out: A prototype tool XsdMetz was implemented in the functional programming language with functional graph representations and algorithms. A series of freely available XML Schema up to two megabyte in ascii file size were collected for analysis. The Results of all metrics for each file were collected. Statistically, it shows that for all schemas, the normalized count of modules for the strong components graph is close to 100%, which indicates the a very low degree of recursiveness. Besides, none f the schemas have more than 3 groups of mutually dependent nodes and the schema for XML Schema has a very large strong component. But the normalized count of modules for global declarations/definitions graph has a relatively lower percentage which means the low encapsulation. For tree impurity metric, it was analyzed on each kind of graphs. The other metrics, such as fan, coupling, coherence and instability are only applied on each node or each module only. Claims/conlusions made: A suite of eleven structures for XML Schema has been defined based on the graph representations of schema structures. The metrics suite was implemented and the intuitions in metrics and their potential use in schema assessment has been described. In future, the validation of these metrics will be focused and the correlations between different metrics will be figured out. Papers that refer to (cited) this paper: This paper is referred by 4 times. In this survey, it is referred by: Dilek Basci and Sanjay Misra. Complexity Metric for XML Schema Documents. OOPSLA 2007. A.2 [Mustafa et al. 2005]
42
Slide 43: Problem to be solved: This paper proposed a new method to determine the complexity of XML documents based on their structures and syntactic aspects. The purpose is to lower the complexity of XML documents and improve their reusability and maintainability by the new method. Previous work referred to measure the software metrics ranges from simple measures, such as counting the number of source lines of code, to the more sophisticated one, such as the number of variable definition/usage associations. Some better known metrics theory for measuring the software complexity are Crawford’s size metrics, McCabe’s complexity, structural decomposition metrics and residual complexity metric. For measuring the complexity of XML documents, since DTDs can be expressed as regular expressions either by direct mapping of DTDs to regular expressions or by applying some analyzing tools. Through assigning a certain value to existing symbols and alphabets in the expression, the complexity of the regular expressions can be evaluated. New algorithm: The previous work for measuring the complexity has a weakness that sometimes it can not distinguish the two different documents which have same expressions. To overcome this problem, the authors in this paper proposed a new algorithm of Weight Allocation Algorithm (WA) by assigning weights to the elements of XML trees according to their distance from the root node (element). Since XML documents can be denoted as a tree representation by applying the DOM(Document Object Model). Through a recursive pre-order traversal of the XML tree can get the corresponding XML documents. The formula used to decide the complexity of a document is: weight(i)=1, when i=root(ele(D)) and weight(i)=weight(parent(i))+1, when i belongs to elem(D)-Root(elem(D) where D denotes a given document and elem(D) is the collection of all elements contained within that document. Experiments/analysis carried out for the algorithm is to test a suite containing 90 different XML documents which have various sizes, areas, nesting levels by implementing a JAVA program.
43
Slide 44: Results obtained from the experiments include the total number of elements, number of distance elements, document size, tree height and the complexity value. The results can be expressed by applying graphical representations. Claims/conlusions made in this paper is that some works should be focused on including the WA algorithm still needs to be improved, some sophisticated methods to determine the complexity if regular expressions need to be further explored, in-depth analysis of XML Schemas to overcome the shortcoming of DTDs should be carried on. Papers that refer to (cite) this paper in this survey is: Complexity Metric for XML Schema Documents; Dilek Basci and Sanjay Misra; OOPSLA 2007; A.3 [McDowell et al. 2004] Problem to be solved: In this paper, the authors only extended the work by other authors and didn’t address a substantial problem. Previous work referred to: [Klettke et al.,2002 ] developed a set of metrics for measuring DTD documents. The work they did only focuses on the complexity of documents. It has limitations for evaluating the XML documents. New idea/algortihm/architecture etc. As an extension of the work done by [Klettke et al. ], the authors in this paper considered the quality of XML schemas and proposed eleven metrics for evaluating both the complexity and quality of XML documents. The eleven metrics were: number of Complex type declarations, number of simple type declarations, number of annotations, number of derived complex types, average number of attributes per complex type declaration, number of global type declarations, number of global type references, number of unbounded elements, average bounded element multiplicity size, average number of restrictions per simple type declaration, element
44
Slide 45: fanning. Besides, two formulas were developed to calculate the complexity indices and quality indices of a XML Schema document by considering all metrics and allocate the different weight to different metrics. Experiments/analysis carried out: in this paper, the author didn’t conduct the experiment for analyzing the metrics. One comprehensive example was shown to give the illustration of each metric. Also a Metric Analyzer tool was developed to process the XML Schemas according to user’s needs and the output is metrics and corresponding indices. Claims/conlusions made: The metrics developed here is not enough to interpreting the XML Schema. A more systematic approach and classification need to be developed for different using situation of XML Schemas. Papers that refer to (cited) this paper: This paper is referred 3 times. In this survey, it is referred by: Mlynkova I., Toman K., Pokorny J.. Statistical Analysis of Real XML Data Collections. Proceedings of the 13th International Conference on Management of Data, 2006. Basci D. and Misra S.. Complexity Metric for XML Schema Documents. Proceedings of Object-Oriented Programming, Systems, Languages, and Applications, 2007. A.4 [Lammel et al. 2005] Problem addressed: The authors in this paper claimed that the XML analysis is derived from the software analysis and of software code metrics. Since XML Schema itself can be viewed as software component. Starting from this, this paper introduces some essential concepts of XML schema analysis, and then they are applied to real-world XML schema usage for a better understanding. Previous work referred to: The XML schema metrics done by other people only focused on some structural statistical collections, such as the size or number of element,
45
Slide 46: complexity and number of types, etc. In [A. McDowell et al 2004], a detailed description of XML Schema was presented. In [D. Basci et al 2007], the complexity metrics of XML schema documents was investigated. In that paper, considers the factors of elements and attributes definitions /declarations as well as their group definitions /declarations to the complexity of XSD. In this paper, the metrics used to are McCabe complexity [T. McCabe et al 1976] and a metrics suite for software [J. Power et al 2004]. New algorithm/Analysis: The purpose of XML Schema analysis aims to extract quantitative and qualitative information from actual XML schemas. In this paper, the schema usage analysis was based on analyzing for feature counts, idiosyncrasy counts, size metrics, complexity metrics, and XML schema styles. For XSD Metrics, The first metrics proposed is XML-agnostic schema size. This metric measure the file size of all XSD files included to a schema. The second is lines of code (LOC). This metric is sensitive to line-break style and is sensitive to element-closing style. Both metrics can not be expected to reveal any XSD-specific insight and they just provide baseline measures for understanding the overall size category of a given schema project. The third metrics is related to XSD-agnostic schema size, It includes the number of all XML nodes (both attributes and elements) and the number of all XML nodes for annotation. (including documentation and appinfos). The fourth one is XSD-aware counts. It counts the global building blocks of schema. It includes the number of global element declarations, the number of global complex-type definitions, the number of global simple-type definitions, the number of global model-group definitions, the number of global attribute-group declarations, the number of global attribute declarations and the sum of all of the above. In practice, the sum of the number of global element declarations and the number of global complex type definitions is often used in informal conversions. The fifth metric is the McCabe complexity for XSD. Mcc is a structural complexity metrics and its typical use is to approximate the psychological complexity of code. Typically, MCC measures the number of linearly independent paths through the control-flow graph of a program module. In XML schema, it can be adopted for the schema grammars. It is assumed that the schema understanding process can be approximated by the number of decisions. The important decision nodes in XML schema include: choices, occurrence constraints,
46
Slide 47: element references to substitution groups, type references to types that are extended or restricted, the multiplicity of root element declarations and nillable elements. For each of the above factors, the different value is counted when calculating the total MCC. The sixth metric is the code-oriented breadth. This measures for the number of parties in a content model. Here, the content model doesn’t necessarily include attributes. The parties include local element declarations, element references, model-group references, basetype references, local attribute declarations, attribute references and attribute-group references. The seventh metric is the Instance-oriented breadth. It tries to approximate the number of children in actual XML trees, and it must deference model-group and attribute-group references. It measures the number of children and selectively includes the attributes. The eighth metrics is the Code-oriented depth. It investigates the level of nesting in content models and instances thereof. On the other hand, it is the depth of its content model. The ninth metrics is Instance-oriented depth. It measures the height of XML trees and it defines the greatest lower bound on the depth of XML trees that are derivable from a given root-element declaration,. New Experiment/Analysis: This article is a survey on XML schema usages. The authors collected 63 schema projects from different IT sectors. Some of them come from Microsoft Webdata/XML’s internal suite of business-critical schemas. The others were scanned and obtained freely on the internet. All analysis was based on these schemas. Result obtained: Each metric was applied to the schemas and the analyzing result was obtained. The authors listed all result for each metric in a table in the paper and a brief analysis was carried out. For the first and second metric, they just provide a basic measures for understanding the overall size category of a given schema project. For second metric, the amount of documentation varies greatly while the appinfos are only used very sporadically. For XSD-aware counts metric, the number of highs and lows for each measure was counted and for #CT measure, a #CT-based categorization table was obtained as the #CT metrics is biased to the XML paradigm , just as much as the “number of “classes” is biased to the OO paradigm. Also it measures the number of hierarchically structured “concepts” in a schema. For MCC metric, the graph sowed that between MCC
47
Slide 48: and #NODE, there is a weak correlation and the additional information contributed by MCC was indicated. For Code-oriented breadth, the result showed that half of the schemas have at least one content model with more than 25 parties. The difference between “parties including attributes” versus “parties excluding attributes” is visually striking in at least 10 cases. And there are 11 schemas with a number of parties above 100. For instance-oriented breadth metric, the graph is much similar to that of codeoriented breadth. For Code-oriented depth metric, the pure nesting of element declarations is limited to 2 for the majority of content models. There is a single schema that exercise element nesting up to the level of 14 and the full description depth is systematically greater. For Instance-oriented depth metric, the result showed that approximately half of all schemas define a maximum depth greater than 7. With a early ceasing, there are a good number of schemas with a maximum depth greater than 5. It was seen that much smaller depth values can be obtained if the glb-depth algorithm was applied. Conclusions made: This paper introduced all necessary concepts of XML schema analysis and executed a detailed empirical study of schema usage. At the basic conceptual level, it introduced the fundamental metrics for the XSD language and identified the basic feature model of the XSD language. At a more problem-specific level, it looked into problems that are related to the so-called impedance mismatch in data-model mapping, as it is relevant for XML data binding. The authors in this paper envisaged that the future schema developer can easily observe trends for schema metrics along development. At the same time, the schema developer is able to express intentions about complexity, features and styles so that warnings can be produced by schema editor when they are violated. This paper was referred by 7 times. In this survey, it is referred by: Visser J.. Structure Metrics for XML Schema. Proceedings of XATA, 2006. Felix Michel. Representation of XML Schema Components. Master Thesis, School of Information University of California, Berkeley, 2007.
48
Slide 49: A.5 [Basci et al. 2007] Problem to be solved: In this paper, the author proposed a new method to measure the complexity of xml schema based on the internal architecture of the XSD components rather than counting the number of schema’s each components. Previous work referred to by the authors in this paper are some procedural metrics for evaluating the complexity of DTD, such as LOC, McCabe, Fan-in and Fan-out proposed by [Klettke et al. ]. Also [Mustafa et al. ] proved that the XML documents with higher nesting levels have higher complexity and more complicated compared to the documents with lower nesting levels. [Besides, Ralf et al ] measure the structural complexity of XSDs with the metrics of Tree impurity, Efferent and Afferent Coupling, Instability, Cohesion, Normalized Count of Modules and so on. New idea proposed in this paper is that authors consider the complexity of each independent component which means the internal complexity of each component is closely related to that of whole xml document. Therefore, a weight value called complexity degree for each component is assigned. The complexity of the schema document is evaluated by summing up each of it’s component’s weight values. Based on design style, a schema may have various number of components declared locally or globally, or the elements and attributes may be defined or declared in group, or some user-defined or built-in simple type and complex type definitions. All of above have an impact on the complexity of XSD. A formula for how to calculate the complexity has been developed and assigning the weights to each element and attribute according to their definition type. Element’s weight value is 1 if the element is declared by using < any > element and anytype. The attribute’s weight value is 1 if attribute is declared by < anyAttribute > element. The weight value of a simple-type is defined to 1 if it is built-in simple type.
49
Slide 50: Analysis carried out to demonstrate the author’s claim is by analyzing a wsdl document and its XSD file. No other experiments are conducted to support the performance of this method. Claims/conlusions made in the paper is that the proposed metric works better than other methods which are based on counting the number of components since the internal architecture of XSD documents affect the overall complexity of document. To reliably applying the metric, the schema quantification should be validated. Lacking of validation for the metric is one shortcoming of this paper. Papers that refer to (cite) this paper: N/A A.6 [Lin et al. 2006] Problem to be solved: In relational databases, statistical summaries mainly focus on the distribution of data values. While this technique is applicable to the data values in XML documents, new applied techniques are required for generalization of the structures of XML documents. In this paper, the authors propose a suite of metrics for major structures of XML documents, namely the nesting of entities and one-to-many relationships. Previous work referred to: Some works have been done for summarizing the structures. Some of their approaches are graph-based. Also some of them derived for estimating selectivity of a query workload. There is method that counts the number of simple paths in XML documents and determines the correlation between paths for estimating the selectivity of a given query workload. In addition, the method that just proposes statistical synopses for XML for path query selectivity estimation. Some properties of real-world DTDs and XML Schemas are surveyed in [G.J.Bex et al. 2004] and [B.choi et al. 2002]. New Metrics: The proposed metrics for XML structures in this paper are included following. The number of paths, the length of a path, the number of star edges in the prefix tree, the number of star edges of a path (implies the number of nested one-to-many relationship in a tree), the number of recursive /non-recursive element types (measures
50
Slide 51: the number of recursive and non-recursive element types in a document), the number of recursive elements of a path (quantifies the recursiveness of an XML document), the number of a particular kind of star edges of a node (measures the multiplicity of each kind of star edges of a node) and the skewness of the number of a particular kind of star edges of a node. Experiments/analysis carried out: There is no experiments in this paper, but the authors apply the metrics on a few popular XML datasets. They are XML benchmark datasets, NASA, SP and some synthetic datasets of XMARK, XBENCH. Here, the XMARK datasets contain synthetic auction transactions and it allows users to vary the size of the generated dataset by providing a scaling factor. The XBENCH dataset has four types, they are TC/SD, TC/MD, DC/SD and DC/ND. Here, TC, DC, SD and MD stand for textcentric, data-centric, single document and multiple documents respectively. Results obtained: The results for the survey in this paper are listed as tables in the paper. The authors apply different metrics on different datasets. They summarizes the result as that the majority of surveyed XML datasets are mild generalization of relations, they are not “tree-like”. Further, these datasets are highly skewed. All the datasets are gardly hierarchical and contain a small number of recursions. Claims/conclusions made: The metrics proposed in this paper are based on the derivation of statistics from a prefix tree of XML structures and used simple paths and star edges. These metrics answer the question of whether an SML structure is tree-like or relation-like. The analysis results indicates that the structures of these XML documents are highly skewed, non-hierarchical and mostly non-recursive which means the datasets are relational-like. Papers that refer to (cited) this paper: 1 A.7 [Bex et al. 2004]
51
Slide 52: Problem to be addressed: The authors in this paper claimed that DTDs has some shortcomings and XML Schema is an extension of DTDs with a restricted form of specialization. In this paper, the authors inspected a number of DTDs and XSDs to answer two questions. (1) Which of the extra feature /expressiveness of XML Schema not allowed by DTDs are effectively used in practice; and (2) how sophisticated are the structural properties of two formalisms. Previous work related to: This paper is an extension of the paper of [Choi 2004], in which three types of schemas were proposed to identify features that are characteristic for DTDs. They are application, data and meta-data related. Also the content model created on that paper was used in this paper to consider XSDs beside DTDs. In addition, from a structural view, the DTDs and XSDs were defined by using tree languages. The specialized DTD as well as single-type SDTD were defined in the paper using the tree languages. New algorithm/analysis: The first problem discussed by the author is the expressiveness of XML schema. It investigated whether the expressive power of single-type SDTDs was used in real-world XSDs. The result showed that only about 15% of total investigated 30 XSDs are true single-type SDTDs. The reason for this is that expressiveness beyond lical tree languages is simply rarely needed. The other reason is that because the relatively new nature of XML Schema and its complicated definition most users have no clear view on what can be expressed. The second problem discussed is about the derived types of XSDs. Two kinds of types that XML Schema provides: simple types and complex types. For XML Schema, the new types can be created by two mechanisms: extension and restriction. The author made a statistic about application of this two mechanisms to simple and complex types of XML schema. The third problem discussed in the paper is additional features that XML schema possesses. One feature that XML Schema has is the application of &-operator. The other is that the utilization of ID attributes and referred to by IDREF or IDREFS for elements in XML document. In addition, referring to elements can be expressed by key/keyref pairs and the use of namespaces for modularity is the feature that XML Schema differs from the DTDs. The last feature is the ability to
52
Slide 53: redefine types and groups. The fourth question discussed is about the regular expression characterization. It also answers the second question introduced at the beginning of the paper that how sophisticated regular expressions tend to be in real world DTDs and XSDs. The simple expressions were widely used and it indicates the simplicity that the DTDs and XSDs expressed. The fifth question discussed in the paper is about the schema and ambiguity. The author applies the term of one-unambiguous to check whether the DTDs and XSDs used in the paper respect the requirement. It is showed that almost all of them meet this requirement. Only few of them are not. Experiment carried out: In this paper, the author didn’t conduct the experiment. There are total 109 DTDs and 93 XSDs collected for study. The author applied each metric and standard introduced in the paper to these samples. The author gives a detailed statistic data for each sample. Result obtained: In each section, the author listed the result for the comparison of DTDs and XSDs. There is no absolute advantage or disadvantage that which schema has, it helps the reader to get a better understanding of what the DTDs and XSDs is. Conclusion made: The analysis showed that many features defined in the XML Schema specification are not widely used, especially those that are related to object oriented data modeling such as derivation of complex types by extension. More importantly, it turns out that almost all XSDs are local tree grammars. The expressive power encountered in real world XSDs is mostly equivalent to that of DTDs. Maybe in future with the development of technology, the level of sophistication offered by XML Schema will have a wide application. The data type of XML Schema overcomes the shortcoming of DTDs that it has the ability to specify the format and type of the text of an element by restriction of simple types. The content models in both samples tend to be simple types. It can give software engineer some suggestions to avoid the utilization of complex types when developing new XML implementations.
53
Slide 54: This paper was referred by 43 times. In this survey, it is referred by: Martens W., Neven F., Schwentick T., Bex GJ.. Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems (TODS). Volume 31 , Issue 3, Pages: 770 - 813, 2006. Mlynkova I., Toman K., Pokorny J.. Statistical Analysis of Real XML Data Collections. Proceedings of the 13th International Conference on Management of Data, 2006. Lin, Z., He, BS., Choi, B.. A Quantitative Summary of XML Structures. Lecture Notes in Computer Science, Conceptual Modeling –ER, 2006.
A.8 [Choi 2002] Problem to be solved: In previous research for DTDs, it was assumed that the recursion and nondeterminism were absence. This assumption brought a problem that it is necessary to justify these assumptions against DTDs in the real world. For this reason, the author in this paper collected a number of collections and made a survey to provide some statistics with respect to a variety of criteria commonly discussed in XML research. Previous work referred to: DTDs have a wide application in implementing efficient storage systems for XML and in typechecking programming and query languages for XML. This work has been studied in literatures. The collection of DTDs in this paper was implemented by “DTD Inquisitor” which was developed by the author earlier. This program can read a DTD and identify problems, compute a number of graph-theoretic properties of it. New analysis: The statistic study of DTDs was categorized into local and global properties. For local properties, it focused on the structure and complexity of the content model. It also studied the properties that affected the parsing of the DTDs or its ambiguity. For global properties, it was focused on some properties that might be important in the mapping of the XML into some database format. In local properties,
54
Slide 55: DTDs were grouped into app, data meta categories based on their function. The first metric was the syntactic complexity. It defined the depth function as a rough measure of this metric. Then the determinism of DTDs was discussed. It applies this standard to analyze the DTDs and find out how many DTDs were non-deterministic. Also the ambiguity of DTDs was explained. The ambiguity of DTDs means that the the maping is not unique. For global properties, the reachability of element was first discussed. The reachability means that for each node Qi, there is exactly one node reachable from x along the path Qi. The author claims that separating the unreachable parts in DTD into smaller DTDs appear to be a bettwr design. Secondly, the recursion in DTDs was defined. The recursion of the DTD means that it has an element which can be reached by itself. The linear recursive means that it is recursive and for any reachable element , this element occurs only once in a content model and the occurrence is not enclosed in “*” or “+”. Following, the simple path, simple cycle, chain of Stars and hub were introduced. The simple paths happened in non-recursive DTDs and the simple cycle happens in recursive DTDs. The chain of Stars means the continuous stars are appeared in the DTD content model. The hub means that an element name with a large fan-in value. The author collected the fan-in value for data DTDs and meta DTDs and represented them on graph. Results obtained: In this paper, there is no experiment carried out. The author analyzed the collection of total 60 DTDs and applied them with the metrics and standards presented in the paper. Of 60 DTDs, 7 were app, 13 were data and 40 were meta. For each metric, the author made a statistic analysis and draws a table or a graph to show an intuitive understanding and a comparison. Conclusion made: The author in this paper provided statistics on some structures of real DTDs. The structures were analyzed that may be assumed in XML research. The analysis will provide a good reference for other people to develop the research in XML. This paper was referred by 67 times. In this survey, it is referred by:
55
Slide 56: Bex GJ., Neven F., Bussche JV.. DTDs versus XML Schema: A Practical Study. In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004. pages 79—84, 2004. Mlynkova I., Toman K., Pokorny J.. Statistical Analysis of Real XML Data Collections. Proceedings of the 13th International Conference on Management of Data, 2006. Lin, Z., He, BS., Choi, B.. A Quantitative Summary of XML Structures. Lecture Notes in Computer Science, Conceptual Modeling –ER, 2006. Bex GJ., Neven F., Bussche JV.. DTDs versus XML Schema: A Practical Study. In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004. pages 79—84, 2004. A.9 [Sahaguet 2000] Problems to be solved: With the increasing popular use of XML as data format, the authors in this paper survey some preliminary results that explore how XML DTDs are actually being used and how people are actually (mis)using DTDs. Some shortcomings and requirements for possible replacement are described. Previous work referred to: DTDs have been invented for SGML. The purpose of it is to determine whether the mark-up for an individual document is correct and also to supply the mark-up that is missing because it can be inferred unambiguously from other mark-up present. This definition is made by ISO. The historical functions of DTDs have been parsing and validation. It was studied from a database perspective for it can be useful to trigger optimizations for XML query languages. This was studied by [S. Abiteboul et al. 1999]. Also DTDs were analyzed by applying them for the purpose of storage and compression because the structure information of XML can be learned in advance. Besides, some studies were done by using DTD information to create a language binding from XML to a programming language such as C++, Java etc. New analysis: This paper is a survey of DTDs. There is no new metrics or architectures are presented.
56
Slide 57: Analysis/experiment carried out: The authors collected the DTDs from some repositories online. For some DTDs which are missing some declarations are cleaned and then normalized them by expanding parameter entities and translating the DTD structure into a convenient data-structure. The last step of the analysis is reporting and visualization. Results obtained: The authors analyze the DTDs properties in terms of its structure size, complexity. Also some specific aspects such as use of mixed-content are analyzed as well as the graphs were drawn to get some insights. The results firstly show that the most published DTDs are not correct with missin elements, wrong syntax or incompatible attribute declarations. Secondly, it shows that a DTD is not always a connected graph. The third result concerns the encoding tuples. It uses the different operator to create unordered sequences. Fourthly, it shows that the inheritance via entity references is purely syntactic and sometimes lead to some cascading mistakes. Finally, it shows that most of the features of DTDs are not being used, such as notations and fancy attribute types. Conclusions made: (1) DTD have all sort of shapes and sizes and are used for many diverse purposes; (2) DTD features are not properly understood; (3) People tend to use only one solution to apply DTDs; (4) people use hacks to solve DTD shortcomings, sometimes for better, sometimes for worse. Besides, since DTDs are often a misleading approximation of the intended structure, it is not a wise strategy to rely on DTDs for storage, compression and optimization. Paper referred to: 49 times. In this survey, it is referred by: Bex GJ., Neven F., Bussche JV.. DTDs versus XML Schema: A Practical Study. In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004. pages 79—84, 2004.
57
Slide 58: Martens W., Neven F., Schwentick T., Bex GJ.. Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems (TODS). Volume 31 , Issue 3, Pages: 770 - 813, 2006. Klettke, M., Schneider, L., and Heuer A.. Metrics for XML Document Collections. Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering. Lecture Notes In Computer Science, 2490:15 – 28, 2002. Mlynkova I., Toman K., Pokorny J.. Statistical Analysis of Real XML Data Collections. Proceedings of the 13th International Conference on Management of Data, 2006. A.10 [Washizaki et al. 2003] Problems to be solved: In Component-based software development, it is difficult to use conventional metrics since the source codes of components cannot be seen, but these metrics require the analysis of source codes. The authors in this paper propose a metrics suite for measuring the reusability of such black-box components which is based on limited information from outside of components without any source codes. Previous work referred to: Some metrics measure the reusability of OO software, such as metrics proposed in [S. Chidamber et al. 1994] and [W. Frakes et al. 1996]. But these metrics require analysis of source codes and can not be applied to black-box components. In previous works, different version of the definition of software component were described. New metrics: In this paper, a metrics suite of five metrics for black-box components was proposed. The confidence intervals for these metrics were set by using the rating scores from JARS.COM. The metrics defined in this paper are: EMI (existence of metainformation), RCO (Rate of Component Observability), RCC (Rate of Component Customizability), SCCr (Self-Completeness of Component’s Return Value) and SCCp (Self-completeness of component’s parameter). They measure the existence of meta-
58
Slide 59: information, observability, customizablity, and external dependency of a black-box JavaBeans component. Analysis carried out: The authors define a new reusability model based on McCall’s FactorCriteria-Metrics issued by ISO for black-box components. This model is structured in three levels in a top-down manner. These three levels are management level, design level and product level. The metrics are used at product level. Besides, a component analysis tool for black-box JavaBeans components in Java language has been developed. It automatically measures primitive values and calculates the values of five metrics. There are total 125 JavaBeans components provided and were applied to to calculate the confidence intervals. Results obtained: The result of analysis indicate that the components with too high an observability and a customizability are not reusable. In all samples, the percentage of components whose values of SCCr are in the confidence interval is 86%. This means that a user of a component is preferable obtaining the results of the business method’s invocation by invoking the read methods of readable properties, not by capturing the return value of the business methods. For SCCp, the percentage of components whose values of SCCp are in the confidence interval is 22%. It means it is difficult to judge the reusability only by using the value of SCCp. Also the authors analyze the correlation between different metrics. The result shows that the metrics of RCO and RCC are independent of the metrics of SCCr and SCCp. If the understandability of a component is high, then the adaptability would also be high. The SCCp cannot reflect the portability of a component compared with SCCr. The user should use only SCCr for measuring the portability of a component. Conclusion made: The authors in this paper propose a metrics suite for evaluating the component’s reusability. Using the metrics and the confidence interval proposed in this paper, it is easy for users to select the components with higher reusability. This approach can help and promote the activity of development with the reuse of existing black-box components.
59
Slide 60: This paper was referred by 47 times. In this survey, it is referred by: Narasimhan, VL., Hendradjaya, B.. A New Suite of Metrics for the Integration of Software Components. The First International Workshop on Object Systems and oftware Architectures (WOSSA'2004), 2004. Bertoa, MF., Troya, JM., and Vallecillo, A.. Measuring the Usability of Software Components. Journal of Systems and Software, vol. 79, no. 3, pages: 427-439, 2006 Narasimhan, VL., Hendradjaya, B.. Some theoretical considerations for a suite of metrics for the integration of software components; - Information Sciences 177, pages: 844-864, 2007. A.11 [Chidamber et al. 1994] Problem to be solved: The authors in this paper focus on the problem of managing the process when do software design and developing. A new suite of theoretically and mathematically based metrics for OO design was proposed. Previous work referred to: Lots of work related to object oriented design metrics had been done. [Moreau. Et al.] proposed three metrics for OO graphical information system, but didn’t give formal definitions. [Lieberherr et. al] presented a more formalized method in defining the rules for OO programming style, and the coupling and cohesion were used in traditional programming. [Pfleeger ] gave the measure of counting the number of objects and methods to develop a cost estimation model for OO development. There are also other methods available for OO design metrics. But the weaknesses of these metrics are short of theory foundation and empirical support. New idea: A suite of metrics include Weighte d Methods per class, Depth of Inheritance tree, number of children, coupling between object classes, response for a class and lack of cohesion in methods are proposed. Different from other methods, the authors evaluated each metric with a set of criterion proposed by Weyker. Theses properties are notions of monotonicity, interaction, noncoarseness, nonuniqueness and permutation. ………..
60
Slide 61: Experiments/analysis carried out: To test the metrics, a tool was developed and the data were collected by this tool. The data were collected from 2 different organizations. One is a software vendor and the other is a semiconductor manufacture. Both of them uses OOD in their development work. The former has a collection of different 634 C++ class libraries and the latter uses Smalltalk programming and has 1459 classes. Results obtained: All the metrics satisfy the most of the properties criterion. But only the property 6(interaction increases complexity) is a exception. This implies that the complexity metric can increase when a class is divided into more classes. This is confirmed by the engineers and software designers who participated in the study. It indicates that satisfying this property is not an essential feature for OO software design. In addition, the metrics of DIT and LCOM don’t satisfy the property 4(monotonicity) when 2 classes are in a parent-descendent relationship. This is right because the distance from the root of a parent can’t become greater than one of its descendants. Claims/conlusions made:The metrics suite developed in this paper provides software designers and mangers an indication with the design details of an application. It is helpful for them to solve the architectural and structural consistency problems. In addition, by using the metrics, the flaws and other leverage points in the design can be identified earlier. It added the insight about the trade-offs between conflicting requirements and ease of testing. This suite is the first empirically validated proposal for formal metrics for OOD. Papers that refer to (cite) this paper: this paper is referred by total 1520 times. In this survey, it is referred by: Frakes, W., Terry, C.. Software reuse: metrics and models. ACM Computing Surveys (CSUR),Volume 28 , Issue 2, Pages: 415 – 435, 1996. Sahuguet A.. Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask. In Third International Workshop WEBDB2000. Lecture Notes in Computer Science. 1997: 171–183, 2000.
61
Slide 62: Washizaki H., Yamamoto H., Fukazawa Y.. A Metrics Suite for Measuring Reusability of Software Components. Proceedings of the 9th International Symposium on Software Metrics, 2003. Cho ES., Min Sun Kim MS., Kim SD.. Component metrics to measure component quality. Software Engineering Conference on, 2001. APSEC 2001. Eighth Asia-Pacific. pages: 419- 426, 2001. Boxall, M., and Araban, S.. Interface Metrics for Reusability Analysis of Components. Proceedings of the 2004 Australian Software Engineering Conference (ASWEC'04), 2004. Narasimhan, VL., Hendradjaya, B.. A New Suite of Metrics for the Integration of Software Components. The First International Workshop on Object Systems and oftware Architectures (WOSSA'2004), 2004. Barnard, J.. A new reusability metric for object-oriented software. Software Quality Control, vol. 7, pages: 35-50, 1998. A.12 [Barnard et al. 1998] Problem addressed: The author in this paper claimed that for both managers of the reuse library and the developers, a way of measuring the reusability of the component is what they really need. Only with reusability metric, the usefulness of a component can be determined. For this reason, the author in this paper defines reusability metric that can be used to give values for the reusability of OO code. Previous work referred to: This paper is based on a project an experiment the author involved to get the result and analyze the developed metrics. The previous work referred to in this paper mainly focus on the reuse of component and some measurements. In some literatures, the factors that affect software reusability were summarized. In general, it includes low module complexity, good documentation, few external dependences, proven module reliability and so on.
62
Slide 63: New algorithm/Metrics suite: The new metrics developed in this paper can be grouped into for classes, for attributes, for methods and for input parameters. For each category, the metrics were presented for measurement. Or classes, low coupling and good documentation were included. For attributes, simple type-few sub-types and meaningful name and description should be considered. For methods, low number of calls to foreign classes and method should cover full coverage for all input parameters should be considered. Also the author developed a formula for how to calculate the reusability value. Experiment carried out: Two phases experiment was carried out. The experiment 1 was easy and some programmers were involved in writing a apiece of specified objectoriented code following the requirement. The programmers were separated into 2 groups. One group was asked to write the code in a very unreusable manner while the other group members are component on OO and write the most reusable code. The experiment 2 was much more professional. It analyzes a wide selection of reusable classes (30 classes with over 200 methods) taken from well used and accepted programming libraries. Result obtained: The result of experiment 1 doesn’t provide much information about the classes and documentation. Only few factors, such as Coupling Between Object (CBO) and Lack of Cohesion in Methods (LCOM) differ from the manner expected: highly reusable code has low CBO and LCOM while the unreusable code has high CBO and high LCOM values. The attributes and method input parameters show a general trend that need to be simple, well-named types for highly reusable codes. From method factors, it showed that he highly reusable code has methods which perform only one function. From the experiment 2, the author listed all the classes from the graoh that showed a constant trend and didn’t show a constant trend. It appears that calls to library classes do not follow the pattern of having low values for highly reusable classes. For highly reusable classes, it only has high reusability within its own library language domain but not to other domains.
63
Slide 64: Conclusion made: The presented reusability metric in this paper are based on empirical evidence in which the classes were assumed to be highly reusable. From the experiment, it can be seen that a class can be regarded as a black box which means that its content are irrelevant to its reusability and the reuser can only concern its documentation and interface. This metric is helpful for developer to write more reusable code. This paper was referred by 8 times. A.13 [Narasimhan et al. 2004] Problem addressed: Modern approach to software re-use has been through Component Based Software Engineering (CBSE). Lots of applications and system are created by assembling independent components which are developed by various developers. It is necessary to develop metrics for measuring the efficiency and effectiveness of such applications, including predicting the cost, effort, and time to develop and testing. In this paper, the authors propose a suite of metrics for measuring component integration. Previous work referred to: The previous work related to this paper includes the research in using the software components to build new system and the challenges it poses, the white box reuse and black reuse of components for system integration. In literatures, the metric of lines of code was used in counting the size of software; also, counting the reusability of component was used as a metric. In addition, by focusing on the process and measurement framework to evaluate the software component was investigated by other authors. New algorithm/Analysis: The metrics proposed in this paper include the complexity Metrics for components, criticality metrics for component and triangular metrics three parts. In complexity metrics for components, the component packing density (CPD) and interaction density metrics for component integration (IDC) were proposed. The CPD is defined in the form of a ratio of constituent to the number of components. The IDC is the ratio between the actual number of interactions to the available interactions in a
64
Slide 65: component. IDC=#I/#Imax, where I is Actual Interactions and Imax is Maximum Available Interactions. Similarly, the metrics for incoming and outgoing interaction densities can be defined. The Incoming Interaction Density of a Component (IIDC) is a ratio of incoming interaction used to the number of incoming interaction available. The Outgoing Interaction Density of a Component (OIDC) is a ratio of the number of outgoing interactions used to the number of incoming interaction available. For metrics for critical component, the link criticality, bridge criticality, inheritance criticality and size criticality have been proposed. For triangular metrics, it is a combination of CPD, AID and Critical metrics into two-axis diagram of density and three-axis diagram to represent the relationship among them. Experiment carried out: There is no experiment carried out in the paper. The author only used an example of Personal Digital Assistant software to illustrate the metrics. It has three main components which have corresponding functions. In addition, the authors proposed an experimental design and gave the expected outcome which wasn’t executed. Result obtained: Each metric was applied to the example and the corresponding value was calculated. The result showed the relationship between the metrics and its functions in the component integration. Conclusion made: The metrics proposed in this paper is based on a measurement theory. It will help component-based developer to identify the level of complexity and criticality in integrated software system. An experimental design was also proposed to validate the metrics theoretically and empirically. A further study on component metric will be continued. This paper was referred by 8 times. A.14 [Alkadi et al. 1998]
65
Slide 66: Problem addressed: The authors claimed that existing OO design methodologies do not generally require adherence to design metrics and there is a lack of available methods to measure the quality of the various components during the software design phase. The method proposed in this paper is described for this purpose. Previous work referred to: For object-oriented design, four key elements are considered importantly, namely abstraction, encapsulation, hierarchy and modularity. These factors have been investigated and corresponding metrics are presented by other people. In [S. Chidamber et al, 1994] proposed six software metrics on software maintenance. In this paper, the authors perform design evaluation by measuring the class structure of objectoriented designs using the above metrics criteria. New Metrics/Formulas: In this paper, a volume metric is defined to help characterize the activity in a class hierarchy based on NOC (number of children) metric. The volume metric represents the complexity of the object based on the number of methods and its attributes. The formula of volume metric V0 is defined as the sum of the number of inherited (NOI), overridden (NOO) and local (NOL) “internal or hidden” methods in an object. In a class hierarchy, the volume of all the children “objects” V* is defined as the sum of total number of methods in each object. The volume of a class Vc is defined as the sum of number of attributes NOA and V*. In addition, two other formulas were defined to compute the average of local and overridden/inherited methods (AIM) and the percentage of overridden/inherited methods (POM). Based on above formula, a heuristic was proposed to restructuring the process of software design. Experiment carried out: There is no experiment carried out to evaluate the performance of metrics proposed in this paper. The author only made an analysis for an illustration of property of each metric. Conclusion made: To evaluate the OO designs, the OO metrics should be formulated. This paper introduced a method to evaluate NOC, measure the volume of an object and measure the volume of the class. The formulas proposed to computes the ratio and local
66
Slide 67: and overridden methods over inherited methods. The heuristic was introduced to indicate when object and class hierarchies need to be reexaminzed and supported with experiment data. This paper was referred by 6 times. A.15 [Cho et al. 2001] Problem addressed: Component-based software development requires quite different approach from OO methods for it applies commonality and variability analysis, components and component’s interfaces in developing the system. Various metrics developed for OO programming cannot be equally applied to CBD process. The authors in this paper presented component metrics that can be efficiently applied in CBD process for this purpose. Previous work referred to: Many different metrics were proposed for OO systems and they were applied to the concept of classes, coupling and inheritance. It include: Weighted Methods per Class (WMC), Response for a Class (RFC), Lack of Cohesion (LCOM), Coupling Between Object Classes (CBO), Depth of Inheritance Tree (DIT) and Number of Children (NOC). These metrics studied by others and they had some limitations in application to CBD process. New Metrics/Analysis: The new metrics proposed in the paper was to measure complexity, customizability and reusability. The proposed metrics were categorized into design metrics and implementation metrics. The metrics proposed for measuring the component complexity were based on MCC. They were divided into four kinds: component plain complexity (CPC), component static complexity (CSC), component dynamic complexity (CDC) and component cyclomatic complexity (CCC). For measuring the customizability, the metric of Component Variability (CV) was defined and assigned weight value into behaviour customization methods and workflow methods to be involved the calculation. For measuring the Reusability, the first metric is
67
Slide 68: component reusability (CR) which is defined by dividing sum of interface methods providing commonality functions in a domain into the sum of total interface methods. The second metric is Component Reuse Level (CRL) which is divided into two kinds. The first one is measured by using lines of code (LOC) and the second one is by dividing functionality that a component supports into required functionality in an application. Experiments carried out: There is no experiment carried out in this paper. The authors applied metrics into several projects proceeded in the bank domain and illustrated the results. Results obtained: The value of each metric proposed in the paper was calculated. By applying metrics into component design and implementation, the result showed that the number of factors of metrics applied in design time is fewer than the number of factors of metrics applied in implementation time. Thus, measurement results of complexity or reusability by using CCC and CRL are more accurate than of by using CPC, CSC, CDC, and CR. Conclusion made: The proposed metrics were tested and it was found that the complexity of a component may help to estimate the component’s size. Also, the reusability and customizability have great effect on software development. In addition, the lines of code of components are suitable for measurements of reusability in CBSD (Component-based Software Development). This paper was referred by 33 times. It was referred by Washizaki H., Yamamoto H., Fukazawa Y.. A Metrics Suite for Measuring Reusability of Software Components. Proceedings of the 9th International Symposium on Software Metrics, 2003. Narasimhan, VL., Hendradjaya, B.. Detailed Theoretical Considerations for a Suite of Metrics for Integration of Software Components. Advances in Systems, Computing Sciences and Software Engineering, Springer Netherlands, pages: 257-264, 2006
68
Slide 69: Narasimhan, VL., Hendradjaya, B.. Some theoretical considerations for a suite of metrics for the integration of software components; - Information Sciences 177, pages: 844-864, 2007. A.16 [Boxall et al. 2004] Problem addressed: Component-based software development relies in reusable components in order to improve quality and flexibility of products. To assess the reusability of a component before reusing it would be very helpful to reusers. Between the two static sources of information for components, the component interfaces are easier to be measure while the external documentation is hard to measure. Thus, in this paper, the authors presented a set of metrics for measuring properties of understandability and reusability. Previous work referred to: In previous work, the study of the interfaces affecting component reusability has been studied in literatures. In [S. Atkinson 1997], it studied how interfaces affect the knowledge transferability. In other papers, it is indicated that component interfaces are understandable and well documented is essential for users to reuse them. In addition, the internal properties can affect the reusability and the interface may be more important than software modules. These works were done by other people. New Metrics: There is no algorithm proposed in this paper. The metrics presented in this paper include: (1) Arguments Per Procedure (APP), it measures the mean size of procedure declarations of an interface and is defined as
na / n p ,
where
np
na
is the
total count of arguments of the publicly declared procedures and
is the total count of
procedures that are declared by an interface. (2) Distinct Argument Count (DAC), it measures the consistency of the naming and typing of arguments. It is defined as DAC=| A|, here A is the set of name-type pairs used as arguments in an interface. (3) Argument Repetition Scale (ARS), it is an alternative to DAC, it is defined as count of procedures in which argument name-type a is used and
na
ARS =
∑ a∈A
a
2
na
, |a| is the
is the Argument Count
69
Slide 70: of the interface. ARS will be in the range between 1 and n. (4) Mean String Commonality (MSC), it measures the commonality of a set of identifiers and defined as
lcs ( x, y ) max( x , y )
∑ ( x , y )∈comb ( A)
MSC A =
, where A is a set of identifiers, n is the count of identifiers,
nC 2
comb(A) computes the set of combinations of two elements from the set A, and lcs(x,y) computes the longest common subsequence of x and y. (5) Identifier Length. It includes the Lean Identifier Length (MIL) and Median Identifier Length (MeIL). MIL is the weighted mean length of identifiers occurring in the interface and MeIL is the weighted median length of identifiers. (6) Reference Argument Density (RAD), it measures the occurrence of reference arguments in an interface and defined as nr / na , where count of pass by reference arguments and na is the total Argument Count. Experiment/Analysis carried out: Interfaces of 12 components were chosen to be measured to provide empirical data for metrics analysis. They are Apple Frameworks, (including the component of CoreAudio, CoreMIDI, DrawSprocket, SystemConfiguration andveclib), Financial Software Components (including the component of Core, DB, GUI, MemDB and XML) and MFC component and MLibrary component. Besides, three command-line programs were developed to yield the values of the metrics. Result obtained: For each section, the result was obtained and analyzed. The result considered as whether the interfaces are relevant to the reusability of components. For some metrics, the linear relationship exists, or there is the possibility that they are corrected with interface size. In addition, MFC has large interface and this makes it an outlier. The detailed result analysis is available in the paper. Here, I don’t give them due to the space and scope of the paper. Conclusion made: The metrics developed in the paper can provide useful information of reusability of component and can give a better understanding of the properties of component’s interface. The theory used in analysis was consistent with expert
nr
is the
70
Slide 71: knowledge. In this paper, the metrics are not complexity measures, it is necessary to consider evaluating the complexity. Also it is desirable to conduct more controlled and extensive testing of the metrics. This will be author’s future work. This paper was referred by 12 times. A.17 [Theoharis et al. 2008] Problems to be solved: Lots of semantic web schemas expressed either in RDF/S or OWL and they have been widely analyzed. Different from previous analysis, in this paper, the authors study the features of individual SW schema graphs and find that the graph features of SW schemas has a power-law degree distributions. In addition, they analyze the feature of morphology for each schema. Previous work referred to: An SW schema can be viewed as a directed labelled graph where nodes are classes or literal types and arcs are properties. The degree distribution is a main feature of the graph. In [R. Gil et al. 2004], it essentially analyzes the aggregation of all DAML library ontology which motivates the author in this paper to study the detailed analysis of individual SW schema feature. In that paper, the total degree of CCDF approximates a power law distribution. In [L. Ding et al. 2005] and [H. Halpin et al. 2006], the authors collected a voluminous set of online FOFA documents and analyzed the resulting RDF instance graph. It was observed a power-law for both in and out degree distributions. Besides, the internet topology, the WWW and the call graph also exhibit a power-law distribution for their in and out degrees. New analysis: In this paper, the authors analyze the graphs of property graph and subsumption graph. They investigate power laws on two different functions of a Discrete Rndom Variable (DRV) X. The first called Complementary Cumulative Density Function (CCDF) which is P(X>x), measures the frequencies of X values. The second is denoted by Value versus Rank (VR), measures the relationship between the ith biggest X value and its rank i. Also the authors use the findings in this paper to generate synthetic SW
71
Slide 72: schemas. The authors developed the algorithm and used the Linear Programming reduction to generate SW graphs in polynomial time. Analysis carried out: The authors conducted the experiments based on 250 schemas collected from the RDFsuite, SchemaWeb and Swoogle schema registries.
Based on the properties and transitive relationship, the graph of a SW schema can be divided into property graph and consumption graph. This corpus contains the biggest schemas published on the WWW until May 2006. According to the size of each schema, the corpus was categoeized into 2 groups. One consists of 83 schemas with more than 100 classes. The other one has 58 schemas that have more than 100 properties. The schema which the size is smaller in extended 250 schemas is not analyzed separately. Result obtained: The results show that the majority of SW schemas, 94.8% for VR and 67.2% for CCDF approximate a power law for property total-degree functions. By analyzing the subsumed classes’ VR, the VR function approximates a power law for 87.9%, while that is 60.2% for CCDF. From sketching an abstract morphology of SW schemas, it can be observed that the classes with big degrees in the property graph are usually located at the higher levels of the subsumption graph. In addition, the class subsumption hierarchies are mostly unbalanced with large branches and many leaves. Conclusions made: This paper is the first work analyzing the property graph for individual SW schemas. The majority of schemas with a significant number of classes and properties approximate a power law for class and property total-degree distributions. It can be concluded that there exist few “focal” classes that form the conceptual backbone of the defined schema and many “peripheral” ones that are used to further detail the analysis of the former. This paper was referred: N/A
72
Slide 73: A.18 [Rotaru et al. 2005] Problems to be solved: In component based software development (CBSD) system, some metrics designed for object-oriented software (OOS) can’t be used for measuring the reusability of CBSD. The authors in this paper proposed methods for this problem, and studied the adaptability and compose-ability f software components. A mathematical model for these characteristics and for evaluating non-functional parameters of software component was presented. Previous work referred to: For software metrics, the most widely used one is McCabe Cyclomatic Complexity (MCC) [McCabe 1976]. It measures the number of linearly independent paths through a program module and it is a count of the number of test cases. It is calculated from a connected graph of the program module that shows the topology of control flow and is defined as: CC=E-N+p, where E is the number of edges of the greaph, N is the number of nodes and p is the number of connected components. Other metrics for measuring the complexity include : Halstead Complexity (by counting operators and operands), Bowels metrics (by evaluating the coupling via parameters and global variables), Ligier metrics (modularity of the structure chart), troy and Zweben metrics (coupling, complexity of structure, calls-to and called-by), Henry and Kafura metrics (Fan-in – fan out). For OO metrics, they are mainly: WMC (Weighted Methods per class), DIT (Depth of Inheritance Tree), NOC (Number of Children), CBO (Coupling Between Object Classes), LCOM (Lack of Cohesion), and RFC (Response for a class). New Metrics: The first metric proposed in this paper is the Compose-Ability metric. It is defined qualitatively by studying the parameters and return values of its interface methods. This metric is inversely proportional to its multiplicity. The multiplicity of an interface method of a software component is the sum of its return and signature multiplicities. Here, the return multiplicity is zero when the method doesn’t return anything and one when the method has a return value. The signature multiplicity is the sum of the multiplicities of the method’s parameters. Also, the authors analyzed the
73
Slide 74: component adaptability and its relationship to complexity. A model for adaptability calculation was developed based on the formulas developed previously. Analysis carried out: There is no experiment implemented. The authors further analyzed the component adaptability with its calculation. The formula developed has a problem of circular dependency, to solve this problem, the solution is to replace the interface complexity with the complexity of the problem that is solved by the component. The authors followed this solution and gave the concrete analysis. Result obtained: The result for above analysis is illustrated in last paragraph. Conclusion made: The authors in this paper addresses the market needs for uniform metrics and methods of assessment the adaptability and compose-ability. The authors proposed a mathematical model for practical assessment of software component. Both from the perspective of the problem to be solved and the interface characterization, the component complexity was assessed. An implementation quality metric was also defined. It was generated by the necessity to break the circular dependency between the adaptability of a component and its complexity. Papers that refer to (cite) this paper: 4 A.19 [Kafura et al. 1987] Problem addressed: The authors in this paper claimed that the high cost of maintenance of the software is a major concern for software engineers. One of this problems perhaps lies in no a comparable cohesive method for requirement analysis, design and testing of the software system. In this paper, the authors explored the relationships between different software complexity metrics and proposed a way to utilize the quantitative information to form a complete maintenance method.
74
Slide 75: Previous work referred to: This paper was written in 1987. Before that, the studies on software reuse had been on for some time. The metrics proposed in other literatures were applied in this paper include: McCabe’s Cyclomatic complexity number, Halstead’s Effort metric E, Lines of code, Henry and Kafura’s Information Flow Metric, McClure’s Control Flow Metric, Woodfield’s Syntactic Interconnection Measure and Yau and Collofello’s Logical Stability Metric. The first three is classed into code metrics and the other four is structure metrics. They are applied in the experiments in this paper. New Algorithm/Analysis: There is no new metrics proposed in this paper. Experiments carried out: The system studied in this paper is a data base management system called Mini Data Base (MDB). It’s based on a relational model and VMS operating system on a VAX 11/780. The MDB is a medium size system with 16,000 lines of FORTRAN code. The system has been developed for several years. There are total 4 versions before the system was applied in the paper for analysis. The analysis of complexity changes was based on 4 versions. Result obtained: The analysis and comparisons between the changes of system’s three versions were made at two different levels. The first is system level. At system level, the total complexity of the system for each of the seven metrics was computed and compared for each version. The second is procedure level, in which the percent change in complexity from one version to the next for each procedure was computed. From the graph obtained in the system, it showed that in each later version of the system, there is an obvious increase in complexity. This reflects the enhancement made and new commands and capabilities added to the system. Also it showed the improvement in maintenance and most of the maintenance preformed in Version was error correction. It confirmed that the metrics reflect he growth in complexity of the system introduced as a result of maintenance activities. At second level, it recorded the percent increases in complexities for the procedures. From the result, it was observed that the metrics can be used to identify improper integration of enhancements and some complex procedures can lead to future improper structuring of the system, since maintainers would avoid using such complex procedures when making enhancements. The other interesting observation
75
Slide 76: is that the result reflects the distinction between code and structure metrics for which when the code metrics increases, the structure metrics don’t correspondingly to increase. Conclusion made: From three different versions of the same single system, seven software metrics were analyzed to assess and control software maintenance activities. The study confirmed the results obtained in previous works. It also confirmed that it is useful to examine carefully components which are “outliers” of one or more metrics. This paper provided a good experience in building the software metric analysis tool. This paper was referred by 65 times. A.20 [Narasimhan et al. 2006]
Problems to be solved :
In Component-Based Software Engineering (CBSE),
component based metrics have been proposed by several researchers. Some of them aim at the reusability, while others focus on the process and measurement framework in developing software components. In this paper, the authors’s research is an attempt to build a suite of software metrics on software assembly. Previous work referred to: The previous works focused on the component based metrics are the use of Lines of Code (LOC) in counting the size of software and estimate the number of LOC early in the software life-cycle. The shortcoming of this method is it is hard to predict the size of the software prior to implementing. For some specific issues on integration, [Sedigh Ali et al] discussed the complexity of interfaces and their integration as quality metrics. [Cho et al.] define metrics for complexity, customisability and reusability. The complexity metrics in this way is calculated by using the combination of the number of classes, interfaces and relationships among classes. The cyclometric complexity is calculated by sum the classes and interfaces. New Metrics: Two metrics are proposed in this paper. They are static and dynamic metrics respectively. The static metrics measure complexity and criticality of component assembly. Here, the complexity is measured using Component Packing Density and
76
Slide 77: Component Interaction Density metrics. The criticality are defined as Link, Bridge, Inheritance and Size criticalities. The complexity and criticality metrics are combined onto a Triangular Metric. For Dynamic metric, it characterizes the dynamic behaviour of a software application by recognizing component activities during run time. Experiments/analysis carried out: A stock-broker system is applied to evaluate the importance of the proposed metrics. In this system, a StockDistributor (SD) component monitors a stock database. If some values change in this database, this component generates an event via an event source to corresponding event sinks. Several StockBroker (SB) components can respond with their event sinks. Then the authors apply the framework proposed by Weyuker to validate the metrics proposed in this paper. That framework identifies 9 properties. Results obtained: For stock-broker system, the author calculates all its static and dynamic metrics. Then triangular metrics is analyzed and it is concluded that the StockQuoter system is a simple application. The number of cycle metric is can’t be showed, since there is no actual cycle in the graph representation. For the analysis of Weyuker’s framework, most metrics fulfill the Weyuker’s property criteria, while few do not. It shows that the metrics presented in this paper can be used both at the design level and at the implementation level. Claims/conlusions made: The metrics proposed in this paper can help developer in reasoning how complex a system is and locating critical areas in a component assembly. It also can help to identify new super-components, and high extent of use of particular components. Papers that refer to (cited) this paper: 0 A. 21 [Bertoa et al. 2004]
77
Slide 78: Problems to be solved: ISO 9126 part 2 and 4 define a set of metrics to measure the quality characteristics and sub-characteristics that constitute a Quality Model. But the metrics defined here are mostly indirect metrics and they are defined without any reference to the attributes they measure. This situation doesn’t improve the quality of following metrics for software components. Even worsen, the metrics are defined, but with no indication to either the attributes they measure or to the quality characteristics they try to assess. Previous work referred to: ISO 9126 Quality Model is a benchmark to software design. Before this work, the authors adapt this quality model to software components. Based on this model, plus the information collected from software component vendors, the metrics developed were not adequate and even ill-defined. The purpose of developing metrics is for the usability of software component. This purpose was widely studied in other literatures. The definition of usability is also various. According to ISO 9126, usability means understandability, learnability, operability, attractiveness and usability compliance. In [B. W. Boehm et al. 1978] and [N. Fenton et al. 1998], the definition of usability are also described. They are a basis for following improvement of usability definition. New metrics: There are three measurable concepts related with software component usability: Quality of documentation, complexity of the problem and complexity of the solution. The authors developed a set of metrics for each attributes. The tables are listed in the paper. Analysis carried out: There is no detailed experiments are carried out in this paper. The authors only analyze the metrics developed in this paper and the relation to attributes they measured. Result obtained: In author’s own understanding and experience, they define the link between attributes measured by metrics. The connection defined is between the quality of documentation and the complexity of design to the understandability, learnability and
78
Slide 79: operability. Though there is not a unique direct relationship between a metric and a quality subcharacteristic, but there are different degrees of relation between every metric with each subcharacteristic. The authors concluded that the quality of documentation mainly affects understandability and learnability. The complexity of design affects to operability. Conclusion made: Some of the metrics proposed in this paper are subjective and they are difficult to automate. There is some limitation for this process. The degree of influence that component attributes have over component subcharacteristics are analyzed and listed in a table in the paper. Also some problems in this research are left for author’s future work. Papers that refer to (cited) this paper: 3
79