Pattern Recognition Letters 67 (2015) 109–112

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Editorial Data science, big data and granular mining

Introduction Mining of big data can be made effective with the methodologies that can deal with their characteristics, such as heterogeneity, dy- With the evolution of various modern technologies, huge amount namism or time varying, redundancy, uncertainty and impreciseness. of data is being constantly generated and collected around us. We Heterogeneity in data comes from the characterization of information are in the midst of what is popularly called information revolution in different ways such as using numerical, categorical, text, and im- and are living in a so-called world of knowledge. Intentionally and/or age or audio/video data. Dynamism in data is due to the mechanism accidentally, generation of these data is inevitable. As a result, large that generates related data changes at different times or different cir- data, broadly characterised by three Vs – large volume, velocity and cumstances, which adds new uncertainty and difficulties for analysis. variety (popularly known as “big data”) [1], is becoming a fancy word, High dimensionality is due to the inappropriate collection of data for and analysis, access and store of these data are now central to various a task. Although a large candidate set of attributes is provided, most scientific innovation, public health and welfare, public security and of them are irrelevant or redundant. These superfluous attributes de- so on. Moreover, big data are highly complex in nature and mining teriorate the learning performance of decision-making algorithms. them is not straight forward. Most of the information is heteroge- In the recent past, evolution of research interest has cropped up neous, time varying, redundant, uncertain and imprecise. To reason, a relatively new area called, granular (GrC) [2,3,7],due understand and mine useful knowledge from these data is becoming to the need and challenges from various domains of applications, a great challenge. It is also true that large integrated data sets can such as , document analysis, financial gaming, organiza- potentially provide a much deeper understanding of both the nature tion and retrieval of huge data bases of multimedia, medical data, and society, and open up new avenues for research activities. remote sensing, and biometrics. Several aspects of GrC [6] play im- Long before the data space, there have been histories of physi- portant roles in the design of different decision-making models with cal and social spaces to describe various phenomena in nature and acceptable performance. human civilization, respectively. These two spaces eventually led to GrC, based on technologies like fuzzy sets [9], rough sets [8], com- natural and social sciences, where a few research activities related to puting with words, etc., provides powerful tools for multiple granu- understanding of processes and concepts such as discovering various larity and multiple-view data analysis, which is of vital importance phenomena in the physical world and understanding the human in- for understanding the complexity in big data. Different users may teractions are being carried out. In the recent past, digitalization of require understanding of big data at different granularity levels and various observable facts of these spaces has produced huge amount need to analyse the data with different viewpoints. GrC has exhibited of data. With the over-flooding of data called “big data” people have some capability and advantages in intelligent data analysis, pattern started realizing the existence of data space along with the natural recognition, and uncertain reasoning for noticeable and social spaces, and shown remarkable interest on it to explore. size of data. However, very few research activities have gone deep into Once the data are generated, they will not evolve accordingly if no the nutshell of heterogeneous, complex and big data analysis so far. special mechanism is arranged. The data may have powerful reac- tion to the real world even if it is fabricated, e.g., rumours spread via Granular computing: components, characteristics and features mobile phones or through a social media. Big data has a role in its promising usefulness in many fields such as commerce and business, GrC by its name is not new, but the popularity in its use for biology, medicine, public administration, material science, and cog- various domains has gained recently. It is a computing paradigm of nition in human brain, just to name a few. The objective of big data, information processing that works with the process of information as it stands, is to develop complex procedure running over large-scale granulation/abstraction. GrC is an umbrella term that accommodates enormous-sized data repositories for – extracting useful knowledge everything about theories, methods, techniques and tools of informa- hidden therein and delivering accurate predictions of various kinds tion granulation. In processing large-scale information, GrC plays an within a reasonable time period. Scientists from the academia, in- important role that finds simple approximate solution which is cost dustry, and open source community have been trying for improved effective and provides improved description of real world intelligent reasoning and understanding of big data and to provide better scope systems. In the present scenario, synergetic integration of GrC and to solve several social and natural problems. It may be mentioned computational intelligence has become a hot area of research to here that while analytics over big data is playing a leading role, there develop and experiment various efficient decision-making models has been a shortage of deep analytical talent globally. for complex problems. http://dx.doi.org/10.1016/j.patrec.2015.08.001 0167-8655/© 2015 Published by Elsevier B.V. 110 Editorial / Pattern Recognition Letters 67 (2015) 109–112

Broadly, the prime motivation in the development of GrC-based in an opposite manner, broadly they are highly correlated to each methodologies is three-fold. These are: (i) GrC does not go for an ex- other. At higher level, a granule represents more abstract concept by cessive precision of solution which is in fact the inherent character- ignoring irreverent details, while at lower level it is more specific one istic of human reasoning process, (ii) the structural re-presentation that reveals more detailed information. A challenging problem in GrC of the problem in the work-flow of GrC makes the solution process is to coordinate and/or swap between different levels of granularity. more efficient, and (iii) the computing method provides transparency Several attempts have been made to develop machine learning mod- in the information processing steps. Human reasoning-process nor- els using these approaches. mally follows the principle of information granulation and performs To improve the decision making-process using GrC, one may have the computation and operations on information granules. Most often, to represent and interpret the granular relationship, such as inter and modelling and monitoring complex systems do not require high level intra-relationship meticulously. The intra-relationship among the el- of precisions and in fact it is often expensive and not necessary. Fur- ements of a granule and inter-relationship among the elements of ther, a task with incomplete and imprecise information poses prob- two different granules have strong influence in GrC-based models. lems to differentiate distinct elements. This motivates one to go for These relationships provide the essential information for whether to granular representation of information and processes. As a result, GrC go for integration or dispersion of granules. framework provides efficient and intelligent information processing It is worthwhile mentioning the importance of methods for rea- systems for various real life decision-making applications. The said soning (judgment) about properties of computations over granules. framework can be modelled with principles of neural networks, in- These properties are constructed (induced) using a higher level gran- terval analysis, fuzzy sets and rough sets, both in isolation and inte- ulation over computations on granules. gration, among other theories. One of the key advantages in computing with granules is that granular processing mimics the human way of understanding of Granular computing: fuzzy and rough sets a particular problem where it requires different levels of informa- tion. GrC works in a similar fashion that deals with the structured Among the different technologies required for performing GrC, representation of the real-world problems. In fact, this form of the ones based on fuzzy and rough sets are the most successful and characterization is inherent in any task of the universe, and several considered to be the best choice for decision-making processes. In relationships among the different levels of structures and multiple fact, the concept of granulation is inherent in those theories. levels of understanding provide ample scope to understand and theory starts with the definition of membership function and granu- interpret them. In many cases, including large data sets, a problem lates the features; thereby producing the fuzzy granulation of fea- may be difficult to be solved at one level, but subsequent derivations ture space. The fuzziness in granules and their values characterise the of the same problem on other levels may lead to a partial, or full ways in which human concepts of granulation are formed, organised solution. This concept of hierarchical problem abstraction is useful and manipulated. In fact, fuzzy information granulation does not re- for knowledge discovery where the data set in question is large with fer to a single fuzzy granule; rather it is about a collection of fuzzy many attributes. The paradigm of GrC suitably offers ample oppor- granules which result from granulating a crisp or fuzzy object. De- tunities for efficient methodologies of knowledge discovery, where pending on the problems and whether the granules and the process it can be combined with fuzzy sets and rough sets (as example); are fuzzy or crisp, one may have operations like granular fuzzy com- thereby developing a systematic modelling framework with a focus puting or fuzzy granular computing. The number of concepts formed on the overall interpretation of the system. Such transparency in through fuzzy granulation determines the corresponding granula- the working process provides improved interaction between the tion being relatively fine or coarse, and choice of the number is an expert knowledge and the system design leading to better system application-specific optimization problem. The theory pro- performance. vides an effective model to acquire knowledge in an information sys- The components of GrC that drive the complete process are: gran- tem with upper approximation and lower approximation as its core ules, granulation, granular relationship and computing with granules. concepts and in making decisions according to the definition of indis- A granule is considered as a building block, which plays a significant tinguishibility (indiscernibility) relation and attribute reducts. Differ- role in the process of GrC. Constructing appropriate granules for a ent variants of the conventional rough sets are available mainly by particular task is crucial, because different sizes and shapes of gran- redefining the indistinguishibility relation and approximation oper- ules decide the success rate of the GrC models. A granule is a col- ators. The rough set approach (RS) can be used to granulate a set lection of entities drawn together by indistinguishability, similarity, of objects into information granules (IGs). The size of the informa- proximity or functionality. Significance of a granule in GrC is similar tion granules is determined, e.g., by how many attributes and how to that of any object, subset, class, or cluster in a universe. Depend- many discrete values each attribute takes in the subset of the whole ing on the size and shape, and with a certain level of granularity, the attribute set, which is selected to do the granulation. Generally, larger granules may characterise a specific aspect of the problem. Granules the number of attributes and more the values that each attribute with different levels of granularity represent the model differently takes, finer is the resulting IGs. In the perspective of knowledge trans- and regulate the decision-making process accordingly. In the forma- formation, the task of analysing data and solving problems by fuzzy tion of different granules, a subset of objects having similarity among sets or rough sets is actually to find a mapping from the informa- them, in terms of some relation or attribute, is considered, which is a tion represented by the original finest-grained data to the knowledge kind of clustering approach. Granules can be formed using fuzzy sets, hidden behind a set of optimised coarser and more abstract infor- rough sets, random set, and interval set. mation granules. These theories are also integrated synergistically The process of formation and representation of granules is called within themselves and with other knowledge acquisition models, granulation. Granulation is performed in both ways, such as inte- which yield, for example, rough neural computing [5], neural fuzzy grating and dispersing the granular structure. Integration involves computing [11] and rough fuzzy computing [4,10]. While the former the process of developing larger and/or higher level granules with two enables forming various knowledge based granular neural net- smaller and/or lower level granules, whereas dispersion involves the works for improved learning and performance, rough-fuzzy comput- process of decomposing larger and/or granules into smaller and/or ing provides a stronger paradigm, than fuzzy sets or rough sets, that lower level granules. These processes of integration and dispersion can handle uncertainty arising both from the overlapping character- are also known as bottom-up and top-down approaches, respectively istics of concepts/classes/regions and granularity in the domain of in the development of granules. Although, these two processes work discourse. Editorial / Pattern Recognition Letters 67 (2015) 109–112 111

Big data: challenges and relevance of granular mining lined with regard to results of clustering produced in the transformed space. A generalization of the proposed approach with detailed al- Various aspects of our day-to-day activities have been influenced gorithmic development is also discussed for the clustered data with and regularised with the presence of big data. It has not only rev- non-numeric information. olutionised individuals but also affected the science, and planning C. Sengoz and S. Ramanna in their article “learning relational facts and policies of the government. Although, accomplishment in this from the web: a tolerance rough set approach” has proposed a gran- domain is in the initial stage, several technical challenges are posed ular model that structures categorical noun phrase instances as well that need to be addressed to fully realise the potential of big data. as semantically related noun phrase pairs from a given corpus. In ad- Generally, achieving the usefulness of big data is followed by multi- dition they put forward a semi-supervised tolerant pattern learning ple levels of operational steps, such as acquisition, information ex- algorithm that labels categorical instances as well as relations. The traction and cleaning, data integration, modelling and analysis, and authors have tried to address the issue of labelling large number of interpretation and deployment. Fuzzy sets, rough sets, neural net- unlabelled data with a small number of available labelled data. In works, interval analysis and their synergistic integrations in granular this study, the model treats the noun phrases, which are described as computing framework have been found to be successful in most of sets of their co-occurring contextual patterns. For this, they have used these tasks. Research challenges around big data arise from various the ontological information from the never ending language learner aspects and issues, such as their heterogeneity, inconsistency and in- (Nell) system. completeness, timeliness, privacy, visualization and collaboration. In the article “a new method for constructing granular neural net- One may note that managing uncertainty in decision-making and works based on rule extraction and extreme learning machine”, the prediction is very crucial for mining any kind of data, no matter small authors X. Xu, G. Wang, S. Ding, X. Jiang and Z. Zhao have introduced or big. While fuzzy logic (FL) is well known for modelling uncertainty a framework of granular neural networks named rough rule granular arising from vague, ill-defined or overlapping concepts/regions, RS extreme learning machine (RRGELM), and developed its comprehen- models uncertainty due to granularity (or limited discernibility) in sive design process. The proposed model is based on the rough deci- the domain of discourse. Their effectiveness, both individually and sion rules that are extracted from the training samples. In the gran- in combination, has been established worldwide for mining audio, ular neural network, the linked weights between the input neurons video, image and text patterns, as present in the big data generated by and granular-neurons are determined by the confidences of rough de- different sectors. FS and RS can be further coupled, if required, with cision rules, while the linked weights between the output neurons (probabilistic) uncertainty arising from randomness in occurrence of and granular-neurons can be initialised as the contributions of the events in order to result in a much stronger framework for handling rough rules to the classification. The network is then trained with ex- real life ambiguous applications. In case of big-data the problem be- treme learning machine algorithm and its efficacy is demonstrated comes more acute because of the manifold characteristics of some of on several data sets. the Vs, like high varieties, dynamism, streaming, time varying, vari- The next article, authored by S. Kundu and S. K. Pal, concerns with ability and incompleteness. This possibly demands the judicious in- mining social networks. In their article titled “Fuzzy-rough commu- tegration of the three aforesaid theories for efficient handling. nity in social networks”, a novel algorithm for community detection in a social network is described. The method identifies fuzzy-rough In this issue communities where a node can be a part of many groups with dif- ferent associated membership values. The algorithm runs on a new This special issue on “granular mining and knowledge discovery” framework of social network representation based on fuzzy granula- presents some novel approaches and methodologies reflecting the tion theory. An index, called normalised fuzzy , state-of-the-art of granular mining and discovering knowledge. Af- is defined to quantify the goodness of the detected communities. It ter a rigorous review process of several submissions from all over the may be mentioned here that the data available from the online so- world, only six articles spanning a segment of the research in gran- cial networking sites are dynamic, large scale, diverse and complex ular mining and its applications were selected for publication in this with all the characteristics of big data in terms of velocity, volume, issue. The concepts behind the models described here would be use- and variety. ful to big data researchers and practitioners. A. Zhu, G. Wang and Y. Dong described a robust system in “detect- A brief description of six contributions is stated in the following in ing natural scenes text via auto image partition, two-stage grouping the order they appear in the issue. In their article “data granulation by and two-layer classification” to detect natural scene text according the principles of uncertainty”, L. Livi and A. Sadeghian present a data to text region appearances. Framework of this system includes three granulation framework that elaborates over the principles of uncer- parts, such as auto image partition, two-stage grouping and two-layer tainty. It is based on the assumption that a procedure of information classification. The first part partitions the images into unconstrained granulation is effective if the uncertainty conveyed by the synthe- sub-images through statistical distribution of sampling points. The sised information granule is in a monotonically increasing relation- designed two-stage grouping method performs grouping in each sub- ship with the uncertainty of the input data. The proposed framework image in first stage and connects different partitioned image regions is aimed to offer a guideline for the synthesis of information granules in second stage to group connected components (CCs) to text regions. and to build the groundwork to compare and quantitatively judge Then a two-layer classification mechanism is designed for classifying over different granulation procedures. The authors have also pro- candidate text regions. vided suitable case studies to introduce a new data granulation tech- These articles provide a wide range of methods with various char- nique based on the minimum sum of distances, which is designed to acteristic features and different applications of granular computing- generate type-2 fuzzy sets. The method with automatic membership based mining and knowledge discovery. We hope that the publica- function elicitation is completely based on the dissimilarity values of tion of this issue will encourage further research activities and mo- the input data that makes it widely applicable. tivate developing novel approaches for uncertainty handling and to The article “clustering in augmented space of granular con- address the challenging issues encountered in real life problems in- straints: a study in knowledge-based clustering” by W. Pedrycz, A. cluding those in big data. Gacek and X. Wang provides a study on fuzzy clustering in aug- In conclusion, we would like to thank all the authors who sub- mented space of granular constraints. The authors have described mitted research manuscripts to this issue, as well as the anonymous the constraints as a collection of Cartesian products of fuzzy sets. reviewers for the time spent to evaluate manuscript and provide The role of information granules using order-2 fuzzy sets is under- constructive comments and suggestions. We would like to extend a 112 Editorial / Pattern Recognition Letters 67 (2015) 109–112 special note of appreciation to Dr. Gabriella Sanniti di Baja, Editor- [10] P. Maji, S.K. Pal, Rough-fuzzy Pattern Recognition: Applications in Bioinformatics in-Chief of Pattern Recognition Letters, and Yanhong Zhai and Jefeery and Medical Imaging, Wiley-IEEE CS Press, N.Y., 2012. [11] S.K. Pal, Granular mining and rough-fuzzy pattern recognition: a way to natural Alex, the publication supporting team for their constant support. computation., (feature article). IEEE Intell. Inf. Bull. 13 (1) (2012) 3–13 (feature article). References

[1] E.E.S. (Eds.), Data Science and Big Data Analytics: Discovering, Analyzing, Visual- Sankar K. Pal∗ izing and Presenting Data, John Wiley and Sons, N.Y., 2015. Center for Soft Computing Research, Indian Statistical Institute, 203 [2] T.Y. Lin, Granular computing: practices, theories, and future directions, in: R.A. Meyers (Ed.), Encyclopedia of Complexity and Systems Science, Springer, Hei- Barrackpore Trunk Road, Kolkata 700108, India delberg, 2009, pp. 4339–4355. Saroj K. Meher [3] W. Pedrycz, A. Skowron, V. Kreinovich (Eds.), Handbook of Granular Computing, Systems Science and Informatics Unit, Indian Statistical Institute, John Wiley and Sons, , 2008. [4] S.K. Pal, A. Skowron (Eds.), Rough-Fuzzy Hybridization: A New Trend in Decision Bangalore Centre, Bangalore 560059, India Making, Springer-Verlag, Singapore, 1999. Andrzej Skowron [5] S.K. Pal, L. Polkowski, A. Skowron (Eds.), Rough-neural Computing: Techniques for Institute of Mathematics, Warsaw University, Banacha 2, 02–097 Computing with Words, Springer-Verlag, Berlin, 2004. [6] S.K. Pal, S.K. Meher, Natural computing: a problem solving paradigm with granu- Warsaw, Poland lar information processing, Appl. Soft. Comput. 13 (2013) 3944–3955. [7] J.T. Yao, A.V. Vasilakos, W. Pedrycz, Granular computing: perspectives and chal- ∗Corresponding author. lenges, IEEE Trans. Cybern 43 (2013) 1977–1989. [8] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982) 341–356. E-mail addresses: [email protected] (S.K. Pal), [9] L.A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in [email protected] (S.K. Meher) human reasoning and fuzzy logic, Fuzzy Sets Syst. 90 (1997) 111–127.