Document Type



This item is available under a Creative Commons License for non-commercial use only


Computer Sciences

Publication Details

Successfully submitted in partial fufillment of the requirements of the Dublin Institute of Technology for the award of M.Sc. in Computing (Data Analytics), March, 2015.


One of the greatest challenges in the modern information world is the storage of data and the ability to extract usable knowledge from it, to enable enterprises to gain insight that can be leverage to succeed in the competitive environment. Data is constantly being generated and collected at a rapid pace from users and customers via many sources such as intranet, internet and other smart devices. To effectively use the collected data, knowledge must be extracted from it to allow effective and efficient business planning. While the core challenge of data analytics is to extract hidden non-trivial information form large datasets there is the more immediate concern of how to effectively index information so that both data and the converted knowledge can be recalled quickly and accurately. This thesis examines the theoretical challenges in reading semi structured data (i.e. a website), converting it to an effective storable format and how to best search the data from an index. With the knowledge gained from the theoretical challenges an investigation will follow to see if it is possible to use only open source components to build a framework that will in part solve the challenges of enterprise indexing. Finally from the experiment a usable framework for a federated search system based on a file based index core and a relational database based core would available for further refinement. The experiment and evaluation reveals that using the proposed open source frameworks and architecture it is possible to combine both the file based index and the relational database index to form a federated search framework with the best attributes from both as well as solving the big challenges around scalability and systems integration in an enterprise environment. Despite the slower processing times of a relational database index compared to a file based index both contains advantages and disadvantages from the performance and maintenance perspective. Rather than picking one technology over the other the experiment shows it is possible to have the two working in a non-intrusive way to index the same dataset within one system.