PhD DATABASE

Title:  
Partial Query Evaluation for Vertically Partitioned Signature Files in Very Large Unformatted Databases
Abstract:  
Signature file approach is a well-known indexing technique. The Bit Sliced Signature File (BSSF) method provides an efficient retrieval environment by only accessing on-bits of query signatures. However, its response time increases for increasing number of query terms. For BSSF we define a stopping condition that tries to avoid the processing of all on-bits of query signatures through partial evaluation. The aim of the stopping condition is to reduce the expected number of false drops to the level that also provides the lowest response time within the framework of BSSF. We propose the Partially evaluated Bit- Sliced Signature File (P-BSSF) method that employs the partial evaluation strategy and minimizes the response time in a multi-term query environment by considering the submission probabilities of the queries with different number of terms. Experiments show that P-BSSF provides 85% improvement in response time over BSSF depending on space overhead and the number of query terms. To provide better optimization of the signature file parameters in multi-term query environments, we propose Multi-Fragmented Signature File (MFSF) method as an extension of P-BSSF. In MFSF, a signature file is divided into variable sized vertical fragments with different on-bit densities to optimize the response time using a similar query evaluation methodology. In query evaluation the query signature on-bits of the lower on-bit density fragments are used first. As the number of query terms increases, the number of query signature on-bits in the lower on-bit density fragments increases and the query stopping condition is reached in fewer bit slice evaluation steps. Therefore, in MFSF, the response time decreases for an increasing number of query terms. The analysis shows that, with no space overhead, MFSF outperforms the P-BSSF and generalized frame-sliced signature file organizations. Due to hashing and superimposition operations used in obtaining signatures, the signature of an irrelevant record may match the query signature, i.e., it is possible to have false drops. In signature file processing the accurate estimation of the number of false drops is essential to obtain system parameters that will yield a desirable response time. We propose a more accurate false drop estimation method for databases with varying record lengths instead of using an average number of distinct terms per record. In this method, the records of a database are conceptually partitioned according to the number of distinct terms they contain and the number of false drops of each group is estimated separately. Experiments with real data show that depending on the space overhead, the proposed method obtains up to 33%, 25%, and 20% response time improvements for the sequential, generalized frame-sliced, and MFSF methods, respectively. For very large databases even one bit slice of MFSF may occupy several disk blocks. We propose the Compressed Multi-Fragmented Signature File (C-MFSF) method that stores the bit slices of MFSF in a compact form which provides a better response time. To minimize the number of disk accesses, the signature size and the disk block size can be adjusted such that most bit slices fit into a single disk block after compression. In such environments, C-MFSF evaluates the queries with more than two terms with only one disk access per query term rather than two disk accesses of the inverted file method which are respectively for the pointer of the query term posting list and the list itself. Our projection based on real data shows that for a database of one million records C-MFSF provides a response time of 0.85 seconds.
URL:  
Area of Science:  
Computer Science
PhD Student:  
Seyit Kocberber
E-mail:  
Scientific Adviser:  
Fazli Can
E-mail:  
University:  
Bilkent University
City:  
Ankara
Country:  
Turkey