E0 261 Sample Questions (Final)

Instructions:
 



QUESTIONS:

  1. Consider association-rule mining of market-basket databases as in the Apriori paper. Define the mapping $t(X)$ as the set of transactions in the database which all contain the item-set $X$, and the mapping $i(S)$ as the set of items that are present in all the transactions of the transaction-set $S$. For example, given items $A,B,C,D$ and transactions $T_1 = (A,B); \: T_2 = (B,C); \: T_3 = (B,C,D)$, then $t(BC) = \{T_2, T_3\}$ and $i(T_1,T_2) = \{B\}$.

    With these definitions, an item-set $X$ is said to be ``complete'' if and only if $X = i ( t (X))$. Your project partner claims that if he were ``magically'' given the identities and supports of only the complete frequent item-sets in the item-set lattice, then he would be able to establish the identities and supports of all frequent item-sets without having to scan the database, and thereby discover the association-rules trivially.

    1. Do you believe the claim w.r.t. establishing all frequent item-set identities? Explain. (5 marks)

    2. For the frequent itemsets that are identifiable (if any), do you believe the claim w.r.t. establishing their supports? If yes, describe an algorithm for computing the supports. If no, explain why and show a sample scenario of the failure. (5 marks)



  2. For the data warehousing environment shown in Figure 1, and given a space budget of 12.5M, list the views recommended for materialization by the greedy algorithm. (4 marks)

    Figure 1: Views on TPC-D database
    \begin{figure}
\begin{center}
\hspace*{1.0in} \centerline{\psfig{file=index.ps,height=8.0in,width=4.0in}}
\vspace*{-7.5in}
\end{center}
\end{figure}


  3. Your instructor claims, with reference to the View Selection paper, that the reason for the greedy algorithm working so well (performance guarantee of 0.63) is that the benefit function possesses a monotonicity property that relates the impact of including new views in the result set on the remaining unselected views in the lattice. Identify and prove this monotonicity property, and explain how it helps in providing this strong guarantee. (4 marks)


  4. With reference to the Database Architecture paper:

    1. The hashing function uses B bits of the hash-value, progressively going from left-to-right in this bit segment. Do you think it is equally acceptable to go from right-to-left in the bit segment? (2 marks)

    2. Recall that in the Linear Hashing paper, we had discussed the high-quality randomizing hash function: [(A k) mod w] where $A = 6125423371$, $w = 2^{32}$, and $A$ is prime wrt $w$. Your project partner suggests that after incorporating a ``mod $H_p$'' factor, this function can be used during the initial clustering phase in memory-resident joins to improve the distribution of data values across the clusters. Explain how you would respond to this suggestion. (2 marks)

  5. A system has 32 KB L1 cache with 1024 lines of 32 bytes each and 4 MB of L2 cache consisting of 32768 lines of 128 bytes each. The cost of a L1 miss is 5 cycles and the cost of a L2 miss is 55 cycles. You have to design a CSB tree for this system with a goal of minimizing the miss cost during a search. Assume that the key size is 4 bytes and a pointer size is 4 bytes. There is an argument between you and your project partner on whether to set the CSB node size to be the L1 cache line size or the L2 cache line size. To settle the dispute:

    1. For both these options, calculate the number of L1 and L2 cache misses during a search and compute the aggregate cost of cache misses in terms of cpu cycles. Show which one is more efficient. (3 marks)
    2. How do these options compare in terms of the node split cost? (3 marks)

  6. The tuple reconstruction paper efficiently handles queries with multiple selection/projection predicates in columnar databases. Further, it claims that - ``Potentially, many operators can exploit the clustering information in the maps, e.g., a max can consider only the last piece of a map, or a join can be performed in a partitioned like way exploiting disjoint ranges in the input maps''.

    Do you agree with the above claim? If yes, validate it by providing your most efficient join algorithm that exploits the existing partition information to implement the binary join operation for a predicate of the form A.x = B.y. If no, explain why. (5 marks)