sample_final

E0 261 Sample Questions (Final)

Instructions:

YOU SHOULD NOT DISCUSS THE QUESTIONS OR ANSWERS WITH ANYBODY ELSE UNDER ANY CIRCUMSTANCES - such activities will immediately attract an F grade. Clarifications, if necessary, should be sought directly from the instructor or the TAs.
Pick up one question paper, one stapled answer booklet, and a few loose sheets. The loose sheets should be used for your rough work and need not be returned. Your ``fair'' answers should be in the stapled answer booklet with your name and SRNo on the front page. Write the answer to each question on a different page.
Spend the first half-hour reading all the questions carefully and fully. If a question is not clear, ask the instructor or the TAs for clarifications.
Write to-the-point answers with proper justification for each of your claims. For questions that ask you to say YES or NO to a suggestion, clearly write whether your answer is YES or NO, before giving the explanation for your answer.
Since this is an open-time exam, it is expected that you will have carefully checked your answers (i.e. no ``silly mistakes'').

QUESTIONS:

Your classmate made the following ``one-phase commit protocol'' (1PC) suggestion during the class on distributed commit processing: Let each cohort pro-actively force a prepared log record to the local disk and enter the prepared state at the time of sending the WORKDONE message itself. That is, the WORKDONE message doubles up as a READY vote.
For the 2PC and 1PC commit protocols, fill up the following table, where ERT refers to ``Earliest Release Time'' w.r.t. the protocol's communication and logging action sequence (e.g. Immediately after receiving StartWork message).

Commit E R T E R T

Protocol Read Locks Write Locks

1PC

2PC

Assume that
- A Strict-2PL-like protocol is used to ensure serializability and no cascading aborts.
- The table entries are for the cohorts of committed transactions. (4 marks)
Consider association-rule mining of market-basket databases as in the Apriori paper. Define the mapping as the set of transactions in the database which all contain the item-set , and the mapping as the set of items that are present in all the transactions of the transaction-set . For example, given items and transactions , then and .
With these definitions, an item-set is said to be ``complete'' if and only if . Your project partner claims that if he were ``magically'' given the identities and supports of only the complete frequent item-sets in the item-set lattice, then he would be able to establish the identities and supports of all frequent item-sets without having to scan the database, and thereby discover the association-rules trivially.
1. Do you believe the claim w.r.t. establishing all frequent item-set identities? Explain. (5 marks)
2. For the frequent itemsets that are identifiable (if any), do you believe the claim w.r.t. establishing their supports? If yes, describe an algorithm for computing the supports. If no, explain why and show a sample scenario of the failure. (5 marks)
For the data warehousing environment shown in Figure 1, and given a space budget of 12.5M, list the views recommended for materialization by the greedy algorithm. (4 marks)

Figure 1: Views on TPC-D database
$\begin{figure} \begin{center} \hspace*{1.0in} \centerline{\psfig{file=index.ps,height=8.0in,width=4.0in}} \vspace*{-7.5in} \end{center} \end{figure}$
Your instructor claims, with reference to the View Selection paper, that the reason for the greedy algorithm working so well (performance guarantee of 0.63) is that the benefit function possesses a monotonicity property that relates the impact of including new views in the result set on the remaining unselected views in the lattice. Identify and prove this monotonicity property, and explain how it helps in providing this strong guarantee. (4 marks)
With reference to the Database Architecture paper:
1. The hashing function uses B bits of the hash-value, progressively going from left-to-right in this bit segment. Do you think it is equally acceptable to go from right-to-left in the bit segment? (2 marks)
2. Recall that in the Linear Hashing paper, we had discussed the high-quality randomizing hash function: [(A k) mod w] where , $w = 2^{32}$ , and is prime wrt . Your project partner suggests that after incorporating a ``mod '' factor, this function can be used during the initial clustering phase in memory-resident joins to improve the distribution of data values across the clusters. Explain how you would respond to this suggestion. (2 marks)
A system has 32 KB L1 cache with 1024 lines of 32 bytes each and 4 MB of L2 cache consisting of 32768 lines of 128 bytes each. The cost of a L1 miss is 5 cycles and the cost of a L2 miss is 55 cycles. You have to design a CSB tree for this system with a goal of minimizing the miss cost during a search. Assume that the key size is 4 bytes and a pointer size is 4 bytes. There is an argument between you and your project partner on whether to set the CSB node size to be the L1 cache line size or the L2 cache line size. To settle the dispute:
1. For both these options, calculate the number of L1 and L2 cache misses during a search and compute the aggregate cost of cache misses in terms of cpu cycles. Show which one is more efficient. (3 marks)
2. How do these options compare in terms of the node split cost? (3 marks)
The tuple reconstruction paper efficiently handles queries with multiple selection/projection predicates in columnar databases. Further, it claims that - ``Potentially, many operators can exploit the clustering information in the maps, e.g., a max can consider only the last piece of a map, or a join can be performed in a partitioned like way exploiting disjoint ranges in the input maps''.
Do you agree with the above claim? If yes, validate it by providing your most efficient join algorithm that exploits the existing partition information to implement the binary join operation for a predicate of the form A.x = B.y. If no, explain why. (5 marks)

Commit	E R T	E R T
Protocol	Read Locks	Write Locks
1PC
2PC