- Your classmate made the following ``one-phase commit protocol'' (1PC)
suggestion during the class on distributed commit processing: Let
each cohort pro-actively force a prepared log record to the local
disk and enter the prepared state at the time of sending the WORKDONE message itself. That is, the WORKDONE message doubles
up as a READY vote.
For the 2PC and 1PC commit protocols, fill up the following table, where
ERT refers to ``Earliest Release Time'' w.r.t. the protocol's
communication and logging action sequence (e.g. Immediately after receiving StartWork message).
Commit |
E R T |
E R T |
Protocol |
Read Locks |
Write Locks |
1PC |
|
|
2PC |
|
|
Assume that
- A Strict-2PL-like protocol is used to ensure serializability
and no cascading aborts.
- The table entries are for the cohorts of committed transactions.
(4 marks)
- Consider association-rule mining of market-basket databases as in
the Apriori paper. Define the mapping
as the
set of transactions in the database which all contain
the item-set
, and the mapping
as the set of items
that are present in all the transactions of the transaction-set
.
For example, given items
and transactions
, then
and
.
With these definitions, an item-set
is said to be ``complete''
if and only if
. Your project partner claims that if
he were ``magically'' given the identities and supports of only
the complete frequent item-sets in the item-set lattice, then he would
be able to establish the identities and supports of all frequent
item-sets without having to scan the database, and thereby
discover the association-rules trivially.
- Do you believe the claim w.r.t. establishing all frequent item-set
identities? Explain.
(5 marks)
- For the frequent itemsets that are identifiable (if any), do you
believe the claim w.r.t. establishing their supports? If yes,
describe an algorithm for computing the supports.
If no, explain why and show a sample scenario of the failure.
(5 marks)
- For the data warehousing environment shown in Figure 1,
and given a space budget of 12.5M, list the views recommended for
materialization by the greedy algorithm.
(4 marks)
Figure 1:
Views on TPC-D database
 |
- Your instructor claims, with reference to the View Selection paper,
that the reason for the greedy algorithm working so well (performance
guarantee of 0.63) is that the benefit function possesses a monotonicity
property that relates the impact of including new views in the result
set on the remaining unselected views in the lattice. Identify and prove
this monotonicity property, and explain how it helps in providing this
strong guarantee.
(4 marks)
- With reference to the Database Architecture paper:
- The hashing function uses B bits of the hash-value, progressively
going from left-to-right in this bit segment. Do you think
it is equally acceptable to go from right-to-left in the bit
segment? (2 marks)
- Recall that in the Linear Hashing paper, we had discussed the
high-quality randomizing hash function: [(A k) mod w] where
,
, and
is prime wrt
.
Your project partner suggests that after incorporating a ``mod
''
factor, this function can be used during the initial clustering phase in
memory-resident joins to improve the distribution of data values across
the clusters. Explain how you would respond to this suggestion.
(2 marks)
- A system has 32 KB L1 cache with 1024 lines of 32 bytes each and 4 MB
of L2 cache consisting of 32768 lines of 128 bytes each. The cost of a
L1 miss is 5 cycles and the cost of a L2 miss is 55 cycles. You have to
design a CSB tree for this system with a goal of minimizing the miss cost
during a search. Assume that the key size is 4 bytes and a pointer size
is 4 bytes. There is an argument between you and your project partner
on whether to set the CSB node size to be the L1 cache line size or the
L2 cache line size. To settle the dispute:
- For both these options, calculate the number of L1 and L2 cache
misses during a search and compute the aggregate cost of cache misses
in terms of cpu cycles. Show which one is more efficient.
(3 marks)
- How do these options compare in terms of the node split cost?
(3 marks)
- The tuple reconstruction paper efficiently handles queries with multiple
selection/projection predicates in columnar databases. Further, it
claims that - ``Potentially, many operators can exploit the clustering
information in the maps, e.g., a max can consider only the last piece of
a map, or a join can be performed in a partitioned like way exploiting
disjoint ranges in the input maps''.
Do you agree with the above claim? If yes, validate it by providing your most efficient join
algorithm that exploits the existing partition information to implement
the binary join operation for a predicate of the form A.x = B.y. If no,
explain why. (5 marks)