(E0 261) Database Management Systems







Sample Questions (Midterm) - Answers

Question 1:
 

  1. Let us first calculate the selectivities of the two predicates.

    Selectivity of `emp.salary $ >$ 40000' $ = \frac{100000-40000}{100000-20001+1} = 0.75$

    Selectivity of `emp.age $ =$ 30' $ = \frac{1}{ICARD(I_2)} = \frac{1}{40}$

    Tuple cardinality is thus $ 100000*0.75*\frac{1}{40} = 1875$

  2. The optimizer will consider 3 access paths: (a) Segment scan, (b) Index scan on $ I_1$ , and (c) Index scan on $ I_2$ .

    Cost of segment scan $ = \frac{TCARD}{P} = \frac{1000}{0.8} = 1250$

    Cost of index scan on $ I_1$ $ = 0.75*(NINDX(I_1) + TCARD) = 0.75(500 + 1000) = 1125$ since $ I_1$ is a clustered index.

    $ I_2$ is a unclustered index. The number of page accesses $ = \frac{1}{40}*(NINDX(I_2) + NCARD) = \frac{1}{40}*(200+100000) = 2505$ .
    However, note that the number of distinct pages of EMP is only 1000, which can completely fit in main memory. So even in the worst case -- that is, if the 40-year old employees, who are estimated to be 2500 in number, are spread sequentially across all these pages, with a new page accessed each time, there can be at most 1000 disk accesses.

    Therefore, the maximum number of disk IOs with $ I_2$ is $ \frac{1}{40}*200 + 1000 = 1005$

    Overall, the index scan on $ I_2$ is the best option and the optimizer will choose that option.

Question 2:
 

(a) For relation $ S$ , the partition of the relation is based on the values of $ S.name$ .

(b) and (c) The following algorithm can be used to generate samples and compute the value of RandomSample():

  1. Lookup, using the index on $ SRNo$ , a random tuple $ t$ with key value $ k$ ;
  2. Lookup, using the index on $ name$ , the set of tuples $ S_t = \{ t_i \mid t_i.A = t.A\}$ ;
  3. Compute $ P_t = \Pi_{name,hostel}(S_t)$
  4. Return $ \displaystyle value = \frac{\mid P_t \mid}{\mid S_t \mid}$

For example, let the STUDENT relation have tuples

$ (1,n_1,a_1,h_1,g_1), (2,n_1,a_2,h_1,g_2), (3,n_1,a_3,h_1,g_3), (4,n_1,a_4,h_2,g_4), (5,n_2,a_5,h_1,g_5)$
Then, the function will return (for random tuple $ t$ with $ k=1$ ):
$ \frac{\mid (n_1,h_1; \; n_1,h_2) \mid}{\mid (n_1,h_1; \; n_1,h_1; \; n_1,h_1; \; n_1,h_2) \mid }$ = $ \frac{2}{4} = 0.5$ .

Finally, in the last computation of $ ns/m$ , the value of $ n$ is simply $ \mid R \mid$ .

It can be worked out to see that the above solution also provides quality guarantees similar to one provided in the sampling paper.

Question 3:
 

YES, the claim is correct.

Given the file construction parameters of $ N,K =1$ , it means that the file grows as follows, measured in quanta sizes: $ 1, 1, 2, 4, 8, 16, \ldots,
2^p$ . This means that the currently assigned segment size ($ 2^p$ ) will be always equal to the sum of the sizes of the previous assignments. Since the while loop in A2 goes backwards through the segments, and assuming a uniformly randomizing hash function, it means that the probability of finding the required address in 1 iteration is 0.5, in 2 iterations is 0.25, in 3 is 0.125, and so on. Therefore the expected number of iterations is approximately $ \sum_{i=1}^{i=p+1} \frac{i}{2^i}$ , which is a convergent series bounded above by 2.

Question 4:
 

The closest point in P4 that covers ALL points in P1 is 40. Therefore, $ \lambda$ is 40/10 - 1 = 3.
The closest point in P4 that covers ALL points in P2 is 80. Therefore, $ \lambda$ is 80/16 - 1 = 4.
The closest point in P4 that covers ALL points in P3 is 70. Therefore, $ \lambda$ is 70/30 - 1 = 1.33.
Therefore, the lowest overall $ \lambda$ value is max (3,4,1.33) = 4.

Question 5:
 

(a) The minimum number of components $ m$ corresponds to the smallest space-optimal index that meets the budget requirement. For a space-optimal range-encoded index, the number of bitmaps is $ n(b-2)+r$ where $ b = \lceil C^{\frac{1}{n}} \rceil $ and $ r$ is the smallest integer such $ b^r {(b-1)}^{n-r} \geq C$ .

In the given problem, $ C=1000$ since the domain is 101 ...1100. Therefore, the minimum number of components corresponds to $ m=3$ since $ b=10$ , $ r=3$ and the bitmap consumption is $ 3 * (10-2) + 3$ = 27 which is less than 40. (With two components, $ b=32$ , $ r=1$ and consumption is $ 2 * (32-2) + 1 = 61$ , which is more than 40.)

The specific space-optimal index is $ < 10,10,10 >$ .

(b) For the time-optimal index, the minimum number of components is $ m = 6$ since this corresonds to an index $ < 2, 2, 2, 2, 2, 32>$ , whose space consumption is 36. The time metric for a time-optimal index is $ m + \frac{1}{3}(1 - \frac{4}{b_1})$ leading to $ 6 + \frac{1}{3} (1 - \frac{4}{32})$ = 6.3 bitmap scans.

Question 6:
 
  1. The analysis phase will start a transaction table containing $ T_1$ and the dirty page table being empty. It will proceed from the begin-checkpoint record at LSN 10. The transaction table at the end of analysis is shown below:
    Transaction lastLSN status undoNxtLSN
    $ T_2$ 100 aborted 100
    $ T_3$ 70 running -
    The dirty page table (DPT) is shown below. Note that $ P_1$ will not be dirty since the DPT written in the checkpoint record is empty indicating update to $ P_1$ made at LSN 00 was written to disk before the checkpoint started. $ P_2$ , which is dirty, didn't get written to the DPT since it is a fuzzy checkpoint and DPT accutaely captures the state only upto begin_checkpoint.
    Page Id recLSN
    $ P_2$ 20
    $ P_3$ 40
    $ P_4$ 50
    $ P_5$ 70

  2. Since $ T_2$ and $ T_3$ were active at the time of the crash, their changes should be undone. The first two CLRs correspond to the updates made at LSN 90 and LSN 70 (Undos are done in the reverse order of the LSNs). The CLRs are shown below:
    LSN LOG undoNextLSN
    110 CLR: Undo T2 LSN 90 40
    120 CLR: Undo T3 LSN 70 50

  3. Once a CLR is written for an update action $ A$ , the undoNextLSN of that CLR points to the previous update $ P$ for that transaction (the next to be undone). Thus on a future restart, during the UNDO phase, the update action $ A$ will be skipped and undo will proceed directly to $ P$ . This implies that only one CLR can be written for each update action to be undone over any number of restarts. In this case, there are 4 update actions (2 for $ T_2$ and 2 for $ T_3$ ) that need to be undone. So a maximum of 4 CLRs will be written before the restart is successful.