E0 261 Sample Questions (MidTerm)

Instructions:
 



QUESTIONS:

  1. Consider the following SQL query:

    select * from EMP where EMP.salary $>$ 40000 AND EMP.age $=$ 30

    Assume that there is a clustered B+ tree index ($I_1$) on EMP.salary and an unclustered B+ tree index ($I_2$) on EMP.age. The employee salaries range from 20001 till 100000 and the ages range from 21 till 60. The statistics are as follows:
    NCARD = 100000 tuples, TCARD = 1000 pages, P = 0.8,
    ICARD($I_1$) = 1000 values, NINDX($I_1$) = 500 pages, ICARD($I_2$) = 40 values, NINDX($I_2$) = 200 pages.
    Assume that we are concerned only with the IO cost and that the buffer pool allocated to this query can hold 1200 pages.

    1. Calculate the tuple cardinality estimated by System R for the query's result. (2 marks)
    2. What access paths would be considered by the optimizer for the relation EMP? Which path is chosen? Show the cost calculations. (4 marks)

  2. The Dean of IISc asks you to estimate the cardinality of the query

    select distinct (S.name, S.hostel) from STUDENT S

    where the STUDENT relation has the schema [SRNo, name, hostel, gpa] and individual $B^+$-tree indexes are available on the SRNo and name attributes, respectively.

    Assume that you use, for this estimation, the strategy outlined in Figure 1 of the Sampling paper discussed in class. Within this framework, describe your: (a) partitioning approach, (b) random sampling strategy, and (c) computation technique for the RandomSample() function.

    [Hint: Consider the possibility of result partitions with fractional sizes.] (6 marks)

  3. In the Linear Hashing paper, it is claimed that the the address computation algorithm A2 is extremely efficient since the while loop is on average executed at most twice. Do you agree? If yes, prove the claim. If no, indicate your assessment of the average number of iterations. (For simplicity, assume that the file construction parameters, $N$ and $K$, are both set to 1.) (4 marks)

  4. Consider the plan diagram $PD$ shown in the following figure, annotated with the corner costs for each of the optimality regions. For this diagram, is it possible to identify the minimum $\lambda$ at which the reduced plan diagram (using Cost-Bounding approach) is guaranteed to feature only a single plan? If yes, identify (with justification) this minimum value, and also mention the associated plan. If not, explain why it is infeasible.

    Image final

    [Note: Assume plan cost monotonicity (PCM) is strictly adhered to by all plan cost functions.] (4 marks)

  5. You are given the task of designing a range-encoded bitmap index on an integer attribute Weight that records the weight of a package to be mailed. The minimum possible weight is 101 grams and the maximum allowable weight is 1100 grams. Disk space is very scarce and you have only enough to store 40 bitmaps overall.

    1. What is the minimum number of components required in the index? Give a set of base values that satisfy the space constraints. (3 marks)

    2. What is the minimum number of components required to produce a time-optimal index that satisfies the space constraints? Give a set of base values for this index and calculate the expected time consumption incurred by the index. (3 marks)

  6. Consider the ARIES recovery algorithm on the execution log shown below. The master record points to the checkpoint at LSN 10. Assume that the dirty page table written in the checkpoint record (at LSN 20) at the beginning is empty.

    LSN LOG
    00 update: T1 writes P1
    10 begin-checkpoint
    20 update: T1 writes P2
    30 end-checkpoint Transaction Table: T1,
      Dirty Page Table:
    40 update: T2 writes P3
    50 update: T3 writes P4
    60 T1 commit
    70 update T3 writes P5
    80 T1 end
    90 update T2 writes P3
    100 T2 abort
      CRASH, RESTART

    1. At the end of the analysis phase, what are the contents of the transaction table and the dirty page table? Show the lastLSN field for the active transactions and the recLSN field for the dirty pages. (2 marks)
    2. Show the first two CLRs written during the UNDO phase. Show the undoNextLSNs. (2 marks)
    3. Due to some glitch in the power supply, the system keeps rebooting repeatedly and the recovery algorithm runs on each restart. Specify the maximum number of log records that may be cumulatively written during recovery over all the restarts. (2 marks)