(E0 261) Database Management Systems




Background Assignment - Answers




  1. No changes would have to be made since the DBMS provides independence between the application layer and the physical layer.

  2. Refer to Figure 1 for the solution.
    Figure 1: ER Diagram
    \begin{figure}\centerline{\epsfig{figure=figs/er-0.eps,width=6.0in,height=8in}}\end{figure}

  3. The relations are identical since they have the same attributes and contain identical information. (Relations are a set of unordered columns and rows.)

  4. SELECT cname
    FROM COMPANY
    WHERE (type = "private") AND cname NOT IN
    ( ( SELECT p.cname
        FROM PROJECT p
        GROUP BY p.cname
        HAVING ( ( COUNT (DISTINCT pstate) > 1) AND
                 ( COUNT (DISTINCT pstate) <> COUNT (pname) ) ) )
    
       UNION
    
       ( SELECT h.cname
         FROM HIRE h, PROJECT p, WORKER w
         WHERE (h.cname = p.cname) AND
               (h.pname = p.pname) AND
               (h.EBno = w.EBno) AND
               ( ( p.pstate <> w.wstate ) OR
                 ( ( SELECT COUNT (*)
                     FROM EXPERIENCE e
                     WHERE (w.EBno = e.EBno)) <> 3 ) ) ) )
    ORDER BY cname ASC
    

    1. The first step is to check whether the decomposition is lossless-join. This can be done using the Table algorithm of Ullman (Algorithm 7.2). Applying this algorithm results in the row corresponding to relation scheme DGRS becoming ``all a'' indicating a lossless join.

    2. The second step is to evaluate the BCNF claim. The two-attribute relations (CP, RT, AL) are in BCNF by definition (Lemma 7.7 in Ullman). For CDR, a candidate key is DR. Therefore, the only possible violations of BCNF in $F^+$ are $C \rightarrow R, C \rightarrow D,
D \rightarrow C, R \rightarrow C$. By taking attribute closures of LHS, we can show that none of these f.d. are in $F^+$. Therefore, CDR is in BCNF. For LSX, a candidate key is S. Therefore, only possible violations of BCNF in $F^+$ are $L \rightarrow X, X \rightarrow L$. Again, we can show by taking attribute closures of LHS that neither is in $F^+$. Finally, for DGRS, a candidate key is GR. The violations possible are $G \rightarrow D, G \rightarrow S, R \rightarrow D, R \rightarrow S, DS \rightarrow G,
DS \rightarrow R, D \rightarrow S$. None of these are in $F^+$.

      Therefore, the BCNF claim is correct.

    3. Dependency preservation claim: Consider the f.d. $DPT \rightarrow R$ in F and apply Algorithm 7.3 in Ullman. We find that R does not appear in the result of this procedure. Therefore, the decomposition is not dependency preserving. Other dependencies which are not preserved are $DST \rightarrow R$ and $CS \rightarrow G$.

  5. Test C, the BCNF definition test is guaranteed to decide the decomposition identity since the 3NF scheme will fail this test definitely. The outcome of Test B will succeed for both the BCNF and the 3NF decompositions. The outcome of Test A may or may not succeed for both 3NF and BCNF. Test D will always succeed for both decompositions since BCNF is a special subset of 3NF.

  6. Refer to Figure 2 for the solution.
    Figure 2: Index Tree Structure
    \begin{figure}\centerline{\epsfig{figure=figs/bt-1.eps,width=6in,height=3in}}\end{figure}

  7. Refer to Figure 3 for the solution.
    Figure 3: Extendible Hash Structure
    \begin{figure}\centerline{\epsfig{figure=figs/eh-0.eps,width=6in,height=4in}}\end{figure}

    1. Using simple iteration, for each tuple in R, we must perform an access for every tuple in S. This requires $ 20000 * 45000 = 9 * 10^8 $ accesses. Also need 20000 accesses to read R. Therefore, the total is 900,020,000 accesses.

      (Note: Must make R to be outer relation since it is smaller than S.)

    2. Block Iteration: Here, for each block in R, we must perform an access for every block in S. The number of blocks of R is $\frac{20000}{25} = 800$ and number of blocks of S is $\frac{45000}{30} = 1500$. Therefore, $800 * 1500 = 1,200,000$ accesses are required. In addition, 800 blocks of R have to be read. Therefore, total is 1,200,800.

      (Note: R must be made outer relation since it has fewer blocks than S.)

    3. Merge-Join: Using merge-join, we assume the relations are already sorted by join attribute. Therefore, we need to read each block of R and S only once. Therefore, $ 800 + 1500 = 2300 $ accesses only are required.

      If the input relations are not already sorted on the join attribute, the cost of external merge-sort of the two relations should be added to the total cost.

      Cost of external merge-sort = $b_r(2 \lceil log_{m-1}(\frac {b_r}{m}) \rceil + 1)$
      where $b_r$ is the number of pages in the relation to be sorted and $m$ is the number of memory pages available.

      Cost of sorting relation R is $800(2 \lceil log_2(800/3) \rceil +1) = 15200$
      Cost of writing final sorted relation is $800$.
      Cost of sorting relation S is $1500(2 \lceil log_2(1500/3) \rceil +1) = 28500$
      Cost of writing final sorted relation is $1500$.
      So total cost should be $16000 + 30000 + 2300 = 48300$.

    4. Index Join:

      $B^+-tree$ index exists on relation R and no index exists on relation S, hence R will form the inner relation and S will form the outer relation. The height of the index in worst case where the nodes are half-full is $\lceil {log}_{10}20000\rceil = 5$ (case 1). The height of the index in best case where the nodes are full is $\lceil{log}_{20}20000\rceil = 4$ (case 2).

      Assuming key-valued join attribute, for each block access from relation S, for each tuple in this block, the corresponding tuple in relation R will be accessed by descending the index and then accessing the data tuple.

      Fetching relation S requires 1500 block accesses. Fetching the corresponding tuples from relation R would require 45000*(5+1) (case 1) or 45000*(4+1) (case 2) block accesses.

      Thus, in all, the join would require 271500 (case 1) or 226500 (case 2) block accesses.

    5. Hash Join:

      Here, we must first read in both relations in order to create the hash buckets. This requires reading of 800 + 1500 = 2300 accesses, and writing out of the hash buckets also costs 2300 accesses. Then reading the hash buckets pair-wise to compute the join costs another 2300 accesses. So, the total cost is 3 * 2300 = 6900 accesses.

      If the hash buckets fit into memory, then the cost is only the initial read, which is 2300 accesses.

      (Note that if only the join values and RIDs are maintained in the hash buckets, then the matching tuples have to be retrieved from disk. This can incur a cost of 20000+45000 = 65000 accesses since hashing results in random seeks.)

  8. The original relation algebra expression is
    $\scriptstyle {\pi}_{T.cname} ( {\sigma}_{T.credits > S.credits}
( {\sigma}_{s.dept=''CSA''} (\rho_T(COURSE) X \rho_S(COURSE) ) )$

    A better expression would be
    $\scriptstyle {\pi}_{T.cname} ( {\sigma}_{T.credits > S.credits}
({\pi}_{T.cnam...
...its}(\rho_T(COURSE) X \pi_{S.credits}(\sigma_{s.dept=''CSA''}\rho_S(COURSE) ) )$

    This expression reduces COURSE (T) to only those values corresponding to credits of courses in CSA, which should be a small set. It also eliminates the unneeded attributes from COURSE (S). The cartesian product is therefore performed on the smallest amount of data possible.

    1. Yes, there is a relational algebra equivalent for LOJ :

      $R \; LOJ \; S = (R \Join S) \cup ( R - {\pi}_{R} (R \Join S) \times {(null, \ldots, null)}$

      where the constant relation ${(null, \ldots, null)}$ is on the schema $S - R$.

    2. The basic LOJ is not associative. For example, consider the following schema: $R_1(A,B)$; $R_2(A,C)$; $R_3(A,D)$ with contents $R_1=\{(x,1)\}$, $R_2=\{(y,2)\}$, and $R_3=\{(x,3)\}$, then $ R_1 \; LOJ \; ( R_2 \; LOJ \;
R3)$ will output $\{(x,1,null,null)\}$, while $ (R_1 \; LOJ \; R_2 ) \; LOJ \; R_3$ will give $\{(x,1,null,3)\}$.

    3. No, LOJ is not commutative since $-$ itself is not a commutative operator.

    4. Since every tuple of $R$ has to appear in the output, the minimum cardinality of $R \; LOJ \; S$ is $\mid R \mid$.

    5. With minor modifications all these algorithms can be used to compute LOJ.

  9. System failures are not the only source of transaction aborts. Aborts may arise due to concurrency control (e.g. deadlock resolution in 2PL) or due to user actions (e.g. $Control-C$.) Therefore, the recovery manager is required to undo such aborts in the immediate update case. For delayed updates, however, the recovery manager is not required since updates do not reach disk prior to transaction commit and therefore aborted transactions can be ignored.

  10. Since log records are always flushed before corresponding disk data updates, the protocol provides recoverability. During recovery, we need to undo the aborted transactions since the system is allowing immediate updates. However, no redo is required since all committed transactions are guaranteed to have their updates on disk. In short, this is a UNDO / NO-REDO protocol.

  11. The lock compatibility matrix is
      R S D W
    R Y N N N
    S N N N N
    D N N Y Y
    W N N Y Y

    The R and S locks are identical to the Read and Write locks discussed in class, and the various RS combinations are therefore obvious. Since the D and W locks alter the value of the balance, they conflict with the R and S locks. However, they don't conflict with themselves or with each other because they are commutative operations (order of deposits or withdrawals is immaterial).

  12. The schedule is


    $T_1$ : R(a) W(a) R(d) W(d)
    $T_2$ : R(d) R(a) W(a) R(c)
    $T_3$ : R(b) R(c) W(c) R(d)


    The schedule could not have been produced by 2PL because $T_2$ is allowed to access item 'a' before $T_1$ which has already accessed item 'a' has completed its growing phase.

    The schedule could not have been produced by TO since $T_1$'s write of data item 'd' after $T_3$ has read it violates the pre-determined serial order $T_1, T_2, T_3$.

    The schedule could have been produced by VT since the protocol is non-blocking.