0

I'm helping someone do homework for their databases class, but the syllabus has changed since I took it. Specifically there is one problem about selectivity that I have no background knowledge for and relentless Google-ing isn't turning up proper results. If anyone could look at the problem and point me towards an online resource that covers this type of problem I would be really grateful!

"Let R be a relation instance with schema(NAME, GENDER, AGE, INCOME). The selectivity of δGENDER='F'(R) is 0.4, the selectivity of δAGE<=30(R) is 0.3 and the selectivity of δINCOME>90,000(R) is 0.6. What is the selectivity of δGENDER='F' AND (AGE<=30 OR INCOME<=90,000)(R)"

To me this looks like a probabilities problem (Which meshes with what little I can find of selectivity, it's just the ratio of things that survive a filter) so my first instinct is to say that the answer to this is 0.4 * (0.3 + 0.6) but I have no idea if I'm in the right ballpark.

  • You're right that there's a lot of annoying/vague/waffly definitions of database/index selectivity, but this one's ok and specific to SQL: https://www.programmerinterview.com/database-sql/cardinality-versus-selectivity/. Then it results from taking all distinct values in a column, not from some specific value `'F'` or range of values `<=30`. Your 0.4 for `'F'` looks like it's saying 40% of the table content are Female(?) If 'selectivity' means the latter, it's no help in combining queries with `AND/OR`: we don't know if all the Females are under 30 or over 30 or evenly spread through the ages. – AntC Dec 18 '19 at 20:24
  • Per the definition of Selectivity would tell you the clustering vs discriminatory power of an index on a particular column: don't use `GENDER` in your search optimisation, it'll only roughly halve the rows to search. Do you use the column with the lowest Selectivity, providing you're searching for a specific value rather than a wide range: that'll quickly narrow down to a few rows. – AntC Dec 18 '19 at 20:28
  • Frankly, the question as posed (with presumably a different definition for Selectivity) looks like bunkum: a (0.3 Selectivity `OR` 0.6 Selectivity) would give a combined Selectivity somewhere between 0.3 (all AGE <= 30 are also INCOME <= 90k) and 0.9 (all AGE <=30 have INCOME >90k, all INCOME <= 90k are AGE > 30). Combined Selectivity across `AND` would be the minimum of the conjuncts. – AntC Dec 18 '19 at 20:36
  • Possibly related, and has links to other q's about Selectivity: https://stackoverflow.com/questions/4185378/indexing-alternatives-for-low-selectivity-columns. AFAICT 'Selectivity' is defined per the link in my first comment, and not per the question as posed. Does the instructor/teaching material know what they're talking about? Most don't: teaching on database theory is usually from academics who've never worked in the industry and regard SQL/databases with distaste. – AntC Dec 18 '19 at 20:45
  • Uh-oh different definition of Selectivity here (read down quite a way) https://blogs.msdn.microsoft.com/bartd/2011/01/25/query-tuning-fundamentals-density-predicates-selectivity-and-cardinality/, more in line with the q as posed. And the article agrees that combining Selectivities is more or less useless "Note that the optimizer's guessed selectivity is wrong". – AntC Dec 18 '19 at 21:12
  • @AntC thank you! I honestly don't know if the new professor has any clue what they're talking about, and this definitely seems like a strange problem. I posted this same question on Reddit to see if anyone there had an idea and someone who seems to know what they're talking about confirmed that I treat it like a probability, so I just went with that... And thank you for that second article, that's one I hadn't yet found! – Nathan Egan Dec 19 '19 at 01:17
  • What is their textbook name & edition? Is their class using a particular DBMS? – philipxy Dec 19 '19 at 04:32
  • Thanks @Nathan. I've just noticed the q as posed asks for INCOME <= 90k whose Selectivity is 0.4: complement of the 0.6 from INCOME > 90k. Treating each column's selectivity as a probability, and `AND/OR` as multiplying/adding probabilities only makes sense if each column's values are independent and values equally spread across each other. But we know in the real world of employment being Female and young will correlate with being low-paid. The probability treatment seems hopelessly simplistic even for a Professor, or perhaps the idea is to prompt students to critique it? – AntC Dec 19 '19 at 10:30

0 Answers0