Preface (P121)
One of the limitations in using replicated join is that one of the join tables has to be small enough to fit in memory. Even with the usual asymmetry of size in the input sources, the smaller one may still not be small enough. You can solve this problem by rearranging the processing steps to make them more efficient. For example, if you’re looking for the order history of all customers in the 415 area code, it’s correct but inefficient to join the Orders and the Customers tables first before filtering out records where the customer is in the 415 area code. Both the Orders and Customers tables may be too big for replicated join and you’ll have to resort to the inefficient reduce-side join. A better approach is to first filter out customers living in the 415 area code. We store this in a temporary file called Customers415. We can arrive at the same end result by joining Orders with Customers415, but now Customers415 is small enough that a replicated join is feasible. There is some overhead in creating and distributing the Customers415 file, but it’s often compensated by the overall gain in efficiency.
Sometimes you may have a lot of data to analyze. You can’t use replicated join no matter how you rearrange your processing steps. Don’t worry. We still have ways to make reduce-side joining more efficient. Recall that the main problem with reduce-side joining is that the mapper only tags the data, all of which is shuffled across the network but most of which is ignored in the reducer. The inefficiency is ameliorated if the mapper has an extra prefiltering function to eliminate most or even all the unnecessary data before it is shuffled across the network. We need to build this filtering mechanism.
Semijoin : reduce-side join with map-side filtering
Continuing our example of joining Customers415 with Orders, the join key is Customer ID and we would like our mappers to filter out any customer not from the 415 area code rather than send those records to reducers. We create a data set CustomerID415 to store all the Customer IDs of customers in the 415 area code. CustomerID415 is smaller than Customers415 because it only has one data field. Assuming CustomerID415 can now fit in memory, we can improve reduce-side join by using distributed cache to disseminate CustomerID415 across all the mappers. When processing records from Customers and Orders, the mapper will drop any record whose key is not in the set CustomerID415. This is sometimes called a semijoin, taking the terminology from the database world.
Last but not least, what if the file CustomerID415 is still too big to fit in memory? Or maybe CustomerID415 does fit in memory but it’s size makes replicating it across all the mappers inefficient. This situation calls for a data structure called a Bloom filter. A Bloom filter is a compact representation of a set that supports only the contain query. (“Does this set contain this element?”) Furthermore, the query answer is not completely accurate, but it’s guaranteed to have no false negatives and a small probability of false positives. The slight inaccuracy is the trade-off for the data structure’s compactness. By using a Bloom filter representation of CustomerID415, the mappers will pass through all customers in the 415 area code. It still guarantees the correctness of the data join algorithm. The Bloom filter will also pass a small portion of customers not in the 415 area code to the reduce phase. This is fine because those will be ignored in the reduce phase. We’ll still have improved performance by reducing dramatically the amount of traffic shuffled across the network. The use of Bloom filters is in fact a standard technique for joining in distributed databases, and it’s used in commercial products such as Oracle 11g. We’ll describe Bloom filter and its other applications in more details in the next section.
Supplement
* Chaining MapReduce jobs (Part1)
* Joining data from different sources (Part2)
* Replicated joins using DistributedCache (Part3)
* reduce-side join with map-side filtering (Part4)
* Creating a Bloom filter (Part5)
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言