精品深夜AV无码一区二区_伊人久久无码中文字幕_午夜无码伦费影视在线观看_伊人久久无码精品中文字幕

COMP 330代做、Python設(shè)計程序代寫

時間:2024-04-02  來源:  作者: 我要糾錯



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標簽:

掃一掃在手機打開當前頁
  • 上一篇:AIC2100代寫、Python設(shè)計程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關(guān)信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級風(fēng)景名勝區(qū)
    昆明西山國家級風(fēng)景名勝區(qū)
    昆明旅游索道攻略
    昆明旅游索道攻略
  • 短信驗證碼平臺 理財 WPS下載

    關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網(wǎng) 版權(quán)所有
    ICP備06013414號-3 公安備 42010502001045

    精品深夜AV无码一区二区_伊人久久无码中文字幕_午夜无码伦费影视在线观看_伊人久久无码精品中文字幕
    <samp id="e4iaa"><tbody id="e4iaa"></tbody></samp>
    <ul id="e4iaa"></ul>
    <blockquote id="e4iaa"><tfoot id="e4iaa"></tfoot></blockquote>
    • <samp id="e4iaa"><tbody id="e4iaa"></tbody></samp>
      <ul id="e4iaa"></ul>
      <samp id="e4iaa"><tbody id="e4iaa"></tbody></samp><ul id="e4iaa"></ul>
      <ul id="e4iaa"></ul>
      <th id="e4iaa"><menu id="e4iaa"></menu></th>
      在线观看亚洲精品| 国产一区视频导航| 亚洲欧美色综合| 国产精品丝袜91| 中文字幕在线观看一区| 亚洲欧洲色图综合| 一区二区三区四区五区视频在线观看| 亚洲欧洲成人av每日更新| 国产精品国产三级国产普通话三级 | 日韩精品一区二区三区老鸭窝| 91精品国产综合久久香蕉麻豆| 欧美一区二区三区色| 精品国产精品网麻豆系列| 国产色91在线| 一区二区三区中文免费| 五月天亚洲婷婷| 久久99深爱久久99精品| 丁香一区二区三区| 欧美综合欧美视频| 久久夜色精品国产欧美乱极品| 中文字幕成人在线观看| 亚洲国产sm捆绑调教视频 | 国产日韩欧美一区二区三区综合| 国产精品乱子久久久久| 五月婷婷综合在线| 国产精品小仙女| 欧美日韩精品高清| 久久精品在线观看| 亚洲国产精品久久人人爱蜜臀| 免费美女久久99| 91色乱码一区二区三区| 日韩欧美精品在线| 亚洲女女做受ⅹxx高潮| 狠狠色综合日日| 欧美视频三区在线播放| 中文字幕欧美激情一区| 日韩激情一二三区| 91亚洲精品乱码久久久久久蜜桃| 日韩色视频在线观看| 亚洲视频免费观看| 国产一区二区中文字幕| 911精品产国品一二三产区| 中文乱码免费一区二区| 久久不见久久见免费视频1| 在线欧美日韩精品| 国产精品福利一区二区三区| 狠狠狠色丁香婷婷综合激情| 欧美午夜理伦三级在线观看| 中文欧美字幕免费| 激情文学综合丁香| 日韩一区二区电影在线| 亚洲成人黄色小说| 一本大道av一区二区在线播放 | 色94色欧美sute亚洲线路二| 久久亚洲影视婷婷| 免费在线观看精品| 欧美视频一区二区三区在线观看| 中文一区二区完整视频在线观看| 久久精品国产免费| 51午夜精品国产| 亚欧色一区w666天堂| 91成人免费在线| 日韩一区欧美小说| 不卡的av电影| 国产精品国产自产拍高清av | 欧美日韩色综合| 一区二区三区91| 在线观看日韩av先锋影音电影院| 亚洲天堂精品在线观看| 成人av电影在线播放| 中文字幕欧美日本乱码一线二线| 国产成人丝袜美腿| 国产精品国产a| 97aⅴ精品视频一二三区| 亚洲欧美视频在线观看视频| 色综合中文字幕| 97久久超碰国产精品| **性色生活片久久毛片| 91在线观看污| 亚洲大片在线观看| 69久久99精品久久久久婷婷 | www.欧美色图| 亚洲精品国产无天堂网2021 | 国产在线精品一区在线观看麻豆| 日韩视频123| 国产乱码字幕精品高清av| 国产精品嫩草影院com| 91在线观看高清| 三级在线观看一区二区 | 成人午夜在线播放| 国产精品看片你懂得| 91在线视频播放| 图片区小说区区亚洲影院| 欧美成人一区二区三区片免费| 国产一区二区成人久久免费影院| 中文字幕精品一区二区三区精品 | 奇米精品一区二区三区在线观看 | 亚洲综合色噜噜狠狠| 欧美一区二区三区播放老司机| 国产在线麻豆精品观看| 日韩美女啊v在线免费观看| 91福利视频久久久久| 免费欧美在线视频| 国产精品美日韩| 欧美一二三区在线观看| 成人国产精品视频| 日韩av午夜在线观看| 国产精品美女久久久久久2018| 欧美区在线观看| 成人免费av在线| 美女mm1313爽爽久久久蜜臀| 亚洲特黄一级片| 日韩视频一区二区三区| 色视频一区二区| 精品一区二区三区免费观看| 夜夜亚洲天天久久| 国产免费观看久久| 日韩一二三四区| 欧美艳星brazzers| av一区二区三区黑人| 久久成人羞羞网站| 污片在线观看一区二区| 亚洲视频图片小说| 中文字幕免费不卡| 精品国产91乱码一区二区三区 | 美腿丝袜亚洲三区| 亚洲国产精品一区二区www在线 | 精品国产一二三区| 精品视频1区2区| 91国偷自产一区二区三区观看| 国产激情偷乱视频一区二区三区| 日韩av成人高清| 亚洲高清不卡在线| 亚洲另类中文字| 日韩毛片精品高清免费| 国产日韩欧美综合一区| 欧美va亚洲va| 欧美mv日韩mv国产| 3atv在线一区二区三区| 欧美群妇大交群中文字幕| 欧美在线观看视频在线| 色噜噜夜夜夜综合网| 91小视频免费看| 91丝袜美女网| 91丨九色丨国产丨porny| 99久久精品一区| 99久久伊人精品| 91在线国产观看| 欧美视频在线一区二区三区 | 成人精品视频一区二区三区| 国产91富婆露脸刺激对白| 国产精品18久久久久久久网站| 国产精品中文字幕日韩精品| 国产精品一品二品| 国产91精品一区二区麻豆亚洲| 成人网在线播放| 91麻豆免费看片| 欧美日韩激情在线| 日韩一区二区免费在线观看| 精品国产乱子伦一区| 精品福利视频一区二区三区| 国产欧美视频在线观看| 综合激情成人伊人| 亚洲成人午夜电影| 青青青伊人色综合久久| 国产精品夜夜嗨| 色丁香久综合在线久综合在线观看| 在线一区二区观看| 欧美日韩国产免费| 亚洲精品一区二区三区香蕉| 国产精品美女www爽爽爽| 亚洲一区二区三区爽爽爽爽爽| 日韩黄色免费网站| 国产99久久久国产精品| 色婷婷综合久久久中文一区二区| 在线不卡中文字幕| 欧美激情一区二区三区蜜桃视频 | 美女视频黄a大片欧美| 国产一区二区三区国产| 91美女片黄在线观看91美女| 在线电影国产精品| 国产精品久久久久久久久动漫| 亚洲影院理伦片| 国产精品一卡二卡| 欧美日韩国产精选| 国产精品免费免费| 蜜臀久久99精品久久久画质超高清 | 成人国产一区二区三区精品| 欧美性大战久久| 久久久一区二区| 亚洲成人免费在线| caoporm超碰国产精品| 欧美一区二区在线不卡| 最新久久zyz资源站| 老司机精品视频在线| 欧美午夜影院一区| 国产精品福利一区二区三区| 美国av一区二区| 欧美精品乱人伦久久久久久| 国产精品动漫网站|