联系方式

您当前位置:首页 >> Python编程Python编程

日期:2024-12-18 09:36

DSCI 510: Principles of Programming for Data Science

Final Project Guidelines

In the ffnal project for this class, you will have the opportunity to apply the knowledge and

programming skills you have learned to a real-world problem. Your project should focus on

web scraping (or collection data through APIs), data cleaning, analysis, and visualization using

Python.

Final Project Due Date: December 19th, 2024 at 4pm PT

Final grade submission via Grading and Roster System (GRS) for Fall 2024 is the week after

December 19th and we should have graded every project by then. We need to set some time aside

in order to be able to grade your projects, therefore we have to be strict about this deadline.

Please refer to the Academic Calendar for the speciffc dates.

Final Project Submission via GitHub Classroom

In order to submit your ffnal project assignment you will need to accept the assignment on our

GitHub Classroom (similar to the lab assignments). With the ffnal assignment repository you

will get a template where you can upload all of your ffles. To get started, Project Proposal

You may send a one page proposal document (in a PDF format) describing your ffnal project.

This proposal should include the following:

1. Name of your ffnal project and a short synopsis/description (1 paragraph max).

2. What problem are you trying to solve, which question(s) are you trying to answer?

3. How do you intend to collect the data and where on the web is it coming from?

4. What type of data cleaning and/or analysis are you going to perform on the data?

5. What kind of visualizations are you going to use to illustrate your ffndings?

There is no offfcial due date for the proposal, but the sooner you send it to us the sooner you will

get feedback on it. We will provide feedback and suggest changes if required. This is usually to

test the feasibility of the project and give you a sense of whether you need to scale back because

it is too ambitious or if you need to do more work in order to improve your grade. Please upload

the original proposal in the same repository with the other ffles of your ffnal project.

Note: For faster processing, you can send us an email: Gleb ([email protected]), Mia ([email protected])

or Zhivar ([email protected]) an email with the subject “DSCI 510: Final Project Proposal”,

please also upload your proposal document to the ffnal project GitHub repository. The

email should contain a link to your GitHub repository or the proposal.pdf ffle itself.

1Project Goals and Steps

1. Data Collection (20%)

You should identify websites or web resources from which you will get raw data for your

project. You can either web-scrape data or collect data using publicly available APIs.

This could include news articles, e-commerce websites, social media posts, weather data,

or any other publicly available web content. This step should be fairly sophisticated as

to demonstrate the techniques you have learned in the class. Use multiple data sources

to compare different data in your analysis. Using Python libraries like BeautifulSoup and

requests, you should be able to write scripts to scrape data from the chosen websites. This

step includes making HTTP requests, handling HTML parsing, and extracting relevant

information.

Please note that if you need to collect data that changes over time, you might want to

setup a script that runs every day and collects the data at a certain time of the day. That

way you can collect enough data to run your analysis for the ffnal project later.

We recommend that you scrape data from static websites, or use publicly available APIs.

If you scrape data from dynamically generated pages, you might run into issues as certain

websites are not keen on giving away their data (think sites like google, amazon, etc).

Please note that some APIs are not free and you need to pay to use them - you should

try to avoid those as when we are grading your ffnal project we should be able to replicate

your code without paying for an API.

2. Data Cleaning (20%)

Once your data collection is complete, you will need to clean the data in order to be able

to process it. This will involve handling missing values, cleaning HTML tags, removing

duplicates, and converting data into a structured format for analysis in Python. If your

raw data is not in English, you should attempt to translate the data into English as part

of this step.

Depending on the size of your data you can upload both raw and preprocessed data to the

data folder in the repository of your ffnal project.

3. Data Analysis (20%)

In this step, you will perform an analysis on the scraped data to gain insights or answer

speciffc questions. You should perform statistical analyses, generate descriptive statistics,

using libraries such as Pandas or NumPy (or any other library you prefer to use). You

should add a detailed description of this step and your speciffc methods of analysis in the

ffnal report at the end.

4. Data Visualization (20%)

Last but not least, you should create plots, graphs, or charts using Matplotlib, Seaborn,

D3.js, Echarts or any other data visualization library, to effectively communicate your

ffndings. Visualizations created in this step could be static or interactive, if they are

interactive - you need to describe this interaction and its added value in the ffnal report.

Our team should be able to replicate your interactive visualizations when we are grading

your ffnal projects.

5. Final Report (20%)

Finally, you will submit a ffnal report, describing your project, the problem you are trying

to solve or the questions that you are trying to answer. What data did you collect as well

2as how it was collected. What type of data processing/cleaning did you perform? You

would also need to explain your analysis and visualizations. See Final Report section for

more information.

The percentages used for grading here are used as a general guideline, but it can be changed

based on your project. If your data collection is trivial but the analysis is fairly complicated,

you could score more points in the data analysis step to compensate. Similarly, complexity of

the ffnal data visualizations could be used to get additional points if you decide to make your

visualizations more interactive and engaging to the end users.

Project Deliverables

GitHub Repository

We will create an assignment for the ffnal project. You will need to accept the assignment and

commit your code and any additional ffles (e.g. raw data or processed data) to the repository.

Here is a generic structure of the repository:

github_repository/

.gitignore

README.md

requirements.txt

data/

raw/

processed/

proposal.pdf

results/

images/

final_report.pdf

src/

get_data.py

clean_data.py

analyze_data.py

visualize_results.py

utils/

And here is a description of what each of the folders/ffles could contain:

1. proposal.pdf

The project proposal ffle (PDF). This is what you can send us in advance to see if your

project meets the minimum requirements or if the scope is too large and if you need to

scale it back. See the section: Project Proposal.

2. requirements.txt

This ffle lists all of the external libraries you have used in your project and the speciffc

version of the library that you used (e.g. pandas, requests, etc). You can create this ffle

manually or use the following commands in your virtual (conda) environment:

You can run this command to create the requirements.txt ffle:

3pip freeze >> requirements.txt

To install all of the required libraries based on this requirements ffle, run this command:

pip install -r requirements.txt

3. README.md

This ffle typically contains installation instructions, or the documentation on how to install

the requirements and ultimately run your project. Here you can explain how to run your

code, explain how to get the data, how to clean data, how to run analysis code and ffnally

how to produce the visualizations. We have created sections in the README.md ffle for

you to ffll in. Make sure you ffll in all of the sections.

Please note that this ffle is most important to us as we will try to reproduce your results

on our end to verify that everything is working. If there is anything that is tricky about

the installation of your project, you want to mention it here to make it easier for us to run

your project.

4. data/ directory

Simply put, this folder contains the data that you used in this project.

(a) The raw data folder will have the raw ffles you downloaded/scraped from the web. It

could contain (not exhaustive) html, csv, xml or json ffles. If your raw data happens

to be too large to upload to GitHub (i.e. larger than 25mb) then please upload your

data to the USC Google Drive and provide a link to the data in your README.md

ffle.

(b) The processed data folder will contain your structured ffles after data cleaning. For

example, you could clean the data and convert them to JSON or CSV ffles. Your

analysis and visualization code should perform operations on the ffles in this folder.

Note: Make sure your individual ffles are less than 25mb in size, you can use

USC Google Drive if the ffles are larger than 25mb. In that case, please provide

a link for us to get to the data in your README.md ffle.

5. results/ directory

This folder will contain your ffnal project report and any other ffles you might have as part

of your project. For example, if you choose to create a Jupyter Notebook for your data

visualizations, this notebook ffle should be in this results folder. If you have any static

images of the data visualizations, those images should go in this folder as well.

6. src/ directory

This folder contains the source code for your project.

(a) get data.py will download, web-scrape or fetch the data from an API and store it in

the data/raw folder.

(b) clean data.py will clean the data, transform the data and store structured data ffles

in the data/processed folder, for example as csv or json ffles.

(c) analyze data.py will contain methods used to analyze the data to answer the project

speciffc questions.

(d) visualize results.py will create any data visualizations using matplotlib or any other

library to conclude the analysis you performed.

4(e) utils/ folder should contain any utility functions that you need in order to process

your code, this could be something generic such as regular expressions used to clean

the data or to parse and lowercase otherwise case-sensitive information.

7. .gitignore

Last but not least, the .gitignore ffle is here to help ignore certain meta-data or otherwise

unnecessary ffles from being added to the repository. This includes ffles that were used

in development or were created as a by-product but are not necessary for you to run the

project (for example, cached ffles added by using various IDEs like VS Code or PyCharm.

Please note that this project structure is only a suggestion, feel free to add more ffles or change

the names of ffles and folders as you prefer. That being said, please take into account that we

will be looking for the speciffc ffles to get the data, clean the data, analyze data, etc. You can

change this structure or create more ffles in this repository as you like but please do mention

where what is in your README.md ffle.

Final Report

You’ll ffnd an empty template for the ffnal report document (pdf) in the GitHub repository once

you accept our ffnal project assignment. At the very least, your ffnal report should have the

following sections:

1. What is the name of your project?

(a) Please write it as a research question and provide a short synopsis/description.

(b) What is/are the research question(s) that you are trying to answer?

2. What type of data did you collect?

(a) Specify exactly where the data is coming from.

(b) Describe the approach that you used for data collection.

(c) How many different data sources did you use?

(d) How much data did you collect in total? How many samples?

(e) Describe what changed from your original plan (if anything changed) as well as the

challenges that you encountered and resolved.

3. What kind of analysis and visualizations did you do?

(a) Which analysis techniques did you use, and what are your ffndings?

(b) Describe the type of data visualizations that you made.

(c) Explain the setup and meaning of each element.

(d) Describe your observations and conclusion.

(e) Describe the impact of your ffndings.

4. Future Work

(a) Given more time, what would you do in order to further improve your project?

5(b) Would you use the same data sources next time? Why yes or why not?

Your final project report should be no less than 2 and no more than 5 pages including any images

(e.g. of data visualizations) that you want to embed in the report. Please spend a decent amount

of time on the report. Your report is the first file we will read. We will not know how great your

project is if you don’t explain it clearly and in detail.

6


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8